
: 2022, Volume 2 - 2 - CC-BY: © Chen et al.
R E S E A R C H A R T I C L E
MASSIVELY UNIVARIATE ANALYSIS
ANDMULTIPLE TESTING
We start with a brief refresher of the conventional sta-
tistical framework typically adopted in neuroimaging.
Statistical testing begins by accepting the null hypothesis
but then rejecting it in favor of the alternative hypothesis
if the current data for the effect of interest (e.g., task A
vs. task B) or potentially more extreme observations are
unlikely to occur under the assumption that the effect is
absolutely zero. Because the basic data unit is the voxel,
one faces the problem of performing tens of thousands
of inferences across space simultaneously. As the spatial
units are not independent of one another, adopting an
adjustment such as Bonferroni’s is unreasonably conser-
vative. Instead, the eld has gradually settled into em-
ploying a cluster-based approach: what is the size of a
spatial cluster that would be unlikely to occur under the
null scenario?
Accordingly, a two-step procedure is utilized: rst
threshold the voxelwise statistical evidence at a par-
ticular (or a range of) voxelwise p-value (e.g., 0.001)
and then consider only contiguous clusters of evi-
dence (Fig.1). Several adjustment methods have been
developed to address multiple testing by leveraging
the spatial relatedness among neighboring voxels.
The stringency of the procedures has been extensively
debated over the past decades, with the overall prob-
ability of having clusters of a minimum spatial extent
given a null effect estimated by two common approach-
es: a parametric method3,4 and a permutation-based
adjustment.5 For the former, recent recommendations
have resulted in the convention of adopting a primary
threshold of voxelwise p = 0.001 followed by cluster
size determination6,7; for the latter, the threshold is
based on the integration between a range of statistical
evidence and the associated spatial extent.5
Problems of multiple testing adjustments
At least ve limitations are associated with multiple
testing adjustments leveraged through spatial extent.8
1. Conceptual inconsistency. Consider that the sta-
ples of neuroimaging research are the maps of
statistical evidence and associated tables. Both
typically present only the statistic (e.g., t) values.
However, this change of focus is inconsistent with
c luster-based inference: after multiple testing
adjustment, the proper unit of inference is the
cluster, not the voxel. Once “signi cant” clusters
are determined, one should only speak of clus-
ters and the voxels inside each cluster should no
longer be considered meaningful inferentially.
In other words, the statistical evidence for each
surviving cluster is deemed at the “signi cance”
level of 0.05 and the voxelwise statistic values
Conventional neuroimaging inferences follow the null
hypothesis signi cance testing framework, where the
decision procedure dichotomizes the available evidence
into two categories at the end. Thus, one part of the ev-
idence survives an adjusted threshold at the whole-brain
level and is considered statistically signi cant (informally
interpreted as a “true” effect); the other part is ignored
(often misinterpreted as “not true”) and by convention
omitted and hidden from public view (i.e., the le drawer
problem).
A recent study1 (referred to as NARPS hereafter) of-
fers a salient opportunity for the neuroimaging com-
munity to re ect about common practices in statistical
modeling and the communication of study ndings.
The study recruited 70 teams charged with the task
of analyzing a particular FMRI dataset and reporting
results; the teams simply were asked to follow data
analyses routinely employed in their labs at the whole-
brain voxel level (but note that nine speci c research
hypotheses were restricted to only three brain regions).
NARPS found large variability in reported decisions,
which were deemed to be sensitive to analysis choices
ranging from preprocessing steps (e.g., spatial smooth-
ing, head motion correction) to the speci c approach
used to handle multiple testing. Based on these nd-
ings, NARPS outlined potential recommendations for
the eld of neuroimaging research.
Despite useful lessons revealed by the NARPS in-
vestigation, the project also exempli es the common
approach in neuroimaging of generating categorical
inferential conclusions as encapsulated by the “signi -
cant versus nonsigni cant” maxim. In this context, we
address the following questions:
1. Are conventional multiple testing adjustment
methods informationally wasteful?
2. NARPS suggested that there was “substantial vari-
ability” in reported results across teams of investi-
gators analyzing the same dataset. Is this conclusion
dependent, at least in part, on the common practice
of ignoring spatial
hierarchy at the global level and
drawing inferences binarily (i.e., “signi cant” vs.
“nonsigni cant”)?
3. What changes can the neuroimaging eld make
in modeling and result reporting to improve
reproducibility?
In this context, we consider inferential procedures not
strictly couched in the standard null hypothesis signi -
cance testing framework. Rather, we suggest that mul-
tilevel models, particularly when constructed within a
Bayesian framework, provide powerful tools for the anal-
ysis of neuroimaging studies given the data’s inherent
hierarchical structure. As our paper focuses on hierarchi-
cal modeling and dichotomous thinking in neuroimag-
ing, we do not discuss the broader literature on Bayesian
methods applied to FMRI.2