Sources of Information Waste in Neuroimaging:
Mishandling Structures, Thinking Dichotomously,
and Over-Reducing Data
Gang Chen,a,* Paul A. Taylor,a Joel Stoddard,b Robert W. Cox,a Peter A. Bandettini,c and Luiz Pessoad
aScientifi c and Statistical Computing Core, NIMH, National Institutes of Health, Bethesda, MD, USA
bDepartment of Psychiatry, University of Colorado, Aurora, CO, USA
cSection on Functional Imaging Methods, NIMH, National Institutes of Health, Bethesda, MD, USA
dDepartment of Psychology, Department of Electrical and Computer Engineering, and Maryland Neuroimaging Center,
University of Maryland, College Park, MD, USA
Chen etal. This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 IGO License, which permits the copy
and redistribution of the material in any medium or format provided the original work and author are properly credited. In any reproduction of this article there should not be any
suggestion that APERTURE NEURO or this article endorse any speci c organization or products. The use of the APERTURE NEURO logo is not permitted. This notice should be
preserved along with the article’s original URL. Open access logo and text by PLoS, under the Creative Commons Attribution-Share Alike 4.0 Unported license.
: 2022, Volume 2 - 1 - CC-BY: © Chen etal.
: 2022, Volume 2 - 1 -
Neuroimaging relies on separate statistical inferences at tens of thousands of spatial locations. Such massively univariate analysis
typically requires an adjustment for multiple testing in an attempt to maintain the family-wise error rate at a nominal level of 5%.
First, we examine three sources of substantial information loss that are associated with the common practice under the massively
univariate framework: (a) the hierarchical data structures (spatial units and trials) are not well maintained in the modeling process;
(b) the adjustment for multiple testing leads to an arti cial step of strict thresholding; (c) information is excessively reduced during
both modeling and result reporting. These sources of information loss have far-reaching impacts on result interpretability as well
as reproducibility in neuroimaging. Second, to improve inference ef ciency, predictive accuracy, and generalizability, we propose
a Bayesian multilevel modeling framework that closely characterizes the data hierarchies across spatial units and experimental
trials. Rather than analyzing the data in a way that  rst creates multiplicity and then resorts to a post hoc solution to address them,
we suggest directly incorporating the cross-space information into one single model under the Bayesian framework (so there is
no multiplicity issue). Third, regardless of the modeling framework one adopts, we make four actionable suggestions to alleviate
information waste and to improve reproducibility: (1) model data hierarchies, (2) quantify effects, (3) abandon strict dichotomiza-
tion, and (4) report full results. We provide examples for all of these points using both demo and real studies, including the recent
Neuroimaging Analysis Replication and Prediction Study (NARPS).
Keywords: Bayesian multilevel modeling; data hierarchy; dichotomization; effect magnitude; information waste; multiple testing problem; result reporting
Correspondence: Gang Chen, Scienti c and Statistical Computing Core, NIMH, National Institutes of Health, Bethesda, MD, USA. Email: gangchen@mail.nih.gov
Received: May 12, 2021
Accepted: November 08, 2021
DOI: 10.52294/2e179dbf-5e37-4338-a639-9ceb92b055ea
Functional magnetic resonance imaging (FMRI) is a
mainstay technique of human neuroscience, which al-
lows the study of the neural correlates of many functions,
including perception, emotion, and cognition. The basic
spatial unit of FMRI data is a voxel ranging from 1 to 3 mm
on each side. As data are collected across time when
a participant performs tasks or remains at “rest,” FMRI
datasets contain a time series at each voxel. Typically,
tens of thousands of voxels are analyzed simultaneously.
Such a “divide and conquer” approach through massive-
ly univariate analysis necessitates some form of multiple
testing adjustment via procedures based on Bonferroni’s
inequality or false discovery rate.
Statisticians classically asked the wrong question –
and were willing to answer with a lie. They asked “Are
the effects of A and B different?” and they were will-
ing to answer “no.”
All we know about the world teaches us that the ef-
fects of A and B are always different – in some decimal
place for any A and B. Thus asking “are the effects
different?” is foolish.
John W. Tukey, “The Philosophy of Multiple
Comparisons,” Statistical Science
: 2022, Volume 2 - 2 - CC-BY: © Chen et al.
: 2022, Volume 2 - 2 -
We start with a brief refresher of the conventional sta-
tistical framework typically adopted in neuroimaging.
Statistical testing begins by accepting the null hypothesis
but then rejecting it in favor of the alternative hypothesis
if the current data for the effect of interest (e.g., task A
vs. task B) or potentially more extreme observations are
unlikely to occur under the assumption that the effect is
absolutely zero. Because the basic data unit is the voxel,
one faces the problem of performing tens of thousands
of inferences across space simultaneously. As the spatial
units are not independent of one another, adopting an
adjustment such as Bonferroni’s is unreasonably conser-
vative. Instead, the  eld has gradually settled into em-
ploying a cluster-based approach: what is the size of a
spatial cluster that would be unlikely to occur under the
null scenario?
Accordingly, a two-step procedure is utilized:  rst
threshold the voxelwise statistical evidence at a par-
ticular (or a range of) voxelwise p-value (e.g., 0.001)
and then consider only contiguous clusters of evi-
dence (Fig.1). Several adjustment methods have been
developed to address multiple testing by leveraging
the spatial relatedness among neighboring voxels.
The stringency of the procedures has been extensively
debated over the past decades, with the overall prob-
ability of having clusters of a minimum spatial extent
given a null effect estimated by two common approach-
es: a parametric method3,4 and a permutation-based
adjustment.5 For the former, recent recommendations
have resulted in the convention of adopting a primary
threshold of voxelwise p = 0.001 followed by cluster
size determination6,7; for the latter, the threshold is
based on the integration between a range of statistical
evidence and the associated spatial extent.5
Problems of multiple testing adjustments
At least  ve limitations are associated with multiple
testing adjustments leveraged through spatial extent.8
1. Conceptual inconsistency. Consider that the sta-
ples of neuroimaging research are the maps of
statistical evidence and associated tables. Both
typically present only the statistic (e.g., t) values.
However, this change of focus is inconsistent with
c luster-based inference: after multiple testing
adjustment, the proper unit of inference is the
cluster, not the voxel. Once “signi cant” clusters
are determined, one should only speak of clus-
ters and the voxels inside each cluster should no
longer be considered meaningful inferentially.
In other words, the statistical evidence for each
surviving cluster is deemed at the “signi cance”
level of 0.05 and the voxelwise statistic values
Conventional neuroimaging inferences follow the null
hypothesis signi cance testing framework, where the
decision procedure dichotomizes the available evidence
into two categories at the end. Thus, one part of the ev-
idence survives an adjusted threshold at the whole-brain
level and is considered statistically signi cant (informally
interpreted as a “true” effect); the other part is ignored
(often misinterpreted as “not true”) and by convention
omitted and hidden from public view (i.e., the  le drawer
A recent study1 (referred to as NARPS hereafter) of-
fers a salient opportunity for the neuroimaging com-
munity to re ect about common practices in statistical
modeling and the communication of study  ndings.
The study recruited 70 teams charged with the task
of analyzing a particular FMRI dataset and reporting
results; the teams simply were asked to follow data
analyses routinely employed in their labs at the whole-
brain voxel level (but note that nine speci c research
hypotheses were restricted to only three brain regions).
NARPS found large variability in reported decisions,
which were deemed to be sensitive to analysis choices
ranging from preprocessing steps (e.g., spatial smooth-
ing, head motion correction) to the speci c approach
used to handle multiple testing. Based on these  nd-
ings, NARPS outlined potential recommendations for
the  eld of neuroimaging research.
Despite useful lessons revealed by the NARPS in-
vestigation, the project also exempli es the common
approach in neuroimaging of generating categorical
inferential conclusions as encapsulated by the “signi -
cant versus nonsigni cant” maxim. In this context, we
address the following questions:
1. Are conventional multiple testing adjustment
methods informationally wasteful?
2. NARPS suggested that there was “substantial vari-
ability” in reported results across teams of investi-
gators analyzing the same dataset. Is this conclusion
dependent, at least in part, on the common practice
of ignoring spatial
hierarchy at the global level and
drawing inferences binarily (i.e., “signi cant” vs.
“nonsigni cant”)?
3. What changes can the neuroimaging  eld make
in modeling and result reporting to improve
In this context, we consider inferential procedures not
strictly couched in the standard null hypothesis signi -
cance testing framework. Rather, we suggest that mul-
tilevel models, particularly when constructed within a
Bayesian framework, provide powerful tools for the anal-
ysis of neuroimaging studies given the data’s inherent
hierarchical structure. As our paper focuses on hierarchi-
cal modeling and dichotomous thinking in neuroimag-
ing, we do not discuss the broader literature on Bayesian
methods applied to FMRI.2