R E S E A R C H A R T I C L E

Sources of Information Waste in Neuroimaging:

Mishandling Structures, Thinking Dichotomously,

and Over-Reducing Data

Gang Chen,a,* Paul A. Taylor,a Joel Stoddard,b Robert W. Cox,a Peter A. Bandettini,c and Luiz Pessoad

aScientiﬁ c and Statistical Computing Core, NIMH, National Institutes of Health, Bethesda, MD, USA

bDepartment of Psychiatry, University of Colorado, Aurora, CO, USA

cSection on Functional Imaging Methods, NIMH, National Institutes of Health, Bethesda, MD, USA

dDepartment of Psychology, Department of Electrical and Computer Engineering, and Maryland Neuroimaging Center,

University of Maryland, College Park, MD, USA

Chen etal. This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 IGO License, which permits the copy

and redistribution of the material in any medium or format provided the original work and author are properly credited. In any reproduction of this article there should not be any

suggestion that APERTURE NEURO or this article endorse any speci c organization or products. The use of the APERTURE NEURO logo is not permitted. This notice should be

preserved along with the article’s original URL. Open access logo and text by PLoS, under the Creative Commons Attribution-Share Alike 4.0 Unported license.

: 2022, Volume 2 - 1 - CC-BY: © Chen etal.

: 2022, Volume 2 - 1 -

ABSTRACT

Neuroimaging relies on separate statistical inferences at tens of thousands of spatial locations. Such massively univariate analysis

typically requires an adjustment for multiple testing in an attempt to maintain the family-wise error rate at a nominal level of 5%.

First, we examine three sources of substantial information loss that are associated with the common practice under the massively

univariate framework: (a) the hierarchical data structures (spatial units and trials) are not well maintained in the modeling process;

(b) the adjustment for multiple testing leads to an arti cial step of strict thresholding; (c) information is excessively reduced during

both modeling and result reporting. These sources of information loss have far-reaching impacts on result interpretability as well

as reproducibility in neuroimaging. Second, to improve inference ef ciency, predictive accuracy, and generalizability, we propose

a Bayesian multilevel modeling framework that closely characterizes the data hierarchies across spatial units and experimental

trials. Rather than analyzing the data in a way that rst creates multiplicity and then resorts to a post hoc solution to address them,

we suggest directly incorporating the cross-space information into one single model under the Bayesian framework (so there is

no multiplicity issue). Third, regardless of the modeling framework one adopts, we make four actionable suggestions to alleviate

information waste and to improve reproducibility: (1) model data hierarchies, (2) quantify effects, (3) abandon strict dichotomiza-

tion, and (4) report full results. We provide examples for all of these points using both demo and real studies, including the recent

Neuroimaging Analysis Replication and Prediction Study (NARPS).

Keywords: Bayesian multilevel modeling; data hierarchy; dichotomization; effect magnitude; information waste; multiple testing problem; result reporting

Correspondence: Gang Chen, Scienti c and Statistical Computing Core, NIMH, National Institutes of Health, Bethesda, MD, USA. Email: gangchen@mail.nih.gov

Received: May 12, 2021

Accepted: November 08, 2021

DOI: 10.52294/2e179dbf-5e37-4338-a639-9ceb92b055ea

Functional magnetic resonance imaging (FMRI) is a

mainstay technique of human neuroscience, which al-

lows the study of the neural correlates of many functions,

including perception, emotion, and cognition. The basic

spatial unit of FMRI data is a voxel ranging from 1 to 3 mm

on each side. As data are collected across time when

a participant performs tasks or remains at “rest,” FMRI

datasets contain a time series at each voxel. Typically,

tens of thousands of voxels are analyzed simultaneously.

Such a “divide and conquer” approach through massive-

ly univariate analysis necessitates some form of multiple

testing adjustment via procedures based on Bonferroni’s

inequality or false discovery rate.

INTRODUCTION

Statisticians classically asked the wrong question –

and were willing to answer with a lie. They asked “Are

the effects of A and B different?” and they were will-

ing to answer “no.”

All we know about the world teaches us that the ef-

fects of A and B are always different – in some decimal

place – for any A and B. Thus asking “are the effects

different?” is foolish.

John W. Tukey, “The Philosophy of Multiple

Comparisons,” Statistical Science

(1991)

: 2022, Volume 2 - 2 - CC-BY: © Chen et al.

R E S E A R C H A R T I C L E

: 2022, Volume 2 - 2 -

MASSIVELY UNIVARIATE ANALYSIS

ANDMULTIPLE TESTING

We start with a brief refresher of the conventional sta-

tistical framework typically adopted in neuroimaging.

Statistical testing begins by accepting the null hypothesis

but then rejecting it in favor of the alternative hypothesis

if the current data for the effect of interest (e.g., task A

vs. task B) or potentially more extreme observations are

unlikely to occur under the assumption that the effect is

absolutely zero. Because the basic data unit is the voxel,

one faces the problem of performing tens of thousands

of inferences across space simultaneously. As the spatial

units are not independent of one another, adopting an

adjustment such as Bonferroni’s is unreasonably conser-

vative. Instead, the eld has gradually settled into em-

ploying a cluster-based approach: what is the size of a

spatial cluster that would be unlikely to occur under the

null scenario?

Accordingly, a two-step procedure is utilized: rst

threshold the voxelwise statistical evidence at a par-

ticular (or a range of) voxelwise p-value (e.g., 0.001)

and then consider only contiguous clusters of evi-

dence (Fig.1). Several adjustment methods have been

developed to address multiple testing by leveraging

the spatial relatedness among neighboring voxels.

The stringency of the procedures has been extensively

debated over the past decades, with the overall prob-

ability of having clusters of a minimum spatial extent

given a null effect estimated by two common approach-

es: a parametric method3,4 and a permutation-based

adjustment.5 For the former, recent recommendations

have resulted in the convention of adopting a primary

threshold of voxelwise p = 0.001 followed by cluster

size determination6,7; for the latter, the threshold is

based on the integration between a range of statistical

evidence and the associated spatial extent.5

Problems of multiple testing adjustments

At least ve limitations are associated with multiple

testing adjustments leveraged through spatial extent.8

1. Conceptual inconsistency. Consider that the sta-

ples of neuroimaging research are the maps of

statistical evidence and associated tables. Both

typically present only the statistic (e.g., t) values.

However, this change of focus is inconsistent with

c luster-based inference: after multiple testing

adjustment, the proper unit of inference is the

cluster, not the voxel. Once “signi cant” clusters

are determined, one should only speak of clus-

ters and the voxels inside each cluster should no

longer be considered meaningful inferentially.

In other words, the statistical evidence for each

surviving cluster is deemed at the “signi cance”

level of 0.05 and the voxelwise statistic values

Conventional neuroimaging inferences follow the null

hypothesis signi cance testing framework, where the

decision procedure dichotomizes the available evidence

into two categories at the end. Thus, one part of the ev-

idence survives an adjusted threshold at the whole-brain

level and is considered statistically signi cant (informally

interpreted as a “true” effect); the other part is ignored

(often misinterpreted as “not true”) and by convention

omitted and hidden from public view (i.e., the le drawer

problem).

A recent study1 (referred to as NARPS hereafter) of-

fers a salient opportunity for the neuroimaging com-

munity to re ect about common practices in statistical

modeling and the communication of study ndings.

The study recruited 70 teams charged with the task

of analyzing a particular FMRI dataset and reporting

results; the teams simply were asked to follow data

analyses routinely employed in their labs at the whole-

brain voxel level (but note that nine speci c research

hypotheses were restricted to only three brain regions).

NARPS found large variability in reported decisions,

which were deemed to be sensitive to analysis choices

ranging from preprocessing steps (e.g., spatial smooth-

ing, head motion correction) to the speci c approach

used to handle multiple testing. Based on these nd-

ings, NARPS outlined potential recommendations for

the eld of neuroimaging research.

Despite useful lessons revealed by the NARPS in-

vestigation, the project also exempli es the common

approach in neuroimaging of generating categorical

inferential conclusions as encapsulated by the “signi -

cant versus nonsigni cant” maxim. In this context, we

address the following questions:

1. Are conventional multiple testing adjustment

methods informationally wasteful?

2. NARPS suggested that there was “substantial vari-

ability” in reported results across teams of investi-

gators analyzing the same dataset. Is this conclusion

dependent, at least in part, on the common practice

of ignoring spatial

hierarchy at the global level and

drawing inferences binarily (i.e., “signi cant” vs.

“nonsigni cant”)?

3. What changes can the neuroimaging eld make

in modeling and result reporting to improve

reproducibility?

In this context, we consider inferential procedures not

strictly couched in the standard null hypothesis signi -

cance testing framework. Rather, we suggest that mul-

tilevel models, particularly when constructed within a

Bayesian framework, provide powerful tools for the anal-

ysis of neuroimaging studies given the data’s inherent

hierarchical structure. As our paper focuses on hierarchi-

cal modeling and dichotomous thinking in neuroimag-

ing, we do not discuss the broader literature on Bayesian

methods applied to FMRI.2