A Multi-Dataset Evaluation of Frame Censoring
for Motion Correction in Task-Based fMRI
Michael S. Jones, Zhenchen Zhu*, Aahana Bajracharya, Austin Luor, Jonathan E. Peelle
Department of Otolaryngology, Washington University in St. Louis, St. Louis, MO, USA
*Current affiliation: MD Program, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
Jones etal. This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 IGO License, which permits the copy
and redistribution of the material in any medium or format provided the original work and author are properly credited. In any reproduction of this article there should not be any
suggestion that APERTURE NEURO or this article endorse any specic organization or products. The use of the APERTURE NEURO logo is not permitted. This notice should be
preserved along with the article’s original URL. Open access logo and text by PLoS, under the Creative Commons Attribution-Share Alike 4.0 Unported license.
: 2022, Volume 2 - 1 - CC-BY: © Jones etal.
Subject motion during fMRI can affect our ability to accurately measure signals of interest. In recent years, frame censoring—that
is, statistically excluding motion-contaminated data within the general linear model using nuisance regressors—has appeared
in several task-based fMRI studies as a mitigation strategy. However, there have been few systematic investigations quantifying
its efcacy. In the present study, we compared the performance of frame censoring to several other common motion correction
approaches for task-based fMRI using open data and reproducible workows. We analyzed eight publicly available datasets repre-
senting 11 distinct tasks in child, adolescent, and adult participants. Performance was quantied using maximum t-values in group
analyses, and region of interest–based mean activation and split-half reliability in single subjects. We compared frame censoring
across several thresholds to the use of 6 and 24 canonical motion regressors, wavelet despiking, robust weighted least squares,
and untrained ICA-based denoising, for a total of 240 separate analyses. Thresholds used to identify censored frames were based
on both motion estimates (FD) and image intensity changes (DVARS). Relative to standard motion regressors, we found consistent
improvements for modest amounts of frame censoring (e.g., 1–2% data loss), although these gains were frequently comparable to
what could be achieved using other techniques. Importantly, no single approach consistently outperformed the others across all
datasets and tasks. These ndings suggest that the choice of a motion mitigation strategy depends on both the dataset and the
outcome metric of interest.
Keywords: motion correction, head movement, frame censoring, scrubbing, FD, DVARS, task-based fMRI
Correspondence: Dr. Michael Jones, Department of Otolaryngology, Washington University in St. Louis, 660 South Euclid, Box 8115, St. Louis, MO 63110, email: jones.mike@wustl.
edu; Dr. Jonathan Peelle, Center for Cognitive and Brain Health, Northeastern University, 360 Huntington, Boston, MA 02115, email: j.peelle@northeastern.edu
Date Received: October 14, 2021
Date Accepted: May 4, 2022
DOI: 10.52294/ApertureNeuro.2022.2.NXOR2026
High-quality neuroimaging analysis depends in part on
minimizing artifacts. Although advancements in hard-
ware and pulse sequence design have reduced many
types of noise inherent to functional MRI, other sources
remain (Bianciardi et al. 2009). One prominent challenge
is artifacts caused by subject head motion. Among other
effects, head motion changes the part of the brain sam-
pled by a particular voxel and can introduce changes in
signal intensity through interactions with the magnetic
eld, which add noise to the data and make it harder to
identify signals of interest.
The effects of head motion have received recent scru-
tiny in the context of resting-state functional connectivi-
ty. Because motion-related artifacts occur in many voxels
simultaneously, they can introduce correlations in fMRI
time series that are unrelated to BOLD activity, leading
to inaccurate estimates of functional connectivity (Power
et al. 2015; Satterthwaite et al. 2019). However, spurious
activation is also of concern in task-based functional neu-
roimaging, where it can lead to both false positives or
a lower signal-to-noise ratio that can make it harder to
detect a true activation of interest. As such, motion in
task-based fMRI potentially introduces a combination of
both Type I and Type II errors.
Rigid body realignment—a mainstay of fMRI analysis
for decades—goes some way toward improving corre-
spondence across images (Ashburner and Friston 2004)
but does not remove extraneous signal components in-
troduced by movement (Friston et al. 1996). A common
approach for mitigating motion-related artifacts is to
: 2022, Volume 2 - 2 - CC-BY: © Jones et al.
include the six realignment parameters (translation and
rotation around the X, Y, and Z axes, reecting estimated
participant motion) as nuisance regressors in rst-level
Beyond motion parameter inclusion, several data-
driven strategies have been developed to reduce the
inuence of high-motion scans on estimated activa-
tions. Wavelet decomposition identies artifacts by ex-
ploiting their non-stationarity across different temporal
scales (Patel et al. 2014). The method has been applied
in resting-state studies but is also applicable to task-
based data. Independent component analysis (Pruim
et al. 2015) identies artifacts based on the spatial dis-
tribution of shared variance. In robust weighted least
squares (Diedrichsen and Shadmehr 2005), a two-pass
modeling procedure is used to produce a collection of
nuisance regressors that are then included in the nal
analysis to weight frames by the inverse of their variance
(that is, downweighting frames with high error).
An alternative motion correction strategy is “scrub-
bing” or “frame censoring” (Lemieux et al. 2007; Siegel
et al. 2014). In this approach, bad scans are identied
and excluded from statistical analysis. One approach is
to do so by modeling them in the general linear model
using nuisance regressors (i.e., “scan-nulling regressors”
or “one-hot encoding”). Although frame censoring has
received considerable interest in resting-state fMRI over
the past several years (Power et al. 2012; Gratton et al.
2020a), it has not seen widespread use in the task-based
fMRI literature. Censoring approaches involve some ef-
fective data loss, in that censored frames do not con-
tribute to the task-related parameter estimates, and that
columns introduced to the design matrix to perform cen-
soring reduce the available degrees of freedom. There
are different ways to quantify “bad” scans, and choos-
ing both an appropriate metric and associated thresh-
old can also be challenging. Thus, additional information
over what threshold should be used for identifying bad
frames—and relatedly, how much data are lost versus
retained—is necessary to make informed decisions.
Although several published studies compare differing
correction strategies (Ardekani et al. 2001; Oakes et al.
2005; Johnstone et al. 2006), a drawback of prior work is
that evaluation was often limited to a single dataset (see
Supplemental Table 1). The degree to which an optimal
strategy for one dataset generalizes to other acquisition
schemes, tasks, or populations is not clear. With the in-
creased public availability of neuroimaging datasets
(Poldrack et al. 2013; Markiewicz et al. 2021), the possibil-
ity of evaluating motion correction approaches across a
range of data has become more feasible.
In the present work, we sought to compare the per-
formance of identical pipelines on a diverse selection of
tasks, using data from different sites, scanners, and par-
ticipant populations. Although our primary interest was
frame censoring, we considered seven different motion
correction approaches:
1. six canonical head motion (i.e., “realignment pa-
rameter”) estimates (RP6)
2. 24-term expansions of head motion estimates
3. wavelet despiking (WDS)
4. robust weighted least squares (rWLS)
5. untrained independent component analysis (uICA)
6. frame censoring based on frame displacement
7. frame censoring based on variance differentiation
This list is not exhaustive but is representative of ap-
proaches that are currently used and feasible to include
in an automated processing pipeline.
Because it is impossible to determine a “ground truth”
result with which to compare the effectiveness of these
approaches, we instead considered four complementary
outcome metrics: (1) the maximum group t-statistic both
across the whole brain and in a region of interest (ROI)
relevant to the task; (2) the average parameter estimates
from within the same ROI; (3) the degree of test–retest
consistency exhibited by subject-level parametric maps;
and (4) the spatial overlap of thresholded group-level
statistical maps. These metrics are simple to dene yet
functionally meaningful and can be applied to data from
almost any fMRI study. In our view, Dice quanties repli-
cability, the mean ROI value quanties effect size (signal),
and maximum-t quanties signal to noise (effect size pe-
nalized by variance).
We analyzed eight studies obtained from OpenNeuro
(Markiewicz et al. 2021), several of which included mul-
tiple tasks or multiple participant groups. As such, the
eight selected studies provided a total of 15 datasets.
The selection process was informal, but studies given
priority included: (1) a clearly dened task; (2) a sufcient
number of subjects to allow second-level modeling;
(3) sufcient data to make test–retest evaluation possible;
and (4) a publication associated with the data describing
a result to which we could compare our own analysis.
A summary of the eight datasets selected is shown in
Table 1 (acquisition details are provided in Supplemental
Table 2). Additional information, including task de-
tails, modeling/contrast descriptions compiled from
: 2022, Volume 2 - 3 - CC-BY: © Jones et al.
publication(s) associated with a given study, and any data
irregularities encountered during analysis, is provided in
the Supplemental Materials.
All scripts used in the study are available at https://osf
.io/n5v3w/. Analysis was performed using Automatic
Analysis version 5.4.0 (Cusack et al. 2015; RRID:
SCR_003560), which scripted a combination of SPM12
(Wellcome Trust Centre for Neuroimaging) version 7487
(RRID: SCR_007037) and the FMRIB Software Library
(FSL; FMRIB Analysis Group) (Jenkinson et al. 2012) ver-
sion 6.0.1 (RRID: SCR_002823). BrainWavelet Toolbox
v2.0 (Patel et al. 2014) was used for wavelet despiking
and rWLS version 4.0 (Diedrichsen and Shadmehr 2005)
for robust weighted least squares.
To the extent possible, we used the same preprocess-
ing pipeline for all datasets (Figure 1a). Briey, structural
and functional images were translated to the center of
the scanned volume, and the rst four frames of each
session were removed in functional images to allow for
signal stabilization. This was followed by bias correc-
tion of the structural image, realignment, coregistration
of the functional and structural images, normalization
into MNI space using a unied segmentation approach
(Ashburner and Friston 2005) resampled to 2 mm isotro-
pic voxels and smoothing of the functional images using
an 8-mm FWHM Gaussian kernel.
Functional images were corrected for motion artifacts
using each of the following approaches: (1) inclusion of
six canonical motion estimates in the rst-level model as
nuisance regressors, (2) inclusion of 24 nuisance regres-
sors based on a second-order expansion of the motion
estimates and rst derivatives, (3) wavelet despiking,
(4) robust weighted least squares, (5) untrained ICA
denoising, (6) frame censoring based on framewise
displacement (FD), or (7) differential variance (DVARS)
thresholding (FD/DVARS thresholding is described later).
Statistical modeling was performed in SPM for all
motion correction approaches. First-level modeling in-
cluded a contrast of interest described in a publication
associated with the dataset for evaluation, followed by
second-level analysis to produce group-level statistical
maps. All rst- and second-level t-maps were threshold-
ed at a voxelwise threshold of p < 0.001 (uncorrected).
Minor pipeline modications were required for robust
weighted least squares, wavelet despiking, and untrained
ICA denoising. As recommended by developers of the
rWLS toolbox, unsmoothed data were used for variance
estimation and contrast maps were smoothed after mod-
eling. For wavelet despiking, functional images were
Table 1. Summary of datasets analyzed
Dataset Reference Task and design
Age range
(median ± SD)
Frames per
ds000102 Kelly et al. (2008) Flanker (E) 22 22–50 0.11 ± 0.12 284
ds000107 Duncan et al. (2009) 1-back (B) 43 19–38 0.08 ± 0.14 323
ds000114 Gorgolewski et al. (2013b) Motor (B) 10 50–58 0.14 ± 0.16 360
Covert verb (B) 10 50–58 0.11 ± 0.11 338
Overt word (B) 10 50–58 0.13 ± 0.12 144
Line bisection (B) 9 50–58 0.13 ± 0.18 468
ds000228 Richardson et al. (2018) Film viewing (E) 122 3.5–12 0.21 ± 0.93 164
33 18–39 0.18 ± 0.27 164
ds001497 Lewis-Peacock and Postle (2008) Face perception (E) 10 19–32 0.11 ± 0.12 1146
ds001534 Courtney et al. (2018) Food images (E) 42 18–22 0.10 ± 0.16 552
ds001748 Fynes-Clinton et al. (2019) Memory retrieval (E) 21 10–12 0.16 ± 0.36 438
20 14–16 0.12 ± 0.17 438
21 20–35 0.08 ± 0.17 438
ds002382 Rogers et al. (2020) Word recognition (E) 29 19–30 0.14 ± 0.35 710
32 65–81 0.30 ± 0.34 710
B = blocked design; E = event-related design.
: 2022, Volume 2 - 4 - CC-BY: © Jones et al.
Replicability was quantied as the Dice coefcient of
thresholded rst-level t-maps (p < 0.001, uncorrected) in
each subject (restricted to the ROI).
FD and DVARS thresholding
Motion correction approaches based on frame censor-
ing required quantication of motion artifacts which
could then be subjected to thresholding. Both framew-
ise displacement (FD) and differential variance (DVARS)
were used. Framewise displacement was calculated as
the sum of the six head motion estimates obtained from
realignment, with a dimensional conversion of the three
rotations assuming the head is a 50-mm sphere (Power
et al. 2012). DVARS was calculated as the root-mean-
squared of the time difference in the BOLD signal cal-
culated across the entire brain (Smyser et al. 2011). As
shown in Figure 2a, both metrics closely tracked arti-
facts apparent in voxel intensities and also each other.
Although FD and DVARS in a given session tended to
be correlated (Figure 2b), they were not identical and
could exhibit slightly different time courses and relative
peak amplitudes (Supplemental Figure S1). As such, we
explored the use of both measures.
Thresholds were determined by calculating FD and
DVARS across all sessions in all subjects, which allowed
values to be identied that resulted in 1%, 2%, 5%,
10%, and 20% frame violations across the entire dataset
(Figure 2c). We adopted this strategy rather than using a
xed value of FD or DVARS for several reasons. First, FD
and DVARS magnitudes change with the TR of the data,
rescaled to a whole-brain median of 1000 across all frames
before processing. The default toolbox settings (wave-
let: d4, threshold: 10, boundary: reection, chain search:
moderate, scale number: liberal) were used. Finally, un-
trained ICA-based denoising was implemented using
ICA-AROMA (Pruim et al. 2015) with additional process-
ing steps performed within FSL. Briey, the unsmoothed
coregistered functional image was demeaned, detrend-
ed, smoothed, and then nonlinearly warped to the FSL
2 mm MNI152 template using FNIRT. The normalized
functional image was then passed to AROMA for denois-
ing. This ICA implementation is not based on training
data, and so we refer to it as “untrained” ICA to distin-
guish it from other ICA-based denoising approaches.
Evaluation of motion correction performance
Three measures were used to quantify the perfor-
mance of each motion correction strategy, illustrated in
Figure 1b: (1) maximum t-value, (2) effect size, and
(3) subject replicability. In the rst measure, the maximum
t-value occurring in the group level parametric map was
extracted both at the whole-brain level and also within
a region of interest relevant to the task. The effect size
was quantied as the mean of all voxels within the ROI
for each subject using the rst-level beta maps. To eval-
uate subject replicability, multisession data were treat-
ed as a test–retest paradigm (the rst session statistical
map was compared to the second session in studies
having fewer than three sessions; even-numbered ver-
sus odd-numbered sessions were compared otherwise).
Fig. 1. Schematic of processing pipeline and outcome measures. (a) Summary of preprocessing and model steps in com-
mon and differing across motion correction strategies. (b) Following statistical modeling, outcomes are summarized in
mean parameter estimates and Dice overlap of thresholded single-subject maps (top) and maximum t-value from the group
analysis (bottom). Dashed lines represent values obtained without motion correction.
: 2022, Volume 2 - 5 - CC-BY: © Jones et al.
because the TR is the sampling rate (for a given move-
ment, sampling more rapidly will give smaller FD values,
even though the total motion is the same). Second, differ-
ent calculations of FD provide different values (Jenkinson
et al. 2002; Power et al. 2012; Van Dijk et al. 2012), and thus
any absolute threshold would necessarily be metric spe-
cic. Finally, datasets differ in their tasks and populations,
and we anticipated that a single threshold would not be
suitable for all datasets. We, therefore, employed the
frame-percent thresholding strategy to obtain a reason-
able range of results in all studies examined. To be clear,
we do not propose a xed-percent data loss approach as
a “production” strategy. Indeed, a conventional denois-
ing approach would be to select an absolute threshold of
acceptable motion based on experience (or preliminary
Fig. 2. Calculation of censoring thresholds in an example dataset (ds000228,
children). (a) Representative grayplot (bottom; Power 2017) showing 500 ran-
domly selected gray matter voxels. DVARS and FD for this session are plot-
ted above. Spikes in the metrics identify frames contaminated by artifacts.
(b) DVARS and FD are correlated but exhibit differing amplitudes and time cours-
es. As such, the use of both measures was explored. (c) Metric values (here shown
for DVARS) used for censoring were determined by plotting frameloss for each
subject as a function of threshold (thin blue traces). Interpolation of the mean re-
sponse (thick black trace) gives metric values corresponding to a data loss of 1%,
2%, 5%, 10%, or 20%. Box plot (inset) summarizes the results across all subjects at
each threshold (box: 25–75% percentiles; crosses: >3 SD outliers). (d) Histograms
of data loss at each FD or DVARS target value across all subjects in this dataset.
Red line indicates 50% frameloss.
data) from the target subject pool, task, scanner, and
so on. However, given the variety of datasets examined
here, we had no a priori guide as to what threshold val-
ues to use. Any xed selection might censor no frames in
some studies and too many in others. Therefore, we em-
ployed the frame-percent thresholding strategy to obtain
an informative range of results in all datasets. The thresh-
old values that resulted from percent data loss targeting
in these datasets are shown in Supplemental Figure S2
and listed in Supplemental Table 3. The amount of data
censored in each participant in a single study is shown in
Figure 2d, and for all studies in Supplemental Figure S3.
To implement frame censoring, rst-level modeling
was repeated for each threshold with a separate delta
function (i.e., a scan-nulling regressor) included in the
design matrix at the location of each violation, which ef-
fectively removes the contribution of the targeted frame
from the analysis. Although some prior studies of motion
correction have censored one or more frames before or
following threshold violations (e.g., “augmentation” of
Siegel et al. 2014), we did not pursue such variations to
avoid further expanding what was already a rather large
parameter space.
Region of interest denition
A task-relevant ROI for each study/task was dened in
one of three ways: (1) a 5-mm sphere (or spheres) cen-
tered at coordinates reported in a publication associated
with the dataset; (2) a whole-brain Z-mask generated by
a task-relevant search term (e.g., “incongruent task”) in
NeuroQuery (Dockès et al. 2020) and thresholded Z > 3;
or (3) a binarized probability map in the SPM Anatomy
Toolbox (Eickhoff et al. 2005) for a task-relevant brain
structure or anatomical region (e.g., “V2”). Additional
details on the ROI denition used in each analysis are
provided in the Supplemental Materials.
Performance of the motion correction strategies orga-
nized by dataset is shown in Figure 3. Each panel in-
cludes a second-level thresholded t-map at the upper
left (p < 0.001, uncorrected) using the “RP6” approach
(six canonical motion parameters included as nuisance
regressors). A contrast descriptor is given below the
map. The ROI used for evaluation is shown at lower left
with the source listed under the rendered image.
These results show there is a substantial variability in mo-
tion correction approaches, with performance depending
both on the data under consideration and the chosen
performance metric. However, some general trends are
apparent. Wavelet despiking tends to offer the best maxi-
mum t-value in both the whole-brain and ROI-constrained
evaluation, with robust weighted least squares also ex-
hibiting good performance (note the ROI-constrained