: 2021, Volume 1 - 1 - CC-BY: © Kruper et al.
Evaluating the Reliability of Human Brain
White Matter Tractometry
John Kruper,
Jason D. Yeatman,
Adam Richie-Halford,
David Bloom,
Mareike Grotheer,
Sendy Caffarra,
Gregory Kiar,
Iliana I. Karipidis,
Ethan Roy,
Bramsh Q. Chandio,
Eleftherios Garyfallidis,
and Ariel Rokem *
Department of Psychology, University of Washington, Seattle, WA, 98195, USA
eScience Institute, University of Washington, Seattle, WA, 98195, USA
Graduate School of Education, Stanford University, Stanford, CA, 94305, USA
Division of Developmental-Behavioral Pediatrics, Stanford University School of Medicine, Stanford,
CA, 94305, USA
Center for Mind, Brain and Behavior – CMBB, Hans-Meerwein-Straße 6, Marburg 35032, Germany
Department of Psychology, University of Marburg, Marburg 35039, Germany
Basque Center on Cognition, Brain and Language, BCBL, 20009, Spain
Department of Biomedical Engineering, McGill University, Montreal, H3A 0E9, Canada
Center for Interdisciplinary Brain Sciences Research, Department of Psychiatry and Behavioral Sciences,
Stanford University School of Medicine,Stanford, CA, 94305, USA
Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing and Engineering,
Indiana University Bloomington, Bloomington, IN, 47408, USA
The validity of research results depends on the reliability of analysis methods. In recent years, there have been concerns about
the validity of research that uses diffusion-weighted MRI (dMRI) to understand human brain white matter connections in vivo,
in part based on the reliability of analysis methods used in this  eld. We de ned and assessed three dimensions of reliability in
dMRI-based tractometry, an analysis technique that assesses the physical properties of white matter pathways: (1) reproducibility,
(2) test-retest reliability, and (3) robustness. To facilitate reproducibility, we provide software that automates tractometry (https://
yeatmanlab.github.io/pyAFQ). In measurements from the Human Connectome Project, as well as clinical-grade measurements, we
nd that tractometry has high test-retest reliability that is comparable to most standardized clinical assessment tools. We  nd that
tractometry is also robust: showing high reliability with different choices of analysis algorithms. Taken together, our results suggest
that tractometry is a reliable approach to analysis of white matter connections. The overall approach taken here both demonstrates
the speci c trustworthiness of tractometry analysis and outlines what researchers can do to establish the reliability of computational
analysis pipelines in neuroimaging.
Keywords: Diffusion MRI, Brain Connectivity, Tractography, Reproducibility, Robustness
Correspondence: arokem@uw.edu
Received: February 26, 2021
Accepted: June 24, 2021
DOI: 10.52294/e6198273-b8e3-4b63-babb-6e6b0da10669
(3, 4). Collections of streamlines that match the location
and direction of major white matter pathways within an
individual can be generated with different strategies:
using probabilistic (5, 6) or streamline-based (7, 8) at-
lases or known anatomical landmarks (9–12). Because
these are models of the anatomy, we refer to these esti-
mates as bundles to distinguish them from the anatomi-
cal pathways themselves. The delineation of well-known
anatomical pathways overcomes many of the concerns
about confounds in dMRI-based tractography (13, 14),
because “brain connections derived from diffusion MRI
The white matter of the brain contains the long-range
connections between distant cortical regions. The inte-
gration and coordination of brain activity through the
fascicles containing these connections are important for
information processing and for brain health (1, 2). Using
voxel-speci c directional diffusion information from dif-
fusion-weighted MRI (dMRI), computational tractography
produces three-dimensional trajectories through the white
matter within the MRI volume that are called streamlines
Kruper et al. This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 IGO License, which permits the copy
and redistribution of the material in any medium or format provided the original work and author are properly credited. In any reproduction of this article there should not be any
suggestion that APERTURE NEURO or this article endorse any speci c organization or products. The use of the APERTURE NEURO logo is not permitted. This notice should be
preserved along with the article’s original URL. Open access logo and text by PLoS, under the Creative Commons Attribution-Share Alike 4.0 Unported license.
: 2021, Volume 1 - 2 - CC-BY: © Kruper et al.
across different models of the diffusion in individual vox-
els, across different bundle recognition approaches, and
across different implementations.
We developed an open-source tractometry software li-
brary to support computational reproducibility: pyAFQ.
The software relies heavily on methods implemented in
Diffusion Imaging in Python (DIPY) (28). Our implementa-
tion was also guided by a previous MATLAB implementa-
tion of tractometry (mAFQ) (9). More details are available in
the “Automated Fiber Quanti cation in Python (pyAFQ)”
section of Supplementary Methods.
The pyAFQ software is con gurable, allowing users to
specify methods and parameters for different stages of
the analysis (Fig. S2). Here, we will describe the default
setting. In the  rst step, computational tractography
methods, implemented in DIPY (28), are used to gener-
ate streamlines throughout the brain white matter (Fig.
S1A). Next, the T1-weighted Montreal Neurological
Institute (MNI) template (29, 30) is registered to the
anisotropic power map (APM) (31, 32) computed from
the diffusion data that has a T1-like contrast (Fig. S1B)
using the symmetric image normalization method (33)
implemented in DIPY (28). The next step is to perform
bundle recognition, where each tractography streamline
is classi ed as either belonging to a particular bundle
or discarded. We use the transformation found during
registration to bring canonical anatomical landmarks,
such as waypoint regions of interest (ROIs) and proba-
bility maps, from template space to the individual sub-
ject’s native space. Waypoint ROIs are used to delineate
the trajectory of the bundles (34). See Table S1 for the
bundle abbreviations we use in this paper. Streamlines
that pass through inclusion waypoint ROIs for a particu-
lar bundle, and do not pass through exclusion ROI, are
selected as candidates to include in the bundle. In ad-
dition, a probabilistic atlas (35) is used as a tiebreaker to
determine whether a streamline is more likely to belong
to one bundle or another (in cases where the streamline
matches the criteria for inclusion in either). For example,
the corticospinal tract is identi ed by  nding streamlines
that pass through an axial waypoint ROI in the brainstem
and another ROI axially oriented in the white matter
of the corona radiata but that do not pass through the
midline (Fig. S1C). The  nal step is to extract the tract
pro le: each streamline is resampled to a  xed number
of points, and the mean value of a diffusion-derived sca-
lar (e.g., fractional anisotropy (FA) and mean diffusivity
tractography can be highly anatomically accurate – if we
know where white matter pathways start, where they end,
and where they do not go” (15).
The physical properties of brain tissue affect the diffu-
sion of water, and the microstructure of tissue within the
white matter along the length of computationally gener-
ated bundles can be assessed using a variety of models
(16, 17). Taken together, computational tractography,
bundle recognition, and diffusion modeling provide so-
called tract pro les: estimates of microstructural proper-
ties of tissue along the length of major pathways. This is
the basis of tractometry: statistical analysis that compares
different groups or assesses individual variability in brain
connection structure (9, 18–21). For the inferences made
from tractometry to be valid and useful, tract pro les
need to be reliable.
In the present work, we provide an assessment of three
different ways in which scienti c results can be reliable: re-
producibility, test-retest reliability (TRR), and robustness.
These terms are often debated, and con icting de ni-
tions for these terms have been proposed (22, 23). Here,
we use the de nitions proposed in (24). Reproducibility
is de ned as the case in which data and methods are
fully accessible and usable: running the same code with
the same data should produce an identical result. Use of
different data (e.g., in a test-retest experiment) resulting
inquantitatively comparable results would denote TRR.
In clinical science and psychology in general, TRR (e.g., in
the form of inter-rater reliability) is considered a key met-
ric of the reliability of a measurement. Use of a different
analysis approach or different analysis system (e.g., dif-
ferent software implementation of the same ideas) could
result in similar conclusions, denoting their robustness to
implementation details. The recent  ndings of Botvinik-
Nezer et al. (25) show that even when full computational
reproducibility is achieved, the results of analyzing a sin-
gle functional MRI (fMRI) dataset can vary signi cantly be-
tween teams and analysis pipelines, demonstrating issues
of robustness.
The contribution of the present work is three-fold:
to support reproducible research using tractometry,
we developed an open-source software library called
Automated Fiber Quanti cation in Python (pyAFQ;
https://yeatmanlab.github.io/pyAFQ). Given dMRI data
that has undergone standard preprocessing (e.g., using
QSIprep (26)), pyAFQ automatically performs tractogra-
phy, classi es streamlines into bundles representing the
major tracts, and extracts tract pro les of diffusion prop-
erties along those bundles, producing “tidy” CSV output
les (27) that are amenable to further statistical analysis
(Fig. S1). The library implements the major functionality
provided by a previous MATLAB implementation of trac-
tometry analysis (9) and offers a menu of con gurable
algorithms allowing researchers to tune the pipeline to
their speci c scienti c questions (Fig. S2). Second, we
use pyAFQ to assess TRR of tractometry results. Third,
we assess robustness of tractometry results to variations
: 2021, Volume 1 - 3 - CC-BY: © Kruper et al.
reliability across measurements, we would call that
“subject TRR,” and ifwe calculated the subject reliabil-
ity across analysis methods, we would call that “subject
robustness.” We explain pro le and subject reliability
in more detail below; we explain wDSC and ACIP in
more detail in equations 1 and 2 in the “Measures of
Reliability” section of the Supplementary Methods.
Pro le reliability
We use pro le reliability to compare the shapes of pro-
les per bundle and per scalar. Given two sets of data
(either from test-retest analysis or from different anal-
yses), we  rst calculate the ICC between tract pro les
for each subject in a given bundle and scalar. Then,
we take the mean of those correlations. We do this for
every bundle and for every scalar. We call this pro le
reliability because larger differences in the overall val-
ues along the pro les will result in a smaller mean of
the ICC. Consistent pro le shapes are important for
distinguishing bundles. Pro le reliability provides an
assessment of the overall reliability of the tract pro les,
summarizing over the full length of the bundle, for a
particular scalar. We calculate the 95% con dence in-
terval on pro le reliabilities using the standard error of
the measurement.
In some cases, there is low between-subject variance in
tract pro le shape (e.g., this is often the case in cortico-
spinal tract (CST)). We use ICC to account for this, as ICC
will penalize low between-subject variance in addition to
rewarding high within-subject variance. Pro le reliability
is a way of quantifying the agreement between pro les.
Qualitatively, we use four descriptions for pro le reliabil-
ity: excellent (ICC > 0.75), good (ICC = 0.60 to 0.74), fair
(ICC = 0.40 to 0.59), and poor (ICC < 0.40) (40).
Subject reliability
We calculate subject reliability to compare individual
differences in pro les, per bundle and per scalar, follow-
ing (41). Given two measurements for each subject, we
rst take the mean of each pro le within each individual,
measurement and scalar. Then, we calculate Spearman’s
ρ from the means from different subjects for a given bun-
dle and scalar across the measurements. High subject re-
liability means the ordering of an individual’s tract pro le
mean among other individuals is consistent across mea-
surements or methods. This is akin to test reliability that
is computed for any clinical measure.
One downside of subject reliability is that the shape
of the extracted pro le is not considered. Additionally,
if one measurement or method produces higher values
for all subjects uniformly, subject reliability would not be
affected. Instead, the intent of subject reliability is to well
summarize the preservation of relative differences be-
tween individuals for mean tract pro les. In other words,
subject reliability quanti es the consistency of mean pro-
les. The 95% con dence interval on subject reliabilities
is parametric.
(MD)) is found for each one of these nodes. The values
are summarized by weighting the contribution of each
streamline, based on how concordant the trajectory of
this streamline is with respect to the other streamlines
in the bundle (Fig. S1D). To make sure that pro les rep-
resent properties of the core white matter, we remove
the  rst and last  ve nodes of the pro le, then further
remove any nodes where either the FA is less than 0.2 or
the MD is greater than 0.002. This removes nodes that
contain partial volume artifacts (16).
We used two datasets with test-retest measurements. We
used Human Connectome Project test-retest (HCP-TR)
measurements of dMRI for 44 neurologically healthy sub-
jects aged 22–35 (36). The other is an experimental data-
set, with dMRI from 48 children, aged 5 years old, collected
at the University of Washington (UW-PREK). More details
about the measurement are available in the “Data” section
of Supplementary Methods.
HCP-TR con gurations
We processed HCP-TR with three different pyAFQ con-
gurations. In the  rst con guration, we used the diffu-
sional kurtosis imaging (DKI) model as the orientation
distribution function (ODF) model. In the second con-
guration, we used constrained spherical deconvolution
(CSD) as the ODF model. For the  nal con guration, we
used RecoBundles (8) for bundle recognition instead of
the default waypoint ROI approach, and DKI as the ODF
model. More details are available in the “Con gurations”
section of Supplementary Methods.
Measures of reliability
Tract recognition of each bundle was compared across
measurements and methods using the Dice coef cient,
weighted by streamline count (wDSC) (37). Tract pro les
were compared with three measures: (1) pro le reliabil-
ity: mean intraclass correlation coef cient (ICC) across
points in different tract pro les for different data, which
quanti es the agreement of tract pro les (38, 39); (2) sub-
ject reliability: Spearman’s rank correlation coef cient
(Spearman’s ρ) between the means of the tract pro les
across individuals, which quanti es the consistency of
the mean of tract pro les; and (3) an adjusted contrast
index pro le (ACIP): to directly compare the values of
individual nodes in the tract pro les in different mea-
surements. To estimate TRR, the above measures were
calculated for each individual across different mea-
surements, and to estimate robustness, these were
calculated for each individual across different analysis
methods. For example, if we calculated the subject
: 2021, Volume 1 - 4 - CC-BY: © Kruper et al.
bundles with a range of 0.57 ± 0.24 to 0.85 ± 0.12 and a
median across bundles of 0.73. We can see that subject
TRR is lower than pro le TRR (Fig. 2). This trend is consis-
tent for MD (Fig. S5) as well as for another dataset (Fig. 3C).
TRR of tractometry in different implementations,
datasets, and tractography methods
We compared TRR across datasets and implementations.
In both datasets, we found high TRR in the results of trac-
tography and bundle recognition: wDSC was larger than
0.7 for all but one bundle (Fig. 3A): the delineation of the
anterior forceps (FA bundle) seems relatively unreliable
using pyAFQ in the UW-PREK dataset (using the FA sca-
lar, pyAFQ subject TRR is only 0.37 ± 0.28 compared to
mAFQ’s 0.84 ± 0. 10). We found overall high-pro le TRR
that did not always translate to high subject TRR (Fig.
3B–G). For example, for FA in UW-PREK, median pro-
le TRRs are 0.75 for pyAFQ and 0.77 for mAFQ, while
median subject TRRs are 0.70 for pyAFQ and 0.75 for
mAFQ. Note that pro le and subject TRRs have differ-
ent denominators (e.g., subjects that have similar mean
pro les to each other would have low subject TRR, even
if the pro les are reliable, because it is harder to distin-
guish between subjects in this case). mAFQ is one of the
most popular software pipelines currently available for
tractometry analysis, so it provides an important point
for comparison. In comparing different software imple-
mentations, we found that mAFQ has higher subject TRR
Tractometry using pyAFQ classi es streamlines into
bundles that represent major anatomical pathways. The
streamlines are used to sample dMRI-derived scalars into
bundle pro les that are calculated for every individual
and can be summarized for a group of subjects. An exam-
ple of the process and result of the tract pro le extraction
process is shown in Fig. S3 together with the results of this
process across the 18 major white matter pathways for all
subjects in the HCP-TR dataset.
Assessing TRR of tractometry
In datasets with scan-rescan data, we can assess TRR at
several different levels of tractometry. For example, the
correlation between two pro les provides a measure of
the reliability of the overall tract pro le in that subject.
Analyzing the HCP-TR dataset, we  nd that for FA calcu-
lated using DKI, the values of pro le reliability vary across
subjects (Fig. 1A), but they overall tend to be rather high,
with the average value within each bundle in the range of
0.77 ± 0.05 to 0.92 ± 0.02 and a median across bundles
of 0.86 (Fig. 1B). We  nd similar results for MD (Fig. S4)
and replicate similar results in a second dataset (Fig. 3B).
Subject reliability assesses the reliability of mean tract
pro les across individuals. Subject FA TRR in the HCP-
TR also tends to be high, but the values vary more across
Fig. 1. Fractional anisotropy (FA) pro le test-retest reliability (TRR). (A) Histograms
of individual subject intraclass correlation coef cient (ICC) between the FA tract
pro les across sessions for a given bundle. Colors encode the bundles, matching
the diagram showing the rough anatomical positions of the bundles for the left side
of the brain (center). (B) Mean (± 95% con dence interval) TRR for each bundle,
color-coded to match the histograms and the bundles diagram, with median across
bundles in red.
Fig. 2. Subject test-retest reliability. (A) Mean tract pro les for a given bundle
and the fractional anisotropy (FA) scalar for each subject using the  rst and sec-
ond session of Human Connectome Project test-retest (HCP-TR). Colors encode
bundle information, matching the core of the bundles (center). (B) Subject reli-
ability is calculated from the Spearman’s ρ of these distributions, with median
across bundles in red (± 95% con dence interval).
: 2021, Volume 1 - 5 - CC-BY: © Kruper et al.
discrepancies between ILF R recognized with waypoint
ROIs and with RecoBundles. Despite this bundle, we  nd
high robustness overall. For MD, the  rst quartile subject
robustness is 0.82 (Fig. 5C, D).
Tractometry results are robust to differences
in software implementation
Overall, we found that robustness of tractometry across
these different software implementations is high in most
white matter bundles. In the mAFQ/pyAFQ comparison,
most bundles have a wDSC around or above 0.8, except
the two callosal bundles (FA bundle and forceps pos-
terior (FP)), which have a much lower overlap (Fig. 6A).
Consistent with this pattern, pro le and subject robust-
ness are also overall rather high (Fig. 6B, C). The median
values across bundles are 0.71 and 0.77 for FA pro le and
subject robustness, respectively.
For some bundles, like the right and left uncinate (UNC
R and UNC L), there is large agreement between pyAFQ
and mAFQ (for subject FA: UNC L ρ = 0.90 ± 0.07, UNC
R ρ = 0.89 ± 0.08). However, the callosal bundles have
particularly low MD pro le robustness (0.07 ± 0.09 for FP,
0.18 ± 0.09 for FA) (Fig. 6B).
The robustness of tractometry to the differences be-
tween the pyAFQ and mAFQ implementation depends
on the bundle, scalar, and reliability metric. In addition, for
many bundles, the ACIP between mAFQ and pyAFQ re-
sults is very close to 0, indicating no systematic differenc-
es (Fig.6D). In some bundles – the CST and the anterior
thalamic radiations (ATR) – there are small systematic dif-
ferences between mAFQ and pyAFQ. In the forceps pos-
terior (FP), pyAFQ consistently  nds smaller FA values than
mAFQ in a section on the left side. Notice that the forceps
anterior has an ACIP that deviates only slightly from0, even
though the forceps recognitions did not have as much
overlap as other bundle recognitions (see Fig. 6A).
Previous work has called into question the reliability of
neuroimaging analysis (e.g., (25, 45, 46)). We assessed
the reliability of a speci c approach, tractometry, which
is grounded in decades of anatomical knowledge, and
we demonstrated that this approach is reproducible,
reliable, and robust. A tractometry analysis typically
combines the outputs of tractography with diffusion
reconstruction at the level of the individual voxels
within each bundle. One of the major challenges fac-
ing researchers who use tractometry is that there are
many ways to analyze diffusion data, including differ-
ent models of diffusion at the level of individual voxels;
techniques to connect voxels through tractography;
and approaches to classify tractography results into
major white matter bundles. Here, we analyzed the re-
liability of tractometry analysis at several different lev-
els. We analyzed both TRR of tractometry results and
relative to pyAFQ in the UW-PREK dataset, when TRR
is relatively low for pyAFQ (see the FA bundle, CST L,
and ATR L in Fig. 3C). On the other hand, in the HCP-TR
dataset pyAFQ, we used the Reproducible Tract Pro le
(RTP) pipeline (42, 43), which is an extension of mAFQ,
and found that pyAFQ tends to have slightly higher pro-
le TRR than RTP for MD but slightly lower pro le TRR for
FA (Fig. 3D). The pyAFQ and RTP subject TRR are highly
comparable (Fig. 3E). In FA, the median pyAFQ subject
TRR for FA is 0.76, while the median RTP subject TRR is
0.74. Comparing different ODF models in pyAFQ, we
found that the DKI and CSD ODF models have highly
similar TRR, both at the level of wDSC (Fig. 3A) and at the
level of pro le and subject TRRs (Fig. 3F, G).
Robustness: comparison between
distincttractography models and
bundlesrecognition algorithms
To assess the robustness of tractometry results to differ-
ent models and algorithms, we used the same measures
that were used to calculate TRR.
Tractometry results can be robust to differences in
ODFmodels used in tractography
We compared two algorithms: tractography using DKI-
and CSD-derived ODFs. The weighted Dice similarity co-
ef cient (wDSC) for this comparison can be rather high in
some cases (e.g., the uncinate and corticospinal tracts,
Fig. 4A) but produce results that appear very different for
some bundles, such as the arcuate and superior longitu-
dinal fasciculi (ARC and SLF) (see also Fig. 4D). Despite
these discrepancies, pro le and subject robustness are
high for most bundles (median FA of 0.77 and 0.75, re-
spectively) (Fig. 4B, C). In contrast to the results found in
TRR, MD subject robustness is consistently higher than
FA subject robustness. The two bundles with the most
marked differences between the two ODF models are
the SLF and ARC (Fig. 4D). These bundles have low wDSC
and pro le robustness, yet their subject robustness re-
mains remarkably high (in FA, 0.75 ± 0.17 for ARC R and
0.88 ± 0.09 for SLF R) (Fig. 4C). These differences are par-
tially explained due to the fact that there are systematic
biases in the sampling of white matter by bundles gen-
erated with these two ODF models, as demonstrated by
the non-zero ACIP between the two models (Fig. 4E).
Most white matter bundles are highly robust
across bundle recognition methods
We compared bundle recognition with the same tractog-
raphy results using two different approaches: the default
waypoint ROI approach (9) and an alternative approach
(RecoBundles) that uses atlas templates in the space of
the streamlines (44). Between these algorithms, wDSC
is around or above 0.6 for all but one bundle, Right
Inferior Longitudinal Fasciculus (ILF R) (Fig. 5). There is
an asymmetry in the ILF atlas bundle (7), which results in