Bias in data-driven estimates of the reproducibility of univariate brain-wide association studies.

doi:10.21203/rs.3.rs-4457116/v1

Download PDF

Article

Bias in data-driven estimates of the reproducibility of univariate brain-wide association studies.

https://doi.org/10.21203/rs.3.rs-4457116/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Recent studies have leveraged consortium neuroimaging data to answer an important question: how many subjects are required for reproducible brain-wide association studies? These data-driven approaches could be considered a framework for testing the reproducibility of several neuroimaging models and measures. Here we test part of this framework, namely estimates of statistical errors of univariate brain-behaviour associations obtained from resampling large datasets with replacement. We demonstrate that reported estimates of statistical errors are largely a consequence of bias introduced by random effects when sampling with replacement close to the full sample size. We show that future meta-analyses can largely avoid these biases by only resampling up to 10% of the full sample size. We discuss implications that reproducing mass-univariate association studies requires tens-of-thousands of participants, urging researchers to adopt other methodological approaches.

Biological sciences/Neuroscience

Biological sciences/Computational biology and bioinformatics/Communication and replication

Biological sciences/Computational biology and bioinformatics

Biological sciences/Computational biology and bioinformatics/Statistical methods

Physical sciences/Mathematics and computing/Statistics

The question of scientific reliability of brain-wide association studies (BWAS) was brought to the attention of many^1,2 by Marek, Tervo-Clemmens et al.³, reigniting discussions^4–7 about the ongoing reproducibility crisis in neuroscience and psychology^8–12. Independent researchers are failing to reproduce (same results using the same methods and data) and replicate (similar results using the same methods and new data) many published findings. In the field of neuroimaging, reproducibility issues are further exacerbated by variability in the methods used by researchers, which can lead to conflicting results¹³. Our trust in the scientific field therefore relies on how well we can estimate its reproducibility.

For any given study design, a reliable way of increasing the likelihood of replication is to recruit more subjects, which will reduce the sampling variability and in turn increase statistical power⁸. For a brain-wide association study (BWAS) which aims to characterise associations between brain measures and behaviours, collecting data is expensive. So how many subjects are required? How do we know? Thousands are required^3,14, according to data-driven approaches which quantify the issue of reproducibility for BWAS using large neuroimaging datasets from the Human Connectome Project¹⁵ (HCP with n = 1,200), the Adolescent Brain Cognitive Development study¹⁶ (ABCD with n = 11,874), and the UK Biobank¹⁷ (UKB with n = 35,735). Among numerous analyses in their study, Marek, Tervo-Clemmens et al.³ estimated statistical errors of univariate BWAS as a function of sample size. Such univariate BWAS often involve tens of thousands of correlations between a brain measure and a behavioural measure, most of which fail to replicate even with thousands of participants. These replication failures can be explained by statistical errors of a study design such as false positive rates¹¹ and low statistical power^8,18–22. To estimate statistical errors in univariate BWAS, Marek, Tervo-Clemmens et al.³ treated a large discovery dataset as a population and then drew replication samples by resampling with replacement (henceforth resampling) from that population. Here, we report the first test of validity of this data-driven approach by using simulated data as ground truth.

Resampling methods strongly bias statistical error estimates when there are no true effects

First, we simulated a discovery null sample with n = 1,000 subjects each with 1,225 brain connectivity measures (random Pearson correlations) and a single behavioural measure (normally distributed across participants). We correlated each brain connectivity measure with the behaviour across all subjects to obtain 1,225 brain-behaviour correlations. Since brain connectivity estimates and behavioural factors were simulated independently from each other, any resulting brain-behaviour correlations were entirely random. Data dimensions were chosen to be computationally feasible for reproducibility, however we invite readers to adjust these and re-run analyses using the openly available code (analyses recoded in R with supporting packages^23–25 for open-source accessibility https://github.com/charlesdgburns/rwr/). We then resampled our null-sample for 100 iterations across logarithmically spaced sample size bins (n = 25, to 1,000) and estimated statistical errors, following the methods described in Marek, Tervo-Clemmens et al.1. Surprisingly, we saw the same trends of statistical errors and reproducibility as those reported by Marek, Tervo-Clemmens et al.1 but with random data (see Fig. 1), with strongly biased statistical power estimates.

These trends in statistical errors thus do not depend on absolute sample size, but the resample size relative to the full sample size. By repeatedly generating new null-samples, rather than resampling from a single null-sample, we verified that these statistical error estimates are indeed biased under the null as the resample size approaches the full sample size (Fig. 2). For example, uncorrected (α = .05), statistical power was estimated to be 63% when resampling at the full sample size (n = 1,000, Fig. 1. d), rather than the expected 5% obtained when generating new null- samples (n = 1,000, Fig. 2. d). One concern is that power is the most inflated while also being the most relevant for failed replications^8,18,19, which could potentially result in misleading meta-science.

Compounding sampling variability underlies biased statistical errors under the null

To explain why biases arise under the null, we investigated the underlying brain-behaviour correlations used in the calculation of statistical errors. Here we focused on resampling at the full sample size (n = 1,000) where these biases are most dramatic. As indicated by the false positive rate (Fig. 1f.), the null distribution of brain-behaviour correlations is not preserved when resampling at the full sample size (Fig. 3). Instead, resampling subjects and computing correlations again results in a distribution wider than expected (comparing Fig. 3a. and c.). This is because resampling involves two sources of sampling variability, first at the level of the discovery sample and again for the resampled replication sample. For instance, if a correlation in the discovery sample is randomly observed to be r = 0.11, then resampling participants and computing the same correlation again results in a correlation which varies around r = 0.11 (Fig. 3e.).

We can formalise this mathematically as nested distributions²⁶, or a convolution²⁷ of two probability distributions, here approximating Pearson null distributions with normal distributions for analytical simplicity. It then follows that given a discovery sample X ~ N(µ, σ₁²), for each observation in our original sample, x_i ∈ X, resampling participants and recomputing correlations corresponds to sampling from several distributions X_i ~ N(x_i, σ₂²), resulting in a final set of correlations distributed according to X* ~ N(µ, σ₁² + σ₂²). Note that σ₁² depends on the size of the discovery sample, while σ₂² is determined by the resample size.

The influence on statistical error estimates such as statistical power is two-fold. First, random correlations in the tail of a discovery sample are more likely to be in the tail of correlations in a resampled replication sample. This inflates power when estimated as the proportion of significant effects in the discovery sample which are significant again in the resampled replication sample (1 – false negative rates). Second, increased sampling variability alone leads to a wider-than-expected distribution of correlations with more extreme tails. These more extreme tails lead to an inflation of P values close to 0 in our resample (compare Fig. 3b. and d.) when calculated using a standard correlation function (e.g., ‘corr’ in MATLAB). We note that simply correcting for this widened null distribution will over-correct for bias in statistical error estimates when true effects are present (see Supplementary Information).

Bias in ground truth simulations depends on statistical power of the full sample size

While we have shown clear biases when there are no true effects, this does not directly imply biases when true effects are present. We note that Marek, Tervo-Clemmens et al.³ have already shown that the largest univariate effect is highly replicable even for moderate sample sizes, so there are at least some true BWAS effects in the real world. However, as the average true effect size remains unknown, we systematically simulated a range of discovery samples, each representing a study where the size of the underlying true effects corresponded with different levels of statistical power. We note that since the bias under the null is driven by the false rejection of null hypotheses, here we adopt a fixed significance threshold after Bonferroni correction which controls for at least one false positive (family-wise error rate). Focusing on statistical power estimates, we show that the bias near the full sample size depends on the true statistical power of the discovery sample (Fig. 4). Power estimates are inflated if the discovery sample is underpowered, but on the other hand a highly powered discovery sample may give conservative power estimates. Note that regardless of power at the full sample size, bias in statistical power is largely avoided when subsampling up to around 10% of the full sample size (see also Supplementary Information for subject-level simulations).

Accurately estimating reproducibility of scientific methods is critical for guiding researcher’s methodological decisions. Our results demonstrate that estimating statistical errors by resampling with replacement from random data results in large biases when resampling near the full sample size. We explain this fully by compounding sampling variability of test statistics when resampling and its knock-on effects on estimated statistical errors. We further simulate ground truth data with true effects to show that statistical power is inflated when the true power of the discovery sample is low and slightly deflated when true power is high. This could lead to circular reasoning in cases where we must assume we have high statistical power before we can rely on the estimation that we have high statistical power. Lastly, we show that this bias is largely avoided when subsampling only up to 10% of the full sample size after Bonferroni correction. This 10% rule of thumb is consistent with the use of resampling techniques in a recent evaluation of statistical power and false discovery rates for genome-wide association studies with hundreds-of-thousands of participants²⁸, as well as recommendations for 10-fold cross-validation to reduce prediction error in machine-learning²⁹.

What are the implications for the results presented by Marek, Tervo-Clemmens et al.1? For the strictly denoised Adolescent Brain Cognitive Development (ABCD) sample (n = 3,928), they report around 68% power at n = 3,928 after Bonferroni correction when resampling at the full sample size (Marek, Tervo-Clemmens et al.1 Fig. 3d.). Our true effect simulation results indicate that this estimate could be inflated from a true average power anywhere between 1% and 40%. Furthermore, when subsampling from the UK Biobank with a full sample size of n = 32,572 Marek, Tervo-Clemmens et al.1 report around 1% power for n = 4,000 and α = 10 − 7. We therefore argue that the 68% power reported for the full ABCD sample (n = 3,928, α = 10 − 7) more likely reflects methodological bias, rather than a result of increased signal after strict denoising of brain data. While the largest BWAS effects may be highly reproducible with 4,000 participants, the average univariate BWAS effect is most likely not reproducible. On the other hand, our true effect simulations (Fig. 4.) also indicate that the UK Biobank estimates at the full sample size are more reliable, with an underlying power likely between 70% and 90% at n = 32,572 after Bonferroni correction. Ultimately, our results suggest that replicating the univariate BWAS tested in Marek, Tervo-Clemmens et al. requires tens-of-thousands of individuals.

Our results only have direct implications for mass univariate association studies, however it is worth noting how methodological decisions could influence reproducibility in neuroimaging. For example, it should be noted that inter-individual correlation studies offer “as little as 5%-10% of the power” of within-subject t-test studies with the same number of participants4. Other methodological choices, such as data modelling, should also be carefully considered. The lack of power in univariate BWAS considered by Marek, Tervo-Clemmens et al. could also be influenced by the choice of a group-averaged brain parcellation³⁰, which fails to account for individual level variations in resting state functional connectivity^31,32. Brain models³³ which do account for such individual variability generalise better, as demonstrated by stronger out-of-sample prediction^31,34, and could also lead to higher replication rates in null-hypothesis significance tests. Note also that how we model null distributions³⁵ of brain-wide statistics has a large influence on resulting P values. With this in mind, one could consider a predictive framework rather than an explanatory one³⁶, which could be replicable with only hundreds of participants^37,38

It is clear that investigations of reproducibility of wider BWAS methods are required. We urge such meta-analyses to evaluate their meta-analytic methods, for example with null data, so they may reliably evaluate the reproducibility of scientific methods used in research.

Simulating null data at subject level

We simulated random phenotype associations with simulated functional connectivity measures. We generated a null-sample with n = 1,000 subjects each with 1,225 edges (random Pearson correlations between 50 random time series) and a single behavioural factor (normally distributed across participants). We correlated each edge with the behaviour across all subjects to obtain 1,225 brain-behaviour correlations. By generating edge connectivity estimates and behavioural factors independently from each other, we ensured that any resulting brain-behaviour correlations are entirely random (ρ = 0), hence obtaining a sample where the null hypothesis is true (i.e., a null-sample).

Estimating statistical errors

We closely followed the methods of Marek, Tervo-Clemmens et al.³, first running analyses on MATLAB using their code 'abcd_edgewise_correlation_iterative_reliability_single_factor.m' and 'abcd_statisticalerrors.m' (https://gitlab.com/DosenbachGreene/bwas). These analyses were then independently recoded in R with supporting packages^23–25 for open-source accessibility https://github.com/charlesdgburns/rwr/. Notably, statistical error estimations involve two-tailed P values derived from parametric null distributions on a given resample size.

Simulating ground truth data with known statistical power

At this stage we take a computationally more efficient approach and simulate summary statistics rather than subject-level data, which allows us to simulate many more true effect scenarios so we can compare estimates with true statistical errors. This approach also lets us increase the number of effects, so we now simulate samples with 55,278 (333 choose 2) effects, the number of resting-state functional connectivity measures which feature in Marek, Tervo-Clemmens et al. Figure 2³. The size of true effects was determined by an inverse power analysis³⁹ with a fixed sample size (n = 1,000) and Bonferroni corrected significance threshold (α = 0.05/55278), which involved a Fisher z-transformation for calculating the critical Pearson r for a given power level (power = 1%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 99%). The Fisher z-transformation⁴⁰, F(r) = atanh(r) = z was also used to sample Pearson correlations using the approximation that the z-statistic is asymptotically normally distributed with mean F(r) and standard deviation \(1/\sqrt{\left(n-3\right)}\). The z-statistic was then transformed into Pearson correlations to simulate brain-behaviour correlations. For each power level, we simulated a discovery sample by first drawing 55,278 random effects (ρ = 0) and afterwards replacing 500 of those with true effects, where ρ corresponds to the critical r for a given power level computed earlier. The choice of 500 true effects was somewhat arbitrary, being moderate enough to seem probable but also sufficiently many true effects to reduce noise in bias estimates. Note that while real world effect sizes of a single BWAS may vary, we instead simulate several effects with the same underlying effect size. This should not be an issue, since the statistical error summary statistics are given as an average across effects, so in this simulation we can think of the underlying effects having an average effect size according to a given power level.

Simulating statistical power estimations from resampling with replacement

Given only summary statistics rather than individual subjects, we cannot resample participants and recompute P values, but instead also simulate obtaining estimates by resampling with replacement as in subject-level analyses³. First, we generate ground truth data with known statistical power, which we treat as our discovery sample. Then, we follow the assumption that the observed effects in a discovery sample are the population effects: for a given resample size n, a resampled effect size was drawn from a normal distribution \(N\left(\text{a}\text{t}\text{a}\text{n}\text{h}\left({r}^{*}\right),1/\sqrt{n-3}\right)\), where r* is a given Pearson correlation from the discovery sample, and then Fisher z-transformed. We then derived P values from an uncorrected null distribution of Pearson correlations with degrees of freedom computed relative to the resample size. We resampled across the same range of sample sizes as in previous analyses (n = 25, … 1,000). We continued to estimate statistical power across 1,000 iterations of resampled brain-behaviour correlations as in Marek, Tervo-Clemmens et al.¹, specifically as the proportion of significant effects in a discovery sample which were significant again in a resample (1 – false negative rates, with α = 0.05/55278). These were then compared to analytical power curves³⁹ computed using Fisher z-transformations for varying sample sizes and effect sizes corresponding to critical r for power levels at the full sample size (n = 1,000) computed earlier.

Competing interests

None.

Author Contribution

C.D.G.B.: Conceptualisation, design, implementation, analysis, interpretation, writing - original draft. A.F.: Interpretation of results, writing - review & editing. G.A.R.: Conceptualisation, design, interpretation, writing - review & editing, supervision.

Acknowledgements

A.F. is supported by a grant from the Biotechnology and Biology research council (BBSRC, grant number: BB/S006605/1) and the Fundação Bial, Fundação Bial Grants Programme 2020/21, A- 29315, number 203/2020, grant edition: G-15516.

Data Availability

R and MATLAB code used for data simulation, statistical analyses, and plotting is available on GitHub: https://github.com/charlesdgburns/rwr/.

Callaway, E. Can brain scans reveal behaviour? Bombshell study says not yet. Nature 603, 777–778 (2022).
Richtel, M. Brain-Imaging Studies Hampered by Small Data Sets, Study Finds. The New York Times (2022).
Marek, S. et al. Reproducible brain-wide association studies require thousands of individuals. Nature 603, 654–660 (2022).
Gratton, C., Nelson, S. M. & Gordon, E. M. Brain-behavior correlations: Two paths toward reliability. Neuron 110, 1446–1449 (2022).
Rosenberg, M. D. & Finn, E. S. How to establish robust brain–behavior relationships without thousands of individuals. Nat. Neurosci. 25, 835–837 (2022).
Botvinik-Nezer, R. & Wager, T. D. Reproducibility in Neuroimaging Analysis: Challenges and Solutions. Biol. Psychiatry Cogn. Neurosci. Neuroimaging 8, 780–788 (2023).
Helwegen, K., Libedinsky, I. & van den Heuvel, M. P. Statistical power in network neuroscience. Trends Cogn. Sci. 27, 282–301 (2023).
Button, K. S. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376 (2013).
Munafò, M. R. et al. A manifesto for reproducible science. Nat. Hum. Behav. 1, 1–9 (2017).
Open Science Collaboration. Estimating the reproducibility of psychological science. Science 349, aac4716 (2015).
Ioannidis, J. P. A. Why Most Published Research Findings Are False. PLOS Med. 2, e124 (2005).
Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S. F. & Baker, C. I. Circular analysis in systems neuroscience: the dangers of double dipping. Nat. Neurosci. 12, 535–540 (2009).
Botvinik-Nezer, R. et al. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 582, 84–88 (2020).
Liu, S., Abdellaoui, A., Verweij, K. J. H. & van Wingen, G. A. Replicable brain–phenotype associations require large-scale neuroimaging data. Nat. Hum. Behav. 7, 1344–1356 (2023).
Van Essen, D. C. et al. The WU-Minn Human Connectome Project: An overview. NeuroImage 80, 62–79 (2013).
Casey, B. J. et al. The Adolescent Brain Cognitive Development (ABCD) study: Imaging acquisition across 21 sites. Dev. Cogn. Neurosci. 32, 43–54 (2018).
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Ingre, M. Why small low-powered studies are worse than large high-powered studies and how to protect against “trivial” findings in research: Comment on Friston (2012). NeuroImage 81, 496–498 (2013).
Yarkoni, T. Big Correlations in Little Studies: Inflated fMRI Correlations Reflect Low Statistical Power—Commentary on Vul et al. (2009). Perspect. Psychol. Sci. 4, 294–298 (2009).
Cremers, H. R., Wager, T. D. & Yarkoni, T. The relation between statistical power and inference in fMRI. PLOS ONE 12, e0184923 (2017).
Szucs, D. & Ioannidis, J. PA. Sample size evolution in neuroimaging research: An evaluation of highly-cited studies (1990–2012) and of latest practices (2017–2018) in high-impact journals. NeuroImage 221, 117164 (2020).
Poldrack, R. A. et al. Scanning the horizon: towards transparent and reproducible neuroimaging research. Nat. Rev. Neurosci. 18, 115–126 (2017).
Ripley, B. et al. MASS: Support Functions and Datasets for Venables and Ripley’s MASS. (2023).
Wickham, H. et al. Welcome to the Tidyverse. J. Open Source Softw. 4, 1686 (2019).
Kassambara, A. ggpubr: ‘ggplot2’ Based Publication Ready Plots. (2022).
El Otmani, S. & Maul, A. Probability distributions arising from nested Gaussians. Comptes Rendus Math. 347, 201–204 (2009).
Convolution of Gaussians is Gaussian. https://jeremy9959.net/Math-5800-Spring-2020/notebooks/convolution_of_gaussians.html.
Chen, Z., Boehnke, M., Wen, X. & Mukherjee, B. Revisiting the genome-wide significance threshold for common variant GWAS. G3 GenesGenomesGenetics 11, jkaa056 (2021).
Witten, I. H., Frank, E., Hall, M. A., Pal, C. J. & DATA, M. Practical machine learning tools and techniques. Data Min. Fourth Ed. Elsevier Publ. (2017).
Gordon, E. M. et al. Generation and Evaluation of a Cortical Area Parcellation from Resting-State Correlations. Cereb. Cortex 26, 288–303 (2016).
Kong, R. et al. Individual-Specific Areal-Level Parcellations Improve Functional Connectivity Prediction of Behavior. Cereb. Cortex 31, 4477–4500 (2021).
Gordon, E. M. et al. Precision Functional Mapping of Individual Human Brains. Neuron 95, 791–807.e7 (2017).
Bijsterbosch, J. D., Valk, S. L., Wang, D. & Glasser, M. F. Recent developments in representations of the connectome. NeuroImage 243, 118533 (2021).
Farahibozorg, S.-R. et al. Hierarchical modelling of functional brain networks in population and individuals from big fMRI data. NeuroImage 243, 118513 (2021).
Markello, R. D. & Misic, B. Comparing spatial null models for brain maps. NeuroImage 236, 118052 (2021).
Yarkoni, T. & Westfall, J. Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning. Perspect. Psychol. Sci. 12, 1100–1122 (2017).
Spisak, T., Bingel, U. & Wager, T. D. Multivariate BWAS can be replicable with moderate sample sizes. Nature 615, E4–E7 (2023).
Chen, J. et al. Relationship between prediction accuracy and feature importance reliability: An empirical and theoretical study. NeuroImage 274, 120115 (2023).
Designing Clinical Research. (Wolters Kluwer/Lippincott Williams & Wilkins, Philadelphia, 2013).
Fisher, R. A. Frequency Distribution of the Values of the Correlation Coefficient in Samples from an Indefinitely Large Population. Biometrika 10, 507–521 (1915).

No competing interests reported.

Supplementarydraft.docx

Download PDF

Editorial decision: Revision requested
22 Jul, 2024
Reviews received at journal
19 Jul, 2024
Reviews received at journal
10 Jul, 2024
Reviewers agreed at journal
14 Jun, 2024
Reviewers agreed at journal
13 Jun, 2024
Reviewers invited by journal
08 Jun, 2024
Editor assigned by journal
04 Jun, 2024
Editor invited by journal
29 May, 2024
Submission checks completed at journal
27 May, 2024
First submitted to journal
21 May, 2024

You are reading this latest preprint version

Bias in data-driven estimates of the reproducibility of univariate brain-wide association studies.

Status:

Version 1

Abstract

Figures

Introduction

Results

Resampling methods strongly bias statistical error estimates when there are no true effects

Compounding sampling variability underlies biased statistical errors under the null

Bias in ground truth simulations depends on statistical power of the full sample size

Discussion

Methods

Simulating null data at subject level

Estimating statistical errors

Simulating ground truth data with known statistical power

Simulating statistical power estimations from resampling with replacement

Declarations

Competing interests

Author Contribution

Acknowledgements

Data Availability

References

Additional Declarations

Supplementary Files

Status:

Version 1