This study contextualizes previous reports of widespread group mean sex differences previously reported in early adolescence (21, 23, 31, 83–86) by comparing the within- and between-sex variance as well as quantifying the neuroanatomical similarities between the sexes at ages 9 to 11 years old. In line with previous research in the developing brain (53, 55, 87), we detected significant inhomogeneity of variance between male and female youths. Moreover, we observed extensive overlap between male and female distributions and found between-sex and within-sex ranked differences to be similar in magnitude for all global and regional measures examined. We conclude that mean group sex differences in early adolescent brain structure are considerably smaller than the sex similarities, and therefore do not reflect distinct sex-based phenotypes (e.g., sexual dimorphism). Holistically, these results underscore the importance of accounting for within-group variance and inhomogeneity of variance when probing sex differences in brain morphology.
To assess similarity, we calculated the overlap (OVL) between male and female distributions in each global and regional measure. The OVL was invariably greater than 0.5, illustrating that across all structural metrics examined, more than half of all youths fell within the overlapping portion of the male and female distributions. In other words, there were substantial similarities between males and females throughout the brain. Similar results have been shown in adults, where “extensive overlap” has been reported between male and female distributions in all brain regions examined (40). While male and female total brain volume (TBV) distributions showed more similarity than difference (raw OVL = 0.585; corrected OVL = 0.682), TBV showed the least overlap between sex distributions of any measure examined, both before and after adjustment. This further supports its status as the largest and most replicable sex difference in pediatric brain structure (12,31,88–90). However, brain size is related to overall body size (91, 92), so this difference may simply be a reflection of overall body size differences between male and female adolescents. Unadjusted regional overlap was lower for cortical and subcortical volume than for cortical thickness, FA, and MD - which had median regional OVLs greater than 0.9 before adjustment. After adjustment, overlap increased in most regions - particularly for regional volumes - and a minimum of 89.6% of the data fell within the overlap between male and female distributions for all adjusted regional measures. These findings further demonstrate that the brains of male and female youth appear very similar after accounting for additional sources of variance in the data. Therefore, our results extend the conclusions of Joel et al. (2015) to early adolescents and reaffirm that human brain macrostructure does not exist in binary, sexually dimorphic categories associated with sex, nor does it appear to exist on a continuum between male and female extremes.
This work expands upon previous findings of sex differences in within-sex variability in childhood (53, 55). Wierenga et al. reported greater male variability in gray matter volume, whereas Bottenhorn et al. found greater male variability in white matter change over time, but greater female variability in cortical macro- and micro-structural change over time. After adjustment, we found significant sex differences in variance for TBV, average FA, average MD, and all regional volumes, with large inhomogeneity in the parietal lobe, basal ganglia, and limbic regions. Male variance exceeded female variance in all gray matter volume regions both before and after adjustment. Higher male variability in volume and diffusivity may be due, in part, to random X chromosome inactivation: heterozygous females express two different alleles of a single gene in a mosaic pattern throughout the brain, whereas homozygous females and males with a single X chromosome exhibit uniform expression (93, 94). Consequently, if two alleles of an X-chromosome gene have opposite effects, males and homozygous females will exhibit one of two extreme phenotypes, while heterozygous females will exhibit a mixed phenotype, decreasing the average trait variability among females. These results suggest that male structural variability is greater than female structural variability in gray matter volume and white matter microstructure, whereas female variability exceeds male variability in cortical thickness. Therefore, future research should examine the link between X-chromosome genes and regional gray matter volumes, while other sources of sex-related variance - such as estrogen and testosterone differences (95, 96), BMI (97), aerobic fitness (98, 99), or eating behaviors (100) - should be explored with regard to cortical thickness variance.
Many univariate methods of comparison (i.e., t-tests, ANOVA) rely on the assumption of homogeneity of variance. Consequently, such tests are inappropriate for comparing sexes on measures with significant inhomogeneity of variance between sexes, such as gray matter volume. Given the combination of large within-sex variance and high overlap between distributions of male and female youth, it is important to instead test whether between-sex differences surpass within-sex differences. Thus, we used ANOSIM to assess the relative magnitude of all pairwise differences between subjects and test for significant differences between the within-group and between-group pairings. Although permutation tests indicated that in some regions we could reject the null hypothesis (i.e., within-sex and between-sex variances do not differ), it is possible for a statistical result to be “significantly different from zero yet inconsequentially small” in a sufficiently large sample (101, 102). For example, in the adjusted data, ANOSIM indicated that between-sex pairings were significantly different from within-sex pairings in 33% of ROIs, yet the maximum observed ANOSIM R statistic in the corrected regional data was 0.0156 (adjusted R range: -0.0013–0.0156). ANOSIM R statistics less than 0.1 indicate that the size of the difference between two adolescents of the same sex is similar to the size of the difference between two adolescents of the opposite sex (3, 103–105). The fact that the results were significantly different from 0, but also very similar to 0 suggests that the sample size is sufficiently large to produce results with statistical significance but little practical or clinical significance. The ubiquity of the high overlap and low R statistic demonstrates that high similarity exists even in the measures with the highest mean sex differences. For instance, the effect size of sex for TBV (f2 = 0.243) would be considered medium-sized by Cohen’s standards (106) and “extremely above average” for the ABCD dataset (101, 107). Nonetheless, the TBV overlap was still greater than the difference (corrected OVL = 0.683) and the within-sex and between-sex differences were similar in size (corrected ANOSIM R statistic = 0.10). This highlights the fact that it is possible to have a relatively large, statistically significant sex effect even when subjects of the same sex differ about as much as subjects of different sexes. It is therefore critical for future analyses of sex to account for the mean-variance relationship and consider non-parametric methods that do not assume homogeneity of variance between sexes.
Taken together, these results contradict claims of sexual dimorphism in pediatric brain structure and contextualize the discussion of sex differences. This distinction between sexual dimorphism and sex differences is meaningful not just in theory, but also in practice. The putative sexual dimorphism of the developing brain has been cited in arguments for single-sex education (108–110) and as evidence in court cases regarding the rights of juveniles (111, 112). Yet, the large overlap between male and female distributions, small ratio of between-sex to within-sex differences, and significant inhomogeneity of variance reported here indicate that average pediatric sex differences are likely due to disparities in variability rather than two distinct phenotypes with a large mean difference. This lends credence to arguments that conventional methods for preclinical and clinical research of sex differences are not well-designed for application to personalized medicine and are insufficient to address health disparities between males and females (113–115). Future research designs should employ more robust statistical methods and focus on precise sex-linked variables, such as hormones, chromosomes, gene expression, body size and composition, or social determinants of health.
Limitations
Due to the cross-sectional nature of this study and the narrow age range of the participants, our results are limited in scope. As such, they should not be assumed to generalize to brain structure in early childhood, later in adolescence, adulthood or to longitudinal trajectories of brain development. Instead, they offer an in-depth look at the neuroanatomy of children between 9 and 11 years old. Furthermore, although sex is multifaceted and encompasses multiple hormonal, genetic, and gross anatomical features, we chose to focus on the presence or absence of a Y chromosome for our operational definition of sex. Consequently, it is unclear to what extent factors like hormone levels, gene expression, or X-chromosome inactivation play a role in our results. Additionally, as a non-experimental study, we cannot provide evidence of a causal link between sex chromosomes and variance. Since few studies examine the influence of social and environmental factors on neuroanatomical sex differences, some authors instead use the term “sex/gender” (116, 117). While our previous work with data from the ABCD Study showed felt-gender did not explain a significant amount of variance in gray or white matter structure (23), we cannot rule out the possible influence of other sociocultural factors that may be correlated with sex.
Although this study discusses significance in terms of p-values (corrected for multiple comparisons), statisticians increasingly warn against dichotomous interpretations of results (i.e., “significant” or “nonsignificant”) (118, 119) and overreliance on statistical significance to infer practical significance (120, 121). The frequency of small but significant f2 and ANOSIM R statistics found in this study further suggest that in such a large, diverse sample, p-values may not be reliable indicators of practical significance. This underscores the danger of dichotomous interpretation of statistical tests in large samples. As such, the significance of the inhomogeneity of variance results should also be interpreted with caution.
Moreover, the results may not be directly comparable between brain regions or metrics with very different mean outcomes (i.e. cerebellum volume vs. pars orbitalis volume, average cortical thickness vs. average FA). While this issue is frequently circumvented with standardization, we did not use this technique because it would have altered the variance we sought to characterize. Scaling was similarly rejected because of the associated reduction in significant digits for some measures. For example, when large values (such as TBV in mm3) are reduced to a smaller value (such as TBV in m3), the loss of precision could lead to more ties when rank-ordering the pairwise distances, ultimately impacting the ANOSIM results. Therefore, because of the regional differences in scale and the intrinsic link between the mean and variance, caution is urged when comparing results between different brain region outcomes.