Participants
The present study was conducted using data from the 1,200 Subject Release of the Human Connectome Project (HCP), which includes structural Magnetic Resonance Imaging (MRI) data from 1113 healthy young adult participants 47. The HCP dataset contains an unequal number of females (n = 606) and males (n = 507) who differ in age (Meanfemales=29.56, Meanmales=27.90, t1111 = 7.63, p < 4.94-14). Therefore, we used a self-built algorithm to randomly select a sex-balanced sample of participants (438 females, 438 males) not differing in age. This sample was subsequently split into the so-called training and testing subsamples (see below).
Imaging and data preprocessing
MRI acquisition and images preprocessing
The MRI acquisition details for the HCP-sample can be found in the reference manual of the S1200 release of the HCP (https://www.humanconnectome.org/storage/app/media/documentation/s1200/HCP_S1200_Release_Reference_Manual.pdf).
Images were preprocessed with the CAT12 toolbox (http://www.neuro.uni-jena.de/cat/, version r1184) of the SPM12 (http://www.fil.ion.ucl.ac.uk/spm/software/spm12/, version 6906) software. CAT12 preprocessing was conducted following the standard default procedure suggested in the manual. Briefly, it includes the following steps: (1) segmentation of the images into gray matter, white matter, and cerebrospinal fluid; (2) registration to a standard template provided by the International Consortium of Brain Mapping (ICBM); (3) DARTEL normalization of the gray matter segments to the MNI template; (4) modulation of the normalized data via the “affine + non-linear” algorithm; and (5) data quality check (in which no outliers or incorrectly aligned cases were detected). Images were not smoothed because we were only interested in the modulated images.
After applying this procedure, which does not include any correction for overall head size, voxels were mapped into 116 regions according to the Automated Anatomical Labeling atlas (AAL,22) by calculating the total gray matter volume for each region of interest (VOI) and participant via a MATLAB script (https://www0.cs.ucl.ac.uk/staff/g.ridgway/vbm/get_totals.m). TIV was estimated using native-space tissue maps obtained in the segmentation step. More specifically, TIV was calculated as the sum of GM, WM, and CSF total values multiplied by voxel size and divided by 1,000 to obtain a milliliter (ml) measurement. The estimates of GMVOL in these 116 regions were employed as the predictors for the machine-learning algorithms described below.
Training/ testing subsets and data standardization
Following current recommendations 48,49, classification algorithms were fitted and tested in two separated groups of participants randomly extracted from the previously described main sample. The training subsample included 288 females and 288 males, whereas the testing subsample included 150 females and 150 males, hence avoiding classification distortions due to between-class imbalance 50,51. In both subsets, females and males were very similar in age (means’ range = 28.4–28.6) and showed similar differences in TIV (Cohen’s d = 1.8 in both cases; see further details in Supplementary Tables 1A and 1B).
Before being used as predictors, all volumetric variables were transformed into z-scores to avoid distortions due to their different ranges 48,52. Standardization was initially performed in the training subset, and the exact same scaling parameters were subsequently used to standardize the testing subset.
TIV adjustment: The raw and the PCP datasets.
Previous studies have shown that the estimates of univariate and multivariate sex differences are largely dependent on TIV variation, but also that not all the currently used methods are equally effective and valid for removing TIV-variation 5,18. Therefore, in the present study, all analyses were conducted twice in the same subjects, without introducing any TIV adjustment (“raw” dataset) and after removing TIV variation with the well-validated power-corrected proportions (PCP) method21. The PCP method improves the traditional proportions approach by introducing an exponential correcting parameter in the denominator. More specifically, the adjusted volume of interest (VOI) is calculated as VOIadj = VOI/TIVb, where the b parameter corresponds to the slope value of the LOG(VOI) ~ LOG(TIV) regression line calculated for the entire sample of participants 21.
Machine-learning algorithms
We report, and compare the outcomes of five classification algorithms that differ in their assumptions (Supplementary Table 1C) and that provide an adequate representation of the principal “families” of machine-learning classifiers.
Testing several ML algorithms is important because algorithms’ internal operations are very much dependent on these assumptions 48,53 and may potentially lead to different outcomes. Moreover, it is necessary to compare the outcomes of different ML algorithms (see PCAM variation subheading) in order to ensure that the results obtained describe method-independent findings and that the conclusions drawn are truly generalizable. The outcomes considered included common classification metrics (such as the percentage of correctly classified cases, %CC) but also novel and alternative ones that were obtained by using the posterior classification probabilities yielded by ML algorithms (in this case, operationalized as the probability of being classified as male, PCAM) as a continuous variable (see details in the Assessing multivariate sex differences in GMVOL subheading).
All the ML classifiers were implemented and cross-validated (5 folds; 10 repeats) using the interface provided by the caret package for R. In alphabetical order, the predictive algorithms used in the present study were:
-Linear Discriminant Analysis (LDA):Implemented by the default options of the lda function from the MASS package 54.
-Logistic Regression (LR): Implemented by the glm function (family= “binomial”) of the stats package natively included in R 55.
-Multiple Adaptive Regression Splines (MARS): Implemented by the earth function of the earth package for R (Milborrow, 2019). The hyper-parameters of the model were determined by a cross-validated grid search assessing 30 possible combinations (degree: 1–3, nprune = 2-116, length.out = 10).
-Random Forest (RF): Implemented by the rf function of the randomForest package 57, built up by aggregating 500 classification trees, each of them using 10 randomly selected predictors.
-Support Vector Machine with a radial kernel (SVM): Implemented using the svmRadial function of the kernlab package for R 58. The tune function (tenfold cross-validation) was used to automatically select the optimal values for the regularization and kernel-width hyper-parameters.
Statistical analyses
All statistical analyses were conducted in the testing subsamples of the raw and the PCP datasets using different packages for R55. Statistical analyses focused on description and effect sizes estimation rather than merely testing statistical significance59. All effect size estimates were accompanied by 95% confidence intervals (CI), and, when appropriate, these effects were also reported in terms of their percentage of the maximum possible (POMP) score (see 35,60). Moreover, when statistical significance was tested, p values were corrected for multiple comparisons with the FDR61 or -when comparing deciles (see below)- with the Hochberg62 method.
Algorithms’ performance: Predictive accuracy.
Algorithms’ performance was initially measured as the percentage of correctly classified cases (%CC) and its 95% CI. These %CC scores were initially compared with the chance-expected value of 0.5 with one-sided binomial tests and with each other by means of the McNemar’s test. Classification bias (whether females or males had higher chances of being misclassified) was also assessed using the McNemar’s test. The details of these analyses are presented in the Supplementary Material.
As discussed in the main text, the calculation of the %CC scores requires a dichotomization of the continuous outcome provided by ML algorithms. To illustrate the statistical and interpretative distortions15–17 introduced by this dichotomization, we calculated new %CC scores after dividing the PCAM this outcome into three thirds (as in some previous studies14,30) instead of into two halves. These scores were compared to the chance-expected value of 0.33 by means of chi-squared tests. The McNemar’s test was used to compare the dichotomization- and trichotomization-associated %CC scores, as well as the trichotomization %CC scores yielded by each algorithm in each dataset.
Assessing multivariate sex differences in GMVOL
As in some previous studies14,32–34, we used the a posteriori probabilities yielded by ML algorithms (in this case, operationalized as the probability of being classified as male, PCAM) as a continuous and informatively-rich dependent variable. The males and females’ PCAM distributions yielded by each algorithm in each dataset were first described through bootstrap estimates of appropriate statistics (skewness, kurtosis, deciles, inter-quantile range and variance; repetitions = 10,000). In a second step, PCAM scores were used to quantify the multivariate sex differences in GMVOL at different levels. More specifically:
Possible sex differences in PCAM dispersion measures (variances and inter-quantile ranges, IQR) were assessed through the original version and a customized version of the comvar2 function included in the freely accessible Rallfun-v38 file (https://dornsife.usc.edu/labs/rwilcox/software/).
The overall degree of similarity between the PCAM density distributions for males and females was quantified using the h overlap index. The h index measures the area intersected by two probability density functions, and it is conceptually related to other measures of overlap, such as the Kullback-Leibler divergence and the Bhattacharyya's distance. However, unlike these overlap metrics, \(\widehat{\eta }\) can be estimated in the absence of symmetry, unimodality, or any other distributional assumption63. In the present study, kernel density estimation (KDE) and \(\widehat{\eta }\) were obtained through the boot.overlap (10,000 repetitions) function of the overlapping package for R64. A second and complementary estimate of these sex differences at the distribution level was obtained by calculating the probability of superiority (PS). The PS is defined as the probability that a randomly sampled member of group A will have a higher score than the score attained by a randomly sampled member of group B. More specifically, the probability that males’ PCAM scores would be higher (PSM), equal to, or lower than those of females (PSF), along with the Cliff’s d statistic65 and its 95%CI, was obtained through the cidv2 function of the rogme package23 .
Because no single score can properly summarize the differences between two distributions23,25,35–39, male-female differences in the PCAM continuum were characterized by comparing their cumulative distribution functions (CDF; 25,35). CDFs make it possible to directly estimate the proportion of cases in each group with PCAM values equal to or lower than any possible cutoff, but also the proportion of subjects in one group have PCAM values equal or lower than a given proportion of cases in another group25,35. Within each CDF, sex-based comparisons were conducted at each decile with the shifthd_pbci function (bootstrap: 10,000 repetitions) of the rogme package23. The shifthd_pbci -and other functions of this package described below- use the Harrell-Davis quantile estimator in conjunction with a percentile bootstrap approach to calculate the deciles and the between-groups differences at those deciles. Unlike traditional parametric methods, this approach ensures that the estimates fall within the bounds of the PCAM distribution [0,1], thus preventing inappropriate inferences. Moreover, during the calculation of these deciles’ differences, the corresponding 95% CIs are calculated, and the significance level is adjusted for multiple comparisons using the Hochberg method. Thus, when one of these CIs does not include the zero value, the difference might be declared statistically significant at a < 0.05 without being concerned about Type I error23.
With the decile estimates obtained, the so-called shift functions24 were also calculated. The shift-function plots the between-groups decile differences against the deciles of one group, thus providing a complete picture of how, and by how much, the score distribution of one group should be re-arranged to match that the scores’ distribution of another group (for a detailed description, see23). Finally, we also compared whether the estimated size of the female-male differences at D5 (median) differed between algorithms and within the deciles of the PCAM distributions obtained with each algorithm. These comparisons were conducted with the original bwquantile function (see acknowledgements section) and with a customized version of this function, respectively.
Following current recommendations23,24, we also estimated the size of the typical difference between any given male and any given female at each PCAM distribution of the raw and PCP datasets. These bootstrapped estimations were conducted using the allpdiff_hdpbci function (bootstrap: 10,000 repetitions) of the rogme package 23, which computes through the Harrell-Davis estimator the deciles (and their 95%CI) of the empirical distribution of all (in this case, 22,500) pair-wise differences between the members of two independent groups. We also calculated the CDFs for these pair-wise differences in the raw and PCP datasets, and then between-datasets decile-based comparisons were conducted with the shiftdhd_pbci function (bootstrap: 10,000 repetitions) of the rogme package23. Finally, we employed the Dqcomhd function (bootstrap: 50,000 repetitions) of the WRS2 package for R to ascertain whether the deciles of these male-female pair-wise differences significantly varied between algorithms in each dataset.
Interpreting multivariate sex differences in GMVOL
For obvious reasons, interpretability has become a major issue in ML applications 66. In the particular case of the study of multivariate sex differences, knowledge about the brains of females and males is only gained when the complex and numerical output of ML algorithms is decomposed and the brain features that contribute the most to the groups’ distinguishability are identified. To provide this information, we extracted global, post-hoc, model-agnostic explanations of the five ML algorithms tested in this study by modeling their outputs through the use of interpretable surrogate models (for a discussion about the different types of interpretability and their associated methods, see 66–68). More specifically, we employed boosted beta regression procedures to identify the brain features that best predicted the PCAM scores observed with each algorithm in each dataset.
Boosted beta regression analyses and between-algorithms’ agreement.
In the statistical literature, beta regression has been established as a powerful and readily interpretable procedures to model bounded (0,1) distributions69. However, because the outcome of classical beta regression procedures might be challenging when using a large number of predictors, boosted beta regression models have been developed27. Boosted beta regression is based on the gamboostLSS algorithm, which performs a reliable variable selection during the iterative fitting process (for a comprehensive description of boosted beta regression, see27). In the present study, boosted beta regressions were implemented through the betaboost package for R70, using the PCAM scores observed with each algorithm in each dataset as the response variable and the volumetric scores of the testing subsample in the raw/PCP datasets as predictors. The number of iterations that most reduced the risk was established through cross-validation and the contribution of each predictor was estimated by using the obtained mu-coefficient values and by constructing a relative importance measure: relative importance = 100*(accumulated risk reduction attributable to a predictor/ total risk reduction in the model).
In a second step, the degree of agreement between the boosted beta regression models obtained was assessed. These comparisons were conducted between datasets and between algorithms within each dataset. For each of these two sets of comparisons, we first assessed whether boosted beta regression models included the same brain features as relevant predictors. More specifically, R-wise agreement (coincidence between all models) was estimated by means of the Hubert’s Kappa index71 and the multi-rater delta index (which, unlike other more commonly used agreement metrics, is not affected by the ratings’ marginal distributions72,73). These two agreement indexes were calculated using software specifically developed for this purpose and freely available at https://www.ugr.es/~bioest/software/cmd.php?seccion=agreement.
We also assessed the degree of agreement on the relative importance attributed to each predictor by using Lin’s concordance correlation coefficient 74, Kendall’s W agreement coefficient 75, the mean of bivariate Spearman's rho rank correlations 76, and the intraclass-correlation coefficient (two-way ANOVA, random effects,w single and average ratings 77,78). In the case of comparisons of algorithms within each dataset, agreement was assessed at the interval and at the ordinal level by inputting the obtained coefficients’ values and their ordinal rank positions, respectively. In the case of datasets comparisons, agreement was assessed by using each predictor’s ranking mean (RM) or its position in a multiplicative “rank of ranks” (RRP, 79) across algorithms. These two measures were also used in a correlational analysis assessing whether the relative importance of these predictors was associated with TIV variation. Thus, as in 5,18, linear regression analyses were conducted to obtain an estimate (r2) of the TIV-explained variance in each brain region, and the r2 scores corresponding to the brain regions identified as relevant predictors in each dataset were correlated with their corresponding RM and RRP values.
Between-algorithm PCAM variation
To assess the degree of similarity of the PCAM scores obtained with different algorithms, three complementary approaches were used. First, for each individual, its minimum maximum PCAM score was subtracted from its maximum PCAM score within each dataset, and the CDFs depicting this maximal PCAM variation in each dataset were built up and described by several summary statistics (minimum, average, deciles, and maximum). Second, the same statistics were used to describe the degree of PCAM variation for each algorithms’ pair within the raw and the PCP datasets. Finally, to assess the degree of ordinal similarity between the PCAM scores yielded by different algorithms, zero-order Spearman’s rho between-algorithms’ correlation matrices in the raw and PCP datasets were calculated. Because we corroborated a significant contribution of TIV to PCAM scores in the raw dataset, this correlational analysis was also conducted using partial (-TIV) Spearman correlations.
Hierarchical clustering
As described above, the high accuracy observed in previous sex classification studies has often been interpreted as showing the ability of ML algorithms to identify two clearly distinguishable brain types, one typical of males and the other typical of females. However, proving that these brain types actually exist requires confirming that the brains of females and males substantially differ in a specific and reproducible pattern of brain features. To assess whether or not these distinctive brain profiles could be found in our data, agglomerative hierarchical clustering methods (average linkage) were used.
Specifically, hierarchical clustering analyses were performed with the hclust function of the stats package55. Initially, the included features were the volumetric z-scores of those brain areas identified as relevant predictors of the PCAM scores yielded by each ML algorithm in each dataset (see beta boosted regression section). Dissimilarity was measured in terms of Euclidean and Spearman distances; Euclidean distances served to quantify the individuals’ disparity in terms of the magnitude of their accumulated differences in GMVOL, whereas Spearman distances measured the discordance in the shape of their brain profiles. Each resulting dendrogram was cut at appropriate heights to obtain 2 to 10 clusters and in each of these alternative partitions the size (number of subjects) and composition (proportion of females) of the resulting clusters were assessed. The robustness of the obtained results was corroborated by repeating the same analyses with the volumetric z-scores of the 116 brain areas from the AAL atlas and also with the top five predictors of the PCAM scores yielded by each ML algorithm in each dataset.
Complementarily, a series of analysis of similarity (ANOSIM) were conducted. ANOSIM is an ANOVA-like, non-parametric test that operates on distance matrices and assesses the null hypothesis that distances between members of two or more predefined groups (in this case, males/females) is the same as between the members of these groups80. Because in the present study this assessment involved a large number of instances (44,850), statistical significance was almost guaranteed and, consequently, it was not truly informative81. Therefore, we focused on estimating the value of R statistic (and its 95%CI), which compares the mean of ranked dissimilarities between groups to the mean of ranked dissimilarities within groups, and whose meaningful range lies between 0 (when the similarity within groups is the same as between-groups) and 1 (when all samples within groups are less dissimilar to each other than to any pair of samples from different groups)80. More specifically, Euclidean and Spearman distance matrices were calculated in 10,000 bootstrap samples using the get_dist function of the factoextra82 package. In each of these distance matrices, an ANOSIM test was conducted with the anosim function of the vegan package83 and the corresponding R value was obtained. In a second step, the 95% confidence intervals of these estimates (normal approximation and percentile method) were obtained through the boot.ci function of the boot package84. Again, these calculations were performed using as features the volumetric z-scores of the 116 brain areas from the AAL atlas, and of all or just the top five predictors of the PCAM scores yielded by each ML algorithm in each dataset.