Study Population
PROGRESS is an ongoing prospective birth cohort study conducted in Mexico City, Mexico. The study enrolled 948 women affiliated with the Mexican Social Security Institute (IMSS) during early pregnancy, and closely tracked the development of the infants during their early years, with assessments every six months initially and later biannually. Comprehensive longitudinal follow-up included surveys, physical examinations, and psychological/behavioral evaluations. Biological samples from mothers and children, including blood, were collected and archived at each visit. Additionally, a convenience sampled subset of participants (n = 123) contributed stool samples when the children reached ages 9–11 years.26 Of the 123 participants with stool samples, 112 also had complete outcome data. The research protocols for both the main PROGRESS study and its microbiome sub-study underwent thorough review and were granted approval by the Institutional Review Board at the Icahn School of Medicine at Mount Sinai in New York and the National Institute of Public Health in Cuernavaca, Mexico.
Blood metal measurement
During the 2nd and 3rd trimester visits (18.3 and 31.6 weeks of gestation, respectively), maternal blood samples were collected using standard venipuncture. 66 All blood specimens were drawn using tubes free from trace metals and were stored at temperatures between 2°C and 6°C until analysis. Metal concentrations, including lead (Pb), arsenic (As), cadmium (Cd), chromium (Cr), zinc (Zn), selenium (Se), antimony (Sb), copper (Cu), cesium (Cs), cobalt (Co), and manganese (Mn), were determined using the Agilent 8800 ICP Triple Quad (ICP-QQQ) in MS/MS mode at the trace metals laboratory of the Icahn School of Medicine at Mount Sinai. Measurements were taken in five replicates and reported as an average. For the purpose of QC, all lab recovery rates by this method were 90 to 110%, and inter-day and intra-day precision (given as %relative standard deviation) is less than 6% for samples with concentrations greater than the limit of quantification. 67 The limits of detection for each metal were: 0.391 ug/L for As, 0.113 ug/L for Cd, 0.117 ug/L for Co, 0.901 ug/L for Cr, 0.122 ug/L for Cs, 1.862 ug/L for Cu, 0.442 ug/L for Mn, 0.377 ug/L for Pb, 0.159 ug/L for Sb, 0.665 ug/L for Se, and 5.053 ug/L for Zn (See Midya, 2024 for more details).21
Akkermansia muciniphila measurement
Details of the gut microbiome sampling and processing procedures have been previously published.26 Briefly, participants were recruited during the 9–11-year PROGRESS study visit. Stool samples were collected by participants at home, processed using the Fast method at the clinic in Mexico City, and promptly stored -70°C. Frozen samples were then shipped to the Microbiome Translational Center at Mount Sinai for microbiome sequencing analysis. The subsequent steps involved the processing and sequencing of the samples in two distinct batches: the first batch containing 50 samples and the 2nd containing 73 samples. For shotgun metagenomic sequencing, the NEBNext DNA Library Prep kit was used, and purified DNA sequencing using the Illumina HiSeq platform. The sequencing reads underwent trimming using Trimmomatic.68 To eliminate any human-associated reads, alignment to a human reference genome was conducted using bowtie2. 69 The remaining reads underwent analysis with MetaPhlAn2 and StrainPhlAn to ascertain detailed microbial taxonomy down to the species and strain levels.70,71 Results of the sequencing process can be found in Eggers, et al, 2023. Only children with no antibiotic use within the past month were included in this sub-study. For this analysis, we used the presence/absence of A. muciniphila from this sequencing data.
Depressive Symptom Measurement
As a part of the neurobehavioral follow-up, childhood depressive symptoms were assessed using the Childhood Depression Inventory (CDI) 2 Self-Report Short Version at ages 9-11 years.72 Part of the CDI form was administered to mothers about their children, and part of the form was administered directly to the children. The CDI is a questionnaire appropriate for use with participants aged 7-17 years and validated in Spanish.73 Items are scored on a scale normalized from 0 to 100, with higher scores indicating worse depression symptoms.
Covariate data
We followed a similar set of minimal covariates in line with our previous work. 27,29,30 Potential confounding variables used in this analysis encompassed a range of factors identified in previous literature and prioritized using DAGs. These included the child’s reported sex at birth, age at the time of stool sample collection, the socioeconomic status (SES) of the mother during pregnancy, the mother’s age at childbirth, the mother’s 2nd-trimester body mass index (BMI) while pregnant, and the batch of microbiome analysis (two batches). The mother’s height and weight were measured during the 2nd trimester using a professional digital scale and a stadiometer. From these measurements, the BMI was calculated and subsequently treated as a continuous covariate for regression analyses. The 1994 Mexican Association of Intelligence Agencies Market and Opinion (AMAI) rule was employed to assess maternal socioeconomic status during pregnancy. This classification system places families into six levels based on responses to 13 questions concerning household characteristics. As most families in the study belonged to the low to middle SES bracket, these six categories were consolidated into three broader classifications: lower, middle, and higher.74 All of these covariates were adjusted for in the statistical models. This sub-study utilized convenience sampling due to proximity to study location, availability of mothers and their children at a given time, or their willingness to participate in the study. Although there is potential selection bias and residual confounding bias, we utilized advanced causal inference techniques to address such issues to the best of our ability. Our analyses utilized covariate-balancing techniques and negative control outcomes to address any systematic and potential selection and residual confounding biases.
Statistical Analysis
Statistical analyses were conducted in R (version 4.2.3). The Pearson correlation coefficient was used to estimate the correlation between prenatal metal exposures. The outcome, t-scored CDI, was log-transformed (log tCDI) for all the main analyses (later in the sensitivity analyses, similar results were replicated with the non-log-transformed t-scored CDI). Some covariates (mother’s age and child’s age at the time of stool collection) included less than 5% missing values; therefore, under the assumption of missing at random, we imputed the missing values using predictive mean matching by the multiple imputation chained equations as implemented in the “MICE” R package.75 A false discovery rate was used to adjust for multiple comparison errors. The statistical analysis was conducted in three stages. First, we estimated the association with each individual metal and log-tCDI using linear regression models. The results were presented through a forest plot with beta estimates and corresponding 95%CI. Second, we estimated the association between the presence of A. muciniphila and log-tCDI using linear regression. Third, using a combination of interpretable machine-learning algorithm and regression-based framework, we identified the most frequently occurring metal-cliques and then estimated their associations with log-tCDI.
Although the details can be found in previous works, we present and discuss this method comprehensively for interpretability and further replicability.27,29,30 A metal-clique is an indicator of a multi-ordered combination of metals. For example, consider a three-metal combination consisting of metals A, B, and C, denoted as A+B+C-. Here, the metal-clique A+B+C- implies higher concentrations of metals A and B (above certain thresholds) and a lower concentration of metal C in the sample. This binarized indicator form of metal-clique A+B+C- denotes an underlying sub-sample (or subgroup) satisfying the conditions of the clique. We used the repeated holdout signed-iterated Random Forest (rh-SiRF),27,29,30 treating the concentrations of metal exposures during both the second and third trimesters as predictors and log(t-CDI) as the outcome. The rh-SiRF algorithm utilizes a combination of “Iterative Random Forests” with “Random Intersection Trees” to search for metal combinations predictive of log(t-CDI) scores. 76–78These predictive combinations of metals were chosen following the branches in the decision trees. Moreover, instead of searching for all possible metal combinations (231 two-components, 1540 three-components, for example), rh-SiRF finds the most frequently occurring combinations on the decision path. Further, bagging and repeated-holdout stages were introduced to estimate the “stability” of the discovered combinations. The algorithm was repeated 500 times on a training/test data partitioning of 60%/40%, with 250 bootstraps implemented in each iteration. From the list of most stable metal combinations, we chose the top three. We presented a closed-loop network of this combination through the Fruchterman-Reingold Layout 79 implemented through the igraph package in R.80,81 Finally, this combination is transformed into a binarized indicator using a quantile-based threshold-finding algorithm. A schematic of this algorithm and R code with illustrations on a simulated dataset is provided on GitHub (https://github.com/vishalmidya/MiCA-Microbial-Co-occurrence Analysis/blob/main/MiCA-vignette.md).
For improved inference, we used a matched-sampling strategy typically applied in causal inference analysis to obtain similar covariate distribution between children with or without A. muciniphila.31 The assumption is that, given the covariates, this balancing approach can potentially create “exchangeable” groups of children with or without A. muciniphila such that they are hypothetically randomly assigned, and most importantly, the covariates did not confound the group assignment.32 Due to the small sample size and to prevent greater sample loss due to covariate-balancing, we used a subclass matching procedure with the propensity score as implemented in the R package “MatchIt”.82 This approach uses propensity scores based on all the covariates to classify participants into subclasses, which were then weighted to balance the influence of the covariates for participants with vs. without A. muciniphila. We used love plots of the differences in standardized means in covariates to examine the suitability of balancing.32 All the regression analyses were based on this covariate-balanced matched sample. We further adjusted all the models with the previously described covariates.