Participants
Data were obtained from the Adolescent Brain and Cognitive Development (ABCD) Study, an ongoing longitudinal investigation of brain and behavioral development in 11,878 children in the United States (U.S.). Youth were recruited primarily from schools within 21 catchment areas chosen to represent the ethnic and sociodemographic diversity of the larger U.S. population (21). Analyses were performed using de-identified phenotypic data from the baseline time point obtained from the ABCD 3.0 National Data Archive release (DOI 10.15154/1519007). Analyses were restricted to youth with available genetics data. See Supplementary Table 1 for subject characteristics.
Genotyping and imputation
Genotyping for the ABCD dataset was performed at the Rutgers University Cell and DNA Repository using the Affymetrix NIDA Smokescreen Array (RUCDR). Quality control was performed by the RUCDR and the ABCD Data Analysis and Informatics Core (DAIC), and included filtering for call signals and rates, and processing through the Ricopili pipeline (57). Genotyping data was merged with data from the 1000 Genomes project and ancestry was assigned using the k-nearest neighbors classification with the first four genetic principal components.
C4 structural alleles were imputed for the ABCD dataset using Beagle4.1 (58) with the latest multiethnic C4 imputation reference panel composed of 1,265 samples of diverse ancestry (18). C4A and C4B genetically regulated expression (GREx) was predicted using previously described expression weights(11).
Downstream analyses were restricted to samples of African (AFR), European (EUR), and Latinx ancestries as the imputation panel was composed of those individuals, and C4 imputation accuracy (based on probabilistic dosage) was lower for Southeast Asian and East Asian. Individuals with average imputed probabilistic dosage <0.7 were excluded from subsequent analyses. To further ensure the use of high-quality imputed data, analyses were restricted to individuals with the five most common C4 structural haplotypes (AL, AL-AL, AL-BL, AL-BS, BS) resulting in a final sample of 7,789 youth (see Supplementary Table 2).
For UK Biobank, chromosome 6 of the phased haplotype data (i.e. ukb_hap_chr6_v2.bgen) was used to impute C4 structural alleles. Beagle4.1, with the multiethnic C4 reference panel, was used for imputation, as was done for the ABCD dataset.
Polygenic risk prediction
Polygenic risk scores (PRS) were calculated for EUR subjects. Genotype imputation was performed on the Michigan Imputation Server (59) using the TOPMed reference panel. Quality control included filtering for genotyping rate >0.99, sample missingness <0.01, Hardy-Weinberg Equilibrium p>1×10−6, minor allele frequency >1%, and imputation score >0.8. GCTA v1.93.31 (60) was used to compute the genetic relationship matrix; a relatedness cutoff of 0.05 was applied. Genetic variants in the major histocompatibility complex (MHC) locus were excluded prior to PRS calculation. PRS were generated using summary statistics from the largest available genome-wide association study of schizophrenia (10) (40,675 cases, 64,643 controls) in PRS-CS (61), applying the LD reference panel from the 1000 Genomes Project phase 3 samples and the recommended global shrinkage parameter for highly polygenic traits (phi=1e-2). Analyses assessing the interaction between C4 expression and PRS were restricted to youth with common C4 structural haplotypes (AL, AL-AL, AL-BL, AL-BS, BS), resulting in a final sample of 3,730 EUR youth (female N = 1,737; male N = 1,993) and 7,715,663 genetic variants.
Neuroimaging data acquisition and preprocessing
Details of the ABCD magnetic resonance imaging (MRI) protocol have been described previously (62). Briefly, imaging data were collected on either a Siemens Prisma, Phillips, or GE 750 3T scanner using 32 channel head or 64 channel head/neck coils. T1-weighted scan parameters were as follows: Siemens - matrix size 256x256, 176 slices, FOV 256x256, resolution (mm) 1x1x1, TR 2,500ms, TE 2.88ms, flip angle 8°, total scan time 7:12; Phillips - matrix size 256x256, 225 slices, FOV 256x240, resolution (mm) 1x1x1, TR 6.31ms, TE 2.9ms, flip angle 8°, total scan time 5:38; GE - matrix size 256x256, 208 slices, FOV 256x256, resolution (mm) 1x1x1, TR 2,500ms, TE 2ms, flip angle 8°, total scan time 6:09.
Tabulated data representing T1-weighted measurements of subcortical volume (VOL), cortical thickness (CT), and surface area (SA) as derived from the Desikan parcellation atlas(32) in FreeSurfer v5.3 (http://surfer.nmr.mgh.harvard.edu/) were downloaded from the NDA. ABCD recommended imaging inclusion criteria were followed to ensure inclusion of high-quality MRI scans, this excluded participants with serious MR findings and participants whose T1-weighted images or FreeSurfer parcellations failed to pass ABCD DAIC quality control. Regional MRI metrics were averaged across hemispheres.
Phenotypic data
Psychotic experiences were assessed with the Prodromal Questionnaire Brief Version (22) (PQ-B) and the Kiddie Schedule for Affective Disorders and Schizophrenia (23) (KSADS-5). The PQ-B is a 21-item child-report questionnaire designed to measure the presence and severity of psychotic-like experiences (PLEs) in childhood. PQ-B symptom scores (PQ-Bsym) were calculated as the total number of endorsed PLEs; severity score (PQ-Bsev) was calculated as the child-reported level of distress (range 1-5) across all endorsed items. In line with previous research indicating that unusual thought content and suspiciousness are most predictive of psychosis risk(24), a binary score (PPbin) was developed by defining prodromal psychosis “cases” as youth who endorsed experiencing at least 3 PLEs reflecting suspicious or unusual thought content (i.e., PQ-B items 1, 4, 5, 7, 8, 11, 12, 13-18), significant distress related to these experiences (i.e., PQ-B distress scores > 6 across items 1, 4, 5, 7, 8, 11, 12, 13-18), and whose parents indicated the child experienced hallucinations, delusions, and associated psychotic symptoms on the KSADS-5 assessment; youth not meeting these criteria served as the “control” group. Cognitive and psychiatric variables from the reported C4 phenome-wide association study (PheWAS) can be found in Supplementary Table 3.
Rates and severity of childhood PLEs were non-normally distributed, with the majority of youth reporting zero or few PLE symptoms (PQ-Bsym; M = 2.55, SD = 3.52) and endorsing low-levels of PLE severity (PQ-Bsev; M = 6.09, SD = 10.39; Figure 1B).
Statistical analyses
Generalized linear models with gamma family and log link were run using the glm package in R version 3.6.3 to test the association between C4 GREx and continuous measures of psychotic experiences (i.e., PQ-Bsym, PQ-Bsev), as these variables demonstrated a right-skewed distribution. The association between C4 GREx and PPbin was tested via logistic regression with binomial family using the glm package in R. Fixed effects included predicted C4A and C4B GREx, age, socioeconomic status (SES; taken as the average of parent education and income at the baseline time point), four genetic principal components, sex, and site ID. Statistical models run in related individuals using family ID as a random effect failed to converge; thus, analyses were restricted to one randomly selected subject from each family.
Models testing multivariable phenome-wide associations were also restricted to one individual per family. For normally distributed variables, linear models were fitted, while for non-normally distributed variables, generalized linear models with gamma distribution and log link were fitted. C4A and C4B GREx, age, SES, four genetic principal components, sex, and site ID were specified as fixed effects. Benjamini-Hochberg False Discovery Rate (FDR) correction was used to account for the number of variables tested at each timepoint (i.e., baseline, 1-year follow-up, 2-year follow-up. The significance of covariates of interest was assessed using the likelihood ratio test.
To test the association between C4 GREx and neuroimaging indices of brain volume, cortical thickness, and surface area, linear mixed models were run via the lme4 package in R. C4A and C4B GREx, age, four genetic principal components, SES, and sex were entered as fixed effects; MRI device serial number and family ID were entered as random effects. To control for global brain effects, whole brain volume, mean cortical thickness, or mean surface area were also included as covariates for respective models. The significance of covariates of interest was assessed using the likelihood ratio test, and FDR correction was used to account for the number of variables tested within each imaging measure (i.e., VOL, CT, SA).
Sex-differences in the effects of C4 GREx on brain and behavioral phenotypes were tested by including an interaction term between C4A and C4B GREx and sex in statistical models. Follow-up analyses were performed in each sex separately.
The total number of participants in each analysis (i.e., those with complete PQ-Bsym, PQ-Bsev, PPbin, phenotype, and/or brain-imaging data) are provided in Supplementary Tables.
Power analyses were conducted in G*Power (63) to determine the range of detectable effects with the current sample size. With an estimated N of 6,500, alpha of 0.05, 1 tested predictor, and 11 covariates, we have 95% power to detect effects (Cohen’s f2) as small as 0.002, equivalent to a Cohen’s d of 0.004. As previous research in adults suggests that we can expect to see a significant association between C4A GRExand cognitive measures at Cohen's f2 = 0.056 (20), power calculations indicate that our study is sufficiently powered to detect small but statistically significant effects, equivalent to a Cohen’s d of 0.004.
Plots were generated using R packages ggpubr, ggplot, and ggesg. PheWAS results were plotted in R by modifying publicly available code from Dr. Yoonjung Joo (https://rpubs.com/helloyjjoo/pheWAS_connectome).
Replication in the UK Biobank
The UK Biobank resource was accessed through application number 38029. Brain imaging data was acquired for 18,361 unrelated individuals with European ancestry. Haplotypes at the C4 locus were imputed as described for the ABCD dataset. After further restricting to individuals with the five most common C4 haplotypes (AL, AL-AL, AL-BL, AL-BS, BS), a total of 16,147 individuals remained in the final analyses. A linear model was then fitted between entorhinal cortex surface area and C4A GREx, using C4B GREx, sex, age at scan, average cortex surface area, location of head in scanner (Data fields 25756 to 25759), first four genotype PCs and scan sites as covariates. A null model was also fitted with only covariates but not C4A GREx. Statistical significance of the association was evaluated by performing a likelihood ratio test. The analysis was performed in females (N = 8,357) and males (N = 7,790) separately, without using sex as a covariate.
Validation of expression weights for C4 genes in PsychENCODE
Previously harmonized genotype array and frontal cortex RNA-seq data from PsychENCODE (19,64) consisted of uniformly processed data from six studies: BipSeq, LIBD_szControl, CMC_HBCC, CommonMind, BrainGVEX, and UCLA-ASD. We imputed C4 alleles in six studies from PsychENCODE separately in the same manner as the ABCD dataset using Beagle4.1 with the latest multiethnic C4 imputation reference panel. This analysis consisted of 340 AFR, 863 EUR, and 50 Latinx samples. C4 expression was then predicted using previously described expression weights(11), which was associated with observed C4 gene expression (Supplementary Fig. 1). Observed C4 gene expression was strongly associated with corresponding predicted gene expression, except for C4B gene expression in AFR samples (P > 0.05).