Sample
Four different datasets (CIBERSAM, PsyCourse, GAIN, and nonGAIN) were obtained and combined to perform a GWAS meta-analysis comprising 4 740 patients of European ancestry (Table 1). In all datasets, subjects met the criteria for SCZ, schizoaffective disorder, schizophreniform disorder, delusional disorder, brief psychotic disorder, or psychotic disorder not otherwise specified, in the Diagnostic and Statistical Manual of Mental Disorders version IV (DSM-IV).
Table 1
Description of samples used in the study.
Dataset
|
Source
|
N
|
% Females
|
Mean AAO (SD)
|
Genotyping chip
|
CIBERSAM
|
Seven groups from the Biomedical Research Network in Mental Health (CIBERSAM)
|
1,704
|
32.98
|
25.32 (8.84)
|
Illumina Infinium PsychArray
|
PsyCourse
|
The PsyCourse study
|
499
|
38.88
|
26.26 (9.55)
|
Illumina Infinium PsychArray
|
GAIN
|
The Genome-Wide Association Study of Schizophrenia (dbGaP repository study accession: phs000021.v3.p2)
|
1,280
|
29.92
|
21.11 (6.77)
|
Affymetrix Genome-Wide Human SNP Array 6.0
|
nonGAIN
|
The Molecular Genetics of Schizophrenia—nonGAIN Sample (MGS_nonGAIN, dbGaP repository study accession: phs000167.v1.p1)
|
1,224
|
31.54
|
21.77 (7.24)
|
Affymetrix Genome-Wide Human SNP Array 6.0
|
Participants in the CIBERSAM (Biomedical Research Network in Mental Health) dataset were recruited from psychiatric in-patient units at seven different hospitals in Spain 29. Participation was approved by the ethical committees at the hospitals involved in the recruitment. Finally, samples were genotyped using the Illumina Infinium PsychArray at the Broad Institute as part of the wave 3 meta-analysis GWAS of SCZ of the Psychiatric Genomics Consortium (PGC-SCZ wave 3)30.
The PsyCourse samples were part of a multi-site German/Austrian longitudinal study (www.psycourse.de) that was conducted between 1 January 2012 and 31 December 2019. The study collected deep phenotypic, neuropsychologic, and omics data from patients with brief psychotic disorder, major depressive disorder (MDD), bipolar disorder (BD), SCZ, schizoaffective disorder, and healthy individuals. Adult participants were referred by the clinical staff or identified by querying patient registries. Study protocols were reviewed and approved by the ethics committees of the Medical Centers and Faculties involved in the recruitment, in accordance with the Declaration of Helsinki. All participants provided written informed consent. The phenotype information was gathered using the v4.1 version of the PsyCourse data release. These samples were genotyped using the Illumina Infinium PsychArray 31.
GAIN and nonGAIN datasets were both obtained from the dbGaP repository (accession numbers phs000021.v3.p2 and phs000167.v1.p1, respectively for each dataset), and genotyped with the Affymetrix Genome-Wide Human SNP Array 6.0 as described elsewhere 32.
Age at onset
In the CIBERSAM dataset, AAO was defined as the onset of the first psychotic symptoms. The patient (and/or family members) and the psychiatrist defined when the first psychological symptoms appeared. In cases where this information was not available, we used the first psychiatric contact due to a psychotic episode. In the PsyCourse dataset, AAO was collected as both the age at first outpatient and inpatient treatment. Subjects with information available for either of these data were included, and when both data were available, the earliest age was used. For the GAIN and nonGAIN datasets, AAO was defined as the most likely AAO of psychotic symptoms consistent with the onset of SCZ. A consensus diagnostician (PI or senior research clinician delegate) reviewed the diagnostic ratings made independently by two research diagnosticians (one of which could be the consensus diagnostician as well) and assigned a final diagnosis and AAO if the ratings were in agreement. The Kolmogorov-Smirnov test was used to determine pair-wise differences in AAO between datasets.
GWAS and functional analyses
Quality control (QC) was conducted for each dataset separately (CIBERSAM, PsyCourse, GAIN and nonGAIN) using PLINK 1.9 33, according to standard procedures for GWAS 34. Briefly, genetic variants with missingness rate > 2%, minor allele frequency (MAF) < 5%, Hardy-Weinberg equilibrium (HWE) P-value < 1e-06, and those belonging to non-autosomal chromosomes were excluded from downstream analyses. Ambiguous and multiallelic variants were also removed. Subjects with a missingness rate > 2%, increased or decreased heterozygosity rates (defined as ± 3 standard deviations away from the sample mean), and relatedness > 12.5% (PI_HAT > 0.125) were excluded. Sex was imputed based on X chromosome heterozygosity/homozygosity rates, before removing sex chromosomes. Principal component analyses (PCA) were conducted using SMARTPCA from EIGENSOFT 6.1.4 35. To keep only subjects with European ancestry, we did not use those individuals who were beyond ± 3 standard deviations from the mean of the first two principal components (PCs) of the European cluster of the 1000 Genomes Project 36 Phase I. We also removed four individuals who clustered in the Finnish subgroup of the European cluster. Before genotype imputation, a PCA was performed again on the remaining subjects and the top 10 PCs were kept for further analysis.
Genotype imputation was conducted for each resulting dataset independently using the TOPMed reference panel 37. Imputed datasets were filtered according to an imputation quality score (r-squared) < 0.9 and converted to binary files using PLINK’s --vcf flag. Then, a post-imputation QC was conducted for each dataset separately, using PLINK. Only single-nucleotide polymorphisms (SNPs) were kept for further analyses. In addition, ambiguous and multiallelic SNPs, as well as SNPs with MAF < 1%, and an HWE P-value < 1e-06 were excluded. The resulting datasets were lifted over to genome build 19 using the UCSC liftOverPlink tool 38. The CIBERSAM dataset included: 1 704 subjects and 4 962 031 SNPs; PsyCourse: 499 subjects and 5 338 835 SNPs; GAIN: 1,278 subjects and 6 042 664 SNPs; and nonGAIN: 1 259 subjects and 6 074 765 SNPs. A GWAS was performed by linear regression for each dataset in PLINK, using normalized AAO as outcome and sex and the top 10 PCs as covariates. A meta-analysis was then conducted using the tool METAL 39 applying an inverse variance strategy. As a result we obtained information on 4 740 subjects and 6 540 522 SNPs. As an alternative method, individual-level imputed genotype data for each of the four separate datasets was merged and a GWAS was conducted in parallel for comparison purposes. From here on, this approach is called the merged approach (Supplementary Note).
Genomic loci showing suggestive associations with AAO (P-value < 1e-05) were identified and explored using the FUMA software 40. Each genomic risk locus was represented by the top lead SNP which had the minimum P-value in the locus. Lead SNPs were defined as associated SNPs (P-value < 1e-05) that were independent of each other at LD r2 < 0.1. Independent significant SNPs were defined as SNPs with a P-value < 1e-05, and independent of each other at a linkage disequilibrium (LD) threshold r2 < 0.6. The genomic risk loci were mapped to protein-coding genes within a 10 kb window based on ANNOVAR 41 annotations. Finally, pathway enrichment analyses were conducted using KOBAS-i 42. In all the analyses, a 5% False Discovery Rate (FDR) was considered for multiple testing correction.
SNP-based heritability
The proportion of phenotypic variance explained by SNPs was estimated using two different methods. First, SNP-based heritability was estimated with the Linkage Disequilibrium Score Regression (LDSC) method 43. To reduce the standard error given our relatively small sample size, the intercept was constrained to 1 after testing that it was not significantly higher than 1 44. The LDSC intercept has been widely employed to distinguish between inflation due to confounding factors (such as population stratification and cryptic relatedness) and inflation due to polygenicity. Deviation of the intercept from 1 is indicative of residual confounding, thus observing an intercept not significantly higher than 1 can be interpreted as there being minimal confounding bias 43. Second, we also estimated heritability using individual-level genotypes with the Genome-based Restricted Maximum Likelihood (GREML) approach implemented in the Genome-wide Complex Trait Analysis (GCTA) tool 45, adjusting for sex, the dataset and the top 10 PCs.
Cross-trait polygenic risk score
PRS analyses were conducted using the PRS-cs software 46 between ten mental phenotypes and AAO. Specifically, we obtained summary statistics data on seven psychiatric disorders downloaded from the Psychiatric Genomics Consortium (https://www.med.unc.edu/pgc/) 47, including SCZ, BD, MDD, attention-deficit/hyperactivity disorder (ADHD), autism spectrum disorder (ASD), obsessive-compulsive disorder (OCD), and cannabis use disorder (CUD). In addition, we obtained GWAS data on three conditions previously associated with AAO: neuroticism 48, educational attainment (EA) 11, and childhood maltreatment 49 (Table S1). The downloaded summary statistics were filtered to remove ambiguous, multiallelic and duplicated SNPs, and SNPs with an imputation score less than 0.9 (if the information was provided). The summary statistics of the mental phenotypes were used as base data, and the individual-level genotype dataset was used as target data. We calculated the PRS for each mental phenotype using the “auto” mode (the shrinkage parameter phi was determined from the data with a Bayesian approach). Then, the scores obtained for each individual in our dataset were regressed out against sex, age, and batch to obtain new adjusted-PRS scores. Linear regressions were performed with the normalized AAO values as outcome and the adjusted-PRSs as independent variables to evaluate the association of each PRS with AAO. Finally, p-values were corrected for multiple correction using FDR.
Copy number variation analysis
CNV analyses were conducted using signal intensity data from those individuals in the four cohorts from whom we obtained the intensity files (N = 4 630). The raw CNVs were obtained using PennCNV 50,51. Briefly, quality control of CNV calls was based on a sample-level criterion that examined the relationship between the standard deviation of the logarithm R Ratio (LRR_SD) and the number of CNV calls (NumCNV). At the end of the process, adjacent calls were merged together into one single call. Thresholds were carefully chosen to include as many subjects as possible but reduce false positives. Thus, subjects with LRR_SD > 0.35, BAF > 0.01, WF > 0.05, and NumCNV > 150 were not included. In parallel, 12 CNVs previously described as significantly associated with SCZ (SCZ-CNV) were obtained from the literature 52. Then, the BEDTools intersect 53 was used to look for overlaps between the described CNVs and our data. Briefly, we selected as CNV carriers only those individuals for whom at least 90% of the SCZ-CNV overlapped with a detected CNV (-f 0.9) and/or with at least 80% of reciprocal overlap (-r -f 0.8). To determine whether the presence of CNVs could be influencing AAO, Wilcoxon and Kruskal-Wallis tests were performed in R4.1.2 54, using non-normalized AAO values.