4.1 Cohorts
This project used genotype and phenotype data of participants (n=70,620) who self-identified as either Non-Hispanic White (NHW, n=50,180), African American (AA, n=8,563), Asian (n=4,742), or Hispanic (HIS, n=2,292) from the Alzheimer’s Disease Genetics Consortium (ADGC) as well as participants from the Knight Alzheimer’s Disease Research Center (Knight-ADRC)(n=4,843). The ADGC collects data from multiple genotyping rounds from several studies. These include the Adult Changes in Thought (ACT) study, the National Institute on Aging (NIA) Alzheimer’s disease centers (ADC), the Alzheimer’s Disease Neuroimaging Initiative (ADNI), The Predictor of Cognitive Decline Among Normal Individuals study (BIOCARD), the Chicago Health and Aging Project (CHAP), the Children’s Hospital of Philadelphia (CHOP), the Einstein Aging Study (EAS), Glaxo Smith Kline (GSK), the Indianapolis Ibadan Dementia Study (Indianapolis), Johns Hopkins University (JHU), JPN2012, Mayo Clinic Jacksonville (MAYO) and Rochester (RMAYO), The Multi institutional Research of Alzheimer Genetic Epidemiology (MIRAGE) study, the Netherlands Brain Bank (NBB), the National Institute of Aging initiative for Late-Onset Alzheimer’s disease and the National Centralized repository for Alzheimer’s Disease and Related dementias (NIALOAD-NCRAD), the Oregon Health & Science University (OHSU), the Rush University Religious Orders Study/ Memory and Aging Project (ROSMAP), the Texas Alzheimer’s Research and Care Consortium (TARCC), the Translational Genomics Research Institute (TGEN), the University of Miami/Vanderbilt University/Mount Sinai School of Medicine (UMVUMSSM), the University of Pittsburgh (UPITT), and the Washington Heights-Hamilton Heights-Inwood Columbia Aging Project (WHIICAP).
Participants from the Knight-ADRC were evaluated by Clinical Core Personnel at Washington University. Cases (CA) were selected on the following bases: a diagnosis of dementia of the Alzheimer’s type, determined by criteria equivalent to the National Institute of Neurological and Communication Disorders and Stroke-Alzheimer’s Disease and Related Disorders Association for probable AD. Severity was evaluated using the Clinical Dementia Rating (CDR). Controls (CO) were assessed by the same criteria and given a nondemented (CDR = 0) diagnosis. Written consent was obtained from all participants.
Phenotypic and covariate data from each study was harmonized and merged. For this study, we selected CA with an age of onset (AAO) on the earlier spectrum of AD: AAO 70 or younger. Conversely, CO were selected as cognitively healthy participants who are older than 70 yo at last assessment. Case-Control status for all ADGC participants were clinically defined following ADRC criteria and clinical dementia rating (CDR, must be ≥ 0.5 for cases) guidelines. To create more homogenous groups among participants, we performed principal component analysis (PCA) for all participants. PCA was performed using Plink v1.970 with only very high quality variants (genotyping rate ≥ 99%, MAF ≥ 0.01, and HWE P>1×10-06). Participants were included into analyses for NHW, AA or Asian based on genetic similarity to common genetic ancestries used by HAPMAP. NHW bounds were defined by five standard deviations beyond the means of the first two principal components for trios of Utah residents of northern and western European ancestry (CEU). This was done similarly for AA with respect to Yoruba adult-parent-trios from Ibadan, Nigeria and for Asian with respect to unrelated Japanese individuals from Tokyo, Japan (JPT; Figure S1A). Strict bounds were used to define participants as NHW and Asian since those populations tend to be relatively homogenous. AA and HIS, which are generally more admixed, extend up to the border of NHW and between the borders of NHW and Asian, respectively (Figure S1B). PCA was then performed on each ancestry separately to identify and remove outliers. Finally, to control for cryptic relatedness, we performed Identity by descent (IBD) analysis. Unrelated participants were selected for analysis (Figure S1C) based on pi-hat < 0.198. Within the related pairs, the individual with highest call rate was kept for analysis. After all QC steps, 27,004 (NHW 6,282 CA, 13,386 CO; AA 782 CA, 3,663 CO; Asian 375 CA, 838 CO; HIS 280 CA, 270 CO) participants remained for analysis. Due to low sample size and lack of statistical power, no analyses were performed using the HIS dataset.
4.2 Genotype QC
DNA was genotyped on various arrays, mapped to GRCh38 human genome reference, and imputed using the TOPMed imputation server. The following preparation and QC steps were run on the downloaded genetic data; I) the Variant Call Format (vcf) files were converted to plink using PLINK v1.90b6.26. II) the chromosomal plink files of each study were merged for their respective ancestry. III) variants with R2 ≤ 0.371 and variants and participants with genotyping rate (GR) < 98% were removed. IV) variants which were not in Hardy-Weinberg equilibrium (HWE, P<10-06) were removed from autosomal chromosomes. Autosomal and Sex Chromosome data were then merged back into a single plink file. V) Finally, study-specific plink files for each ancestry were merged into a single, ancestry-specific plink file. Genetic data for the Knight-ADRC participants were generated and processed by the Cruchaga Lab at Washington University in St. Louis (https://cruchagalab.wustl.edu/)with identical QC filters as described in Deming et al.,32, including a minimum GR 98% and removal of variants from autosomal chromosomes which are not in HWE (P<10-06). After performing QC steps for each ancestry’s phenotype and covariate, we applied a final genotyping rate filter of 90% to maximize the number of high-quality variants as well as minor allele frequency (MAF) filters unique to each dataset based on a minor allele count (MAC) of 5 (NHW MAF ≥ 0.02%; AA MAF ≥ 0.1%, Asian MAF ≥ 0.2%).
4.3. Statistical Analysis
Single variant association analysis was carried out for NHW, AA, and Asian ancestries using plink v2.3. We used a MAC cutoff of five (MAFNHW=0.02%, MAFAA=0.1%, MAFAsian=0.2%), so the number of variants included in each analysis was 12,725,244 for NHW, 19,508,138 for AA and 9,351,864 for Asian. Sex, genotyping array and the first ten principal components (PC) were included as covariates for NHW and AA analyses. Only sex and the first 10 PCs were used for Asian because over 80% of subjects were from a single cohort and genotyped on a single array (Table S2). Age was excluded as a covariate since sample selection is based on age. Additionally, because plink files were merged based on PCA-based ancestry selection, we used a strict HWE filter of P > 1×10-30 in the analysis. For each single variant analysis, significance was set at the standard genome-wide significance threshold of 5×10-08. Following single variant analysis, meta-analysis was performed with a random-effects model using Plink v1.9 “--meta-analysis” function. Variants from meta-analysis used the same significance threshold as single variant analysis. Following initial single variant analysis, stepwise conditional loci was performed at genome-wide significant loci. In stepwise conditional analysis, the lead variant for each locus was included in the model and analysis was re-run on the locus (1MB flanks on the top hit) to identify independent signals. If an independent signal was identified, the new lead variant was included in the model as well and the steps were repeated until there was no longer any significant signal in the locus.
4.4 Annotation and Gene-based analysis
Summary statistics from single variant analysis were annotated using the FUMA72 online API after LiftOver73 to hg19 format. Multi-marker Analysis of GenoMic Annotation (MAGMA) gene-based analysis27, which is integrated into the FUMA web application, was also run using single variant analysis summary statistics for each ancestry as well as meta-analysis summary statistics.
4.5 Investigating the shared genetic architecture of EOAD and LOAD
Methods to investigate overlap between EOAD and LOAD include, cross-checking the sentinel hit in GWAS loci, correlation between LOAD PRS and EOAD status, genetic covariance using Linkage Disequilibrium SCore (LDSC)74,75 regression local covariance using SUPERGNOVA76, and finally, colocalization.
4.5.1 Cross-check of lead SNP in GWAS loci
To cross-check sentinel hits in GWAS loci, we directly compared the P values of the sentinel variant from our EOAD GWAS with the same variant in LOAD summary statistics from the most recent LOAD GWASs of NHW, AA, and Japanese participants.
4.5.2 Correlation of EOAD status and LOAD PRS
Polygenic risk scores (PRS) were calculated using PRSice277 with LOAD summary statistics as the base GWAS and EOAD plink files for the same ancestry as the target. Risk scores were generated for two different models, one including the APOE region (chr19; 44,000,009-47,999,435 bp) and one excluding it. PRSice2 was run using LOAD SNPs at the following p-value thresholds; 5×10-08, 5×10-05, 5×10-02, and 5×10-01 in each model. The clumping p, R2, and kb thresholds were set at 1, 0.1, and 250, respectively. We also included a phenotype file containing EOAD status in the PRSice input and used the software to perform logistic regression (EOAD status ~ LOAD PRS) for each P threshold in each APOE model.
4.5.3 Local genetic covariance
Local covariance was calculated using Super GeNetic cOVariance Analyzer (SUPERGNOVA)76. LOAD and EOAD summary statistics were prepared by using the munge.py program from LD SCore (LDSC). hg19 formatted summary statistics were used for these steps as they were already prepared including the rsID for each variant as well as this format matched the bfiles and partitioned .bed files provided by the SUPERGNOVA tutorial also in hg19 format.
4.5.4 EOAD and LOAD colocalization
Colocalization between EOAD and LOAD was performed using the coloc.abf function in the Coloc R package. Loci of interest for comparing the two datasets were identified by taking 1MB upstream and downstream of the sentinel SNP in each EOAD GWAS locus. Within each locus, the set of overlapping variants between the two datasets were used to run colocalization. Loci were only considered to colocalize if hypothesis four (variant is causal for both phenotypes) had a posterior probability greater than or equal to 0.8 (PP.H4 ≥ 0.8).
4.6 Gene Prioritization
To identify likely causal genes within GWAS loci, we employed a summation of scores from a weighted presence-absence matrix for all genes within 1MB flanks of the sentinel variant in GWAS loci. The presence-absence matrix is filled based on evidence of overlap with molecular quantitative loci (QTL), if a gene is significant in gene-based analysis, if a gene is the nearest gene to a significant variant, and if a significant variant is a non-synonymous exonic variant for a gene. Overlap with molecular QTLs was assessed for protein QTLs (pQTL) and expression QTLs (eQTL) in three categories: QTLs for a given gene colocalize with EOAD at a GWAS locus, a given gene has a shared locus with EOAD, or the lead variant in a GWAS locus is a genome-wide significant QTL for a gene. A shared locus is defined as any genome-wide significant QTL within 1MB flanks of the lead variant for a GWAS locus. The specific scoring framework is described by supplementary Table S7.
4.6.1 Generating and accessing QTL data
Cerebrospinal fluid (CSF) pQTL data was generated using CSF levels of 7,584 unique aptamers (6,179 unique proteins) measured on the SOMAscan7k platform for 4,223 participants from six dementia relevant cohorts30. pQTL summary statistics used for post-GWAS analyses is generated by meta-analysis of the discovery (n=1,912) and replication (n=1,195) datasets. Cis-pQTLs were variant-aptamer associations within 1MB in either direction of the target protein-coding gene’s hg38 coordinate-based transcription start site. eQTL data were downloaded from GTEx Portal v8 for all Brain tissues, accessed on June 29, 2022. Additional cis-eQTL data was downloaded from the most recent MetaBrain analysis from the MetaBrain website (https://www.metabrain.nl) on downloaded February 24, 2023. CSF metabQTL data was generated using metabolite levels measure by Metabolon for 1,224+1,087 (discovery and replication) participants from five dementia relevant cohorts, while brain metabQTLs are generated using parietal, prefrontal cortex, and temporal cortex post-mortem brain biopsies from 1,172 participants from four dementia-relevant cohorts54. CSF metabQTLs analysis is run separately in the discovery and replication sets, then meta-analyzed for the final summary statistics. Further information on pQTL methods can be found in the source material Western et al.,30 and metabQTL methods can be found in source material from Wang et al.,54.
4.6.2 Colocalization with molecular QTLs
Colocalization with molecular QTLs was performed identically to as described in methods section 4.5.4. All variants within 1MB flanks of the sentinel SNP in each GWAS locus were subset in EOAD summary statistics. QTL data was subset similarly per tissue, per gene. Overlapping variants between the two datasets were used to run colocalization. pQTLs were only considered to colocalize if hypothesis four (variant is causal for both phenotypes) had a posterior probability greater than or equal to 0.8 (PP.H4 ≥ 0.8).
4.6.3 Overlap with molecular QTLs
Overlap with molecular QTLs was queried as described in Methods section 4.6. For each GWAS locus, we determine bounds by taking 1MB upstream and downstream flanks to the lead SNP in the locus. We then subset the molecular QTLs for each gene for each tissue to those loci and query for any genome-wide significant SNPs.
4.6.4 Annotation and gene-based analysis
Annotation and gene-based analyses were performed for each single ancestry and meta-analysis using FUMA software as described in Methods section 4.4. For the purposes of gene prioritization, two annotation items were analyzed. First, a gene was scored if it was the nearest gene to a genome-wide significant variant in the locus. Second, a gene was scored if a genome-wide significant variant in the locus was a non-synonymous exonic (protein altering) variant. For MAGMA gene-based analysis, a gene was scored simply if was significant in gene-based analysis for any ancestry or meta-analysis after multiple testing correction (Bonferroni adjusted P = 0.05 / n genes). The specific cutoff for each ancestry varied, but the number of genes was ~19,000 for each, leading to an approximate cutoff of 2.63×10-06.
4.7 Biological inference of prioritized genes
Prioritized genes were assessed in four main categories: Primary brain cell type, differential expression in AD brain tissues, association with related phenotypes, biological pathways, and protein-protein interaction. In each we used online methods, protein-based analysis, or pathway analysis, which will be further described below.
4.7.1 Primary Brain cell type
Cell-type specificity analysis was performed by Western et al.,30 at the NeuroGenomics and Informatic center at Washington University in St. Louis. The methods are described in full in their publication. As an overview, gene expression data for human astrocytes, neurons, oligodendrocytes, microglia/macrophages, and endothelial cells were downloaded and used to determine a primary cell-type for all SOMAscan 7K panel proteins. Gene expression was averaged across participants for each cell type, then summed for a total expression level for each gene across cell types. Following this, the percentage to which each cell type contributed to the total expression was calculated. A gene was reported to have a specific cell-type if the highest contributing cell type was 1.5× higher than the second highest contributing cell type. For example, if for Gene X, the highest contributing cell type, Cell-type A, was 45% of the total expression, and the next highest cell type, Cell-type B, was 30% or less of the total expression, then Gene X would be Cell-type A specific. Protein in the SOMAscan7K platform were matched to their Entrez gene symbol using the SOMAlogic provided documentation and the ratio-based strategy above was used to determine cell-type specificity for all proteins. EOAD prioritized genes were subset from this to determine cell-type specificity. If cell-type specificity was unable to be determined by this method, the human proteome atlas (https://www.proteinatlas.org/) was also queried to determine if any cell type specificity information was available for a given gene.
4.7.2 Differential Expression in AD brain tissues
The Agora knowledge portal is a database powered by AMP-AD research which hosts evidence for AD relevant genes. The Agora gene comparison tool allows for query and simultaneous comparison of a set of genes’ differential expression as measured by RNA or protein across nine brain regions: anterior cingulate cortex, cerebellum, dorsolateral prefrontal cortex, frontal pole, inferior frontal gyrus, posterior cingulate cortex, parahippocampal gyrus, superior temporal gyrus, and temporal cortex. The tool shows graphically whether any genes in the queried set are significantly upregulated or downregulated in AD diagnosed participants compared to controls. Additionally, the output also provides a genetic score, multi-omics score, and a risk score which sums the multi-omics and genetic score to allow for a numerical assessment of how much a given gene contributes to AD. Agora and AMP-AD methods are further described here https://help.adknowledgeportal.org/apd/AD-Risk-Scores-Data-and-Methods.2826043399.html.
4.7.3 Association with related phenotypes
We used online resources GeneCards28,29 (https://www.genecards.org/), the Human Protein Atlas (https://www.proteinatlas.org) and the GWAS catalog55 (https://www.ebi.ac.uk/gwas/) to search for our prioritized genes and understand what other phenotypes they are known to affect. Particularly we looked to if they were associated with or known to affect other neurological, dementia and neurodegenerative, cancer, and other health risk factors (e.g., BMI, cardiovascular risk factors, education attainment, etc.).
4.7.4 Biological pathways
Pathway analyses were performed with the enrichGO78, enrichDO79 DisGeNet and enrichKEGG functions from the R package ClusterProfiler v4.0980 as well as the Gene2Func tool from FUMA72 and enrichment analyses from STRING36. In each case, the input set was all threshold passing genes (Total prioritization score ≥ 4 points) except those from the APOE region, and all genes were included as the background.
4.7.5 Protein-protein interactions
Protein-protein interactions were empirically investigated using STRING (https://string-db.org/) database to see if our prioritized genes (excluding the APOE region) were enriched for any interactions. Additionally, we performed proteome-wide association (PWAS), which integrates GWAS summary statistics and pQTL data to resolve causal protein in GWAS loci. PWAS was performed with FUSION81 transcriptome-wide association study software. Variant weights were calculated for each protein and for each association separately for those aptamers with at least one study-wide pQTL. Protein levels were estimated based on these weights and correlated with EOAD using NHW summary statistic on chromosomes where genome wide significant EOAD GWAS loci were identified. Finally, we performed colocalization between EOAD and trans-pQTLs to see if there was potential evidence of co-regulation of genes in EOAD loci.