Genome-wide association studies (GWAS) have burst onto the genetic landscape over the last 15 years. Many traits and common diseases have been studied in hundreds of thousands of individuals, identifying genetic loci associated with height, cancer, coronary artery disease, and many other traits1–3. However, translation of the associations identified by GWAS is often challenging, as many relevant loci fall in gene-rich or intergenic regions. These associations make interpretation and assignment of loci to specific genes difficult.
Recently, large databases analyzing the genetic regulation of biological molecules, including the GTEx consortium, eQTLGen, and MetaBrain4–6, have described genetic loci affecting mRNA levels. These resources have been integrated with disease GWAS analyses to prioritize functional genes at previously uncharacterized loci7. However, analyses focusing on mRNA miss disease-relevant biology. First, many of the studies focus solely on genetic variants that are in close proximity to the gene encoding the mRNA, preventing the identification of cross-genome regulatory effects. Second, the correlation between levels of mRNA and their encoded proteins is weak8,9, while proteins are typically the molecule that most-directly act on disease. As a result of this low correlation, the overlap between expression (mRNA) quantitative trait loci (QTL) and protein QTL is also low, suggesting genetic regulation unique to proteins10.
Large studies have investigated genetic association of protein levels, but they have overwhelmingly focused on plasma11–15; previous work has shown that plasma proteogenomics shares little overlap with the brain16. Small studies have analyzed the proteogenomic signature of cerebrospinal fluid (CSF)16–20. Targeting specific CSF proteins has proven successful at elucidating causal genes at some of the disease GWAS loci, including an association for the TREM2 protein at the MS4A locus providing a potential mechanism about how MS4A modifies AD risk7. However, these studies are limited by comparatively small sample sizes or number of proteins analyzed.
Alzheimer’s disease (AD) is the most common highly heritable neurodegenerative disease, with heritability estimates in the range of 60 to 80%21. GWAS have identified over 75 genetic loci associated with AD22. However, the functional genes driving the associations for most of these loci are still unknown, or how these GWAS genes interact between them in specific pathways and disease mechanisms.
Here, we present the largest-to-date investigation of the genomic signature of the human CSF proteome, in both number of proteins and sample size. We analyze the genetic regulation of 7,028 unique proteins using CSF from 3,107 healthy and neurologically impaired individuals. We classify proteogenomic hotspots that regulate multiple proteins, then integrate our results with a large-scale AD risk GWAS to identify novel causal and druggable proteins relevant to AD. Finally, we use our AD-associated proteins to build a prediction model that excels at classifying AD biomarker status.
Unique genetic architecture of CSF proteomics
We performed the proteogenomic analysis of CSF by integrating proteomics (aptamer-based assay: SOMAscan 7k23,24; Supp Fig. 1; Supp. Tables S1&S2) and genomic data in a total of 3,107 unrelated Non-Hispanic White (NHW) individuals (Supp Fig. 2). We utilized a three-stage study design (Fig. 1): 1) a discovery stage including 1,912 CSF samples and genomic data from unrelated NHW individuals from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), Parkinson’s Progression Marker Initiative (PPMI), and Ace Alzheimer Center Barcelona (FACE) with 7,559 aptamers passing stringent QC, corresponding to 6,334 unique proteins (Fig. 2A, top; Supp. Figure 3; see Methods for proteomic data generation and QC); 2) a replication stage including 1,195 CSF samples and genomic data from unrelated NHW individuals from the Knight-ADRC Memory and Aging Project (Knight-ADRC), the Dominantly-Inherited Alzheimer’s Network (DIAN), and Barcelona-1 with 7,028 aptamers passing QC, corresponding to 6,177 unique proteins (Fig. 2A, Supp. Figure 4, Supp. Table S1&S2); and 3) a fixed-effect meta-analysis of the aptamers measured in both discovery and replication (Table 1). For pre-QC descriptions of the proteomics samples and samples removed by each step, see Supp. Tables S3&S4. In the meta-analysis, we considered protein-variant associations (pQTL) to be significant if they had Pdisc < 0.005, Prep < 0.05, a consistent direction of effect in discovery and replication, and study-wide significant p-value in the meta-analysis (P < 5×10− 8 for cis variants and P < 3.45×10− 11 for trans variants; see materials and methods).
Table 1
Demographics of post-QC samples used in discovery and replication.
Cohort
|
# Proteomics Samples
|
Avg. Age
|
% Male
|
%APOE4+
|
#CO
|
#AD
|
#ADAD
|
#ADRD
|
#PD
|
Discovery
|
|
|
|
|
|
|
|
|
|
ADNI
|
689
|
73.7 (SD 7.5)
|
58.2
|
50.2
|
149
|
521
|
0
|
19
|
0
|
PPMI
|
785
|
61.8 (SD 9.4)
|
55.1
|
NA
|
157
|
0
|
0
|
1
|
627
|
FACE
|
438
|
71.9 (SD 8.3)
|
41.1
|
35.6
|
128
|
238
|
0
|
72
|
0
|
Replication
|
|
|
|
|
|
|
|
|
|
DIAN
|
193
|
38.6 (SD 10.7)
|
48.4
|
27.6
|
73
|
3
|
116
|
1
|
0
|
MAP
|
805
|
71.4 (SD 8.7)
|
46.7
|
39
|
565
|
176
|
2
|
59
|
3
|
Barcelona-1
|
197
|
68.8 (SD 7.5)
|
52.3
|
NA
|
4
|
63
|
0
|
129
|
1
|
Total
|
3107
|
|
|
|
1076
|
1001
|
118
|
281
|
631
|
ADNI: Alzheimer’s disease Neuroimaging Initiative; PPMI: Parkinson’s Progression Markers Initiative; FACE: Ace Alzheimer Center Barcelona; DIAN: Dominantly-Inherited Alzheimer’s Network; MAP: Knight ADRC Memory and Aging Project |
%APOE4+: Percentage of individuals who are carriers of at least one C allele at rs429358. |
#CO: Number of cognitively normal individuals |
#AD: Number of individuals affected with late-onset Alzheimer’s disease, as determined by clinical status |
#ADAD: Number of individuals affected with early-onset Alzheimer’s disease |
#ADRD: Number of individuals affected with non-AD dementia (including frontotemporal dementia, lewy body dementia, etc.) |
#PD: Number of individuals affected with Parkinson’s disease |
In the meta-analysis, we identified 2,316 index pQTL associations for 1,961 protein aptamers (1,786 unique proteins; Fig. 2B, Supp. Table S5). Of those, 1,247 (53.8%) were cis (+/- 1MB from TSS) and 1,069 (46.2%) were trans-pQTL. Correlation of effect sizes for the study-wide significant variants was extremely high (R = 0.98, P < 2.2×10− 16, Supp. Figure 5, Supp. Table S5), confirming that both cohorts are contributing to the associations observed and highlighting the robustness of these pQTL associations. We further replicated our findings using a third independent cohort with SOMAscan proteomics measurements and genomic data (N = 183 from the Stanford Iqbal Farrukh and Asad Jamal Alzheimer’s Disease Research Center). We observed a strong index pQTL variant effect size correlation between the meta-analysis and external replication (1,676 variants tested) when comparing the SOMAscan results (Nrep = 183, R2 = 0.92, P < 2.2×10− 16, Supp. Figure 6, Supp. Table S5). For pathway enrichment analysis results for all proteins with a pQTL, see Supp. Figure 9 and Supp. Table S6.
Because a large portion of our dataset (2,031, 65.4%, Table 1) is derived from individuals with Alzheimer’s disease, we sought to determine if our associations were consistent across healthy and affected individuals. We classified individuals into AD-relevant biomarker groups using the amyloid/tau/neurodegeneration (A/T/N) framework25 (see methods). We performed association analysis for our index pQTL variants stratifying by AD biomarker positive (A+T+; N = 775) vs negative (A−T−; N = 889). When comparing effect sizes of the index pQTL variants, we observed a strong correlation between A−T− vs A+T+ (R = 0.98, P < 2.2×10− 16; Fig. 2E, Supp. Table S5), meta-analysis vs A+T+ (R = 0.98, P < 2.2×10− 16, Supp. Figure 7A), and meta-analysis vs A−T− (R = 0.99, P < 2.2×10− 16, Supp. Figure 7B), indicating that the pQTL identified in our meta-analyses are consistent across disease states. However, we did observe some associations that were biased toward either diseased or healthy individuals (see Extended Results).
GWAS loci can often have independent significant variants in the same locus26. To identify independent signals in the same locus, we performed conditional analyses using GCTA-COJO27 (Supp. Figure 5, Supp. Table S7). In total, we identified 3,373 conditionally independent associations (see Extended Results, Supp. Table S7). Most aptamers (1,137, 58.0%) had a single independent association, but 824 aptamers had at least two, 337 had at least three, 144 had at least four, and one protein (GSTM1) had 11 independent associations at its cis locus. This suggests that many proteins have complex mechanisms regulating their CSF levels. These analyses were performed at a study-wide level, so it is possible that we excluded many additional independent associations below our study-wide significance threshold. We also used VEP to annotate each conditionally independent association.
We identified all variants in linkage disequilibrium (LD, R2 > 0.8) with each conditionally independent pQTL variant and performed variant annotation on the associations using the Ensembl Variant Effect Predictor (VEP)28, determining the most severe consequence corresponding to each variant set (R2 > 0.8, Supp. Table S8&S9). The largest proportion of the variants were intronic (N = 1,156, 34.3%), followed by upstream/downstream of nearby genes (691, 20.5%), missense (799, 23.7%), or intergenic (324, 9.6%, Supp. Figure 8, Supp. Table S7). In order to determine if the pQTL were enriched in any specific annotation category, we compared our proportions to those from VEP annotation of 3,373 random variants and 1,000 permutations. We observed a significant enrichment in multiple categories of protein-altering variants, including missense, splice donor, and protein-truncating variants (see Extended Results, Supp. Figure 8D, Supp. Table S10), but not intronic, intergenic, or non-coding. While intronic variants were the most prevalent pQTL annotation overall, the percentage annotated as intronic (34.3%) was much lower than the randomly selected variants (51.7% intronic on average). The independent variants were also enriched for protein-altering variants overall (P < 0.001), indicating that pQTL are enriched for coding variants in comparison with other variant types.
Common variants in both cis and trans had lower effect sizes on average than rare variants (Rcis= -0.49, Rtrans = -0.58, Pcis < 2.2×10− 16, Ptrans < 2.2×10− 16, Fig. 2C), consistent with previous results13,16. We also confirmed previous evidence13,16 that cis pQTL variants tend to have lower effect sizes as they get farther from the protein coding gene (Fig. 2D; Supp. Figure 5).
We and others have reported limited overlap of pQTL associations across tissues and with other molecular QTL12,16. We analyzed the overlap of the CSF pQTL with plasma and brain using two previously-published studies: first, summary statistics from a plasma pQTL analysis of over 35,000 individuals and approximately 5,000 proteins (SOMAscan5k; Supp. Table S11)11. In total, 4,735 aptamers overlapped between our meta-analysis and the large plasma pQTL study, covering 1,682 (72.6%, out of 2,316) of our CSF pQTL associations. Of the 1,682 possible shared signals with the plasma study11 (N = 35,559), 1,131 (67.2%, PP.H4 < 0.8) did not colocalize, representing novel and CSF-specific pQTLs (Fig. 2F). We identified a significantly higher ratio of CSF-specific trans associations compared to cis (548/734 trans vs 582/948 cis, P = 4.3×10− 9), consistent with previous work16. We also identified 634 additional novel index pQTL (27.4% of the 2,316 total CSF associations, 299 cis and 335 trans) that correspond to proteins that were not analyzed in the plasma study. A total of 1,765 CSF pQTL were not observed in the large plasma study.
Second, we compared to our previous brain (N = 380) and plasma (N = 529) pQTL (SOMAscan1.3k platform; Supp. Tables S12 & S13)16. For the brain dataset, 1,002 aptamers overlapped, corresponding to 465 associations. For the in-house plasma dataset, 862 aptamers overlapped, corresponding to 389 associations. We performed Bayesian colocalization for each pQTL association that had the corresponding aptamer present in one of the three datasets. We observed shared associations (posterior probability of one functional variant, PP.H4 ≥ 0.8) for just 32.8% (551/1,682) of the CSF pQTL with the larger plasma dataset11, 16.5% (64/389) with our previous plasma study, and 7.1% (33/465) with our previous brain study (Figs. 2F&G). The low overlap with brain may be partially driven by lower power to detect associations in the brain study (N = 380). In total we identified 1,720 novel index pQTL (74.3% of all CSF pQTL, 848 cis and 872 trans) that were either only significant in CSF (CSF-specific, 1,153) or were associated with proteins only measured in our analysis (true novel, 567). When analyzing all conditionally independent associations 2,448 were novel to CSF, of which 1,652 were considered CSF-specific and 796 were truly novel.
We next compared our cis-pQTL associations to multiple eQTL datasets from whole blood (GTEx4, N = 670; eQTLGen5, N = 31,684), neurologically-relevant tissues (GTEx cortex4, N = 205; MetaBrain6: cortex, N = 2,970; cerebellum, N = 492; basal ganglia, N = 208; hippocampus, N = 168; spinal cord, N = 108; and microglia29, N = 90). Of the 1,247 cis-pQTL associations, 29.2% colocalized (PP.H4 ≥ 0.8) with MetaBrain cortex eQTL, 22.4% with MetaBrain cerebellum, 16.6% with GTEx whole blood, 14.1% with eQTLGen whole blood, 9.5% with GTEx cortex, 7.5% with MetaBrain basal ganglia, 6.3% with MetaBrain hippocampus, 5.1% with MetaBrain spinal cord, and 4.7% with microglia (Fig. 2G, Supp. Tables S14-S22). We observed the largest number of shared associations with cortex and cerebellum even with relatively small sample sizes. In total, 567 (45.5%) cis CSF pQTL associations do not colocalize with any eQTL. Of the 680 cis signals that overlap between eQTL and pQTL, 83.1% (565) colocalize with at least one neurologically relevant tissue and 55.9% (380) are unique to these neurological tissues. Only 16.9% (115) are uniquely shared between CSF and whole blood. In total, 391 (31.4%) of our CSF cis-pQTL associations do not colocalize with any pQTL or eQTL dataset analyzed.
Consistent with previous research16, we observed a high degree of both tissue and molecule specificity for our CSF pQTL, with over 50% of the associations (cis and trans) not overlapping with any other dataset or tissue. When analyzing cis associations specifically, we observed the highest overlap with plasma pQTLs (366/1,247, 29.4% shared, Fig. 2G, Supp. Table S23), suggesting that regulation of proteins across tissues is more likely to be shared than that of eQTL vs pQTL even in the same tissues. This supports the notion that proteins have additional regulatory mechanisms beyond gene expression. When comparing our results to eQTL, we observed the highest overlap with the cortex and cerebellum (364 and 279 shared associations, 29.2% and 22.4%, Fig. 2G, Supp. Table S23), indicating CSF is a better proxy to brain than blood. These results support the fact that proteins are regulated at different levels compared to RNA that could include post-translational mechanisms (phosphorylation, glycosylation, cleavage, binding to receptors among others). These processes can contribute to localization of proteins, level of excretion from cells, and changes in protein conformation, all factors that can affect levels in the CSF.
Pleiotropic regions regulate neurologically relevant pathways
Pleiotropic regions of the genome may be vital to regulation of multiple factors of biologically relevant pathways, but are often removed or not investigated in pQTL studies. To identify genomic regions that regulated multiple proteins, we grouped the index pQTL variants from each pQTL association through linkage disequilibrium, using a threshold of R2 ≥ 0.1. We identified 194 regions that regulated levels of at least two proteins (271 regulating at least two aptamers, Fig. 2B, top; Supp. Figure 5; Supp. Table S24). Of these, 27 regions were associated with at least five proteins (37 with at least five aptamers), seven regions regulated at least ten, and three regions regulated more than 50.
Here, we prioritized the three most pleiotropic regions (chr3q28, chr6p22.2-21.32, and chr19q13.32; other regions are discussed in the extended results). Due to the complexity of the HLA region on chr6p22.2-21.32, we grouped all variants located in this locus together regardless of linkage disequilibrium information (see Methods). For these three regions, we first determined the genomic localization of the genes encoding each regulated protein. We then performed pathway and cell-type enrichment analysis to identify the cellular context of the regulated proteins. Finally, we performed a pheWAS using the GWAS Catalog30 to determine other traits regulated by the index pQTL variants in each region.
The chr3q28 pleiotropic region is intergenic (located between GMNC and OSTN) and consists of eleven index variants corresponding to one LD block (R2 > 0.5, Supp. Figure 20F; two blocks when using R2 = 0.8, Supp. Figure 10G) associated with 130 unique proteins (136 aptamers; Fig. 3A, Supp. Table S24), all of them trans. No pQTLs was observed in this region in plasma11, suggesting this is a CSF-specific pQTL hotspot. Proteins regulated by this region included five members of the syntaxin family (STX2, 3, 7, 8, and 12) involved in synapse function31 and five ephrin family members (EPHA7, EFNB1-3, EPHB2) involved in neural development and memory32,33. For each of the proteins regulated in the region, we analyzed brain-relevant cell-type expression data34 (Supp. Table S25) to determine cell specificity and observed a significant enrichment of neuron-specificity (P = 0.001, FC = 1.588, Fig. 3B, Supp. Figure 11, Supp. Table S26), suggesting functional relevance of this region for brain-related traits. In addition, we observed 234 enriched pathways associated with the 130 proteins. The 26 most-significant pathways were dedicated exclusively to neuronal and cell surface pathways (Fig. 3C, Supp. Figure 10, Supp. Table S27). These included neuron projection development (GO: 0031175, 32/123 analyzed proteins, P = 3.99×10− 9), cell junction (GO: 0030054, 44/127, P = 1.22×10− 8), axon development (GO: 0061564, 20/123, P = 4.96×10− 7), and SNAP receptor activity (GO: 0048812, 7/126, P = 5.93×10− 7). A PheWAS of the index variants in this region identified twelve traits associated with index pQTL variants, eleven of which were measures of brain morphology, volume, or surface area (Fig. 3D, Supp. Table S28). These include highly significant associations for vertex-wise sulcal depth35 (P = 1×10− 128), white matter mean diffusivity36 (P = 1×10− 19), and cortical surface area37 (P = 1×10− 17). Importantly, we also observed an association between rs9877502 and CSF levels of phosphorylated tau (P = 1×10− 36), a pathological hallmark for AD38 and a key biomarker. Additionally, OSTN, flanking the 3’ end of this region, is highly expressed in human neurons39 and regulates dendritic growth in the human brain40, suggesting regulation of OSTN may be driving the associations in the region. The highly neuronal-specific nature of the associated proteins, the link between OSTN and dendritic growth, and the numerous associations with brain morphology in this region highlight its importance for neurological development.
The chr6p22.2-21.32 pleiotropic region spanned about 8MB in the HLA region and was flanked by HIST1H2AA to RPL12P1 based on literature-defined boundaries41. This region corresponds to 156 associations for 140 index pQTL variants in 65 LD blocks (R2 > 0.5) associated with 68 unique proteins (74 aptamers; Fig. 3E, Supp. Table S24). The same region in plasma contained 1,757 associations11, significantly more than CSF (9.72% of all plasma associations vs 6.74% of all CSF associations; P = 4.35×10− 6). In CSF, we identified 32 cis associations for 28 aptamers and 23 unique proteins and 124 trans associations for 60 aptamers and 56 unique proteins. These included multiple complement components such as complement C2 (top index pQTL SNP P = 8.32×10− 32), complement factor B (top index pQTL SNP P = 1.86×10− 31), and C4a anaphylatoxin (top index pQTL SNP P = 1.86×10− 21). The proteins with pQTL in this region were enriched in microglia and macrophages (P = 0.03, FC = 1.574, Fig. 3F, Supp. Figure 11, Supp. Table S26). Pathway analysis identified 64 total enriched pathways among proteins regulated by variants in this region (Fig. 3G, Supp. Figure 12, Supp. Table S29). These were highly immune-specific, including regulation of the immune response (GO: 0050776, 23/64 analyzed proteins, P = 2.98×10− 10), leukocyte mediated immunity (GO: 0002443, 14/64, P = 1.38×10− 7), and antigen processing and presentation (KEGG: hsa04612, 6/39, P = 7.03×10− 6). Through a pheWAS of the index pQTL variants in this region, we identified 160 associations with traits and diseases (Fig. 3H, Supp. Table S30). These included autoimmune disorders like asthma42 (P = 8×10− 14) and myositis43 (P = 3×10− 49) and neurodegenerative disorders including Parkinson’s disease44 (P = 4×10− 17) and AD45 (P = 3×10− 14). The shared associations in this region with both immunological and neurodegenerative traits adds evidence to the known relationship between these two processes46.
In CSF, the region associated with the largest number of proteins corresponds to eleven index pQTL variants in three LD blocks (R2 > 0.5, Supp. Figure 13D; seven LD blocks with R2 = 0.8; Supp. Figure 13E) on chr19q13.32 in the APOE gene region. Variants in this region are associated with the levels of 331 proteins (337 aptamers; Fig. 3I, Supp. Table S24), of which three (two isoforms of APOE and APOC2) are in cis. This locus is known to regulate the levels of many proteins in plasma and CSF11,12,16. We observed significantly more associations located in this region in CSF than in plasma11 (14.6% of all CSF pQTL vs 0.67% of all plasma pQTL, P < 2.2×10− 16). APOE is the strongest genetic risk factor for AD47 and has been associated with at least eighteen other diseases, including heart disease and high cholesterol48. We observe associations in this region for known AD biomarkers, including four members of the 14-3-3 protein family (YWHAB, YWHAE, YWHAG, and YWHAZ), and neurofilament light and heavy chains (NEFL and NEFH)49,50. Calcineurin (PPP3R1), associated with both phosphorylated tau levels and rate of decline in AD51, is also genetically regulated by this region. Interestingly, although we observed an enrichment in neurons of proteins regulated by variants in this region (P = 7.93×10− 6, FC = 1.53, Fig. 3J, Supp. Figure 11, Supp. Table S26), APOE itself was specific to astrocytes (72.4% of expression in astrocytes, Supp. Table S25). This suggests a potential cell-to-cell communication between astrocytes and neurons. Pathway analysis results support this hypothesis (Fig. 3K, Supp. Figure 13), as we observe enrichment for seventeen pathways including glutamatergic synapse (KEGG: hsa04724, P = 5.1×10− 4), assembly and cell surface presentation of NMDA receptors (Reactome: R-HSA-9609736, P = 1.66×10− 4), and axon development (GO: 0061564, P = 6.58×10− 5). We also observe pathways enriched for apoptotic function, suggesting a potential role of APOE52–54 in cell death. Alternatively, this finding could be tagging neuronal death and neuronal protein release, which would be expected driven to individuals with neurological disease, as and be associated with APOE4. In order to determine if this is the case, we analyzed these variants in individuals that are cognitively normal and biomarker negative (AT−) and compared them to those who are biomarker positive (AT+). We found that for these 337 aptamer-SNPs pairs, 299 showed no significant difference in effect size between the AT− and AT+ analyses, suggesting that this finding is not just an artifact of disease status. APOE variants regulate the levels of these proteins in both healthy and diseases individuals. Pathways related with the apoptotic function include activation of BH3-only proteins (Reactome: R-HSA-114452, P = 5.67×10− 5) and apoptosis itself (Reactome: R-HSA-109581, P = 2.4×10− 4, Supp. Table S31). Our PheWAS of the three independent signals in this locus identified 1,232 associated traits, highlighting the highly pleiotropic nature of this region and its involvement in several diseases and risk factors. These traits were largely involved in aging, cardiovascular, metabolic, and neurological traits, including lifespan, coronary artery disease, familial combined hyperlipidemia, and AD (Fig. 3L, Supp. Table S32).
Overall, we observed three highly pleiotropic regions that regulated the levels of many CSF proteins located on chromosomes 3q28, 6p22.2-21.32, and 19q13.32. The chromosome 6 and 19 regions, centered around the HLA and APOE loci, respectively, were also observed in plasma, although our analyses indicate that the APOE region show a higher degree of pleiotropy in CSF than plasma11 (P < 2.2×10− 16) and the opposite for chr 6 (P = 4.35×10− 6). In CSF both the chr3q28 and 19q13.32 regions were enriched for neuronal proteins and traits and were more pleiotropic in CSF than plasma, suggesting even pQTL hotspots show differences across tissues. While these regions are often overlooked in QTL analyses, we highlight that they represent master regulatory regions that are vital to the understanding of disease and important biological processes, and that to fully understand protein-protein interactions and proteins that are part of the same pathway it is instrumental to further explore these pleiotropic regions.
Novel proteins associated with Alzheimer’s disease
Approaches integrating QTL with disease GWAS have been successful at resolving GWAS loci to prioritize causal genes. We sought to build upon this in the context of AD by utilizing three independent and complementary approaches: estimation and correlation of protein levels in AD using shared genetic etiology through proteome-wide association study (PWAS)55,56, prioritization of putative causal proteins for AD through Mendelian randomization (MR)57, and identification of shared genetic variants associated with both protein level and AD through colocalization (COLOC)58,59.
We performed a PWAS using the FUSION framework55. We first calculated variant weights for each protein and for each association separately for those aptamers with at least one study-wide pQTL. We then obtained the genetically estimated protein levels based on these weights and correlated them with disease status using AD GWAS summary statistics (Supp. Figure 14). Pathway enrichment analysis results and directionality for all PWAS-significant aptamers are available in Supp. Figures 14&15 and Supp. Table S33&S34. Because all proteins with associations in the APOE region were associated with AD risk (Supp. Figure 16), we removed APOE-associated proteins and recalculated pathway enrichment (Supp. Figure 17, Supp. Table S35). Finally, all proteins associated with AD through a cis locus are detailed in Supp. Figure 18 and through a trans locus (besides the APOE and HLA regions) in Supp. Figure 19. After excluding pleiotropic regions (those associated with five or more aptamers), 115 associations for 107 aptamers and 97 proteins were significant in the PWAS analyses, including 81 that were driven by cis-pQTL and 34 by trans (Supp. Table S33).
We next performed MR57 to identify proteins causal for AD, using the pQTL summary statistics and the Bellenguez et al. AD GWAS summary statistics22. For each aptamer, we selected significant cis and trans instrument variables through LD pruning on all variants with P < 5×10− 8, after removing pleiotropic regions. Our analyses identified 37 proteins (40 aptamers) as putative causal for AD (Fig. 4A, Supp. Table S36&S37) based on MR analyses (FDR-corrected P < 0.05) with confirmed directionality based on the Steiger test60 (Supp. Table S38). In addition, 33 of these 40 MR-significant aptamers were also FDR-significant in the PWAS analysis (Fig. 4A). Cis-specific MR for those with applicable variants confirmed all but one association (APOBEC2, Supp. Figure 21).
Finally, we performed COLOC58,59 between each of our 2,316 significant pQTL associations (cis and trans) and the AD GWAS. After exclusion of pleiotropic regions, we identified 28 proteins (33 aptamers) that colocalized with AD risk (Fig. 4A, Supp. Tables S39 & S40). After filtering of the colocalization results based on pleiotropy, 32 aptamers overlapped between PWAS and COLOC.
Of all proteins prioritized by PWAS, COLOC and MR, 15 (17 aptamers, eleven in cis, three in trans, and one in both cis and trans) were shared between all three of them and 42 proteins (48 aptamers, 31 cis, 17 trans) were significant between at least two of the methods (Figs. 4A-D; Supp. Table S41). Through cell-type enrichment analysis, we found that these 42 proteins were overrepresented in microglia/macrophages (N = 10, P = 0.008, FC = 2.01) and mature astrocytes (N = 8, P = 0.03, FC = 1.76, Fig. 4E, Supp. Figure 11, Supp. Table S26). We also performed pathway and gene set enrichment analyses and observed overrepresentation of the proteins in pathways relating to neurodegenerative traits, including late onset AD (DisGeNet:C0494463; P = 1.32×10− 6) and brain atrophy (DisGeNet:C4551584; P = 4.55×10− 6). Consistent with the cell type enrichment, these proteins were also present in immune pathways, including regulation of the immune response (GO:0050776; P = 1.58×10− 5) and virus receptor activity (GO:0001618; P = 6.46×10− 4, Fig. 4F, Supp. Figure 20, Supp. Table S42).
Proteins supported by multiple methods
These (n = 42) proteins associated with AD based on our analyses (PWAS, COLOC and/or MR) 1) provide additional evidence of functionality for some of the nominated AD gene loci, 2) nominate alternative genes in those loci, and 3) identify novel genes implicated in AD that have not been reported before.
Our findings support the nominated functional gene for ten of the GWAS-identified22 loci: APOE, ACE, CR1, CTSH, EGFR, GRN, IL34, SHARPIN, TMEM106B, and TREM2. These include putatively protective effects (as identified through PWAS and MR) for ACE in cis at 17q23.3 (PWAS Z=-9.04, MR β=-0.206), TREM2 in cis at 6p21.1 (PWAS Z=-14.5, MR β=-0.236), TREM2 in trans at 11q12.2 (MS4A locus, PWAS Z=-10.03, MR β=-0.236), and GRN in cis at 17q21.31 (PWAS Z=-7.05, MR β=-0.22). We also identified proposed proteins associated with increased risk, including CTSH in cis at 15q25.1 (PWAS Z = 4.81, MR β = 0.041) and SHARPIN in cis at 8q24.3 (PWAS Z = 6.98, MR β = 0.246). These results offer more evidence for established candidate genes like TREM2 and CR1, while also reaffirming the genes prioritized at newly-identified loci (SHARPIN, EGFR, CTSH, among others).
We identified alternative candidate proteins at four AD risk loci. First, the association at 1q32.2 has been associated with CR1 for years. While we observed an association between CR1 and AD through PWAS and COLOC, we also observed the same positive association with CR2 protein (index pQTL variant rs679515) in both PWAS and MR (PWAS Z = 11.85, PWAS P = 1.94×10− 32, MR P = 3.39×10− 13, COLOC PP.H4 = 0.996, Supp. Figure 22). Most studies have nominated CR1 as the functional AD gene in this region61–63, but our analyses also nominate CR2 as a potential additional factor affecting this locus. The similarities in structure and function between CR1 and CR264 make it plausible that both may play a role in neuroinflammation known to be implicated in AD pathogenesis65. At the 7q22.1 AD locus, the proposed genes were ZCWPW1 and NYAP1. However, the causal gene has been unclear, with PILRA and PILRB also nominated66. Our analyses indicate that PILRA can be also a functional gene in this locus. Cis levels of PILRA (index pQTL variant rs1859788) were negatively associated with AD (PWAS Z = -8.532, PWAS P = 1.44×10− 17, MR P = 4.06×10− 11, PP.H4 = 0.995, Supp. Figure 22). This supports previous research suggesting altered ligand binding with PILRA offers a protective effect on AD67, with the index pQTL variant identified being described as the likely causal allele. The index variant has also been fine-mapped as the likely causal variant at the locus in another GWAS study68. Additionally, at the 16p11.2 AD locus, we nominate PRSS8 as the candidate gene instead of KAT8. Cis-regulated levels of PRSS8 (index pQTL variant rs1978485) were associated with increasing risk of AD (PWAS Z = 4.01, PWAS P = 5.95×10− 5, MR P = 2.43×10− 6). Using COLOC-SuSiE59, we identified three credible sets of causal variants affecting PRSS8 levels, with only one colocalizing with AD risk (PP.H4 = 0.977, Supp. Figure 22), highlighting the complexity involved in GWAS locus resolution. This protein was also identified as being associated with AD in plasma12, albeit with an opposite direction of effect. This directionality difference between tissues has been reported for other proteins for AD in previous analyses10. Finally, at 19q13.32, we identify NECTIN2 (index pQTL variant rs3810143) as an additional candidate gene near APOE. CSF NECTIN2 levels were associated with lower risk of AD (PWAS Z = -12, MR β = -0.685). The protein serves as a receptor for multiple Herpes Simplex Virus (HSV) forms69, which have repeatedly been linked to AD70. Recent work from our group has also identified an association in the NECTIN2 region for CSF TREM2 levels (in press), supporting its relevance in AD biology. Through integration of AD GWAS with CSF pQTL, we identify proteins that differ from those prioritized previously and may constitute new targets.
While the power of AD GWAS studies is constantly increasing, undiscovered associations that may be found by including larger samples size still exist. In support of this, we also identified twelve proteins (APOBEC2, C1S, CA12, CABLES2, CD33, CD72, CLN5, FCGR3B, HDGF, PTPA, SIGLEC9, and SIRPA, all cis) whose associated loci do not reach genome-wide significance in GWAS studies, but are associated with AD in our analyses. Of these proteins, three were supported by all three analyses (PWAS, COLOC, and MR): FCGR3B, CLN5 and SIRPA. On chromosome 1, higher genetically estimated CSF levels of FCGR3B (cis association, index variant rs4379692) were associated with lower risk of AD (PWAS Z=-4.45, P = 8.42×10− 6, MR β=-0.0195, P = 9.64×10− 8) at a locus that reaches suggestive significance for AD (PP.H4 = 0.937). Low FCGR3B copy number has been associated with autoimmune disease71, supporting an important role in immune function also associated with AD. We observed a negative association between cis-regulated levels of CLN5 and AD (chr13:77000848:A:G, PWAS Z=-4.21, PWAS P = 2.59×10− 5, MR P = 3.29×10− 5, PP.H4 = 0.946, Supp. Fig. S21). Variants in CLN5 cause a form of neuronal ceroid lipofuscinosis, a lysosomal storage disorder72. CLN5 defects have also been linked to retromer dysfunction, which has previously been identified in AD73,74. On chromosome 20, cis-regulated levels of SIRPA were positively associated with AD risk (rs17855616, PWAS Z = 4.41, PWAS P = 1.02×10− 5, MR P = 1.37×10− 5, PP.H4 = 0.959). Loss of SIRPA in microglia has been associated with increased synapse loss75 and its binding partner, CD47, is upregulated in synapses in AD and may act through shielding of synapses from microglia pruning76. These proteins highlight the need for continued growth of AD GWAS to larger sample sizes. For discussion of the other proteins, see the Extended Results.
In addition, we also identified 17 proteins contributing to AD through at least two of PWAS, MR, and/or COLOC whose associations are driven by trans-pQTL including TMEM132C, LRP6, CRADD, and STX17 among others (Table S41, Fig. 4C). These findings suggest novel protein-protein interactions that have not been identified before, similar to the case of TREM2 levels regulated by trans-pQTL in the MS4A locus7 that we have replicated here. First, the BIN1 locus is one of the regions with the strongest association with AD22. The causal link between BIN1 and AD is unclear, but it has been connected to increased risk through interaction with tau77. We found that levels of TMEM132C were associated with the BIN1 locus that colocalized with the AD risk signal (PP.H4 = 0.999, Supp. Table S39). CSF levels of TMEM132C regulated by this locus were associated with increased risk of AD (PWAS Z = 20.1, PP.H4 = 0.999, Supp. Fig. S21). While little is known about TMEM132C specifically, its family members have been linked to panic disorders78 and insomnia79, suggesting an important brain function potentially through interaction with the actin cytoskeleton80. This protein also has a pQTL at the APOE locus discussed above. We also observed trans-pQTL for both STX17 and CRADD at 11q24.3 near ADAMTS8 (index pQTL variant rs3740888). Mutations in CRADD cause intellectual disability81 and the protein contributes to apoptosis of neurons82, potentially mediating a protective effect on AD through increased apoptosis of diseased neurons. Like TMEM132C, CRADD is also associated with the APOE locus discussed above. STX17 is a member of the SNARE complex and is vital for the fusion of autophagosomes with lysosomes83 and interruption of this function is associated with axonal dystrophy, a feature of AD84. Both proteins were associated with decreased risk of AD at this locus (STX17 PWAS Z=-3.84 MR β=-0.062; CRADD PWAS Z=-3.53, MR β=-0.102). These analyses suggest that ADAMTS8, STX17 and CRADD are part of the same pathway, potentially contributing to disease through autophagosomal-lysosomal pathways. Additionally, we observed an association between CSF levels of LRP6 and intergenic variants in ADAM10 at 15q21.3 (index pQTL variant rs1427281). ADAM10 contributes to cleavage of amyloid precursor protein into neuroprotective components85 and overexpression of it in a mouse model of AD was found to reduce amyloid plaque burden86 and lessen memory deficits87. Levels of LRP6 regulated by this locus were associated with increased risk of AD (PWAS Z = 5.81, MR β = 0.189). LRP6 has been found to function in Wnt signaling vital to synapse function and variants in LRP6 have been associated with synapse degeneration in AD88. Variants at this locus may contribute to AD by increasing secretion of LRP6 to CSF and thereby decreasing the cellular levels of the protein. The integration of trans-pQTL with AD performed here identifies novel protein-protein interactions that help to explain the disrupted pathways observed in AD. All other proteins prioritized through trans associations are discussed in the Extended Results.
Consistent with our cell-type results that showed enrichment of our 42 proteins in microglia/macrophages (N = 10, P = 0.008, FC = 2.01) and mature astrocytes (N = 8, P = 0.03, FC = 1.76, Fig. 4D, Supp. Figure 11, Supp. Table S26), these AD-associated proteins were enriched in immune function (GO:Regulation of immune response, P = 1.58×10− 5, Fig. 4E, Supp. Table S42). Immune-relevant proteins include TREM2 (cis association at 6p21.1 and trans at 11q12.2), which is well-known for its role in microglia’s response to neurodegeneration89. It is targeted for processing to the soluble form by ADAM1090, whose genetic locus was found to also regulate LRP6 levels above. TREM2 complexes with DAP12, which regulates activation of microglia through its immunoreceptor tyrosine-based activation motif (ITAM)91. IL34, also identified here, interacts with CSF1R on the surface of microglia leading to microglial activation92. Several of our identified proteins, including PILRA93, CD3394, and SIGLEC995, also contain ITAM or immunoreceptor tyrosine inhibitory motif (ITIM) domains. ITAM-containing proteins in coordination with ITIM-containing proteins regulate processes including microglial phagocytosis and apoptosis96. We also identified other proteins involved in immune response. Variants in SHARPIN were associated with AD through reduction of the NF-κB-mediated immune response97. C1S and C4BPA are both involved in the complement system98,99.
Another relevant pathway for these proteins involves lysosomal dysfunction which is known to be implicated in AD pathogenesis100. Our analyses not only support this process but also extend its ties to AD by adding novel causal proteins linked to the lysosome (GO:Lysosomal lumen acidification including CLN5, GRN, TMEM106B, P = 1.23×10− 5, Fig. 4E, Supp. Table S42). As stated above, mutations in CLN5 contribute to a form of neuronal ceroid lipofuscinosis72. Neurons deficient in CLN5 protein showed impaired lysosomal activity and movement, suggesting the vitality of CLN5 to lysosome function72. In AD, autophagosomes showed an inability to fuse with lysosomes, causing accumulation in neurites101. In addition, STX17, associated in trans near ADAMTS8, is vital to the normal function of this process and depletion of STX17 leads to autophagosome accumulation83,102. We also identified an increased risk for AD associated with CTSH (Fig. 4B, Supp. Table S33). This is consistent with a previous work, wherein increased mRNA levels of CTSH were observed in AD patients and microglia with CTSH knockouts showed increased Aβ phagocytosis103. Cathepsins are known to function in proteolysis of lysosomal proteins104. Interestingly, CST8, a protein we associated with decreased risk of AD, is part of a protein family that inhibits the activity of cathepsins, offering a potential mechanism for its protective effect105. We also identified two proteins, GRN and TMEM106B, that are typically associated with frontotemporal dementia (FTD)106,107 but were also identified in the latest AD risk GWAS22. Both proteins are implicated in lysosomal function108,109. We prioritized two pathways through which AD pathology may develop. First, we identified numerous proteins involved in immune function, prioritizing those involved in ITAM/ITIM signaling. Second, we characterized potential functions of lysosomal proteins through interruption of autophagosome-lysosome fusion.
Finally, we searched DrugBank110 to identify potential therapeutic agents targeting the identified AD-associated proteins that may be repurposed. Of the 42 prioritized proteins, 15 had at least one reported molecule in DrugBank (Fig. 4C). FCGR3B is targeted by cetuximab (DrugBank accession number DB00002), a chemotherapy agent. Cetuximab has been associated with decreased risk of AD111, making it a potential drug-repurposing target (Supp. Table S5). Glutamic acid (DB00142) targets SLC25A18 and is heavily involved in excitatory signaling in neurons112. SLC25A18’s function as a mitochondrial glutamate transporter may be abnormal in AD113. CASQ1 is a ligand for calcium (DB11093), and studies have suggested that targeting calcium receptors may be beneficial in AD114. Finally, CA12 is targeted by carbonic anhydrase inhibitors like acetazolamide (DB00819), which was shown to reduce Aβ-induced mitochondrial toxicity115. All associated drugs can be seen in Supp. Table S5. Through our analyses, we identify numerous causal and druggable proteins for AD. These drugs, or others developed to target these proteins, may offer promising targets for AD treatment in the future.
Identified proteins successfully predict AD biomarker status
The identification of accurate tools to prioritize individuals at risk of AD is a vital focus for the field. We sought to compare the predictive power of the prioritized proteins to classify individuals who were positive for two established AD biomarkers: CSF amyloid beta-42 and CSF phosphorylated tau-181 protein. We developed a proteomic risk score based on the transcriptional risk score (TRS) approach116(Fig. 5A). Briefly, we started with the normalized aptamer levels (z-score) for all proteins associated with AD through PWAS (N = 456) in each individual. Using the TRS framework, we harmonized the aptamer levels so that higher proteomic risk score would always increase AD risk. The average per-individual PWAS protein z-score was significantly associated with amyloid/tau status (P for A-T- vs A + T + = 9.04×10− 116, Fig. 5B). We then split our dataset into training and testing datasets (training: ADNI, FACE; testing: MAP, Barcelona-1, consistent with our discovery and replication for the pQTL mapping). We used LASSO regression to select significant predictors (N = 162, aptamers with non-zero weights are denoted in Supp. Table S33) in the training dataset. We then used the calculated weights (designated ProtRS) to predict amyloid/tau (AT) positivity in the testing dataset, and additional independent dataset (Stanford Dataset), as well as in age-stratified and APOE genotype-stratified analyses.
The ProtRS showed highly accurate prediction of AT status in both the training (NA−T− = 355, NA+T+ = 471, AUC = 1, Supp. Fig. S23) and in the testing datasets (NA−T− = 429, NA+T+ = 261, AUC = 0.968, 90% CI = 0.955–0.981, Fig. 5C, Supp. Tables S44 & S45). In order to further validate this prediction model, we analyzed a third independent cohort (Stanford ADRC, NA−T− = 58, NA+T+ = 21). These samples were quantified using the SOMAscan5k assay, so around half of the proteins in the model were missing (288/456 measured, 106/162 significant predictors measured). Even with potentially losing some effectiveness due to missing measurements, we still observed high predictive ability in the independent dataset (AUC = 0.958, CI = 0.920–0.996, Fig. 5D, Supp. Table S44).
Given the prioritized proteins were identified through genetics, we also compared the ProtRS to a model based on a polygenic risk score calculated using the AD risk GWAS22 (including the APOE region) with age and sex as well as a model combining APOE ε2 and ε4 genotype with age and sex. Both genetics-focused models showed AUC similar to other previous studies for PRS (AUCPRS = 0.759, CIPRS = 0.729–0.790; AUCAPOE = 0.784, CIAPOE = 0.748–0.820, Fig. 5C, Supp. Table S44) but the ProtRS showed significantly increased AUC compared to both models (ProtRS vs PRS P < 2.2×10− 16, ProtRS vs APOE P = 8.56×10− 15, Supp. Table S45). These results suggest a highly accurate and repeatable prediction model for AT status. We find that proteins prioritized through genetics significantly improve upon variant-based prediction models, supporting the use of genetics to prioritize potential disease biomarkers.
As protein levels change with time while genotype stays static, we wanted to determine if the ProtRS could predict AD risk at different ages. We stratified our samples by age bracket and compared our ProtRS to PRS (including sex; Fig. 5E, Supp. Tables S44 & S45). Our ProtRS model showed improved predictive ability compared to PRS for individuals in all age brackets (45–60, 61–65, 66–70, 71–75, 76–80, and 81+; P ≤ 5.80×10− 9, ΔAUC ≥ 17.2%, Fig. 5F). The ProtRS was especially accurate at predicting status in younger individuals compared to PRS, exhibiting an AUC of 0.981 in the 45–60 bracket (CI 0.955-1.00) compared to only 0.702 when using PRS (CI = 0.620–0.785, P = 5.80×10− 9, ΔAUC = 0.279). This highly significant improvement also persisted for individuals aged 61–65 (P = 7.54×10− 12, ΔAUC = 0.255).
PRSs for AD are heavily dependent on the inclusion of APOE ε2 and ε4 genotype, with knowledge of those genotypes alone being more predictive than a pruned model containing the other genome-wide variants117. We stratified our samples by APOE genotype to analyze the effectiveness of the ProtRS to predict AT status compared to PRS excluding the APOE region. The PRS model also included age and sex as covariates. The ProtRS outperformed the PRS with a significantly higher AUC across all genotype categories (ΔAUC ≥ 0.189, P ≤ 5.70×10− 9, Figs. 5G & 5H). While both methods performed well for individuals with the APOE ε4 allele, the PRS decreased in effectiveness in APOE ε33 or ε2 carriers while the ProtRS was consistent across genotypes. When predicting AT status in ε2 carriers, the ProtRS had an AUC of 0.999 (CI 0.998-1.00) while the PRS showed an AUC of 0.690 (CI 0.603–0.778, P = 5.70x10− 9, ΔAUC = 0.309, Supp. Tables S44, S45). The ProtRS also had an AUC of 0.997 for predicting A/T status in ε33 carriers (CI 0.995–0.999), significantly better than the PRS with an AUC of 0.0.748 (CI 0.714–0.782, P < 2.20×10− 16).
Across ages and APOE genotypes, the ProtRS developed based on AD-associated proteins is highly accurate and consistently performs better than PRS at predicting amyloid/tau positivity. This is consistent with previous results, where risk scores based on molecules predict disease more effectively than genetic variants alone116,118. Our results were consistent across training, testing, and external replication datasets, supporting our model as being highly replicable. We also showed high accuracy in all age brackets and APOE genotypes, while the PRS was less accurate in younger individuals and those without the APOE ε4 allele.