Profiling and analysis of human brain proteome and transcriptome
To identify genetic variants influencing protein expression and understand its link to psychiatric disorders, we generated deep proteomic data of the frontal cortex from a total of 288 human post-mortem brain samples (Supplementary Table 1), including 211 normal, 29 BP, and 48 SCZ samples. These samples were analyzed by extensive fractionation (two-dimensional LC) and high-resolution, accurate mass, tandem mass spectrometry (LC/LC-MS/MS) to generate a deep human brain proteome (Fig. 2a). We identified and quantified a total of 19,272 proteins (14,221 genes) in at least one sample at the protein FDR < 1% (Fig. 2b) using 29 batches of 11-plexes tandem mass tags (TMT) experiments. Of these, 11,672 proteins (8,321 genes) were detected across all 288 samples (Fig. 2b; Supplementary Table 2). The vast majority of proteins (78.07%; 15,045/19,272) were detected in more than 25 batches (Fig. 2c). To our best knowledge, this is the deepest human brain proteome data to date available for studying the genetic regulation of protein expression. After extensive quality controls (QC) 32, we focused on 268 samples of high quality for the subsequent proteome-wide genetic regulation analysis. In addition, we also performed RNA sequencing of the frontal cortex from 416 human brain samples (Supplementary Table 1), and high-density genotyping using whole-genome sequencing for all samples. After stringent QC and sample identity verification 33 and normalization of gene expression quantifications and genotype imputation, we detected a total of 17,160 expressed genes (Supplementary Table 3) and ~8.1 million single nucleotide polymorphisms (SNPs) with minor allele frequency (MAF) > 5% as genotypes.
We next compared our proteomic to transcriptomic data genetated from matched 264 samples. The quantified proteins cover ~70% of the range of mRNA expression detected by RNA-seq, indicating a deep coverage of our proteomic data (Fig. 2d). Those undetected proteins whose corresponding RNAs also had low expression signals are likely due to either not being translated into proteins in the brain or the concentration under the detection limit by the mass spectrometer. A moderate positive correlation between expression levels of mRNAs and proteins (r = 0.44) was observed (Fig. 2e), which is consistent with previous findings 24,25,27. A subset of proteins showed high variability in expression (Fig. 2f), which were found to be mainly enriched in functional terms related to extracellular matrix (ECM) organization, blood microparticle, plasma lipoprotein particle, and integrin binding (Extended Data Fig. 1a), whereas lowly variable proteins were enriched in terms associated with the housekeeping functions, such as proteasome complex, regulation of mRNA stability, and regulation of cell cycle (Extended Data Fig. 1b).
Human brain proteome and transcriptome reveal the genetic architecture of expression regulation
To characterize genetic variants influencing the expression level of genes and proteins, we performed both proteome-wide and transcriptome-wide association analyses. To increase statistical power and reduce false positives 34,35, we removed variously measured and unmeasured confounding factors, such as experimental and technical batch effects (Extended Data Fig. 2a). By using the probabilistic estimation of expression residuals (PEER) program 36, we captured 99% of the hidden variance in proteomic data with 13 controlling factors (Extended Data Fig. 2b). Further correlation analysis indicated that the effect of various covariates on protein expression variation has been well-controlled (Extended Data Fig. 2c, d).
We first performed a genome-wide linkage analysis (Extended Data Fig. 3) of the expression of 11,672 proteins using the QTLtools program, identifying 788 cis-acting (or local acting) genomic loci (i.e., cis-pQTLs; within ±1 Mb from the transcriptional start site for each tested gene) that modulate 883 protein expression (i.e., pGenes) at the genome-wide FDR < 5% using the Storey q value method (Fig. 3a; Supplementary Table 4a). Note that we will use the term FDR and “q value” interchangeably for the linkage analysis hereafter. Most of the significant cis-pQTLs (10.8%; n = 180) are associated with more than one protein as opposed to 3.1% of the cis-eQTLs, suggesting that cis-pQTLs tend to be more pleiotropic than cis-eQTLs. We observed that 316 and 567 cis-pQTLs involved positive and negative effects on protein expression (q value < 5%; see methods), respectively. Of these, 44 cis-pQTLs have relatively large effect size (β > 0.5), including 14 and 30 cis-pQTLs with positive and negative effect size, respectively (Fig. 3b). In addition, we found that a trend for alleles with low frequency has a stronger effect on cis-pQTLs (Fig. 3c). We also detected 256 trans-acting (or distal-acting) loci that regulate 511 proteins at the genome-wide FDR < 5% (Supplementary Table 4b; p value < 2.33 × 10-11; Bonferroni Correction), of which 19 (3.9%) loci harbor both cis- and trans-pQTLs.
We next performed genome-wide linkage analysis for gene expression (i.e., eQTL) of 416 frontal cortex samples that include nearly all of the samples (264/268; Supplementary Table 1) used for proteome-wide analysis. Similarly, by removing hidden confounding factors for 17,160 expressed genes using the PEER program, we identified 9,333 significant cis-eQTLs that involve 9,960 genes (i.e., eGenes) at the genome-wide FDR < 5% (Extended Data Fig. 5a; Supplementary Table 5a), which include 569 cis-eQTLs with large effect size (β > 0.5; Extended Data Fig. 5b). We also found that expression level of a total of 271 genes is regulated by 259 trans-eQTLs (Supplementary Table 5b). Positional enrichment analysis showed that 56.28% (5,605/9,960) of a significant cis-eQTLs cluster within 10 kb of the transcription starting site (TSS) of its target genes (Extended Data Fig. 5c). In contrast, we found that 15.13% of cis-pQTLs are in exonic regions compared to 5.43% in cis-eQTLs (Fig. 3d). Interestingly, the vast majority (75.63%) of exonic cis-pQTLs are non-synonymous variants compared to 33.64% of non-synonymous exonic cis-eQTLs. The observation indicates that coding variants have a unique and great impact on protein expression, different from the finding that cis-eQTLs tend to be found in the TSS region observed in this study and previous GWAS and eQTL studies 37,38. The observation that cis-pQTLs are enriched in coding variants is also consistent with recent published proteome-wide association studies in AD 39 and in lymphoblastoid cell lines (LCLs) 40.
One notable example of cis-pQTLs is C14orf159 (UniProt ID: Q7Z3D6) with a mapping nominal p value of 1.57 × 10-72 (q value = 5.20 × 10-43; Fig. 3e), the second most significant cis-pQTL. Compared to the alternative allele (T/T), the reference homozygous allele (C/C) increases the protein expression by 4.59-fold (Fig. 3e, inset). C14orf159 is a mitochondrial matrix protein, involved in mitochondrial metabolism. Although C14orf159 has not been directly implicated in any psychiatric disorders, the protein-protein interaction (PPI) network of C14orf159 indicates that it could be associated with SCZ as it links to five proteins whose functions have been associated with SCZ, including PSMB4, GLS, GLS2, LRRN4CL, and C8orf82 (Fig. 3f). Among these five proteins, three of them were found to be a cis-eGene, including C15orf40 (p value = 1.51 × 10-11, q value = 1.94 × 10-6), GLS2 (p value = 3.96 × 10-6, q value = 4.09 × 10-3, and OPLAH (p value = 6.56 × 10-5, q value = 3.57 × 10-2). For example, GLS2 is an enzyme that catalyzes the hydrolysis of glutamine into glutamate and ammonia, which is highly expressed in the brain. GLS2 was also found to be hypermethylated in SCZ male samples 41 and its expression is altered in the pathology of SCZ 42.
Among 256 trans-pQTLs identified in this study, 11 were found to modulate the expression of more than five proteins (Extended Data Fig. 4). For example, a trans-QTL (rs77546871) in WW domain-containing oxidoreductase (WWOX) protein regulates the expression of 27 downstream proteins (Extended Data Fig. 4; blue lines). WWOX is also a significant cis-pGene (p value = 9.19 × 10-10, q value = 1.57 × 10-72). WWOX has been implicated in signaling pathways regulating the central nervous system (CNS) development and neural differentiation 43, and dysfunction of this gene has been found to result in reduced GABA-ergic inhibitory interneuron numbers in mice 44. GWA studies have also identified WWOX as a risk gene for common neurodegenerative conditions, such as SCZ 45, AD 46, and autism 47.
As expected, cis-eQTLs with larger effect sizes are enriched for risk genes 48. We found that 11 out of 569 eGenes with large effect size (β > 0.5) were associated with SCZ (Extended Data Fig. 5b). For example, the lead cis-eQTL for GSTM5 (Glutathione S-Transferase Mu 5) was strongly associated with both gene expression (p value = 2.78 × 10-74, q value = 3.01 × 10-48) and protein expression (p value = 8.63 × 10-28, q value = 3.52 ×10-21) (Extended Data Fig. 5d, e). GSTM5 is a member of the antioxidant glutathione S-transferase family, playing an important role in protective mechanisms against oxidative stress in SCZ pathogenesis 49. Genetic polymorphisms of several members of the glutathione S-transferase family, such as GSTM1, GSTT1, GSTP1, and GSTA1, have been associated with the pathophysiology of SCZ 50.
Extensive colocalization between cis-pQTLs and cis-eQTLs
To investigate whether the association signals regulating gene and protein expression levels are driven by the same genetic variant, we performed colocalization analysis for 883 cis-pQTLs and 9,960 cis-eQTLs using the coloc program 51. The colocalization analysis estimates five posterior probabilities (PP0, PP1, PP2, PP3, and PP4) (see methods). We identified 724 pairs of significant cis-pQTLs and cis-eQTLs within 1 Mb region (upstream or downstream), harboring 664 unique genomic loci. Among these, we found 465 pairs of colocalized QTLs (i.e., 423 genomic loci) with evidence of the same cis-pQTL and cis-eQTL signals (PP4 > 0.50; Fig. 4a; Supplementary Table 6a). Over-representation analysis found a significant enrichment (Fisher exact test; p-value = 5.8 × 10-3) of colocalized cis-QTL signals. These colocalized pGenes are enriched for pathways associated with psychiatric disorders, such as carbon and diverse amino acid metabolisms, and peroxisome (Extended Data Fig. 6a),
The vast majority (79.57%; 370/465) of co-localized signals have the same direction of effect, with a significant positive correlation of effect size (r = 0.40; p value = 2.2 × 10-16; Fig. 4b). This observation largely supports the fact that an increase in effect size of cis-eQTLs is indicative of an increase that of cis-pQTLs. In contrast, only a moderate correlation of expression levels between colocalized proteins and genes (r = 0.21; p value = 7.1 × 10-4) was observed (Fig. 4c). Furthermore, we found that the effect size of colocalized cis-pQTLs is smaller than that of its corresponding cis-eQTLs (Extended Data Fig. 6b,c), in agreement with the previous observation in AD 39. In addition, we found 83 cis-pQTLs showing a different cis-eQTL signal within the cis window (PP3 > 0.5; not colocalization), and 30 QTLs being either pQTL-specific or eQTL-specific (PP1 > 0.5 or PP2 > 0.5). As an example, we illustrate serine racemase (SRR) protein to show the colocalization of cis-eQTL and cis-pQTL signals, which had a large posterior probability of colocalization (PP4 = 0.99). SRR was identified as a significant cis-pQTL (p value = 2.64 × 10-19, q value = 3.71 × 10-13), and a significant cis-eQTL (p value = 4.88 × 10-87, q value = 5.85 ×10-75) (Fig. 4d-f), supported by a significant positive correlation of expression at gene and protein levels (Fig. 4g). SRR is a highly expressed protein in brain acting as an endogenous ligand of N-methyl d-aspartate (NMDA) receptors. Disruption of the SRR protein was shown to reduce the function of NMDA receptors and is associated with the susceptibility to SCZ 52.
Mediation analysis reveals mediation of protein regulation
To explore whether the protein expression is dependently regulated by its cis-pQTL via the corresponding mRNA transcription (Fig. 5a), we performed a conditional mapping for the protein expression using the corresponding gene expression as a co-variate (Extended Data Fig. 7a). For 411 colocalized pGenes, we found that expression level of more than half of the proteins (262/411; 63.75%) is regulated by transcription (Fig. 5b; Supplementary Table 7), suggesting that these protein regulation were largely regulated through transcriptional mechanisms. The remaining 149 transcription-independent pGenes might be mediated through other unknown genetic mechanisms, such as post-transcriptional processes, translation mechanisms, or protein degradation.
This transcription-dependent protein regulation was supported by a modest correlation (r = 0.34) between transcripts and proteins (Fig. 5c), which is significantly higher (p value = 2.2 × 10-16) than those transcription-independent pGenes (r = 0.14). However, the effect size of transcription-mediated pGenes was significantly lower as compared to that of transcription-independent pGenes (Fig. 5d; 0.11 vs 0.18; p value = 6.5 × 10-6), suggesting direct genetic effects on protein abundance tend to be stronger than the mediation effects. As expected, the vast majority of transcription-dependent pQTLs are found to be in the genomic regulatory regions, whereas transcription-independent pQTLs are enriched in exonic regions (Extended Data Fig. 7b).
To illustrate the transcription-dependent regulation, we highlighted an example of transient receptor potential cation channel subfamily V member 2 (TRPV2), an ion channel protein. TRPV2 showed a signficant cis-pQTL (p value = 2.93 × 10-45, q value = 2.11 × 10-27), but the signal was abolished after conditioning on gene expression as a co-variate. TRPV2 level at gene level is also regulated by a significant cis-eQTL (p value = 3.70 × 10-97, q value = 2.69 × 10-58) (Fig. 5e). The expression levels of transcript and protein are highly correlated (Fig. 5e; inset). For the transcription-independent regulation, a significant cis-pQTL was found to regulate DHTKD1 protein abundance without being regulated by its transcription (Fig. 5f). As expected, there is a lack of correlation between protein and transcript abundance (Fig. 5f; inset). Interestingly, DHTKD1 protein was found to be a transcription-mediated regulation in mouse liver 53.
Causal contribution of cis-pGenes and cis-eGenes to psychiatric disorders
We next sought to identify genomic loci that associate psychiatric disorders through genetic effects on gene and protein expression. We evaluated 313 SCZ genomic loci identified by a meta-analysis of data from a recent publication by the Psychiatric Genomics Consortium (PGC) 54. We used the summary-based Mendelian randomization (SMR) analysis coupled with the heterogeneity independent instruments (HEIDI) test 55. We identified 4 pGenes that passed both the HEIDI heterogeneity test (PHEIDI > 0.05) and the SMR significance threshold of PSMR < 6.3 × 10−5 (0.05/790; p value = 0.05 corrected by the total number of pGenes) (Fig. 6b; Supplementary Table 8a). We also detected 19 eGenes that passed the HEIDI heterogeneity test (PHEIDI > 0.05) and the significance threshold of PSMR < 5.3 × 10−6 (0.05/9,495; corrected by the total number of eGenes) (Fig. 6b; Supplementary Table 8b). Among these pGenes and eGenes with significant cis-pQTL and cis-eQTL, 2 pGenes and 13 eGenes were also prioritized for the SCZ GWAS loci 54. Note that SMR analysis cannot distinguish causality from pleiotropy. In addition, one protein (BTN2A1) and 21 genes showed PHEIDI < 0.05 from the HEIDI test, suggesting that expression and GWAS are likely to be driven by different variants in the same linkage disequilibrium block.
An example of the causality effect of pGene on SCZ is DARS2, a mitochondrial aspartyl-tRNA synthetase. The SMR analysis detected a significant association between DARS2 protein expression and SCZ (PSMR = 1.66 × 10−5 and PHEIDI = 0.16). DARS2 is a significant pGene (p value = 5.88 × 10-21, q value = 4.93 × 10-15) (Fig. 6c), which is a highly expressed in the brain and has been identified as the strongest causal gene of SCZ in an independent GWAS 17. As an example of a significant causal association between eGenes and SCZ, CUL9 (cullin-9) exhibited a significant association between gene expression and SCZ, with PSMR = 1.78 × 10−7 and PHEIDI = 0.11 (Fig. 6d). CUL9 is a parkin-like ubiquitin ligase that has been prioritized as a candidate gene for an SCZ GWAS locus 54.
Integrative analysis prioritizes proteins for psychiatric disorders
Previous studies have shown that molecular endophenotypes with a significant QTL (e.g., eQTLs, methylation QTLs (mQTLs), and pQTLs) tend to influence complex diseases 38, and they can be harnessed to prioritize risk genes in GWAS 56. Although it is currently difficult to pinpoint a causal gene for GWAS loci, the prioritized genes/proteins would be plausible candidates to unravel biological mechanisms underlying the GWAS associations. In this study, we attempt to establish a framework to systematically prioritize risk genes for 294 significant GWAS loci (p value < 5 × 10−8) and 290 suggestive loci (5 × 10−8 < p value < 1 × 10−6) with small effect size 53,55.
We sought to combine multiple data sets to prioritize genes/proteins for GWAS loci using order statistics (Fig. 7a). Five data sets were included for the prioritization, including pGenes ranked by cis-pQTL p values, eGenes ranked by cis-eQTL p values, co-localization between cis-pQTLs and cis-eQTLs ranked by PP4 values, and disease relevance score with SCZ by the GeneCards database, and connectivity score ranked by the number of downstream SCZ risk genes in protein-protein interaction (PPI) network (see Methods; Extended Data Fig. 8a). To derive the PPI network connectivity score, we first extracted high-confidence PPI with a score ≥ 700 (mean score: 295, range: 150–999) and kept those nodes with cis-pGenes or cis-eGenes (Extended Data Fig. 8b; Supplementary Table10), yielding an SCZ network with 2,011 nodes and 3,118 protein-protein interactions (Extended Data Fig. 8b). We used order statistics to generate a final ranking score, followed by identifying candidate genes for GWAS loci. To further assess the cell-type-specific differential expression of these ranked proteins, we also downloaded single-cell transcriptomic data generated from 48 postmortem human prefrontal cortex samples, including 24 schizophrenia cases and 24 controls 57 and mapped differential expression events and expression abundance of 20 cell types to our ranked proteins (Supplementary Table 10). The final ranking result showed that the top-ranked 60 proteins include 8 candidate genes from the 313 significant PGC SCZ phase 3 GWAS loci (Fig. 7b; Extended Data Fig. 9; Supplementary Table 9) and 2 additional candidate genes for other SCZ GWAS loci (Supplementary Table 10). Single-cell transcriptomic data 57 support that 43 out of the top 60 proteins are schizophrenia differentially expressed genes (SZTR, defined as “schizophrenia transcriptional resilience”).
A major challenge in GWAS is unable to detect loci with small effects due to low statistical power 58. We next leveraged our prioritized proteins to identify candidate risk genes for 311 suggestive GWAS loci with a smaller effect size. We found 2 out of the top 60 ranked genes in suggestive GWAS loci (Fig. 7b), For example, PPP2R4 is prioritized as a candidate gene for a GWAS locus rs6478858. This is also supported by evidence that PPP2R4 is functionally associated with SCZ risk genes that include MAPT, PPP2R2A, FOXO3, AKT3, RERE, RSMO6, PSMA4, and MAD1L1 (Fig. 7c). PPP2R4 showed a colocalized significant cis-pQTL (p value = 2.70 × 10−15, q value = 1.58 × 10−8) and cis-eQTL (p value = 8.92 × 10−16, q value = 2.31 × 10−12) (Fig. 7d). The reference homozygous allele (G/G) increases protein and gene expression levels by 1.16-fold and 1.14-fold compared to the homozygous alternative allele (A/A), respectively (Fig. 7e, f). Single-cell transcriptomic data showed that PPP2R4 decreases the expression level in microglia and is differentially expressed in AD compared to control samples (Extended Data Fig. 9). These results suggest that our protein prioritization is a powerful strategy for identifying candidate genes in GWAS loci with small effects.