Comparison of DNA methylomes with transcriptomes reveals tissue-specific signatures unique to methylation
To investigate the contribution of tissue type to the similarity of DNAm-profiled samples, we reduced the high-dimensional methylome dataset to two dimensions, which revealed sample similarity by tissue of origin (Fig. 1a). We identified robust methylation-derived tissue clusters (Methods) that reflect shared yet distinct patterns compared to transcriptome-derived clusters (Fig. 1b). Observed patterns suggest that the biological processes that drive methylome similarity across tissues differ, in part, from corresponding transcriptome ones, and that blood, despite being the most widely studied biospecimen type for DNAm, is not a suitable proxy of solid tissue DNAm patterns.
Local correlation between DNA methylation and expression can be tissue-specific or shared across tissues
To characterize the interdependence between DNAm and expression, we mapped DNAm associations with local gene expression (eQTMs) across tissue types. The number of CpGs correlated with at least one gene (eCpGs) ranged from 1,542 in testis to 4,568 in ovary (Supplementary Table 2). We assessed replication of eQTMs in a muscle-derived external dataset 19 by estimating the true positive rate (π1) of eQTMs identified in each GTEx tissue type. We observed a high replication rate (cross-tissue average π1 = 0.75), confirming the reproducibility of the eQTMs mapped herein despite limited samples available. Similarly to transcriptome- and eQTL-derived patterns observed in previous GTEx analyses 3,4,9, eQTMs tend to be either tissue-specific or shared across most tissue types (Supplementary Figure 1).
To investigate how accurately the CpG-gene assignments provided by Illumina reflect the eQTM results observed in GTEx data, we contrasted our eQTM findings with the EPIC array CpG annotation file provided by Illumina. We observed that only 45% (2,641/5,898) of detected eCpGs match an Illumina CpG-gene association (Supplementary Note). Thus, our eQTM predictions can be utilized to assign CpGs to gene(s) with which they are biologically linked.
The association of DNA methylation with expression of nearby genes differs by regulatory elements
To functionally characterize the relationship between CpGs and genes, we identified enrichments (Fisher’s exact test FDR < 0.01) of eCpGs in gene regulatory elements compared to CpGs uncorrelated with gene expression. Promoters and proximal enhancers are strongly enriched (OR = 4.55 and 4.17, respectively) for eCpGs, as well as insulators (OR = 2.32) and distal enhancers (OR = 1.74). Overall, 54% (3,188/5,898) of eCpGs overlap with gene regulatory elements. Using logistic regression, we examined additional factors for evidence of contribution to eQTM presence (Supplementary Table 2), which revealed distinctive signatures by regulatory element class (Fig. 1c). Proximity of CpG to gene transcription start site (TSS) increases the likelihood of the eQTM association ubiquitously across regulatory elements, but high CpG methylation and corresponding negative correlation with gene expression are only predictive of eQTMs linked to promoters and proximal enhancers. Distal enhancer and insulator eQTMs are enriched for low-methylated eCpGs, and low gene expression appears to be predictive of insulator-linked eQTMs exclusively.
Together, these results suggest that DNAm in CpG sites is associated with transcription of cis-genes through their regulatory elements, as described 27. However, they also indicate heterogeneity of the biological mechanisms driving eQTMs, that depend on the class of regulatory elements involved, compatible with patterns observed in blood cells 23,28.
Genetic regulation of DNA methylation in cis exhibits tissue specificity
To characterize the genetic regulation of DNAm across tissues, we mapped genetic variants that affect DNAm levels of proximal CpG sites in cis (mQTLs). We first analyzed each tissue separately by fitting a linear model that accounts for technical and biological factors related to DNAm variability, and identified significant (FDR < 0.05) mQTL CpG sites in cis (mCpGs). Subsequently, we modeled mQTL effects jointly across tissues to achieve increased QTL-mapping power by leveraging tissue-shared effects 29. Additionally, we mapped secondary cis-QTL signals using a stepwise regression procedure 30. To be able to compare mQTL to cis-eQTL patterns, we employed an analogous QTL-mapping approach to identify eQTLs.
We detected a total of 286,152 mCpGs, ranging from 108,844 in testis to 206,802 in lung (Fig. 2a), of which 45,543/286,152 (16%) have secondary signals in at least one tissue. For blood and muscle tissues, we quantified the mQTL replication rate in external datasets and observed a high replication rate (cross-tissue average π1 = 0.92). For a particular CpG, genetic regulation of DNAm tends to be either highly tissue-specific or highly shared across tissue types (Supplementary Figure 2), similarly to the observed pattern reported for eQTLs and splicing QTLs (sQTLs) 4. On average, 37% of mCpGs observed in a single tissue are also present in all the remaining tissues (Fig. 2b), but only 5% of the identified mCpGs were detected as mCpGs exclusively in a single tissue. Compared to eQTLs, mQTLs appear to be significantly (Wilcoxon rank-sum test P < 0.05) more shared across tissues, as observed in blood cells 23.
Functional mechanisms that drive genetic regulation of DNA methylation differ from gene expression
To characterize differential molecular mechanisms of mQTLs relative to eQTLs, we integrated annotation of CpG islands (CGIs), genomic functional elements, and chromatin states (Supplementary Table 3) and performed within- and meta-tissue enrichment analyses (Methods). We observe that eQTLs are more strongly enriched in open chromatin sites than mQTLs (Supplementary Figure 3). While both eQTLs and mQTLs are enriched in gene regulatory regions (Fig. 2c), only eQTLs are enriched in CGIs and gene transcripts, particularly in splicing and untranslated exon regions (UTR), as previously described 4. Conversely, mQTLs are depleted from CGIs and genes, as previously observed in blood 20,25, but are strongly enriched in distal enhancers. Compatible patterns are observed when analyzing tissue-specific extended chromatin state predictions matching mQTL tissue source (Supplementary Figure 3). DNAm QTLs also show enrichment in putative insulators (Fig. 2c). To further characterize additional mQTL-associated transcription factors (TFs), and to distinguish between mQTL and eQTL distinctive TF binding sites (TFBS) associations, we integrated empirical annotation of TFBSs corresponding to 339 TFs. Considering corresponding TFBSs, we identified 126 TFs significantly enriched (FDR < 0.01, Odds Ratio (OR) > 1) in eQTLs or mQTLs (Supplementary Table 3). We observed remarkably different TFBS enrichment profiles for mQTLs compared to eQTLs. The eQTL enrichments with the smallest p-values across tissues correspond to TFs involved in basal transcription, e.g. RNA Polymerase II genes. Conversely, mQTLs are enriched, among other TFBSs, in binding sites of steroid receptors, e.g. ESR1 and NR2F2, and other proteins known to be involved in 3D organization of the genome (Supplementary Table 3, Fig. 2d).
Altogether, these results suggest that mQTLs and eQTLs largely diverge in their underlying biological mechanisms, driven by mostly distinct sets of TFs. While eQTLs result from variants altering gene body and regulatory elements, mQTLs result in part from variants altering non-genic, distal regulatory elements, elements bound by proteins involved in chromatin spatial conformation and long-range interactions, including insulators. The location of mQTLs with respect to open chromatin and actively transcribed regions indicates that, compared to eQTLs, mQTLs are more likely to reside in proximal regulatory regions that appear inactive in the mQTL-mapping context analyzed herein.
Genetic co-regulation of DNA methylation and gene expression is not pervasive and exhibits heterogeneity across regulatory elements
Given their divergent genomic enrichment profiles, it is expected that mQTLs and eQTLs are driven, at least in part, by different causal variants. To quantify the extent of eQTL-mQTL pairs (e/mQTL) that share a predicted causal variant, we performed e/mQTL colocalization, and observed that the proportion of detectable mQTLs that show clear colocalization with a detectable eQTL is moderate (Fig. 2e), as only 21% of mQTL loci are suggestively colocalized (PP4 > 0.5) with at least one eQTL. Despite limitations in accurately estimating this fraction, our results indicate that a considerable fraction of mQTLs do not show clear associations with local gene expression in the context analyzed herein.
Among e/mQTL colocalized variants, the direction of the effect on methylation and expression is significantly (exact binomial test P < 2.2e-16) more often (53%) in the opposite direction, as previously observed 26. However, we observe significant (test of equal proportions P < 2.2e-16) differences between regulatory regions; the proportion of mCpGs corresponding to opposite e/mQTL effects tends to be larger in eGene-matching promoters and proximal enhancers (61%) than in distal enhancers (52%) and minority (39%) in insulators. These observations are in line with the view that hypomethylation in proximal gene regulatory regions is associated with active transcription.
Genetic regulation of DNA methylation is characterized by molecular regulatory pleiotropy
In order to characterize the molecular pleiotropic nature of mQTLs, we quantified the number of mCpGs and eGenes involved in loci harboring mQTL-eQTL colocalizations (Methods). Overall, we observe pervasive pleiotropy; the majority (78%) of colocalized eQTLs-mQTLs impact multiple mCpGs, and a considerable minority (28%) impact multiple eGenes (Supplementary Figure 4). The largest pleiotropic set, identified in ovary, is led by variant rs6433571 and involves 114 mCpGs and 8 eGenes in the HOXD gene cluster region (Fig. 3a), associated with epithelial ovarian cancer 31. This pleiotropic effect is not driven by ovary-specific gene expression (Fig. 3b) but by ovary-specific genetic regulation of DNAm and expression (Fig. 3, c and d). By means of QTL-GWAS colocalization, we identified 112/114 mCpGs and 7/8 eGenes as significantly (PP4 > 0.5) colocalized with ovarian cancer risk (Supplementary Table 4), including the HOXD1 and HOXD3 genes which have a suspected role in genetic risk of ovarian cancer 32,33, as well as less characterized genes, e.g. non-coding gene HAGLR. Together, these findings provide an illustrative example of how genetic variants can drastically modify the DNAm landscape in a long-range and tissue-specific manner, altering gene expression and impacting disease risk.
Genetic regulation of DNA methylation impacts complex trait associations extensively
Prior studies 17–20 have provided evidence that mQTLs can be associated with human phenotypes. To evaluate the impact of mQTLs on traits in a systematic manner, and compare their effects to those of eQTLs, we integrated QTLs with genome-wide association study (GWAS) summary statistics of 87 GWASs, 83 of which had at least one QTL-overlapping GWAS hit (P < 5e-08), and were therefore tested for colocalization with a robust multi-method approach (Supplementary Figure 5).
Across all GWASs, tissues, and QTL types, we identified a total of N = 12,922 significant (RCP > 0.3 and PP4 > 0.3) QTL-GWAS colocalizations - named simply ‘colocalizations’ (Supplementary Table 5). We observed that mQTL colocalizations were more abundant than eQTL colocalizations for almost all (91%) of GWASs (Fig. 4). The overlap between eQTL- and mQTL-GWAS colocalizations is moderate, with 27% (749/2,734) of GWAS hits colocalizing with both QTL types (e/mQTL-shared), 55% of hits colocalizing with at least one mQTL but with no eQTLs (mQTL-specific), and 18% of hits colocalizing with at least one eQTL but with no mQTLs (eQTL-specific).
These results highlight the importance of integrating different types of -omics data from different tissue sources, and considering secondary QTL signals, to maximize the expectation of identifying molecular links to inheritable traits.
Genetic regulation of DNA methylation facilitates the fine-mapping of trait-associated causal variants and characterization of regulatory mechanisms
Among mQTL-specific colocalizations, we identified an ovary-specific mQTL association (rs2853669-cg07380026, P = 6.7e-13) colocalized (PP4 = 0.84) with a breast cancer GWAS signal in the TERT locus (Fig. 5a). TERT expression is mostly undetectable in adult tissues but high in tumors, and the locus harbors multiple independent variants associated with several types of cancer risk 34. Another example of a mQTL-specific colocalization is an ovary-specific, secondary mQTL association (rs7161194-cg05029961, P = 8.0e-15) that colocalized (PP4 = 0.98) with a body mass index GWAS signal in the microRNA-rich MEG9 locus (Fig. 5b), for which colocalization could not be identified considering the primary mQTL signal. The mVariant rs7161194 affects cis-microRNAs’ expression 35.
For 19% (144/749) of the colocalized e/mQTL-GWAS shared loci, the mQTL-GWAS association shows greater colocalization probability than the eQTL-GWAS association in at least one tissue and/or independent QTL colocalization. We observe cases where the additional resolution to define potentially causal variants brought by mQTLs is small due to almost perfect linkage disequilibrium (LD) between lead mQTL and eQTL variants, as in the breast cancer linked NTN4 locus (Fig. 5c). However, in multiple instances the lead e/mQTL variants are in moderate or low LD, and the mQTL mirrors the GWAS association substantially more optimally than eQTL does, as in the hypertension-associated MYO9B locus (Fig. 5d). We also observe cases where the GWAS locus harbors association signals compatible with the existence of multiple independent causal variants, where GWAS colocalization with eQTLs and mQTLs contributes to differentiate and characterize these independent causal variants by their distinct colocalization patterns, as in the EFEMP2 locus linked to asthma (Fig. 5e). Together, these results provide evidence that integration of e/mQTL-GWAS colocalization signals can aid the fine-mapping of the causal variant(s) and better characterize the molecular mechanisms underlying complex traits, as shown 2,36,37.
Trait-linked methylation quantitative trait loci exhibit molecular regulatory pleiotropy and enrichment in trait-relevant tissues
Identification of QTL associations with traits in relevant tissue(s) can provide insights into their underlying genetic and molecular mechanisms 6. By analyzing the observed proportions of mQTL-GWAS colocalizations per tissue, we identified 18/65 traits with a disproportionate (test of equal proportions FDR < 0.05) amount of colocalizations in at least one tissue (Fig. 6a). Overall, the tissue with the largest proportion of colocalizations per trait matched the prior given current biological knowledge. For instance, blood clot and cell count traits were enriched in colocalizations derived from whole blood mQTLs, and breast cancer was nominally (Fisher’s exact test P = 0.02) enriched in breast-derived mQTLs. For traits where the observed tissue link is less obvious, the observed enrichment can be artifactual or it could point to an uncharacterized role of a specific tissue in the trait’s biology. Together, these results suggest that mQTLs are informative of complex traits’ relevant tissues and can thereby aid the characterization of trait etiology, as observed for eQTLs 5–7 .
It is expected that many QTLs that impact traits do so by exhibiting molecular regulatory pleiotropy, i.e. altering multiple molecules and/or molecular phenotypes. It has been shown that eQTLs where the lead eVariant regulates multiple eGenes are more likely to yield a trait association than eQTLs that regulate a single eGene 4. However, the effect that regulatory pleiotropy plays in mQTL-trait associations has not been extensively characterized. Here, we observe that mQTL-GWAS colocalizations are enriched in mVariants regulating multiple, as opposed to single, mCpGs (OR = 2.65, Fisher’s exact test P < 2.2e-16). Among trait-linked mCpGs that colocalize with at least one eGene, we observed enrichment (OR = 1.40, P = 6.3e-08) for trait colocalizations involving multiple mCpGs and eGenes (Tier 4 in Supplementary Figure 3), as in HOXD locus (Fig. 3a). These results indicate that mQTLs that exhibit regulatory pleiotropy have increased chances to impact a complex trait.
Trait-linked genetically-regulated methylated loci exhibit decreased methylation and are preferentially located in regulatory regions
To better understand the DNAm changes that contribute to mQTL impactfulness on traits, we characterized DNAm levels of trait-linked mCpGs and their overlap with open chromatin regions, as well as DNAm gains and losses attributable to increased disease-risk alleles. Compared to mCpGs without identifiable trait links, we observe an enrichment (Fisher’s exact test P < 2.2e-16, OR = 1.50) of trait-linked mCpGs in in open chromatin regions; with which 53% (1,791/3,381) of trait-linked mCpGs overlap, with e/mQTL-shared GWAS-colocalized loci exhibiting a stronger enrichment (OR = 1.96) compared to mQTL-specific loci (OR = 1.36). Trait-linked mCpGs tend to have lower DNAm levels compared to trait-agnostic mCpGs (Wilcoxon rank-sum test P < 0.05), whereas trait-linked eGenes tend to be highly expressed (Supplementary Figure 6). To characterize the pathogenicity of genetically-regulated DNAm changes, we examined the association of disease-risk alleles with the direction of mQTL effects for a subset of 31/87 disease GWAS traits, and we did not observe a global, significant (FDR < 0.05) association of disease-risk alleles with increased or decreased DNAm, either across or within traits (Methods). These results suggest that genetically-regulated DNAm loci that play a role in trait etiology correspond to lowly-methylated CpG sites in active chromatin regions, and that both gains and losses of DNAm can be pathogenic depending on the context.
Typically, variants contributing to the genetic basis of a trait are thought to act by affecting gene regulation. Concordantly, we observe an enrichment (Fisher’s exact test P < 0.05) of trait-linked mCpGs in gene regulatory elements (OR = 1.63, P < 2.2e-16), compared to mCpGs without an identified trait link; 71% (2,390/3,381) of trait-linked mCpGs fall into this category. However, mCpGs corresponding to mQTL-specific colocalizations show a divergent profile compared to e/mQTL-shared colocalizations (Fig. 6b). While e/mQTL-shared GWAS-colocalized mCpGs are depleted in distal enhancers (OR = 0.68) and enriched in gene body element regions (OR = 1.54 to 2.22), promoters (OR = 2.19) and proximal enhancers (OR = 2.12), mQTL-specific GWAS-colocalized mCpGs are enriched in both proximal (OR = 1.36) and distal (OR = 1.39) enhancers, but not in promoters. Our results suggest that distal regulatory elements play an important role in DNAm impact on genotype-phenotype associations.
Integration of trait-linked methylated loci with functional maps enables the identification of trait-associated candidate genes
To identify genes that could mediate the effect of mQTLs on human traits, we integrated mQTL-specific trait-linked mCpGs with curated promoter- and enhancer-gene target predictions 38,39 and eQTM associations generated herein (Methods). We identified 68% (1,307/1,911) of mCpGs as functionally linked to gene(s) (Supplementary Table 6). Among highly supported (by ≥ 3 mCpGs) cases, we identify both well-known gene-trait associations, such as APOB with cholesterol, TERT with cancer, and ABO with blood traits. We also identify lesser characterized trait-linked genes, such as RUNX1 with asthma and TMEM72 with red blood cell counts, with biological evidence compatible with predicted trait links (Supplementary Note). Our results suggest that integrating the mQTL-trait maps generated herein with additional functional genomic maps can enable the identification of trait-linked candidate genes.