Mapping genetic variants to regulatory elements and target genes. In this study, we compare two distinct approaches to link genetic variants highly associated with disease, with a p-value of ≤5⋅10− 8 according to the GWAS data, to regulatory elements and their target genes (Fig. 1). The first approach, schematically shown in Fig. 1A, employs the unidimensional (1D) genome structure to identify those enhancers and TF binding sites located in the linear proximity up and downstream of a SNP. In this example, a TF binding site (green shape) is found downstream from a SNP (red star) and an enhancer (orange rectangle) is found upstream. For prostate cancer, the search distance is set to 5 kbp on both sides of the SNP in order to create a SNP-centered window whose size is the same as the resolution of the Hi-C data (10 kbp). Since the resolution of the Hi-C data for breast cancer used in this study is 5 kbp, we search for regulatory elements located 2.5 kbp down and 2.5 kbp upstream of a SNP. Both regulatory elements shown in Fig. 1A affect the expression of their target genes, either indirectly through TF (blue teardrop) binding (genes G1-3, purple ovals) or directly (genes G4-6, yellow ovals). The search for regulatory elements in the linear proximity from 808 SNPs highly associated with breast cancer identified 12 TF binding sites affecting 8 genes for 59 SNPs and 51 enhancers affecting 33 genes for 50 SNPs. A similar search conducted for 13,447 SNPs highly associated with prostate cancer resulted in 247 TF binding sites affecting 180 genes for 986 SNPs and 3,851 enhancers affecting 613 genes for 7,641 SNPs.
The second approach maps SNPs highly associated with cancer at a p-value of ≤5⋅10− 8 to regulatory elements located in the spatial proximity according to the 3D genome structure. Here, we utilize highly confident intra- and inter-chromosomal contacts obtained from the Hi-C data with a q-value of ≤0.05. Specifically, for each SNP, we collected those DNA fragments containing at least one regulatory element and forming physical contacts with that SNP. Next, we selected one fragment with the lowest q-value for a contact; in case of multiple fragments having the same lowest q-value for contacts, the longest-range fragment from the SNP was picked. As shown in Fig. 2B, a DNA fragment containing a disease-associated SNP (red star) physically interacts with another fragment through a highly confident long-range contact. In this example, the interacting fragment contains a binding site (green shape) for a TF (blue teardrop) and an active enhancer (orange rectangle). Just as in the first approach utilizing the 1D linear genome, these elements regulate the expression of their target genes, G1-3 (purple ovals) and G4-6 (yellow ovals), respectively.
Following this procedure, we identified 19,240 chromatin fragments forming highly confident contacts with 808 SNPs associated with breast cancer. Selecting only one long-range chromatin contact per SNP with the lowest q-value resulted in 702 fragments containing regulatory elements. These elements include 239 enhancers having 147 target genes in contact with 702 SNPs and 83 binding sites for TFs having 70 target genes in contact with 459 SNPs. Similar to breast cancer, selecting one long-range chromatin contact per SNP with the lowest q-value from 1,952,907 chromatin fragments forming contacts with 13,447 SNPs associated with prostate cancer resulted in 13,429 contacts. Among these interactions, 13,410 contacts are between 13,410 SNPs and 3,585 enhancers with 747 target genes, and 1,387 contacts are between 1,387 SNPs and 324 binding sites for TFs with 190 target genes.
Disease association of enhancers connected to genetic variants in cancer. In order to measure the relevance of those regulatory elements affected by SNPs to a disease, a series of disease association (DA) scores were computed. For each high-risk SNP, we calculated the median DA score for mapped enhancers and TFs as well as the median DA score for target genes whose expression is regulated by these elements. The number of SNPs along with quantile and inter-quantile range (IQR) values are reported in Table 1 (enhancers) and Table 2 (TFs). The distribution of DA scores for active enhancers is presented in Fig. 2. Figure 2A shows that the median (2nd quantile) DA score of 4.80 for enhancers linked to highly associated SNPs in breast cancer in the 3D genome structure is higher compared to 2.91 in the 1D linear genome (Table 1). Similar to breast cancer, Fig. 2B and Table 1 show that the median DA score for enhancers linked to SNPs highly associated with prostate cancer is higher in 3D (4.81) than 1D (4.14). To further corroborate these results, we computed DA scores for target genes whose expression is affected by enhancers linked to genetic variants in cancer. Figure 3 and Table 1 show that the median DA scores are systematically higher in 3D compared to 1D in breast cancer (Fig. 3A, 2.4 for 1D and 5.0 for 3D) and in prostate cancer (Fig. 3B, 2.4 for 1D and 3.2 for 3D). In addition to higher DA scores, IQRs are typically smaller in the 3D genome structure compared to 1D (Table 1), for instance, the IQR for the enhancer DA score is 0.40 in 3D and 0.61 in 1D for breast cancer, and 0.43 in 3D and 0.83 in 1D for prostate cancer.
Table 1
Disease association (DA) statistics for enhancers linked to significant SNPs in breast and prostate cancer. Statistics for enhancers identified with 1D and 3D approaches include the number of SNPs used in the analysis, quantiles, and the inter-quantile range (IQR). For each cancer type, DA scores for enhancers and their target genes are reported.
Statistic
|
Breast cancer
|
Prostate cancer
|
Enhancer DA score
|
DA score for target gene
|
Enhancer DA score
|
DA score for target gene
|
1D
|
3D
|
1D
|
3D
|
1D
|
3D
|
1D
|
3D
|
# of SNPs
|
50
|
702
|
50
|
662
|
7,641
|
13,410
|
7,213
|
12,010
|
1st quantile
|
2.69
|
4.61
|
2.1
|
4.3
|
3.74
|
4.59
|
1.9
|
2.6
|
2nd quantile
|
2.91
|
4.80
|
2.4
|
5.0
|
4.14
|
4.81
|
2.4
|
3.2
|
3rd quantile
|
3.30
|
5.01
|
2.6
|
5.2
|
4.57
|
5.02
|
3.4
|
3.9
|
IQR
|
0.61
|
0.40
|
0.5
|
0.9
|
0.83
|
0.43
|
1.5
|
1.3
|
Table 2
Disease association (DA) statistics for transcription factors (TFs) linked to significant SNPs in breast and prostate cancer. Statistics for enhancers identified with 1D and 3D approaches include the number of SNPs used in the analysis, quantiles, and the inter-quantile range (IQR). For each cancer type, DA scores for enhancers and their target genes are reported.
Statistic
|
Breast cancer
|
Prostate cancer
|
TF DA score
|
DA score for target gene
|
Fraction of DA-TFsa
|
DA score for target gene
|
1D
|
3D
|
1D
|
3D
|
1D
|
3D
|
1D
|
3D
|
# of SNPs
|
59
|
459
|
42
|
210
|
986
|
1,387
|
759
|
1,387
|
1st quantile
|
2.0
|
10.0
|
2.5
|
4.3
|
0.31
|
0.65
|
1.4
|
3.0
|
2nd quantile
|
3.0
|
12.0
|
2.7
|
4.7
|
0.33
|
0.67
|
1.9
|
3.2
|
3rd quantile
|
4.0
|
12.0
|
3.0
|
4.9
|
0.34
|
0.69
|
3.0
|
3.7
|
IQR
|
2.0
|
2.0
|
0.5
|
0.6
|
0.03
|
0.04
|
1.6
|
0.7
|
a Fraction of disease-associated TFs within a set of all TFs linked to significant SNPs.
Disease association of transcription factors connected to genetic variants in cancer. Next, we analyze the disease association of TFs linked to SNPs highly associated with cancer (Fig. 4) and their target genes (Fig. 5) with statistics reported in Table 2. The distribution of DA scores for TFs mapped to breast cancer is presented in Fig. 4A. Here, the median DA score of 12.0 in the 3D genome structure is higher than 3.0 in 1D. In addition, Fig. 5A shows that the median DA score of TF target genes is also higher in 3D (4.7) compared to 1D (2.7). In the absence of numerical DA scores for TFs linked to SNPs highly associated with prostate cancer, we conducted the analysis using the fraction of disease-associated TF (Fig. 4B). On average, about two-thirds of TF mapped to SNPs in 3D are disease-associated, whereas this fraction is only one-third in 1D. Further, Fig. 5B shows that the median DA score of the corresponding target genes is higher in 3D (3.2) than in 1D (1.9). In contrast to active enhancers, IQRs for TFs are similar between 1D and 3D, except for the distribution of DA scores for TF target genes in prostate cancer (1.6 and 0.7, respectively, Table 2).
Examples of gene regulation through chromatin interactions in breast cancer. We present several case studies in order to illustrate the significance of the 3D genome structure in linking genetic variation to gene regulation in breast cancer (Fig. 6). The locations of genomic elements discussed in this section and their association with breast cancer are provided in Supplementary Tables S1-S3. The first example is mitogen-activated protein kinase kinase kinase 1 (MAP3K1), a serine/threonine kinase known to play an important role in different functions of the cell [58, 59]. MAP3K1 can be activated by different stimuli, such as cytokines and growth factors, that constitute a complex signaling network controlling a diverse array of cellular functions [60]. In addition to numerous studies focused on somatic mutations in MAP3K1 [61–63], GWAS revealed associations between SNPs, including rs7714232 and rs16886272 regulating the expression of MAP3K1, and breast cancer [64, 65]. Further, multiple transcription factors, such as ER-α, FOXA1 and GATA3, were shown to upregulate the expression of MAP3K1 through long range chromatin interactions [64]. Figure 6A shows that rs7714232 at position Chr5:56,011,357 is in contact with a chromatin fragment containing an active enhancer 119861, and rs16886272 at position Chr5:56,067,434 is in contact with a fragment containing a putative binding site for transcription factor GATA3 predicted with a p-value of 2.2⋅10− 5. Enhancer 119861 is associated with breast cancer at a p-value of 2.4⋅10− 5 and GATA3 has a high disease association score of 5.9. MAP3K1, which itself has a high disease association score of 5.3, is a target gene for both regulatory elements. Thus, rs7714232 and rs16886272 may indirectly affect the expression of MAP3K1 through physical interactions with an active enhancer and a transcription factor binding site.
Fibroblast growth factor receptor 2 (FGFR2) belonging to the receptor tyrosine kinase family mediates the cellular signaling and plays important roles in the developmental induction, cell growth and differentiation, and cell fate [66–68]. Several studies reported the association between mutations affecting FGFR2 and breast cancer [69, 70]. For example, multiple SNPs located in the second intron of FGFR2 cause the increased expression of FGFR2 linked to cancer progression [71]. Another study reported an association between the expression of FGFR2 and the number of breast tumor initiating cells [72]. GWAS data revealed the association between FGFR2 genetic variants and the risk of breast cancer [69, 73], for instance, rs4752575 was shown to alter the expression of FGFR2 leading to the increased susceptibility to breast cancer [74]. Figure 6B shows that rs4752575 located at position Chr10:123,407,187 physically interacts with a chromatin fragment containing multiple transcription factor binding sites, including a putative binding site for forkhead box protein A1 (FOXA1) predicted with a p-value of 4.5⋅10− 4. FOXA1 is highly associated with breast cancer with a score of 5.8 and was identified as one of the master regulators of FGFR2 [75]. According to these data, we propose a new model explaining the high association of rs4752575 with breast cancer at a p-value of 5.5⋅10− 9. Specifically, rs4752575 may dysregulate the expression of FGFR2 through the chromatin interaction with the binding site for FOXA1, a master regulator of the FGFR2 gene.
Chromodomain Y-like protein 2 (CDYL2) is a putative epigenetic factor belonging to a family of proteins characterized by the presence of N-terminal chromodomain that binds methylated histones H3K9 and H3K27 [76–78]. CDYL2 has been identified as either a tumor suppressor or oncogene depending on the cancer type [79, 80]. Moreover, CDYL2 was reported to be overexpressed in breast cancer supporting its role in disease progression [81]. The transcript variants of CDYL2 were found to be differently associated with breast cancer, suggesting a new therapeutic strategy targeting specific CDYL2 isoforms [82]. Several genetic variants related to CDYL2 have been identified by GWAS to be associated with breast cancer progression and development, including rs13329835 found in the intergenic region of CDYL2 gene [83, 84]. Another variant, rs9940301, at position Chr16:80,641,906 is strongly associated with breast cancer progression in women of African ancestry with a p-value of 2.0⋅10− 9 [85]. According to the Hi-C data (Fig. 6C), rs9940301 is in contact with a chromatin fragment containing three putative enhancers, 2317260, 2317262, and 2317263, all associated with breast cancer with a p-value of 9.0⋅10− 6. These enhancers activate the expression of the CDYL2 gene suggesting a new association mechanism between rs9940301 and breast cancer through physical interactions with enhancers regulating CDYL2 gene expression.
Examples of gene regulation through chromatin interactions in prostate cancer. Figure 7 presents selected cases demonstrating how the genetic variation in prostate cancer can be linked to gene regulation by analyzing the 3D genome structure. The locations of genomic elements discussed in this section and their association with prostate cancer are provided in Supplementary Tables S1-S3. Androgen receptor (AR) is a master regulator belonging to the nuclear receptor superfamily [86]. It acts as a transcription factor binding a specific ligand molecule to control the expression of targeted genes [87]. Prostate function primarily depends on the androgen signaling axis through the regulation of AR target genes [88]. AR is highly associated with prostate cancer with a score of 8.0, which is consistent with observations that it is often overexpressed in prostate cancer [89] and mutations in the AR gene are present in a large population of castration-resistant prostate cancer patients [90, 91]. The GWAS data revealed that numerous SNPs near the AR locus are associated with prostate cancer [92, 93]. For instance, rs6152 is located in the first exon of the AR gene and it is associated with a susceptibility to prostate cancer at a p-value of 1.5⋅10− 12 [94, 95]. We also found that rs6152 at position ChrX:66,765,627 forms a contact with a chromatin fragment containing an active enhancer 2765787 associated with prostate cancer at a p-value of 1.6⋅10− 2 and affecting the expression of AR (Fig. 7A). Thus, the physical interaction between rs6152 and an enhancer may play a role in the regulation of AR gene expression during prostate cancer progression.
Octamer-binding transcription factor 4 (OCT4), a member of the POU domain-containing family of transcription factors, is expressed in embryonic and adult stem cells [96]. Although six different pseudogenes identified for OCT4 are not expressed, these elements are believed to play a role in the regulation of OCT4 expression [97, 98]. Interestingly, two of these OCT4 pseudogenes, POU5F1P5 and POU5F1B, were found to be transcribed in cancer cells [99]. POU5F1B was shown to be overexpressed in gastric cancer and its knockdown confirmed a role for POU5F1B in the promotion of tumor cell growth [100]. Further, the methylation level near the POU5F1B gene [101] and the genetic variation around that region [102] were found to be associated with the prostate cancer risk. For instance, rs6983267 was reported to be in linkage disequilibrium with the open-reading frame of the POU5F1B gene among people of European ancestry and associated with the expression of POU5F1B in prostate of white subjects [103]. Located at position Chr8:128,413,305, rs6983267 is associated with prostate cancer at a p-value of 2.8⋅10− 141. Figure 7B shows that it is also in contact with a chromatin fragment containing multiple putative transcription factor binding sites including a confidently predicted binding site for AR with a p-value of 3.5⋅10− 4, which regulates the expression of POU5F1B. According to these findings, rs6983267 may play a role in regulating PO5F1B expression by affecting the binding of AR to its binding site.
EH domain-binding protein 1 (EHBP1) gene encodes Eps15 homology domain binding protein playing a role in endocytic trafficking [104]. Recently, GWAS reported the association of a genetic variant rs721048, located within one of the introns of the EHBP1 gene, and the susceptibility to prostate cancer [28, 105] with a p-value of 5.0⋅10− 22. Interestingly, Fig. 7C shows that rs721048 at position Chr2:63,131,731 forms a contact with a chromatin fragment containing an active enhancer 406774 associated with prostate cancer at a p-value of 1.5⋅10− 4 that regulates the expression of EHBP1. Further, enhancer 406774 also regulates the expression of orthodenticle homeobox 1 (OTX1), a transcription factor playing a critical role in multiple developmental processes, such as the neuronal differentiation [106]. Several studies reported the hypermethylation of the OTX1 gene promoter region in non-small lung cancer [107, 108] and an altered OTX1 expression in medulloblastoma and other brain tumors [109]. It is important to note that the expression of OTX1 is also associated with prostate cancer risk [110]. Further, OTX1 is one of several transcription factors involved in tumor-specific enhancer networks and it was found to be linked to active enhancers in prostate adenocarcinoma [111]. Our data suggest that rs721048 may be associated with prostate cancer through the disruption of the mechanism of action of certain tumor-specific enhancers causing the dysregulation of the expression of OTX1 and EHBP1 genes.
Mapping genetic variants to topologically associating domains. Next, TADs were identified from the Hi-C data and all regulatory elements and SNPs were mapped to these domains as shown in Fig. 8. Based on the presence of SNPs highly associated with cancer, the resulting TADs are divided into two groups, TADs containing no SNPs (control, Fig. 8A) and TADs containing at least one SNP with a p-value of ≤5⋅10− 8 according to the GWAS data (SNP-rich, Fig. 8B). Specifically, we identified the total of 21,157 TADs from the Hi-C data for breast cancer, including 30 TADs enriched with disease-associated SNPs. Among 30 SNP-rich TADs, 26 also contain enhancers (477 in total) and 17 contain TF binding sites (36 in total). The control dataset for breast cancer comprises 16,473 TADs containing 259,780 enhancers and 10,463 TADs containing 23,890 binding sites for TFs. In addition, the total of 17,435 TADs were detected from the Hi-C data for prostate cancer, including 304 TADs enriched with disease-associated SNPs; 291 of these SNP-rich TADs contain enhancers (7,940 in total) and 241 contain TF binding sites (686 in total). The control dataset for prostate cancer comprises 13,587 TADs containing 250,789 enhancers and 10,070 TADs containing 22,562 binding sites for TFs.
We first take a glance at selected genetic variants highly associated with breast (4 SNPs) and prostate (3 SNPs) cancers discussed above in order to determine whether gene regulatory elements located in their spatial proximity according to the Hi-C data reside in the same TAD. Interestingly, this holds true in almost all cases. Both variants rs7714232 and rs16886272 associated with breast cancer, enhancer 119861, and a binding site for transcription factor GATA3 are located in TAD 6450 whose boundaries are Chr5:56,010,000 – Chr5:56,140,000. Further, variant rs9940301 along with all three enhancers, 2317260, 2317262, and 2317263, reside in the same TAD 17447 (Chr16:80,630,000 – Chr16:80,950,000). Variant rs4752575 and FOXA1 binding site belong to neighboring TADs 12498 (Chr10:123,360,000 – Chr10:123,460,000) and 12497 (Chr10:123,330,000 – Chr10:123,360,000), respectively. In prostate cancer, TAD 1830 (Chr2:63,010,000 – Chr2:63,160,000) contains both rs721048 and enhancer 406774, TAD 8618 (Chr8:128,410,000 – Chr8:128,490,000) contains both rs6983267 and AR binding site, and TAD 16953 (ChrX:66,530,000 – ChrX:67,270,000) contains both rs6152 and enhancer 2765787. On that account, we expect TADs enriched in disease-associated SNPs to also contain those gene regulatory elements having higher disease association compared to control TADs, which is investigated in the following section.
Association of SNPs and regulatory elements with cancer in the context of TADs. In order to verify the assertion that those TADs carrying genetic variants highly associated with cancer also contain gene regulatory elements whose disease association is high, we first calculated DA scores for enhancers located within control and SNP-rich TADs. The distribution of DA scores is shown in Fig. 9 with the corresponding statistics reported in Table 3. Compared to a median DA score of 4.75 for control TADs in breast cancer, SNP-rich TADs contain enhancers with a higher median DA score of 5.66 (Fig. 9A). A median DA score of 5.66 for enhancers located in SNP-rich TADs is higher than a value of 4.69 for those belonging to control TADs in prostate cancer as well (Fig. 9B). Similar to enhancers, Fig. 10 shows the distribution of DA scores computed for TFs residing in SNP-rich and control TADs with the corresponding statistics reported in Table 3. In breast cancer, the median DA score of 11.5 for TFs located in control TADs is lower than a value of 17.0 for those present in SNP-rich TADs, whereas in prostate cancer, the fraction of disease-associated TFs in SNP-rich TADs is twice as high as in control TADs (Fig. 10B). Further, Table 3 shows that IQRs for enhancers and TFs are very similar between SNP-rich and control TADs in both cancers. These results demonstrate that disease-associated genetic variants and gene regulatory elements indeed tend co-localize within certain TADs identified based on the 3D genome structure.
Table 3
Disease association (DA) statistics for enhancers and transcription factors (TFs) within TADs in breast and prostate cancer. TADs are divided into two groups, containing no SNPs with a significant association to disease (control) and those containing at least one SNP with a significant disease association (SNP-rich). Statistics include the number of TADs used in the analysis, quantiles, and the inter-quantile range (IQR).
Statistic
|
Breast cancer
|
Prostate cancer
|
Enhancer DA score
|
TF DA score
|
Enhancer DA score
|
Fraction of DA-TFsa
|
Control
|
SNP-rich
|
Control
|
SNP-rich
|
Control
|
SNP-rich
|
Control
|
SNP-rich
|
# of TADs
|
16,473
|
26
|
10,463
|
17
|
13,587
|
291
|
10,070
|
241
|
1st quantile
|
4.55
|
5.42
|
10.5
|
17.0
|
4.50
|
5.42
|
0.32
|
0.66
|
2nd quantile
|
4.75
|
5.66
|
11.5
|
17.0
|
4.69
|
5.66
|
0.33
|
0.67
|
3rd quantile
|
4.90
|
5.80
|
12.5
|
18.0
|
4.87
|
5.77
|
0.35
|
0.69
|
IQR
|
0.35
|
0.38
|
2.0
|
1.0
|
0.37
|
0.35
|
0.03
|
0.03
|
a Fraction of disease-associated TFs within TADs.