To have a comprehensive list of ncRNAs, we used a combined list of ncRNAs provided by the FANTOMCAT [24] and Ensembl [25] consortia (see method section). This includes ncRNAs from different types inclusive of pseudogene (22.9%), lncRNA intergenic (21.4%), long intergenic ncRNAs (5.6%), lncRNA divergent (13.4%), antisense (3.3%), lncRNA sense intronic (6%), miRNA (5%), misc RNA (3%), lncRNA antisense (4.8%). A full list of ncRNAs provided in Supplementary figure S1a. Somatic mutations from 19 cancer types were downloaded from ICGC [3], including 1,855 breast cancer samples containing 17,163,482 single point somatic mutations and 10,460 samples with other cancers containing 67,752,271 somatic point mutations inside the ncRNA regions.
Background model to identify significant non-coding RNAs in breast cancer
To identify the ncRNAs that are significantly mutated in breast cancer samples, we counted the number of samples with somatic mutations in ncRNA from 1,855 breast cancer samples and 10,460 samples with other cancers. We then calculated a P-value for each ncRNA using Fisher’s exact test (see method section). To identify significantly mutated ncRNAs, we calculated P-values for 1,000,000 random permutations of breast/non-breast labels for each ncRNA and estimated the probability that an association emerges by chance (see method section). This provides a threshold for each ncRNA separately (Figure 2). As a result, we identified 929 ncRNAs (99% confidence interval – see method section) that significantly mutated in breast cancer samples (Supplementary table S1). Looking into our candidate ncRNAs revealed that 27.2% of them are lncRNAs intergenic, 18.1% pseudogene, 3.8% long intergenic non-coding RNAs (lincRNA), 20.8% lncRNA divergent, 2.5% antisense, and 6.6% lncRNA sense intronic (Supplementary figure S1b).
Enrichment of regulatory features in the BC-associated ncRNAs
To determine if the somatic mutations are enriched in ncRNAs with regulatory function, we first examined the enrichment of HMEC-related chromatin states provided by the ENCODE consortium within our significant ncRNAs. As Figure 3a shows, both ENCODE promoters and enhancers have been significantly enriched within our candidate ncRNA genes (3.23 and 2.26 times with P-values 4.12e-29 and 5.28e-32 enrichment for promoters and enhancers, respectively), suggesting breast cancer de novo somatic mutations are enriched in ncRNAs with enhancer and/or promoter like functions.
We also investigated these enrichments in the 2nd and 3rd sets of ncRNAs (each set contains 929 ncRNAs) with the best mutational P-values in breast cancer and the last set of ncRNAs with worse mutational P-values (see method section). As Figure 3a shows, there is the same trend of enrichment (but at much lower level) for ChromHMM predicted promoters and enhancers in the 2nd and 3rd sets of most mutated ncRNAs in breast cancer (Figure 3a). Interestingly, there is no such trend for the last set of ncRNAs (those ncRNAs with no de novo mutation in breast cancer samples), supporting our hypothesis that de novo somatic mutations are enriched in enhancer-like ncRNAs. We provided an annotated list of candidate ncRNAs with ChromHMM in Supplementary table S2.
The FANTOM5 consortium has released lists of transcribed human promoters and enhancers and tissue-specific transcribed enhancers of humans using CAGE (Cap Analysis of Gene Expression [26]) to study cell-type-specific enhancers. Therefore, we investigated the enrichment of FANTOM5 promoters and enhancers that overlap with the significant ncRNAs identified in this study. Figure 3b shows that both FANTOM5 promoters and enhancers are enriched in the candidate ncRNAs (1.66 and 1.76 times (P-value 3.59e-34 and 3.68e-27) enrichment for FANTOM5 promoters and enhancers, respectively). In other words, 52.4% of candidate ncRNAs overlapped with FANTOM5 promoters, and 36.4% of them overlapped with FANTOM5 enhancers; however, only 34.9% and 23.5% of all ncRNAs overlapped with FANTOM5 promoters and enhancers, respectively. Performing the same analysis on FANTOM5 mammary-specific enhancers demonstrated that the proportion of candidate ncRNAs that overlap with differentially expressed enhancers in the mammary epithelial cell is 3.84 times (P-value 4.03e-05) more than the genome-wide expectation (Figure 3b). There is also the same trend for 2nd and 3rd sets of ncRNAs with the best mutational P-values in breast cancer.
An annotated list of candidate ncRNAs with FANTOM5 annotations is provided in Supplementary table S3. For example, the pseudogene NKAPP1 is differentially expressed in ABL1/ABL2 knockdown (shAA) breast cancer-associated cell lines [27] and downregulated in breast cancer [28]. It is also a biomarker associated with breast cancer prognosis [29, 30]. Our analysis demonstrated NKAPP1 as one of the most significant ncRNAs with a P-value of 3.43e-06. This non-coding gene also overlapped with both FANTOM5 enhancer and ChromHMM predicted enhancer and promoter. CATG00000062386 is another ncRNA gene that is significantly mutated in breast cancer samples. This FANTOMCAT specific ncRNA overlapped with FANTOM5 and ChromHMM enhancers and FANTOM5 mammary epithelial cell differentially expressed enhancers, indicating a potential enhancer role for this ncRNA in breast cancer.
Histone modifications H3K27ac and H3K4me1, CTCF binding sites, and DNase hypersensitive sites are significantly enriched for BC-associated ncRNAs
We next investigated the enrichment of HMEC-specific chromatin histone active marks (e.g., H3K27ac and H3K4me1), CTCF binding sites, and DNase hypersensitive sites in the candidate ncRNAs identified in this study. These active chromatin marks are involved in many processes, including transcriptional regulation that regulates gene expression [31]. Our investigation of histone active marks demonstrated that both histone marks H3K27ac and H3K4me1 are significantly enriched 2.1 times (P-value 3.65e-57) and 1.6 times (1.90e-42) for H3K27ac and H3K4me1, respectively, in the candidate ncRNAs (Figure 4a). As these histone marks act as transcriptional activation and are found in both enhancer and promoter regions, our results suggest that our candidate list of ncRNAs is important for the transcriptional process in breast tissue. We performed the same analysis on histone marks H3K27me3, which is involved in the repression of transcription. Interestingly, we did not see significant enrichment (1.15 times with P-value 5.14e-02) for histone H3K27me3 within our BC-associated ncRNAs (Figure 4a).
Transcription factors CTCF function as a transcriptional activator, repressor, insulator, or pausing transcription. In addition to CTCF sites, DNase hypersensitive sites (DHS) also have key roles in gene regulation as regulatory element markers [32]. Both CTCF and DNase are functionally related to transcriptional activity and are necessary to regulate chromatin structure. Here, we choose three CTCF ChIP-seq experiments (HMEC+Broad+CTCF, HMEC+Broad+EZH2 and HMEC+UW+CTCF) and three DHS ChIP-seq experiments (MCF-7, MCF-7+Hypoxia_LacAcid, T-47D), all related to the breast cancer. We calculated the enrichment for the four sets of ncRNAs, including significant, second, and third sets of ncRNAs with the best mutational P-values and the last set of ncRNAs with the lowest mutational P-values. Our assessment of CTCF binding and DNase hypersensitive sites also demonstrated that both CTCF (2.45 times on average with P-value 7.42e-35) and DNase (1.8 times on average with P-value 7.23e-50) are significantly enriched for our candidate ncRNA genes (Figure 4b). Having such significant enrichment for CTCF binding sites and DNase accessible sites is strong evidence that our significant set is not chosen randomly and is related to the gene regulation process. There is also the same trend for the 2nd and 3rd sets of ncRNAs (Figure 4b). However, the first set's enrichments are much higher than the 2nd and 3rd sets of most frequently mutated ncRNAs. As Figure 4b shows, there is no enrichment for the last set of ncRNAs, suggesting that these ncRNAs may not be involved in the transcriptional regulation.
For example, LINC00535 is an antisense non-coding gene that is known to be associated with breast cancer [33]. LINC00894 is another non-coding gene that is the most downregulated lncRNA in MCF-7/TamR cells [34]. These ncRNAs are significantly mutated in breast cancer samples with P-values 4.21e-3 and 1.53e-11, respectively. LINC00152 is another example of ncRNAs with a substantial role in enhancing breast cancer, which causes inactivation of the BRCA1/PTEN by DNA methyltransferases as tumorigenesis, mainly in triple-negative breast cancer (TNBC) [35]. All these ncRNAs overlapped with ENCODE predicted enhancers, histone 27 acetylation, CTCF binding sites, and DNase hypersensitive sites, suggesting a potential transcriptional regulatory role for these ncRNAs. An annotated list of candidate ncRNAs with these features is provided in Supplementary tables S4 and S5.
BC-associated GWAS SNPs are significantly enriched in the candidate non-coding RNAs
Multiple genome-wide association studies identified disease-associated genes and their respective pathways, which provided a comprehensive understanding of the disease's etiology. It has been reported that more than 93% of disease-associated variations found by GWAS are located in the non-coding regulatory regions of genomes [36], suggesting non-coding regulatory regions are relevant to disease and genetic mutations in gene regulatory regions is a significant mechanical contributor to diseases. To examine the enrichment of BC-associated GWAS SNPs in the candidate ncRNAs, we extracted the BC-associated SNPs from a pooled list of two GWAS datasets from the EBI GWAS Catalog [37] and GWASdb v2 from the Wang Lab [38] (for more details see the method section). As Figure 5a shows, BC-associated GWAS SNPs are significantly enriched (P-value 2.3e-27) in the candidate ncRNAs. This enrichment is much higher for those ncRNAs that contain more than 4 GWAS SNPs (> 10 times enrichment). Interestingly, when we performed the enrichment analysis for ncRNAs with more than 5 GWAS SNPs, only the candidate list of ncRNAs shows enrichment for GWAS SNPs (> 20 times), and there is no enrichment in the 2nd and 3rd sets of ncRNAs (Figure 5a). Performing the same analysis on lung cancer-associated GWAS SNPs did not show such enrichment for our candidate ncRNAs (Figure 5a), indicating the candidate ncRNAs are relevant to breast cancer. For example, candidate ncRNAs RP11-353N4.6 are known to carry breast cancer associated GWAS SNPs [39]. More details on the annotated list of ncRNAs with BC-related GWAS SNPs can be found in Supplementary table S6.
BC-associated non-coding RNAs have a significantly higher fraction of eQTL polymorphisms
Expression quantitative trait loci are genomic locations that influence gene expression in disease-related tissue and are important for understanding its mechanisms. To assess the level of eQTL polymorphisms in the candidate non-coding genes, we calculated the enrichment of breast mammary tissue eQTL polymorphisms downloaded from GTEx consortia [40] in the candidate ncRNAs. Our analysis revealed that breast mammary tissue eQTL polymorphisms are significantly enriched (1.77 times with P-value 4.11e-20) in the candidate ncRNAs. Interestingly, this enrichment is much higher for those ncRNAs that contain more than three eQTL polymorphisms (> 2x enrichment), and such an increase has not been seen in the 2nd and 3rd sets of ncRNAs (Figure 5b and Supplementary table S6). As a control, we calculated the enrichment of lung-specific eQTL polymorphisms in the candidate ncRNAs. As Figure 5b shows, the enrichment of lung-specific eQTL polymorphisms in the breast-associated ncRNAs is much lower than the enrichment observed for mammary tissue eQTL polymorphisms (Figure 5b). For example, lncRNA RP11‐37B2.1 and RP11-426C22.5 are two of our candidate ncRNAs with significant P-values 7.91e-4 and 6.72e-06, respectively. These ncRNAs encompass 232 and 111 eQTL polymorphisms, respectively. lncRNA RP11‐37B2.1 influences the risk of tuberculosis and the possible correlation with adverse drug reactions (ADRs) from tuberculosis treatment [41], and RP11-426C22.5 is downregulated in SW1990/GZ Cells [42]. Both RP11‐37B2.1 and RP11-426C22.5 overlapped with histone active mark H3K27ac, ChromHMM potent enhancer, and DNase hypersensitive sites, suggesting a potential transcription regulation role of these ncRNAs.
Chromosome conformation capture data shows a potential regulatory role for BC-associated non-coding RNAs
High-throughput chromosome conformation capture (Hi-C) based assays have been used to successfully identify regulatory regions and targets of disease-associated variations [43, 44]. To gain a further understanding of the genes that our candidate ncRNAs are interacting with, we analyzed two publicly available Hi-C datasets from HMEC obtained from the Rao et al. study [45]. Here, we used MHiC [46] and MaxHiC [47] to analyze Hi-C raw data and identify statistically significant interactions, respectively. We identified 188,982 statistically significant interactions (P-value < 0.01 and read-count ≥ 10 – see method section) in the Hi-C library 1. For 6,187 of the interactions (%3.3), one side of the interaction overlapped with at least one candidate ncRNA and another side of the interaction overlapped with protein-coding genes (promoter regions of coding genes – see method section; Supplementary table S7). Repeating this analysis on the second Hi-C library also identified 318,034 statistically significant interactions. For 9,879 of the interactions (3.1%), one end of the interactions overlapped with the candidate ncRNAs and another end overlapped with the promoter region of protein-coding genes (Supplementary table S7). We identified 1,167 common significant interactions between the two libraries where one side of the interaction overlaps with at least one candidate ncRNA (Supplementary table S8) and another side with protein-coding genes. In other words, for 226 ncRNAs out of a pool of 929 candidate ncRNAs (24%), there was at least one significant interaction in both the libraries (Supplementary table S8); this is significantly higher (1.74 times; P-value 4.61e-11) than all ncRNAs that overlapped with one side of the interactions in both libraries (14%).
Interestingly, for 757 significant ncRNAs (%82), we identified at least one interaction in either Hi-C library 1 or library 2, resulting in 21,564 interactions. For 19,674 out of 21,564 interactions (Supplementary table S9), one end of the interaction that encompasses candidate ncRNAs overlapped with either ENCODE HMM predicted enhancer or active histone mark H3K27ac (both presented in HMEC). This observation suggests a potential enhancer role for these ncRNAs; In many cases, another end of the interactions overlap with protein-coding genes, including cancer-associated genes (Supplementary table S9). We provided a prioritized list of candidate ncRNAs that interacted with cancer-associated protein-coding genes in both the Hi-C libraries and overlapped with ENCODE HMM predicted enhancers and active histone mark H3K27ac (Table 1). For example, MYC is a BC-associated protein-coding gene, acting as a transcription factor. In common with three other transcription factors (POU5F1, SOX2, and KLF4), it can induce epigenetic reprogramming of somatic cells to an embryonic pluripotent state [48]. We found significant Hi-C interactions between MYC and two of our candidate ncRNAs CASC8 and PVT1 in both Hi-C libraries. PVT1 is a known enhancer for MYC [49], however, CASC8 has not yet been identified as a putative enhancer for MYC. There are ~200kb genomic distances between the CASC8 and transcription start site (TSS) of MYC in which there is no protein-coding gene that overlaps with CASC8. There are strong signals of histone active marks and ENCODE predicted enhancers, and most importantly, FANTOM5 breast differentially expressed enhancer overlapped with CASC8, suggesting a potential enhancer region in CASC8 (Figure 6). Interestingly, there is no somatic mutation or BC-associated GWAS SNPs overlap with MYC, however, CASC8 is significantly mutated in breast cancer samples, and most importantly, it encompasses ten BC-associated GWAS SNPs. Altogether, this evidence may indicate a putative enhancer role of CASC8 for MYC.
Another example is non-coding RNA CATG00000061359 is a FANTOMCAT-specific intergenic lncRNA that is significantly mutated in breast cancer samples (P-value 5.32e-4). There is a significant Hi-C interaction between CATG00000061359 and gene GTPBP8 (previously associated with breast cancer [50]) in both the libraries. We also found a breast tissue-associated eQTL polymorphism in this lncRNA that influences the expression of GTPBP8 in breast tissue. Interestingly, both HMEC specific H3K27ac and HMM predicted enhancer overlap with this lncRNA. This may suggest that variation in CATG00000061359 may influence the expression of GTPBP8in breast cancer. We have provided an annotated list of candidate ncRNAs with Hi-C interactions in Supplementary table S9.