Overall DNAm pattern in cancers
Since there were a higher number of primary cancers than normal solid tissues in the TCGA database assessed on the Illumina Infinium Human Methylation 450 BeadChip array, we selected 10 cancer cohorts that had more than 20 matched normal solid tissue samples [BLCA, BRCA, HNSC, KIRC, KIRP, LIHC, LUAD, LUSC, THCA, and UCEC], resulting in 4501 primary cancers and 598 normal solid tissues.
Next, we projected all 5,099 samples using the T-distributed stochastic neighbor embedding (t-SNE) method to reveal the overall DNAm pattern. The primary cancers were significantly more clustered with the corresponding normal samples from the same tissues, irrespective of cancer cohorts, confirming the distinct pattern of cell-of-origin in cancer (Fig 1A). Next, we screened gene promoters across 22 chromosomes to reveal DNAm variables. The hypermethylated CGI were frequently located at gene promoters and presented a state of instability in primary cancers, which was usually associated with suppression of tumor suppressor genes and incomplete differentiation, directly linking DNAm changes to oncogenic transformation (Fig 1B). We also plotted the average DNAm level in the gene region. Consistent with previous findings, in normal solid tissues, the gene-body regions exhibited higher methylation levels compared with 5' upstream regions. It supported the hypothesis that genomic regions involved in active transcription were hypomethylated to facilitate the accessibility of transcription factors (TFs). The methylation level peaked at the first exon near the transcription start site (TSS) and the transcription termination site (TTS), and then there was a sharp decrease in TTS (Fig 1C). The distribution of methylation levels was very similar across different cancer cohorts. For the 10 cancer cohorts, although we did not observe a significant decrease in methylation levels of the gene region (i.e., 5'UTR, coding region, 3'UTR), the distribution of methylation levels was generally similar but in a different shape to that in normal solid tissues. Specifically, the peaks at the first exon near TSS or TTS were almost absent in cancers, suggesting the transcriptional “barriers” were significantly weakened. Conversely, the methylation level in the 5' upstream regions was elevated in cancers (Fig 1D). We hypothesized that it constituted a mechanism of blocking post-transcriptional regulation in primary cancer. Similar to the well-known genomic instability of cancer, we found that the patterns of methylation levels were not that highly accordant with each other as observed in the normal solid tissues, suggesting methylation heterogeneity in different cancers.
Patterns of methylation instability in cancers
Genomic instability is a hallmark of cancer; it includes a variety of DNA alterations, including point mutations, indels, structural variations, or chromosome rearrangements. Since the previously observed methylation heterogeneity could contribute to genomic instability through the methylation-induced somatic mutations; therefore, we investigated if the DNAm of cancers also displayed patterns of instability.
For considering the 10 cancer cohorts jointly, we first computed the coefficient of variation (CV) across all primary cancers and normal solid tissues, respectively, for a single CpG for each cancer cohort, followed by a paired t-test (P-value < 0.05 for 10 cancer cohorts) to access the statistical significance between the two CVs. We found that the CV values of primary cancers were significantly higher than that of normal solid tissues (Fig 2A). As expected, there were variations in the overall DNAm levels by Levene's test per CpG per the cancer cohort, and this result further confirmed that DNAm presented different variations in different cancer cohorts (Fig 2B). For each gene set, we used the same method as above and obtained similar results (Figs 2C-D). To visualize the relative methylation variables in primary cancers compared with normal solid tissues, we plotted the Z-score normalized log2FC for each CpG in the form of a histogram. The observed distribution was significantly biased toward high or low values to the expected standard normal distribution (Kolmogorov–Smirnov test, P-value < 0.05 for all 10 cancer cohorts), indicating substantial methylation variations in all cancer cohorts (Fig 2E for BLCA and S1 Fig for other nine cancer cohorts). Next, we respectively fitted the Gaussian distribution in means of each CpG of primary cancers and means of each CpG of normal solid tissues to visualize the relative methylation instability in primary cancers. We found that the variance of means of each CpG of primary cancers was significantly higher than that of normal solid tissues (Fig 2F for BLCA and S2 Fig for other nine cancer cohorts). Similarly, most primary cancers exhibited high numbers of both hyper- and hypomethylated whole-genome CpGs (Fig 2G for BLCA and S3 Fig for other nine cancer cohorts). We referred to these primary cancers, characterized by DNA methylation instability (DMI), to distinguish them from cases where only hyper- or hypomethylation occurred. We scored the extent of DMI for each primary cancer by combining the percentages of hyper- and hypomethylated loci and found that most of the individual primary cancers were located in more than 25 % of hyper- and hypomethylated loci, which concluded that there existed a different level of instability in most individual primary cancers.
Patterns of common methylation in cancers
Using the Wilcoxon rank-sum test for all 10 cancer cohorts, we tested the DMCs between primary cancers and normal solid tissues, with the threshold set to |log2FC| ³ 1 and P-value< 0.05. The DMCs ranged from 1,860 in THCA to 28,587 in UCEC. Among the DMCs, we found that 138 CpGs were shared by all 10 cancer cohorts with the same DNAm tendency (hypo- or hypermethylation) (Fig 3A and S1 Table). When the tissue specificity was removed, primary cancers were more clustered and significantly separated from corresponding normal samples from the same tissue, confirming the presence of a common pattern in different primary cancers (Fig 3B). Similarly, most of the 138 CpGs preferred hypermethylation in primary cancers and hypomethylation in normal solid tissues, and when the tissue specificity was removed, different primary cancers exhibited significant commonality (Fig 3C). MAGMA was used for functional enrichment analysis on 138 CpGs, and we found that nuclear chromosome segregation was the most enriched category in GO analysis (Fig 3D) and dorsoventral axis formation was the most enriched categories in KEGG analysis (Fig 3E).
Compared with scattered DMCs, excessive DMCs within a short distance were found to be more biologically relevant. Thus, we scanned the genome for the hypo-DMCs in DMC-enriched regions. Instead of using a fixed window size, we selected top DMCs as seeds and extended the region in both directions until the score of the region, which was defined as average -log(p), was a preset threshold, i.e., 95% quantile of the genome. We performed 1000 simulations by shuffling the P-value of each DMC in the whole genome to identify the significance. We observed significantly more DMRs in the real genome compared to simulations (Empirical distribution, P-value < 0.001) (Fig 4A and in S4 Fig for other nine cancer cohorts). It indicated that in most cancer cohorts, the DMR could be explained by excessive hypomethylation in cancers. To further validate the clustering tendency, we performed a Run test for randomness. A run is defined as a series of consecutive DMCs that show a higher or lower methylation level in primary cancers. As expected, we observed significantly fewer runs than that from a random process, confirming continuous hypomethylation in cancers across the genome (S2 Table).
For DMRs of six different cancer cohorts (BLCA, HNSC, LIHC, LUAD, LUSC, and UCEC) that ensured enough sample cohorts and acquired a larger intersection (Fig 4B), we found eight common DMRs in cancer (S3 Table). MAGMA was used to annotate the hallmark, GO, and KEGG in eight common DMRs, and the hallmark enrichment indicated that the formation of angiogenesis and coagulation was common in cancer progression, and these were also associated with DNA repair and fatty acid metabolism (Fig 4C). Functional enrichment analysis performed on eight common DMRs found motor activity as the most enriched category in GO analysis (Fig 4D) and found cardiac muscle contraction as the most enriched category in KEGG analysis (Fig 4E).
Patterns of differential methylation in cancers
Next, we investigated the differences in methylations in each cancer cohort and explored the feasibility of automated cancer cohort discrimination with cancer-specific DMCs. For real-world applications, too many predictors would be cost-inefficient and might cause overfitting. To avoid that, the RF algorithm was used to produce a thinned set of DMCs by iteratively removing less discriminative DMCs until reaching the top 10 DMCs in each cancer cohort (S4 Table). Using the derived set of 10 most informative DMCs, we achieved an accuracy, sensitivity, and specificity of 99% by the RF classifier for separating primary cancers and normal solid tissues for all 10 cancer cohorts. Unsupervised hierarchical clustering of the thinned set of DMCs revealed distinct methylation profiles between primary cancers and corresponding normal solid tissues in the top 10 DMCs (Fig 5A for BLCA and S5 Fig for the other nine cancer cohorts). The data was also used for visualization of the t-SNE projection (Fig 5B for BLCA and S6 Fig for the other nine cancer cohorts), which agreed with the results of heatmaps.
We further optimized the final model using binary logistic regression with backward elimination for DMCs selection using the reference cohorts. The top 10 DMCs models in each cancer cohort were subsequently applied to the independent validation cohort for evaluating the diagnostic performance. The risk-scores for individual patient in each of cancer cohorts were calculated as follows (for example BLCA: Logit (P) = - (2.67 × cg01279413) - (1.93 × cg01425409) - (1.80 × cg03304660) - (1.05 × cg04181701) - (1.43 × cg06804091) - (1.29 × cg07709358) - (1.56 × cg08313382) - (1.16 × cg11912215) - (2.43 × cg13357249) - (3.18 × cg16489193) + 7.25. S5 Table shows the other cancer cohorts.)
Next, we explored whether the DMCs in the thinned set could also be used to simultaneously distinguish 32 different cancer cohorts. Again, the RF classifier was used to perform the multi-classification. We included 99 DMCs (10 DMCs from 10 cancer cohorts and removed a duplicate cg07274716) in the classification model, with 8423 primary cancers from 32 cancer cohorts. An accuracy of 97.9% was achieved, which was 3.8% higher compared with the classification model using a full DMC set (94.1%). The classification result was also used for visualization in t-SNE projection in 3D space, which clearly showed that primary cancers could be clustered using 99 DMCs (Fig 5D). Figure 5E is 180 degrees horizontal rotation of Fig 5D. Moreover, the RF was used for iterative computation of 99 DMCs in 32 cancer cohorts and acquired more important 10 DMCs, using the top 10 DMCs with 91.2% accuracy. For each of the top 10 DMCs, they all significantly affected the survival time during 4,000 days. For example, cg13324103 showed hypermethylation in primary cancers (Fig 5A) and hypermethylation corresponded to lower survival rates in corresponding cancer cohorts (Fig 5F for cg13324103 and S7 Fig for the other nine DMCs).
A significant number of primary cancers exhibited a high frequency of both relative hyper- and hypomethylation in 99 DMCs (Fig 5A for BLCA and S2 Fig for other nine cancer cohorts). We scored the extent of DMI in one of 99 DMCs by combining the percentage of DNAm loci offsetting median > 0.5 or < 0.5, which revealed that 99 DMCs could classify 32 primary cancers, but each DMC still exhibited high instability (Fig 5C). We also used CV divided by the median as a measure of stability for 99 DMCs and found that higher stability (< median) in 99 DMCs exhibited significantly longer survival than the instability patients (Fig 5G for BLCA and S8 Fig for other nine cancer cohorts).