Overall Co-Methylation Patterns
We determine co-methylated CG pairs to be those with a correlation coefficient greater than or equal to 0.8. This cutoff is used in a previous publication [11]. There are C2272990= 37,261,633,555 possible pairs in both tumor and normal data, of which 298,194,565 in tumor and 794,262,245 in normal are co-methylated (i.e., highly correlated based on the 0.8 cutoff value). These give proportions of 0.80% and 2.13% in tumor and normal respectively, such that normal samples have roughly 2.7 times the high correlations in tumor samples, see Table 1.
Table 1
CG pairs with a high correlation level.
| Total CG Sites | Total CG Pairs | Pairs with |Correlation| ≥ 0.8 | Proportion with |Correlation| ≥ 0.8 |
Normal | 272990 | 37261633555 | 794262245 | 2.13% |
Tumor | 272990 | 37261633555 | 298194565 | 0.80% |
For all C2272990 possible pairs, Table 2 shows the six-number summary of the number of CG sites that each CG site is highly correlated with. The distributions are extremely skewed to the right, with most sites highly correlated with a few CG sites and a small number of sites correlated with a lot of CG sites. Note that the numbers for normal samples are generally higher than the numbers for tumor samples. Furthermore, Fig. 1A shows that the tumor dataset has more CG sites with a number of correlations under 100 compared to the normal dataset. The normal dataset has more in the 101 to 20,000 range. After 30,000, the tumor dataset drops to 0 (as shown in Table 2, the maximum in the tumor dataset is 27,996 correlations), so the normal dataset has more correlations in that range. A bar graph with more groups can be seen in the Supplemental Fig. 1 in the Additional File 1.
Table 2
Summary for the number of CG sites each CG site is highly correlated with.
| Min | Q1 | Median | Mean | Q3 | Max |
Normal | 0 | 0 | 84 | 5819 | 4375 | 42787 |
Tumor | 0 | 0 | 3 | 2185 | 50 | 27996 |
Q1 and Q3 mean 25th and 75th percentiles respectively. |
Figure 1B is a scatterplot that compares the number of high correlations each specific CG site has in the tumor data and the normal data. The blue line is for scale only; it has a slope of 1. As we see, there is a dark line along the x-axis, which corresponds to a large number of CG sites having few highly correlated partners in tumor but many such partners in normal. This pattern also exists along the y-axis, though to a much lesser extent. In addition, there is a large clump in the top right-hand corner, indicating that many CG sites have high correlations with a lot of sites in both normal and tumor data. We also note that the plot has a surprisingly well-defined border; the number of points appears to drop significantly at around x = 42,000 and y = 27,000, which are the rough maximum number of CG sites a specific CG site can highly correlate with in normal and tumor respectively.
Co-Methylation Signs and Chromosome Patterns
We further examine the overall correlation patterns based on if the two CG sites in each pair are on the same or different chromosomes and if the correlation is positive or negative. The overall statistics are in Table 3. We see that the vast majority of CG pairs are on different chromosomes (about 94% for both tumor and normal) and are positively correlated (93.4% for normal and 97.19% for tumor). Note that there are two striking patterns. First, although the percentages in normal and tumor are similar, the number of pairs can be dramatically different. For example, for the pairs on the same chromosome, there are 45.2 million in the normal dataset and 17.5 million in the tumor dataset. That is, the number of normal pairs is about 2.5 times the number of tumor pairs. Second, for the pairs with negative correlations, normal samples have more negative pairs than tumor samples (6.6% for normal vs. 2.81% for tumor), and the two-proportion test p-value is < 2.2e-16, which shows that tumor and normal datasets are significantly different. For the pairs with positive correlations, tumor samples seem to have a relatively larger percentage (93.4% for normal and 97.19% for tumor), but the total number of positive pairs is much less than the number in the normal data (741.9 million for normal and 289.8 million for tumor).
Table 3
Co-methylated CG pairs on the same/different chromosomes and with positive/negative correlations.
| Total Pairs | Pairs on Same Chr. | Pairs on Diff. Chr. | Negative Pairs | Positive Pairs |
Normal | 794,262,245 | 45,249,327 (5.70%) | 749,012,918 (94.30%) | 52,400,697 (6.60%) | 741,861,548 (93.40%) |
Tumor | 298,194,565 | 17,490,793 (5.87%) | 280,703,772 (94.13%) | 8,391,729 (2.81%) | 289,802,836 (97.19%) |
For the highly correlated pairs that are on the same chromosome, we examine the distribution of the distances between the CG sites in each pair in terms of base pairs (see Fig. 2). Due to the fact that there are almost three times as many total normal pairs as there are tumor pairs, every interval of ten million base pairs has more normal pairs than tumor pairs. However, the distribution of distances between co-methylating CG sites is relatively similar. We find an inverse association between co-methylation and genomic distance; that is, co-methylation is more likely to occur between CG pairs that are located close to each other.
A further analysis shows that the median distance among all pairs on the same chromosome is 39,617,206 for tumor and 41,737,711 for normal, which is about a 2 million base difference (see Table 4). We perform t-tests on the distances between highly correlated CG sites for each chromosome to compare tumor samples with normal samples. Test results show that there is a significant difference (with extremely small p-values close to 0) for all chromosomes, though this finding is likely due to very large distances and a large number of sites (see Supplemental Table 1 in the Additional File 1). We therefore also examine distances between pairs for each single chromosome using side by side boxplots for tumor and normal data. The vast majority of chromosomes exhibit distance patterns similar to that of the overall data. That is, although the mean or median differences are relatively large, the boxplots do not clearly illustrate this difference because there is a large range as seen in the whole genome data. However, ChrX exhibits a significantly different pattern, where the median distance between tumor CG pairs is much smaller than the median distance between normal CG pairs, meaning that the tumor CG pairs are concentrated more closely together than the normal CG pairs. The overall shape of the histograms of distances between correlated CG site pairs on the same chromosome appears to be mostly the same between normal and tumor data, with only the ChrX distances being notably different. Later, we further study the co-methylation patterns on ChrX separately, and the results are summarized in the subsection titled “Chromosome X.”
Table 4
Summary of the distances between co-methylated CG pairs on the same chromosome.
| Min | Q1 | Median | Mean | Q3 | Max |
Tumor | 2 | 14111897 | 39617206 | 54455914 | 80313554 | 248046750 |
Normal | 2 | 15498757 | 41737711 | 57065415 | 84514828 | 248094980 |
We also plot the number of CG pairs on the same or different chromosomes, see the top two plots of Supplemental Fig. 2 in the Additional File 1. In these figures, the dots appear to form lines, suggesting that a lot of the CG sites have the same ratio of same to different chromosome correlations. We also see that the slope of the lines is much less than one, which is expected as the ratio of same to different chromosome correlations is very low. In order to see the pattern clearly, we plot log2((diff + 1)/(same + 1)) as shown in Fig. 3A, where “diff” means the number of highly correlated CG partners on different chromosomes and “same” means the number of CG partners on the same chromosome. For the tumor dataset, we can see that the median value is at 0, while in the normal dataset, the median is greater than 0. This indicates that in the tumor dataset, the number of same and different chromosome pairs are more likely to be equal to each other, while in the normal dataset, there tends to be more CG pairs that are on different chromosomes. The tumor dataset also has a noticeably larger number of outliers than the normal dataset, indicating more heterogeneity within the tumor dataset.
As for the positive/negative correlation patterns, there appear to be a lot of CG sites with a high number of negative correlations or positive correlations in either normal or tumor, but not both, as shown in Supplemental Fig. 2 (in the Additional File 1) bottom plots’ horizontal and vertical axes. In the normal graph, there is also a fairly large clump of points with a significant number of both negative and positive correlations. In addition, we also plot log2(number of positive correlations + 1) and log2(number of negative correlations + 1) for both the tumor and normal data in Fig. 3B and 3C. For the negative correlations, the tumor dataset has a much larger number of outliers, which again indicates more heterogeneity within the tumor dataset. When looking at the positive correlations, there are also more outliers in the tumor dataset than in the normal dataset. In the positive correlation boxplots, the median of the normal dataset is also noticeably higher than the median of the tumor dataset. This difference suggests that the CG sites in the normal dataset tend to form more positive correlations than the sites in the tumor dataset.
We find that the percentage of negative correlations is significantly different between normal (6.6%) and tumor (2.8%) as shown in Table 3. We then identify the following CG sites: A. CG sites that only have negative correlations with other CG sites; B. CG sites that have more negative than positive correlations with other sites. We identify these two lists of CG sites in normal and tumor data separately and compare them; see the two top Venn diagrams in Fig. 4, i.e., A and B. Figure 4A shows that there are 621 CG sites in the normal dataset that only have negative correlations, and there are 665 CG sites in the tumor dataset that only have negative correlations. There are 10 CG sites out of > 600 CG sites that are overlapped between the tumor and normal. This finding tells us that different sets of CG sites in the tumor and normal datasets play certain roles by negatively correlating with other CG sites or genes. Figure 4B shows that there are 4,391 CG sites in the normal and 7,363 CG sites in the tumor that have more negative correlations than positive correlations. There are 1,612 CG sites that overlap. Unlike the data shown in Fig. 4A, there seems to be a larger difference between the normal and tumor dataset when looking at the number of CG sites that have more negative than positive correlations.
We also compare positive correlations between the normal (93.40%) and tumor (97.19%) datasets. Figure 4C shows that there are 112,895 CG sites in the tumor dataset that only have positive correlations and that there are 76,202 CG sites in the normal dataset that only have positive correlations. There are 39,368 CG sites that are overlapped between the tumor and normal data, and these overlapping CG sites make up 51.66% of the CG sites that only have positive correlations in the normal dataset and 34.87% of the CG sites that only have positive correlations in the tumor dataset. This is unlike the data shown in Fig. 4A, which shows a much smaller percentage of overlapping CG sites that only have negative correlations. Figure 4D shows that there are 173,075 CG sites in the tumor dataset that have more positive than negative correlations, and there are 194,129 CG sites in the normal dataset that have more positive than negative correlations. There are 135,948 overlapping CG sites.
Highly Changing CG Pairs
To further investigate the relationship between the tumor and normal samples, we examine how the correlation coefficient of each CG pair changes by comparing the normal with the tumor samples. Following the example of a previous publication [17], we split the correlation coefficient values, all between − 1 and 1, into 8 intervals: [-1, -0.75), [-0.75, -0.50), [-0.50, -0.25), [-0.25 ,0), [0, 0.25), [0.25, 0.5), [0.5, 0.75), [0.75, 1). The number of CG pairs that fall within certain intervals in the tumor and normal datasets are recorded in Table 5.
The sum of all the cells in the table is 272,990 x 272,899 / 2 = 37,249,349,005. The percentages refer to the number of pairs in the cell out of the total number of pairs in the row. For example, the 8.53% in the top-left-most cell means that 9,047,541 out of 106,055,622 CG pairs (i.e., 8.53%) have a correlation within [-1, -0.75) in both tumor and normal datasets. Most of the CG pair correlation values do not change much between the normal and the tumor dataset, especially when the correlation value in the normal dataset is in the range [-0.5, 0.5]. The largest CG percentages for each row, those being 20%-50%, are either found to have normal and tumor correlation values that are within the same range or values that are separated by 1–2 intervals. The percentages that are > 20% are in bold.
Table 5
The number of CG pairs whose correlation coefficients fall within certain intervals.
| Tumor [-1, -0.75) | Tumor [-0.75, -0.50) | Tumor [-0.50, -0.25) | Tumor [-0.25, 0) | Tumor [0, 0.25) | Tumor [0.25, 0.50) | Tumor [0.50, 0.75) | Tumor [0.75, 1) |
Normal [-1, -0.75) | 9047541 (8.530940%) | 13685670 (12.904238%) | 21792226 (20.547922%) | 34659212 (32.680221%) | 22030844 (20.772915%) | 4588346 (4.326358%) | 249385 (0.235145%) | 2398 (0.002261%) |
Normal [-0.75, -0.50) | 4532553 (0.483943%) | 43143743 (4.606482%) | 139239351 (14.866663%) | 343557340 (36.681809%) | 325165891 (34.718144%) | 76117608 (8.127120%) | 4773392 (0.509658%) | 57889 (0.006181%) |
Normal [-0.50, -0.25) | 950340 (0.026118%) | 30082357 (0.826755%) | 321883396 (8.846337%) | 1289397068 (35.436564%) | 1551479465 (42.639387%) | 419567332 (11.530990%) | 24952455 (0.685770%) | 293958 (0.008079%) |
Normal [-0.25, 0) | 392148 (0.004785%) | 17342879 (0.211636%) | 385387947 (4.702910%) | 2635151893 (32.156900%) | 3855226940 (47.045542%) | 1221099246 (14.901140%) | 79112772 (0.965417%) | 956285 (0.011670%) |
Normal [0, 0.25) | 198005 (0.001766%) | 11092196 (0.098906%) | 312990583 (2.790851%) | 2905279677 (25.905581%) | 5662488508 (50.490855%) | 2156699767 (19.230700%) | 163940452 (1.461812%) | 2190125 (0.019529%) |
Normal [0.25, 0.50) | 70324 (0.000822%) | 4876128 (0.056974%) | 150373427 (1.756992%) | 1709044540 (19.968805%) | 4212522101 (49.219918%) | 2235224229 (26.116789%) | 240956009 (2.815376%) | 5505226 (0.064324%) |
Normal [0.50, 0.75) | 16019 (0.000457%) | 1236027 (0.035233%) | 30003186 (0.855234%) | 451563452 (12.871708%) | 1493350206 (42.567591%) | 1177953830 (33.577292%) | 304867905 (8.690187%) | 49195266 (1.402299%) |
Normal [0.75, 1) | 1171 (0.000106%) | 101571 (0.009200%) | 2299244 (0.208250%) | 36610029 (3.315896%) | 223487199 (20.242003%) | 303425075 (27.482251%) | 212705866 (19.265501%) | 325446342 (29.476793%) |
There are 3,569 highly changing CG pairs that cross 7 intervals. Among these CG pairs, 2,398 of them change from very high negative correlations (i.e., [-1.0, -0.75)) in the normal data to very high positive correlations (i.e., [0.75, 1.0]) in the tumor data, and 1,171 CG pairs change from very high positive correlations ([0.75, 1.0]) in the normal data to very high negative correlations ([-1.0, -0.75)) in the tumor data. Further examination of these 3,569 CG pairs reveals that there are 1,443 unique CG sites that are involved in the negative to positive changes and 822 unique CG sites that are involved in the positive to negative changes. The union of these two sets has only 1,880 unique CG sites, so there is a considerable overlap; that is, 385 CG sites are involved in the changes in both directions. We then separate the 1,880 unique CG sites involved in the highly changing CG pairs into three different groups as shown in Table 6. 385 CG sites that are involved in both the positive to negative and negative to positive changes form the "both.direction" group. 437 CG sites that are only involved in the positive to negative changes form the "uniq.pos2neg" group. Lastly, 1085 CG sites that are only involved in the negative to positive changes form the "uniq.neg2pos" group. We will study the DM patterns of these CG sites in the next section.
Table 6
CG sites involved in co-methylation changes between normal and tumors and their DM status.
| #CG Sites | #DM | %DM |
both.direction | 385 | 50 | 12.99% |
uniq.pos2neg | 437 | 47 | 10.76% |
uniq.neg2pos | 1058 | 101 | 9.55% |
All | 272990 | 23361 | 8.56% |
Differential Methylation
Next, we investigate whether there is any relationship between co-methylation and differential methylation. In particular, we study how many of the CG sites sorted into the three groups mentioned previously are DM, meaning there is a significant difference between their normal and tumor methylation levels. We perform paired t-tests on the 53 methylation levels between normal and tumor for all 272,990 CG sites. To be considered a DM site, the p-value of the t-test must be < 0.05, and the absolute mean difference value must be > 0.2. Information on the number of CG sites involved in the co-methylation changes and the number of DM CG sites can be found in the last two columns of Table 6. Among all the highly-changing CG sites, 9.55%-12.99% are DM sites. These percentages are larger than the overall DM rate, which is 8.56%. The CG sites highly changed in the both.direction group have a very large difference compared with the average DM rate (12.99% vs. 8.56%). We also look at the number of DM CG sites by chromosome (See Supplemental Fig. 3 and Supplemental Table 2 in the Additional File 1). For Chr1 to Chr22, the percentages of DM CG sites fall within the 5.91%-11.75% range. ChrX, however, only has 2.22% DM sites, marking another way in which ChrX differs from the other chromosomes.
In addition to examining the CG sites whose correlations switch from positive to negative and vice versa, we also study the relationship between DM CG sites and the number of CG sites they are highly correlated with, see Supplemental Table 3 in the Additional File 1 and Fig. 5. Supplemental Table 3 shows that among all the 272,990 CG sites, 23,361 (8.56%) are DM. We see that being DM is somewhat associated with the pattern of the number of high correlations. For example, for Tumor (0.2) (that is, the group with 0.2 as the mean difference cutoff value), if being DM is independent from the number of correlations, we would expect each category in the Tumor (0.2) row to be about 8.56% of their respective categories in the Tumor (all) row. However, some categories are much lower than expected—e.g., 5k-9999 and 10k+.
To view the patterns in Supplemental Table 3 (in the Additional File 1) clearly, we plot the data presented in this table in Fig. 5. For each of the different mean difference estimate values (0.2, 0.3, 0.4), we plot the number of normal and tumor correlations for each of the intervals in Supplemental Table 3. The number of tumor correlations seems to spike in the 1–99 interval for all plots, and the number drastically drops for the intervals beyond 1–99. This change is not that apparent in the normal data. When comparing the tumor data with the normal data in the 1–99 interval, we also see the tumor data and normal data have the largest percentage difference. For example, in the 1–99 category, that is 62.26% (tumor) vs. 37.51% (normal) when using DM sites selected without a cutoff, as shown in Supplemental Table 3 in the Additional File 1.
Locations of CG sites
The locations of CG sites are important. Next, we conduct analysis on the locations of different sets of CG sites. There are six location categories: Open_Sea, Island, N_Shelf, N_Shore, S_Shelf, and S_Shore. These correspond to not being associated with a CpG island (i.e., Open_Sea), being on a CpG island, being on a north shelf, being on a north shore, being on a south shelf, and being on a south shore of a CpG island respectively. As for the locations of the original 476,947 CG sites and our selected 272,990 sites, their distributions are shown in Table 7 columns 2 and 3. As we see, the largest category is Open_Sea (about 35%), followed by CpG island (about 30%). North and South regions are around the same, with shore being about 2.5 times the value of shelf. We also conduct this analysis on the highly changing CG site datasets (both.direction, uniq.pos2neg, uniq.neg2pos) as shown in Table 7 columns 4–6. The majority of highly changing sites are mainly in the Open_Sea (40.3% − 55%) and Islands (11.6% -26.1%). The Chi-square test shows that there is a significant association between the types of CG sites (both.direction, uniq.neg2pos, uniq.pos2neg) and the location of these CG sites. For example, when comparing the uniq.neg2pos and uniq.pos2neg groups, we can see that the locations of these two groups are different. The uniq.neg2pos has a larger percentage of CG sites in the Open_Sea than the uniq.pos2neg group (55% vs. 40.3%); it has much smaller percentage of CG sites in the Island than the uniq.pos2neg group (11.6% vs. 26.1%). The Chi-squared test of comparing these groups gives a test statistic of 59.17, with a degree freedom = 5 and a p-value = 1.804e-11.
Table 7
The locations of different CG sets.
| All | Filtered | both.direction | uniq.neg2pos | uniq.pos2neg | Normal Top 100 | Tumor Top 100 |
Open_Sea | 170901 (35.8%) | 97551 (35.7%) | 183 (47.5%) | 582 (55.0%) | 176 (40.3%) | 4 (4%) | 2 (2%) |
Island | 148332 (31.1%) | 81789 (30.0%) | 76 (19.7%) | 123 (11.6%) | 114 (26.1%) | 60 (60%) | 70 (70%) |
N_Shelf | 23109 (4.8%) | 11906 (4.4%) | 17 (4.4%) | 51 (4.8%) | 17 (3.9%) | 0 (0%) | 0 (0%) |
N_Shore | 58427 (12.3%) | 36421 (13.3%) | 43 (11.2%) | 124 (11.7%) | 59 (13.5%) | 16 (16%) | 15 (15%) |
S_Shelf | 22788 (4.8%) | 11667 (4.3%) | 14 (3.6%) | 54 (5.1%) | 13 (3.0%) | 0 (0%) | 1 (1%) |
S_Shore | 53390 (11.2%) | 33656 (12.3%) | 52 (13.5%) | 124 (11.7%) | 58 (13.3%) | 20 (20%) | 12 (12%) |
Total | 476947 (100%) | 272990 (100%) | 385 (100%) | 1058 (100%) | 437 (100%) | 100 (100%) | 100 (100%) |
The first column is the location. The second column is for all 476947 CG sites. The third column is for the 272990 CG sites selected for our co-methylation analysis. The fourth to sixth column are for three types of highly changing sites. The last two columns are for super-connector CG sites that are highly co-methylated with other sites.
In addition, we also obtain the top 100 super-connector CG sites in both normal and tumor datasets and analyze their locations as shown in Table 7 columns 7 and 8. The majority of these super-connector sites are on islands (60% for normal, 70% for tumor) or shores (36% for normal, 27% for tumor). We then compare the distribution of the top 100 super-connector CG sites’ locations in each dataset with the overall distribution of all 272,990 CG sites. Since some of the cells have 0 CG sites, we compare specific cells with two-sample tests for equality of proportions. For example, doing this for the Open_Sea category between the overall data and normal top 100 super-connector gives a p-value of 7.17e-11 (35.7% vs. 4.0%), and for the Island category, it gives a p-value of 1.14e-10 (30% vs. 60.0%). Furthermore, the locations of super-connector CG sites are also very different from the locations of highly changing sites. For example, as for the Open_Sea region or category, 40.3% − 55% of the highly changing sites are there, but only 2%-4% of the super-connector sites are there. As for the Island region, only 11.6%-26.1% of the highly changing sites are there, but 60%-70% of the super-connector sites are there. In summary, highly changing sites and the super-connector sites have significantly different locations from each other and from the locations of the CG sites in the whole Illumina 450K dataset as well.
Induced Network Modules
We study the relationship of the genes associated with CG sites listed in Table 7 using the software ConsensusPathDB (CPDB) [18–20]. We will first discuss the both.direction.DM, uniq.pos2neg.DM, and uniq.neg2pos.DM sets, which are the sets that consist of 50, 47, and 101 CG sites, respectively. The 50 sites in both.direction.DM are mapped to 45 distinct gene symbols, which are then plugged into CPDB, resulting in the network graph in Fig. 6. Note, the legend in the bottom of this figure is also for other CPDB figures (Fig. 7–10). To avoid redundancy and save space, we do not include this legend in the other figures.
In Fig. 6, the RPTOR, CSF1R, and AGO2 genes are of particular interest. They are hub genes with the most interactions. RPTOR (Regulatory Associated Protein Of MTOR Complex 1) is a protein associated with tuberous sclerosis 1 and tuberous sclerosis. CSF1R is associated with leukoencephalopathy, hereditary diffuse, with spheroids and brain abnormalities, neurodegeneration, and dysosteosclerosis. AGO2 is associated with Lessel-Kreienkamp Syndrome and colorectal cancer. The 47 sites in the uniq.pos2neg.DM set are mapped to 47 gene symbols, see Fig. 7. In this figure, CUX1 and KRT8 are hub genes. Aberrant expression of KRT8 is associated with multiple tumor progression and metastasis; so has CUX1 [21]. Finally, the uniq.neg2pos.DM set consists of 101 CG sites that are mapped to 107 gene symbols, see Fig. 8. In this figure, MCM7, HDAC4, BIRC5, BIRC6, G3BP1, and CUX1 appear to have the most connections. These particular genes are interesting. BIRC6 is involved in regulating apoptosis and p53. Both BIRC5 and BIRC6 have been linked to cancer in other publications [22–28]. MCM7 has been linked to prostate cancer in particular and esophageal squamous cell carcinomas [29, 30]. HDAC4 is also related to cancer [31], as is G3BP1 [32]. As seen in the Fig. 6 image legend, most interactions are protein-protein interactions, with some gene interaction submodules, such as CSF1R, FOXP1, PAX5, ETS, and MYB in Fig. 6. Most entities are protein (cyan), with a few genes (indigo), protein complexes (greenish blue), and RNAs (orange).
We next examine those CG sites that are co-methylated with a very large number of other CG sites. We extract the top 100 CG sites that are co-methylated with the most other CG sites in the tumor data as well as the 100 CG sites that are co-methylated with the most other CG sites in the normal data. There is no overlap between these two sets of CG sites. We then find the genes associated with these sets of CG sites separately and perform the induced network module analysis on these two gene lists, see Figs. 9 and 10.
In the tumor “top100” list (Fig. 9), TAF1 and HNF4A are the main hub genes. In the normal “top100” list (Fig. 10), TAF1, MAX, and MYC genes are the most significant hub genes. Therefore, TAF1 is a hub gene in both the normal and tumor lists, while HNF4A, MAX, and MYC are not. TAF1 (TATA-Box Binding Protein Associated Factor 1) is associated with X-linked intellectual disabilities and X-linked dystonia [33]. TAF1 also phosphorylates p53 on Thr55 [34]. HNF4A (Hepatocyte Nuclear Factor 4 Alpha) is associated with Type 1 Maturity-Onset Diabetes of the Young [35]. HNF4A is also a potential marker for distinguishing between primary gastric cancer and metastatic breast cancer [36]. MYC is a proto-oncogene involved in cell cycle progression and apoptosis; amplification of MYC is observed in numerous human cancers, including breast cancer [37–40]. MAX is the associated factor X of MYC (MAX and MYC together form a protein complex that is a transcriptional activator) and is associated with pheochromocytoma [33]. Note that all of these hub genes are all proteins and also are all intermediate nodes, meaning they are added by the CPDB and are not part of the original input gene lists. It is likely that there are certain genetic or epigenetic changes on such hub genes, and they affect the genes in our co-methylation lists. The majority of their connections are gene regulatory interactions (see light blue lines).
In addition to looking at the highly changing CG sites and highly correlated CG sites, we also investigate the 419 differentially methylated sites in the tumor and normal dataset separately. These 419 sites are selected using p-value < 0.05 and mean difference > 0.4. Among these 419 DM sites, we then focus on the CG sites highly correlated with at least 1 other CG site that are only in either normal or tumor data. There are 109 and 29 such CG sites in normal and tumor data respectively. We then conduct the network analysis using the CPDB; see Supplemental Fig. 4 for normal and Supplement Fig. 5 for tumor in the Additional File 1. These two figures show that there are far more connected genes present in the normal dataset, while the genes associated with the tumor dataset do not show many connections. In the normal dataset, the genes PCDHGA5, PCDHGB4, NKX2-1, SKI, RUNX1, sabp4_human, and RARA seem to be the major hub genes. PCDHGA5 is a protein coding gene associated with Wolf-Hirschhorn Syndrome, which is caused by the deletion of a region of chromosome 4. In regards to endometrial cancer, it is also identified as a deregulated gene with different methylation patterns [41]. PCDHGB4 is also a protein coding gene that is identified as a potential passenger gene in a study related to endometrial cancer, and novel mutations of the gene are only found in tumor samples [42]. NKX2-1 is found to be inversely associated with p53 and KRAS mutations [43]. SKI is found to be a negative prognostic marker in the early stages of colorectal cancer [44]. RUNX1 is thought to have a role in breast cancer and endometrial cancer, and reduced levels of it creates an environment which supports tumor growth [45]. When overexpressed, RARA is found to be associated with worse survival rates in colorectal cancer patients [46]. The tumor dataset does not have any notable hub genes. In the normal dataset, the majority of the interactions are protein interactions, which are represented by the orange lines. In the tumor dataset, all of the interactions are protein interactions. Additionally, most of the hub genes exist in the network as proteins.
Chromosome X
In the previous section, when we study the overall co-methylation pattern regarding the distance between highly correlated CG sites, we find that ChrX has an outstanding pattern when comparing normal with tumor (see Fig. 11A). That is, the median distance between tumor pairs is much smaller than the median distance between normal pairs, meaning that the tumor pairs are concentrated more closely together than the normal pairs. Due to this finding, we further examine the highly correlated pairs located on ChrX. The ChrX tumor dataset has a much greater percentage of co-methylated sites located very close together than the ChrX normal dataset: 44.7% of tumor pairs are located within 10 million bp of each other compared to only 17.3% of normal pairs, see the horizontal line in Fig. 11A. This is a statistically significant difference with the two-proportion test p-value < 2.2e-16. Furthermore, as shown in Table 8, while the maximum absolute distances for the ChrX normal pairs and tumor pairs are very similar, the tumor pairs are concentrated at very close distances to each other, so the median distance between highly correlating sites is about three times larger in the normal data than in the tumor data—that is, 46,352,309 bp (for normal) vs. 15,134,665 bp (for tumor). Note that the ChrX pattern shown in Fig. 11B is very different from the overall pattern of the whole genome as shown in Fig. 2.
Table 8
Summary of absolute distances between co-methylated CG pairs on ChrX.
| Minimum | 1st Quartile | Median | Mean | 3rd Quartile | Maximum |
ChrX Tumor | 2 | 1382456 | 15134665 | 27195461.17 | 37411823 | 151495608 |
ChrX Normal | 2 | 17983484.25 | 46352309 | 52772051.89 | 85721278.75 | 152125211 |
For each CG site on ChrX, we also examine the number of CG sites that this site is highly correlated with in the tumor and normal data separately, see Fig. 12C. This figure shows that over 50% of ChrX sites do not highly co-methylate with any other CG sites, with the normal data having a slightly higher proportion of such sites than tumor. This proportion is much lower in the full dataset, with around 35% for tumor and an even smaller proportion, about 26%, for normal, as shown in Fig. 1A. However, in both the ChrX data and full data, we observe that there is a considerably higher proportion of tumor sites than normal sites that are highly correlated with 1 to 100 other CG sites, and a higher proportion of normal sites than tumor sites that are highly correlated with more than 100 other sites. Note, the exception to this observation is in the 20,001–30,000 range as illustrated in Fig. 1, with more tumor CG sites than normal CG sites falling into this category, though this is not the case for the ChrX-only data. In summary, we find the co-methylation pattern in ChrX is very different from the autosomes or the whole genome.
Table 9
Comparing three co-methylation studies.
| 2013 study by Akulenko and Helms | 2020 study by Mallona et al. | Our study |
Tissue/Cancer | Breast cancer | Colon cancer | Breast cancer |
Data type | Illumina 27K (27500) | Illumina 450K (485577) | Illumina 450K (485577) |
CGs or genes used | 13133 genes | ~ 300000 CG sites | 272990 CG sites |
Analysis unit | Gene | CG (or probe) | CG (or probe) |
Sample size | 317 tumor and 27 adjacent normal | (1) 90 tumor and 90 adjacent normal (2) 256 tumor and 38 adjacent normal | 53 tumor and 53 adjacent normal |
Tumor vs. normal | Combined | Separated & compared | Separated & compared |
Relationship with genomic distance | Yes | Yes | Yes |
Gene or CG pairs on the same chromosome | 74/187 gene pairs | No specific result | 45.2 million CG pairs for normal; 17.5 million CG pairs for tumor |
Negative correlation | No | Yes | Yes |
Number of high correlation partners | No | Yes | Yes |
ChrX co-methylation | No | No | Yes |
Highly changing sites | No | No | Yes |
Relationship with DM sites | No | No | Yes |
Comparing with other related studies
Next, we compare our study with the two most relevant co-methylation studies. One is for breast cancer [11] and another is for colon cancer [47], see Table 9. The first several rows of this table show how the three studies are similar or different regarding tissue/cancer type, data size, analysis unit, tumor/normal sample used and so on. One key similarity of these three studies is that we all analyze the relationship between co-methylation and genetic distance. We all find a weak negative correlation between the co-methylation and genomic distance for CG pairs on the same chromosome. As for the tumor and normal samples, Akulenko and Helms do not separate them in their analysis, so no comparison is done [11]. Mallona et al. compare tumor and normal data and find that they have different co-methylation patterns although their analysis and our study are conducted from different perspectives and with different approaches [47]. In addition, we conduct analyses in five additional aspects that Akulenko and Helms do not do: negative correlation, number of high correlation partners (or co-methylation degree), ChrX co-methylation, highly changing sites, and relationships with DM sites. Furthermore, Akulenko and Helms report 74 out of 187 co-methylated gene pairs on the same chromosome [11]; we report 45.2 million (i.e., 5.7%) CG pairs for normal and 17.5 million (i.e., 5.87%) CG pairs for tumor on the same chromosome. Finally, Akulenko and Helms’ study is based on the Illumina 27K data, which is about 10 times less than the data we used. Therefore, our findings show a bigger and clearer picture of co-methylation patterns as more CG sites are used.