Delineating ubiquitously methylation resistant CG sites in blood and normal tissues
Tumor cfDNA is mixed with a background of normal and blood cfDNA in plasma at different and unpredictable amounts[20]. HCC DNA has both hypomethylated and hypermethylated regions that differentiate it from healthy DNA[21]. We reasoned that an ideal cancer marker would be a CG position that is ubiquitously unmethylated in normal tissues and blood but is methylated exclusively in tumors being categorically different than any normal cfDNA in blood. DNA methylation profiles could be heterogeneous across individuals; we therefore examined whether we could identify CG positions in publicly available DNA methylation arrays that are uniformly unmethylated across all the individuals and across 17 different somatic tissues. We first generated a list of 47981 CG positions that were hypomethylated in every single individual (beta=<0.1 and median <0.02) in 234 individuals in 17 different somatic tissues using Illumina 450K array data in GSE42752; GSE52955; GSE53051; GSE60185; GSE63704; GSE65821; GSE69852; GSE69852; GSE85464 and GSE85566. We then generated a list of 68260 unmethylated CG positions in blood DNA in each of the 312 individuals in GSE40279. We overlapped the two lists to obtain a list of CGs that are unmethylated in every single individual in both blood and 17 somatic tissues. To increase the robustness of the list and to exclude sites with residual variation in methylation across individuals that are derived from sex or age differences, we overlapped this list with a list of 60379 of unmethylated CGs in blood DNA in all 656 individuals males and females aged from 19 to 101 years (GSE40279). This overlap resulted in a final list of 28,775 CGs which are unmethylated across all individuals in multiple tissues at different ages and both sexes. The list of ubiquitously unmethylated sites were highly enriched for CG islands (10xe-814, Hypergeometric test), Transcription Start sites TSS200(7.7xe-317),1st exon (3xe-68), 5’UTR(3.8xe-27), and Phantom High CG density promoters (5.22 fold enrichment 5.6xe-395) but depleted for the north and south shores of CG islands (3xe-28 and 3xe-20), enhancers (a 4.47 fold depletion 6.8xe-145), 3’UTR (a 13 fold depletion 3.8xe-36) and low CG density Phantom promoters (a 2.67 fold depletion 6.3xe-5). Thus, our list includes a highly selective group of CGs located in CG rich promoters that are uniformly “methylation resistant” across tissues and individuals.
Discovery of 4 CG sites that classify HCC samples from healthy blood and other tissues; “HCC detect” markers
We then tested whether any of these ubiquitously “methylation-resistant” CGs become methylated in cancer. We used a dataset of Illumina 450K DNA methylation profiles from 66 HCC samples from all stages (GSE54503) and 77 control non-HCC liver samples (fibrosis and cirrhosis) (GSE61258) to generate a list of sites that show the highest methylation differential between HCC and control liver, limiting our analysis to the 28K methylation resistant CGs that we shortlisted. Remarkably, many of these ubiquitously methylation resistant CGs were methylated in HCC samples. 286 CG positions were methylated more than 20% in at least 50% of the HCC samples. A list of the top 20 CG sites with average difference in DNA methylation between HCC samples and non-HCC liver of above 0.2 (heatmap Fig. 1A) was further reduced by penalized regression to 4 CG sites; cg02012576 an intergenic region associated with the Checkpoint With Forkhead And Ring Finger Domain (CHFR) gene, cg03768777 at the 1st exon of the Vasohibin 2 (VASH2) gene, cg05739190 at the 1st exon of the Cyclin-J gene (CCNJ) and cg24804544 at the body of the Glutamate Receptor, Ionotropic, Delta 2 (Grid2) Interacting Protein 1 gene (GRID2IP). A weighted polygenic methylation score for HCC was computed by multivariable linear regression equation based on methylation values for these 4 CG positions in the training data (Table S1). The polygenic score significantly differentiated HCC and control samples (Fig.1B). A Receiver Operating Characteristic Curve (ROC) analysis performed on the calculated polygenic scores for the HCC and control samples shows an area under the curve of 0.9910 (Fig. 1C). Our training cohort included HCC samples from all stages with the goal of broad detection of cancer notwithstanding stage. We termed the 4CG marker set, “HCC detect”.
We validated “HCC detect” using DNA methylation 450K data for 793 HCC samples and 116 normal adjacent tissue (NAT) (heatmap Fig. 1D). We calculated the “HCC detect” score in 450K DNA methylation array data for healthy blood (n=968), healthy liver tissue (n=15), other liver disease (158), other healthy tissues (n=234), other normal adjacent tissues (721), cancers from 31 other tissues (8753), total n=11704 (Fig. 1E) (Table 1 and Table S2). The “HCC detect” polygenic score significantly differentiates HCC from all other groups as determined by one-way ANOVA after correction for multiple comparisons (F=793, p<0.0001, DF 11696; p<0.0001 for all comparisons) (Fig. 1E). AUC of 0.99 is computed by ROC curve analysis when HCC samples (n=739) are compared to Healthy blood (n=968) (sensitivity of 97% and specificity of 96%) (Fig. 1F), AUC of 0.97 when HCC (n=739) is compared to all healthy and NAT tissues including liver (n=2212) (specificity of 95% and sensitivity of 87%) (Fig. 1G), AUC of 0.95 when HCC samples are compared to 234 DNA methylation samples from healthy tissues (specificity of 95% and sensitivity of 85%), AUC of 0.92 when HCC (n=739) is compared to NAT of HCC (n=116) (specificity of 94% and sensitivity of 95%), AUC of 0.966 when HCC is compared to healthy liver tissue (specificity of 100% and sensitivity of 88%) and AUC of 0.87 when HCC is compared to 8753 samples from 31 different types of cancer (Table S2) (specificity of 90% and sensitivity of 64%). The HCC-detect methylation score detects early-stage HCC samples as well as late-stage HCC (Fig. 1H). These results validate that “HCC-detect” differentiates HCC samples at all stages from healthy tissues. Similar AUC values were obtained when equal weight was given to each CG in the detect score assuming that methylation at any of the 4 CGs is sufficient to classify a sample as HCC, though certain CGs are methylated (>20%) in a higher fraction of HCC samples than others (59% for Vash2, 57% for CHFR, 50% for GRID2IP and 44% for CCNJ) (Table S4).
Discovery of a single CG site whose methylation state classifies HCC samples correctly from tumors of different cell-type origins; “HCC-spec” marker
Several of the previously published early cancer detection DNA methylation biomarkers were were not tested across different cancers. Thus, these markers might detect several different kinds of cancers as well as HCC. The “HCC-detect” score developed for HCC preferentially detects HCC amongst 31 cancers in TCGA (Fig. 2A), however it detects other cancers as well, reducing the specificity and sensitivity of differentiating HCC from other tumors (specificity 90% and sensitivity 66%). To discover a set of markers that distinguishes tumors originating in the liver from other tumors we trained a cohort of 240 randomly selected DNA methylation samples from TCGA representing 16 different cancers, 10 HCC samples and 10 healthy blood samples. In this case, we didn’t limit our search to the 28,775 methylation-resistant CGs, in order not to miss liver-specific methylated CGs.
We calculated the differential methylation between the average methylation across the HCC samples and the average in all other cancers for each CG and its t-statistics. We shortlisted 7 CGs using a very strict threshold of (delta >0.5 and adjusted p value after Bonferroni correction for multiple testing of Q<10-20) (Heatmap in Fig. 2B). A multivariable linear regression with the 7 CGs as co-variates revealed that cg14126493 at the body of the F12 gene has the largest effect. A weighted methylation score for F12 computed by a linear regression equation (Table S3) classified all HCC samples correctly within a mixture of 240 samples in the training cohort. ROC curve (HCC:all other samples) revealed an AUC of 0.9973 with sensitivity of 99% and specificity of 100% (Fig. 2C). We designated cg1412693 as “HCC-spec” marker.
We then validated the “HCC-spec” as classifier of HCC DNA within a mixture of other tumors in a set of 11,692 samples that included 31 different cancers, non-malignant tissues and HCC samples (Table S2). “HCC-spec” is a liver specific marker. It differentiates HCC from 31 other cancers (scatter plot in Fig. 2D). The “HCC-spec” score is significantly different between HCC and healthy blood (p<0.0001), healthy tissues (p<0.0001), normal adjacent tissue (NAT) samples from 31 cancers (p<0.0001), and 31 cancers (p<0.0001) but is not significant between HCC and healthy liver tissue, liver-disease or HCC NAT (nonparametric Kruskal–Wallis one-way analysis of variance with Dunn’s multiple comparisons test) (Fig. 2E). The “HCC-spec” score accurately classified HCC samples (n=739) versus all other cancers (n=8754 for other cancers) with an AUC of 0.988 (99% specificity and 97% sensitivity) (Fig. 2 F). The AUC for HCC and normal blood (n=968) is similar 0.981 (100% specificity and 97% sensitivity) (Fig. 2G) and similarly the AUC for classifying HCC from healthy tissues (n=234) is 1 (100% specificity and 100% sensitivity). Remarkably, the DNA methylation level of a single CG site is sufficient to classify DNA as derived from liver tissue and not from other tissues or cancers. The “HCC spec” score however is not as accurate as expected for classifying HCC DNA and other non-malignant liver DNA as it is a liver-specific rather than cancer specific marker (Fig. 2E). The AUC for healthy liver tissue is 86% with specificity of 100% and sensitivity of 73%, and AUC for liver disease tissues of 0.84 with specificity of 95% and sensitivity of 71%. However, by combining the “HCC-detect” which accurately differentiates HCC and other liver DNA and “HCC spec” scores which differentiates HCC from other cancers (“HCC-detect”+”HCC-spec”=combined methylation score) we are able to accurately detect HCC DNA in a mixture of samples that included 31other cancers, normal tissue and liver tissues. AUC for HCC against all other tissues combined, including 31 cancers and liver tissues is 0.9862 with a specificity of 94% and sensitivity of 95% (Fig. 2G). At the threshold calculated by this ROC (a combined score of 0.87) the specificity against blood is 100%, against other 31 cancers is 95%, against normal tissue is 100% and against other cancers-NAT is 98.9%. However, at this threshold other liver tissues and liver disease DNA will be detected as well at the rate of 50%. To establish a threshold that differentiates HCC from other liver diseases we performed an ROC with HCC and other liver disease; the AUC is 0.937 at a sensitivity of 87% and specificity of 95% (using a higher threshold of a combined “HCC-detect” and “HCC-spec” score of 1.1).
We compared “HCC-spec” and “HCC-detect” DNA methylation markers (Fig. 3A) to two other extremely promising sets of HCC biomarkers that were recently described [22] [19] (Fig. 3B,C) using DNA methylation values for the respective CGs in Illumina 450K arrays from the 11701 samples described above. The heat maps presented in Fig. 3 show that although previously published markers display dramatic differences in methylation between HCC and HCC-NAT samples as previously reported, there is a high background of DNA methylation across other cancers and normal tissues. The combined “HCC-detect” and “HCC-spec” markers delineated here show a categorical differentiation between high methylation in HCC and extremely low methylation in other tissues and most cancers. While two of the “HCC-detect” markers are methylated to different extent in several cancers (Fig. 3B, C), F12 is exquisitely methylated in HCC and liver disease samples but not in other cancers, normal tissues or blood (Fig. 3A).
We tested whether the differential methylation of the 5 differentially methylated CGs discovered by analyzing DNA methylation data in Illumina 450K arrays will be differentially methylated in other publicly available methylation data derived by a different method. Wen et al., [23] examined both cfDNA and tumor tissue as well as NAT-HCC and plasma from liver cirrhosis and normal patients, by bisulfite conversion followed by methylated CpG tandem amplification and sequencing which enriches for methylated CGs (n=191) (GSE63775). We examined the count of methylated reads in genomic regions containing each of the 5 CGs of the “HCC-detect” and “HCC-spec” markers in this data set (57 tissue samples and 94 plasma samples). The 5 genes were significantly differentially methylated in all HCC tissue samples compared to HCC-NAT and in HCC plasma samples compared to plasma from cirrhosis and normal livers (Fig. 4A,B,D.E insets) with the exception of GRID2IP which showed a difference in methylation which didn’t reach significance in plasma because of the low number of reads in the serum sample (Fig. 4. C inset), but nevertheless it was significantly differentially methylated in HCC tissue samples (Fig. 4C). These data confirmed in an independent data set that the “HCC-detect” and “HCC-spec” CGs are methylated in HCC and in plasma cfDNA in HCC patients.
Validation of HCC-detect and HCC-spec DNA methylation in a clinical study examining plasma cfDNA in 398 people (Clinical trial gov ID:NCT03483922)
We recruited 402 people from Dhaka city in Bangladesh which included healthy controls, chronic hepatitis B patients as well as patients at different stages of HCC from stage 0 to stage D according to the EASL–EORTC Clinical Practice Guidelines (Clinical data summary Table 2). In difference from examining DNA methylation in a tumor biopsy where a significant fraction of DNA is derived from the tumor (as in the TGCA methylation database), tumor DNA in plasma is mixed with DNA from potential other sources and the extent of dilution of tumor DNA in other DNA is unknown. Thus, the level of methylation of plasma DNA might reflect an unpredictable dilution of tumor DNA;the level of methylation of cfDNA is therefore a function of the state of methylation of DNA in the tumor and the unknown and stochastic mixture with other DNA. Thus, it is anticipated that the level of methylation is lower than what we derived from examining tumor DNA methylation data. However, if the methylation profile of the tumor DNA is categorically distinct from the methylation profile of other potential sources of DNA in plasma as anticipated by the analyses above (Fig. 1-3), we expected that it would be detectable even on a high background. We used bisulfite mapping combined with next generation sequencing, which provides DNA methylation profiles at a single DNA molecule resolution.
We developed a multiplexed targeted amplification next generation sequencing bisulfite mapping assay, that measures the state of methylation of regions spanning 100 to 200 bp around the “HCC-spec” and “HCC-detect” CGs in up to 200 people in parallel. We first examined whether the 5 CG positions in the genes that we have selected are differentially methylated in plasma cfDNA derived from HCC patients in comparison to plasma from healthy people or from people with chronic hepatitis B (n=402, 4 samples with sequencing reads below 100 were removed from the analysis and we remained with 398 informative samples). The analysis included 46 healthy controls, 49 Chronic hepatitis, and 302 patients with HCC at different stages: Stage 0-2, Stage A- 34, Stage B-86, Stage C-106 and Stage D-76). All “HCC-spec” and “HCC-detect” CGs were significantly more methylated in HCC patients than in healthy and chronic hepatitis B patients plasma (one way ANOVA with correction for multiple comparisons) with the exception of CCNJ which was significantly more methylated in HCC patients than in chronic hepatitis but was nominally significant in HCC to controls comparison; while median methylation in control and chronic hepatitis B heterogenous was slightly above 0%, methylation of up to 50 to 80 % was noted in the HCC samples (Fig 5A scattered plots). Hypermethylation was noted even at early HCC stages (Fig S1A) similar to the results obtained in the TCGA HCC dataset (Fig. 1H). These data confirmed that the “HC-detect” and “HCC-spec” CGs selected using TGCA tumor methylation data are differentially methylated in HCC patients’ plasma.
Targeted sequencing allows capturing the methylation state of several CG in the proximity of the CGs that were selected using the TCGA data. We noted that in all 5 regions, hypermethylation was not limited to the 5 CGs selected in the 450K arrays and that there was a high correspondence in methylation levels of the CG included in the “HCC-detect” and “HCC-spec” sets and proximal CGs (heatmap Fig. S1B). To evaluate the consistent methylation state across the regions, we computed the median methylation in the amplified region for each of the 5 genes (heatmap Fig. 5B). We used median rather than average to exclude situations where a high average is driven by a spurious high methylation of a single CG. Median values of percentage methylation from 0 to 100 were normalized (log 2) and a “HCC-detect” M score was computed from the SUM of normalized medians of CHFR, VASH2, CCNJ and GRID2IP regions giving equal weight to each region. Similarly, the “HCC-spec” M score detecting HCC specificity versus other cancers was computed from the median methylation of the F12 region. Both scores “HCC-detect” and “HCC-spec” significantly differentiated the HCC group from either the healthy control or chronic hepatitis B groups (one-way non-parametric ANOVA with correction for multiple comparisons) (Fig. 5C, 5E) but there was no significant difference between the chronic hepatitis B (CHB) and the healthy control groups. Importantly, both the “HCC-detect” M scores and the “HCC-spec” M scores were significantly different from control and chronic hepatitis B groups at early and late HCC stages (Fig. 5D, F): there are no significant difference between HCC stages (one-way non-parametric ANOVA with Dunnett correction for multiple comparisons).
To examine the biomarker quality of the” HCC-detect” M score we analyzed its receiver operator characteristics. The AUC for the “HCC-detect” score (302 HCC patients and 46 controls) was 0.93, the specificity 91% and sensitivity 89% (Fig. 6A).
We used logistic regression to model “HCC-detect” M score as a predictor of probability of HCC (Fig. 6C) and computed a predicted probability for each person using the logistic regression equation (scattered plot Fig. 6D). The HCC samples cluster around the probability of 1, few CHB samples are predicted a probability of 1 while most of the samples of the healthy and CHB samples median is around a predicted probability of 0.5. The AUC for the “HCC-spec” M score is 0.89 with specificity of 91% and sensitivity of 72% (Fig. 6D). We computed the logistic regression equation for “HCC-spec” M- scores (Fig. 6E) and the predicted probability for each person (Fig. 6F).
We then computed a combined probability score for cancer detection and HCC specification. We computed ROC curve of the combined probability of “HCC-detect” and “HCC-spec” scores of 46 healthy controls and 302 HCC patients (Fig. 7A). The calculated AUC is 0.9432 the specificity is 95% and the sensitivity is 85%. A perfect combined score is 2 which indicates a predicted probability of 1 for cancer and predicted probability of 1 that the cancer is HCC. The median score for each of the HCC stages including early stages approach 2 and is significantly different than the healthy and chronic hepatitis groups, which are not different from each other (non-parametric ANOVA and Dunnett correction for multiple comparisons) (Fig. 7B and a scatter plot for all individual samples in Fig. 7C). We calculated a threshold sum probability from the AUC curve (1.337) and used it to classify the samples as either HCC (1) or no HCC (0) (scatterplot Fig, 7D). This threshold accurately classifies 95% of the control samples 75% of the Stage A samples, 84% of the stage B samples, 82% of the stage C samples and 94% of the stage D samples (heatmap presenting the classification for each of the 398 samples (HCC-red, no HCC-blue) is presented in Fig. 7E). Using this threshold, 12% of the chronic hepatitis B (CHB) samples are classified as HCC compared with 5% of controls. The higher fraction of chronic hepatitis B that are classified as HCC compared to healthy controls might reflect the increased risk of conversion of chronic hepatitis B to HCC.