Identification of differentially methylated sites in LUAD
We initially performed differential expression analysis to select DEGs between LUAD and normal lung tissues in TCGA-LUAD dataset. With cut-off criteria of FDR < 0.05 and |log2FC| >2.0, a total of 960 DEGs were identified, including 653 up-regulated DEGs and 307 down-regulated DEGs (Fig. 1A). We then selected the methylation sites which were differentially methylated between LUAD and normal lung tissues and significantly negative correlated with the expression of corresponding DEGs. We thought that such methylation sites could influence the gene expression and further participate in tumor progression. The results showed that a total of 1362 DMSs corresponding to 471 DEGs were identified, including 752 hypermethylation sites and 610 hypomethylation sites (Fig. 1B).
Functional enrichment of DMGs
To further investigate the biological processes which the DMSs might be involved in, we performed GO annotation and KEGG pathway enrichment using DAVID database for the 471 corresponding DEGs. The DEGs were significantly enriched in many cancer-related pathways. The top significant terms emerging from the gene oncology enrichment analysis were shown in Fig. 2A. For instance, the most significant GO term, cell division, has been reported in multiple articles related to the progression and metastasis of cancer [15–17]. We also found that DEGs were significantly enriched in angiogenesis, which is a core hallmark of advanced cancers, especially in LUAD [18–20]. Besides, other significant GO terms, such as regulation of cell cycle and regulation of small GTPase mediated signal transduction were also related to cancer progression and chemoresistance reported in many studies [21, 22]. As shown in Fig. 2B, KEGG pathway enrichment analysis found twelve significantly enriched pathways related to cancer progression, such as PI3K-Akt signaling pathway [23, 24], ECM-receptor interaction [25, 26] and p53 signaling pathway [27, 28]. The results indicated that these DEGs played key roles in multiple cancer-related pathways, and further indicated that the DMSs might be involved in LUAD progression by regulating the corresponding gene expression.
Construction of PPI Network
Using STRING database, a PPI network was constructed to further explore the interactions between the 471 DEmRNAs. After removing unconnected nodes, the PPI network of DEGs is consisted of 188 nodes and 888 edges when combined score > 0.7 was set as the cutoff criterion (Fig. 3A). Furthermore, the top 10 hub genes, including cyclin dependent kinase 1 (CDK1), cyclin A2 (CCNA2), cyclin B1 (CCNB1), cell division cycle 20 (CDC20), cell division cycle associated 8 (CDCA8), aurora kinase B (AURKB), assembly factor for spindle microtubules (ASPM), PDZ binding kinase (PBK), ribonucleotide reductase regulatory subunit M2 (RRM2) and centromere protein F (CENPF), were identified using the cytoHubba plugin for Cytoscape, with a higher degree of connectivity (Fig. 3B). Most of ten genes had been reported to be closely related to tumorigenesis and progression of LUAD.
Establishment of the DMSs-based prognostic signature
Performing the univariate Cox regression analysis, we identified DMSs with potential prognostic value in TCGA-LUAD cohort. Details of the clinical characteristics are presented in Supplementary Table S1. We found that 59 DMSs were significantly associated with overall survival, including 47 hypermethylation sites and 12 hypomethylation sites. The list of 59 DMSs is showed in Supplementary Table S2. Thus, these methylation sites were defined as prognosis-related DMSs to construct the prognostic signature. We used the glmnet package in R to perform LASSO regression analysis in TCGA-LUAD cohort. We obtained the optimal value of the parameter λ, which controlled the degree of LASSO regression complexity, and selected the significant variables through multiple cross-validation. We found that the parameter λ reached the optimal value, when the number of variables was three. Therefore, combining the regression coefficients of three DMSs under the optimal λ value, we constructed a three-DMSs risk score model to guide the prognosis of LUAD patients. The general information of the three DMSs is displayed in Table 2. The risk score formula was created as follows: Risk score = (1.0003*methylation level of cg21339084) + (0.1484*methylation level of cg07400091) + (-0.2536*methylation level of cg23843180). Calculating the risk score for each patient in TCGA-LUAD cohort, patients were classified into a high-risk or a low-risk group based on the median risk score. We found that the three-DMSs signature significantly stratified patients in terms of overall survival (log-rank p = 1.9E-04; Fig. 4A). Patients with high risk score had significantly shorter OS than those with low risk score. The mortality rate was 34.0% (71/209) in the high-risk group, significantly higher than 14.4% (30/208) in the low-risk group (p < 0.001, Fisher exact test; Fig. 4B). The risk score distribution, survival status, and methylation profile of the three prognostic DMSs are shown in Fig. 4C. As shown in Table 3, multivariate Cox regression analysis suggested that the three-DMSs signature was an independent prognostic factor, after adjusting for age, gender and stage (HR = 2.29, 95%CI: 1.47–3.57, P = 2.36E-04). Furthermore, noticing the patients with lymph node metastasis status, we found that patients in the high-risk group had a higher lymph node metastasis rate than those in the low-risk group (26.2% vs. 15.8%, p = 0.018, Fisher exact test; Fig. 4B). From the three DMSs, two were associated with high risk (cg21339084 and cg07400091; HR > 1) and one appeared to be protective (cg23843180; HR < 1). The methylation level of the three prognostic DMSs was detected and the differences between high- and low-risk groups were compared. We found that patients with high-risk scores tended to hypermethylation at risky sites, whereas patients in the low-risk group tended to hypomethylation at protective sites (Fig. 5A-5C).
Table 2
General information of the three DMSs
ProbeID | Gene | chrom | chromStart | chromEnd | coefficient |
cg21339084 | LIMS2 | chr2 | 128422432 | 128422434 | 1.0003 |
cg07400091 | S1PR1 | chr1 | 101704472 | 101704474 | 0.1484 |
cg23843180 | NGEF | chr2 | 233852838 | 233852840 | -0.2536 |
Table 3
Univariate and multivariate Cox regression analysis in TCGA-LUAD and GSE56044 cohorts
Variables | Univariate analysis | Multivariate analysis |
HR (95% CI) | P | HR (95% CI) | P |
TCGA-LUAD cohort |
Age |
< = 60/>60 | 0.96(0.63–1.47) | 0.852 | 1.19(0.77–1.85) | 0.437 |
Gender |
Male/Female | 0.94(0.64–1.40) | 0.774 | 0.99(0.66–1.49) | 0.964 |
Stage |
I + II/III + IV | 2.74(1.82–4.13) | 1.30e−06 | 2.79(1.85–4.21) | 1.06e−06 |
Risk score |
Low/High | 2.22(1.45–3.42) | 2.77e−04 | 2.29(1.47–3.57) | 2.36e−04 |
GSE56044 cohort |
Age |
< = 60/>60 | 2.83(1.26–6.35) | 0.012 | 2.88(1.26–6.59) | 0.012 |
Gender |
Male/Female | 1.10(0.63–1.93) | 0.737 | 0.79(0.44–1.43) | 0.438 |
Risk score |
Low/High | 2.15(1.20–3.85) | 0.010 | 2.16(1.19–3.91) | 0.011 |
Prognostic validation of the three-DMSs signature
An independent cohort (GSE56044), containing 82 LUAD patients with both methylation data and clinical information, was used to validate the prognosis performance of the three-DMSs signature. Similarly, we calculated the risk score for each patient using the three-DMSs signature, after which patients were classified into a high-risk (n = 41) or a low-risk (n = 41) group based on the median risk score. We found that patients in high-risk group had a shorter survival time than those in low-risk group (HR = 2.15, 95% CI: 1.20–3.85, log-rank p = 0.008, Fig. 6A). Furthermore, we calculated the mortality rate in each risk group. The result showed that the mortality rate in high-risk group was 32% higher than that in low-risk group (p = 0.006, Fisher exact test; Fig. 6B). The risk score distribution, survival status, and expression profile of the three prognostic DMSs are shown in Fig. 6C. As biased stage information, the stage variable is excluded when performed multivariate Cox regression analysis. In accordance with the result of training set, the multivariate Cox regression analysis confirmed that the three-DMSs signature was significantly correlated with overall survival as an independent prognostic factor (HR = 2.16, 95% CI: 1.19–3.91, P = 0.011, Table 3).
Construction of three-DMSs signature-based nomogram
Multivariate Cox analysis indicated that three variables (age, stage, and three-DMSs risk score) were independent risk factors for OS. Thus, A nomogram predicting 3- and 5-years OS was constructed based on the multivariate analysis data. As shown in Fig. 7, the total points for a patient can be obtained by adding the points from each independent prognostic factor listed in the nomogram. C-indexes for the nomogram were 0.71 (95%CI: 0.58–0.85) and 0.70 (95%CI: 0.52–0.88) in TCGA-LUAD and GSE56044 cohorts, respectively. The calibration plots for the probabilities of 3 and 5-year OS indicated no apparent departure from the ideal line, showing good agreement between the nomogram-predicted OS and actual OS of LUAD patients in both the training and validation cohorts (Fig. 8). Such results indicated that the three-DMSs signature-based nomogram could provide insight into regarding survival prediction and serve as a clinically available guide for personalized treatment of LUAD patients.