Identification of DEGs in LUAD
A flowchart for our study was shown in Figure 1. The lung adenocarcinoma mRNA sequencing dataset was downloaded from the TCGA database. A total of 3581 DEGs were obtained according to the criterion: |logFC| > 1, FDR < 0.05, including 2386 upregulated genes and 1195 downregulated genes. List, Heatmap, and volcanic map of the DEGs were shown in the supplementary document: Additional file 1, Additional file 2, Additional file 3.
Establishment of a four-gene panel as a prognostic indicator
Univariate Cox regression analysis was performed for identifying the DEGs associated with OS using the “survival” package of R language. Of the 3,581 DEGs, 523 genes were identified as being associated with OS for LUAD patients (p<0.01, Additional file 4). Then, lasso regression analysis was implemented to further obtain a stable set of genes (Additional file 5). Seven of the 294 mRNAs significantly associated with OS were screened out via this analysis (ANLN, C1QTNF6, DKK1, ERO1A, GNG7, LDHA, MELTF). At last, a four-gene panel as a prognostic indicator was obtained via multivariate Cox regression analysis. The forest map of Cox regression analysis was shown in Figure 2. The four genes screened were dickkopf WNT signaling pathway inhibitor 1(DDK1), G protein subunit gamma 7(GNG7), lactate dehydrogenase A (LDHA), melanotransferrin (MELTF, also known as MTF1(metal regulatory transcription factor 1)). Among them, DKK1, LDHA and MELTF are high-expressed in tumor tissues compared with tissue adjacent to carcinoma, but GNG7 is low-expressed. The heat map of differential expression was shown in Figure 3. The risk score = (0.38606 * ExpressionDKK1) + (-0.77458 * ExpressionGNG7) + (1.95469 * ExpressionLDHA) + (0.83740 * ExpressionMELTF). Each patient from the TCGA-LUAD database was awarded a risk score based on the Cox regression model composed of the four genes. Then, these patients were divided into a low-risk and high-risk group according to a cut off value obtained by the R package “survminer" and “survival”. The predictive performance of the four-gene panel for OS was estimated using the time-dependent ROC curve by R package “timeROC” and “survival”. The Area under the ROC curve (AUC) of this four-gene panel as a prognostic indicator was 0.740 and was superior to other clinical indicators used for prognostic classification. The OS in the high- and low-risk groups were compared using the Kaplan–Meier survival curve via the R package “survival”. The results indicated that high-risk group had a worse prognosis compared with the low-risk group (Figure 4A).
Validation of the four-gene panel as a prognostic indicator
To further validate the accuracy of the four-gene panel as a prognostic indicator, we computed the risk score of each patient in GSE42127 using the same model. Consistent with previous results, a significantly worse OS was observed in the high-risk group compared with the low-risk group. ROC curve showed that the AUC for OS was 0.752, indicating a better predictive performance compared with other clinical indicators used for prognostic classification (Figure 4B).
Independent prognostic value of the four-gene panel
To determine whether the four-gene panel is an independent prognostic indicator for LUAD patients, 344 patients from the TCGA-LUAD database with complete clinical information including age, gender, and TNM stage were included for further analysis. Multivariate Cox regression analysis suggested that only the risk score calculated from the four-gene panel was independent prognostic factors for OS (Figure 5A, 5B). The consistent result was obtained in the patients from GSE42127 (Figure 5C, 5D).
Establishment of predictive nomogram
We established a nomogram to predict 1-year, 3-year, and 5-year OS in 460 patients with complete clinical information from the TCGA-LUAD database using five factors including risk score, age, sex, pharmaceutical, and pathologic stage (Figure 6A). The C-index for the nomogram model was 0.710 (95% CI 0.624–0.796). Calibration curve showed that the nomogram had the superior prediction efficiency (Figure 6B). These results indicated that the nomogram might be to serve as a prognostic model used for clinical management of LUAD patients.
The genetic alteration, expression and survival analysis of the four genes
Among the 230 patients included in cBioPortal database, 21 (9%) shown genetic alterations in the four genes. Missense mutation, amplification and deep deletion were common genetic alteration (Figure 7A). We further validated the expression of the four genes using the lung adenocarcinoma dataset (GSE75037 [15]) from the GEO database, and the results were consistent with the analysis of the TCGA-LUAD dataset (Figure 7B). Kaplan-Meier survival curves indicated that high expression of DKK1, LDHA, and MELTF and low expression of GNG7 were associated with a poor OS for LUAD (Figure 7C).
Predictive value of the four-gene panel for patients with EGFR, KRAS and TP53 mutation
To explore the predictive value of the four-gene panel for patients with EGFR, KRAS and TP53 mutation, we performed a combined analysis of gene mutation and transcriptome data. A waterfall map of mutations in TP53, KRAS, EGFR and the four genes was shown in Figure 8. The results showed that mRNA expression differences of GNG7 and MTF1 (MELTF) were only observed in TP53 mutant and wild-type patients (Figure 9). We further analyzed the predictive value of the four-gene panel for OS in LUAD patients with EGFR, KRAS and TP53 mutant, respectively. The results showed that the four-gene panel we established was still good predictors for OS in LUAD patients with EGFR, KRAS and TP53 mutations. In addition, ROC curve analysis also showed good predictive performance (Figure 10).
Gene Set Enrichment Analyses
GSEA analysis was used to identify signaling pathways enriched in low and high expression of the four genes, respectively. The results revealed that genes involved in cell cycle, ubiquitin mediated proteolysis, RNA degradation, aminoacyl tRNA biosynthesis, DNA replication, proteasome, small cell lung cancer, and P53 signaling pathway were enriched in GNG7 low expression patients. In the high-expressed group of DKK1, KEGG pathways including cell cycle, RNA degradation, spliceosome, proteasome, DNA replication, P53 signaling pathway and so on were enriched. In the high-expressed group of LDHA, KEGG pathways including proteasome, RNA degradation, spliceosome, DNA replication, RNA polymerase, cell cycle, P53 signaling pathway and so on were enriched. In the high-expressed group of MELTF, the enriched KEGG pathways were mainly focused on bladder cancer, proteasome, DNA replication, base excision repair, pyrimidine metabolism. These results suggested that the absence of GNG7 expression and the increase of DKK1, LDHA and MELTF expression may be significantly related to the metabolism of genetic material, especially in the regulation of cell cycle pathway (Figure 11).
DNA methylation level and mRNA expression of GNG7
We further explored the relationship between DNA methylation level and mRNA expression of GNG7 using MethHC (A database of DNA Methylation and gene expression in Human Cancer) database. The results showed that the DNA methylation levels of GNG7 were significantly higher in 18 kinds of cancerous tissues than adjacent noncancerous tissues (Figure 12). Furthermore, methylation level of the promoter and CpG Island region was negatively correlated with mRNA expression of GNG7 (Figure 13).