3.1 Flow chart of this study
The detailed workflow for the construction of the risk models and downstream analysis is shown in Figure 1. We first used the TCGA database to extract MRGs from the COAD and READ data sets and then observed the differential expression of MRGs at the gene expression level. Specific prognostic risk models for colon and rectal cancers were constructed using the MRG set data, and the predictive ability of each risk model was tested through ROC analysis. The risk model of colon cancer was validated in the GEO database. Finally, Kaplan–Meier analysis and expression level detection of the genes in the risk models were performed, and the mutation information of each gene in COAD and READ in the models was examined.
3.2 Differential expression and functional annotation of MRGs in COAD
From the TCGA database, we first downloaded mRNA expression data and the clinical information of 398 COAD tissue samples and 39 nontumor tissues (Table 1). A total of 772 MRGs were detected in this database, and 196 differentially expressed MRGs were obtained. The expression patterns of the differentially expressed MRGs in COAD and nontumor tissues are shown in the volcanic and heat maps (Figure 2A and B). Of the 196 differentially expressed genes, 98 genes were down-regulated, and 98 genes were up-regulated in the tumor tissues. To observe metabolic differences between colon cancer tissues and normal tissues, we used the KEGG pathway to conduct a functional enrichment analysis of the differentially expressed MRGs. We found that the up-regulated genes are mainly involved in the following metabolic pathways: purine metabolism, amino acid biosynthesis, carbon metabolism, and cysteine and methionine metabolism (Figure 2C). Meanwhile, the down-regulated genes are mainly involved in fatty acid degradation, starch and sucrose metabolism, retinol metabolism, and galactose metabolism (Figure 2D). We also downloaded the mRNA expression data and corresponding clinical information of 84 READ tissue samples and two nontumor tissues (Table 1). Owing to the small number of normal rectal tissue samples, we were unable to obtain effective differential gene data.
3.3 Construction and verification of the COAD risk model
To investigate the relationship between MRGs and colon cancer prognosis, we constructed a prognostic risk model. Initially, univariate regression analysis was performed on 13 genes that are significantly associated with prognosis, including two low-risk MRGs (GSR and SUCLG2P2) and 11 high-risk MRGs (CPT1C, PLCB2, PLCG2, PLA2G2D, PTGDS, GAMT, ENPP2, SULG2P2, PIK3CD, PIP4K2B, GPX3, and MAT1A; Figure 3A). Then, we used Lasso regression and multivariate regression to generate a final prognosis model with eight MRGs, namely, CPT1C, PLCB2, PLA2G2D, GAMT, ENPP2, PIP4K2B, GPX3, and GSR (Figure 3B and C). The specific information of these genes, including full name, coefficient, and relevant pathway, is shown in Table 2. The patients with colon cancer in the TCGA database were divided into low- and high-risk groups, and Kaplan–Meier survival analysis was performed. The results showed that patients with high risk scores had significantly worse overall survival than those with low risk scores (Figure 3D). Figure 3E shows the association between survival time and risk score. Survival time decreased significantly with increasing risk score. Time-dependent ROC analysis indicated that this prognosis model can accurately predict the OS for patients with colon cancer (Figure 3F). The heatmap developed to show the gene expression profiles indicated that the high-risk genes (CPT1C, PLCB2, PLA2G2D, GAMT, ENPP2, PIP4K2B, and GPX3) were always expressed in the high-risk patients, whereas GSR was often expressed in the low-risk patients (Figure 3G).
To verify the validity of the prediction model, we used the colon cancer data of GSE17538 from the Gene Expression Omnibus (GEO) as validation data. Consistent with the results obtained from TCGA database, our finding showed the prediction model can provide reliable prognoses for the patients in GSE17538. The patients with high risk scores had considerably poor overall survival (Figure 4A). ROC curve analysis showed the sensitivity and specificity of the risk score for the OS when combined with clinical characteristics (Figure 4B). The association between survival time and risk score indicated that the survival time decreased with the significant increase in risk score (Figure 4C).
3.4 Construction of the READ risk model
Using the approach used in colon cancer analysis, we performed univariate regression analysis to construct a prognostic model for rectal cancer. Seven MRGs were used (Figure 5A). A final prognosis model with six MRGs (TDO2, PKLR, GAMT, EARS2, ACO1, and WAS) was constructed using Lasso regression and multivariate Cox regression (Figure 5B and C). Specific information on the genes, including full name, coefficient, and relevant pathways, is shown in Table 4. Meanwhile, the results of Kaplan–Meier survival analysis showed that patients with low risk scores had significantly better overall survival than those with high risk scores, suggesting that the established prognostic model is available (Figure 5D). Survival time increased when the risk score decreased significantly (Figure 5E), and ROC curve analysis showed the sensitivity and specificity of the risk score for OS when combined with clinical characteristics (Figure 5F). Furthermore, the heatmap showed that the high-risk genes (TDO2, PKLR, and GAMT) were always expressed in the high-risk patients (Figure 6G). Given the small number of rectal cancer cases in the GEO database, we were unable to use these data in validating the risk model.
3.5 Prognostic risk models of COAD and READ were independently related to OS
To further validate this prediction model, we performed univariate and multivariate regression analyses and used TCGA and GSE17538 in exploring the correlations of clinical parameters and risk score with overall survival in patients with colon cancer. The TCGA results of univariate Cox regression showed that pathological stage; T, N, and M stages; and risk score were all significantly correlated with overall survival (all P<0.001; Figure 6A). Multivariate Cox regression showed that T stage and risk score were correlated with overall survival in patients with colon cancer (P<0.05; Figure 6B). The GSE17538 results showed that not only univariate Cox regression analysis but also multivariate regression analysis indicated that the prognostic risk model of COAD was independently related to overall survival (P<0.05; Figure 6C and D). As to READ, the results of univariate regression and multivariate Cox regression analysis indicated that the obtained risk model was independently related to overall survival (P<0.05; Figure 6E and F).
3.6 Comprehensive analysis of all the genes in the prognostic models of COAD and READ
According to the previous results, the COAD risk model has eight genes, and the READ risk model has six. To further evaluate the prognostic value of the above genes, we used the GEPIA database in performing Kaplan–Meier analysis. We found that high expression of CPT1C, PLCB2, or GPX3 indicates significantly poor prognosis, whereas the high expression of GSR indicates excellent prognosis in patients with colon cancer in the GEPIA database (Figure 7A17). Although PLA2G2D, GAMT, ENPP2, and PIP4K2B cannot significantly predict prognosis, patients with high expression level of PLA2G2D, GAMT, ENPP2, or PIP4K2B always had poor prognosis (Figure 7A). In READ, high TDO2 and GAMT expression levels were correlated with poor prognosis, whereas high EARS2 and ACO1 expression levels predicted good prognosis (Figure 7B). Overall, the results of Kaplan–Meier analysis were generally consistent with the univariable Cox analysis results.
Meanwhile, we analyzed the mRNA expression levels of the genes in relative tumor and normal tissues with the GEPIA database. As shown in Figure 8A, in the COAD risk model, the gene expression levels of PLA2G2D, ENPP2, and GPX3 in colon cancer tissues were significantly lower than corresponding normal tissues, but changes in the other genes were not obvious. As to the READ risk model, only GAMT had significantly lower expression in rectal cancer tissues than in corresponding normal tissues (Figure 8B).
Furthermore, we investigated genetic alterations of genes in the prognosis models with the data of Gene Set Cancer Analysis Lite (http://bioinfo.life.hust.edu.cn/web/GSCALite/). The results showed that the genes in the COAD model were changed in all the 38 queried samples (100%; Figure 9A), and the genes in the READ model were changed in all the five queried samples (100%; Figure 9B). In COAD, genetic alterations included single nucleotide polymorphism (SNP), delete mutation, and insert mutation, whereas the genetic alterations in READ just included SNP.
3.7 Involved metabolic pathways of the genes in the risk models analyzed by GSEA
To understand the metabolic characteristics of colon cancer and rectal cancer tissues, we used the GSEA approach in analyzing major differences in metabolic pathways between high- and low-risk patients. In colon cancer, inositol phosphate metabolism, tyrosine metabolism, arachidonic acid metabolism, glycerophospholipid metabolism, and ether lipid metabolism were activated in high-risk tumors, whereas fatty acid metabolism, butanoate metabolism, amino sugar and nucleotide sugar metabolism, propanoate metabolism, and glycolysis gluconeogenesis were inhibited in low-risk tumors (Figure 10A). In rectal cancer, glycerophospholipid metabolism, arachidonic acid metabolism, alpha linoleic acid metabolism, linoleic acid metabolism, and glutathione metabolism were activated in high-risk tumors, whereas the TCA cycle, propanoate metabolism, one carbon pool by folate, lysine, valine leucine and isoleucine degradation were inhibited in low-risk tumors (Figure 10B).
3.8 Expression of GSR and ENPP2 in colon and rectal cancers detected through immunohistochemistry staining
To test the validity of the prognostic models, we detected the protein expression levels of GSR and ENPP2 through immunohistochemistry staining in clinical colorectal cancer specimens and analyzed the relationship between the expression levels and patient prognosis. Immunohistochemical studies have demonstrated that GSR is expressed in the cytoplasm and nuclei of cancer cells (Figure 11A) and ENPP2 is mainly expressed in the cytoplasm of cancer cells (Figure 11H). Kaplan–Meier survival analysis showed that the high expression levels of GSR in the cytoplasm is significantly correlated with the poor prognosis of patients with colon cancer (Figure 11B) but not with the prognosis of patients with rectal cancer (Figure 11E). No significant correlation was found between the expression levels of GSR in nucleus and prognoses of patients with colon cancer (Figure 11C) and rectal cancer (Figure 11F). Meanwhile, the high expression of GSR the cytoplasm and nucleus could not reminder the prognoses of patients of colon cancer (Figure 11D) or rectal cancer (Figure 11G). In addition, patients with colon cancer and high ENPP2 expression levels in cancer tissues always had poor survival (Figure 11I), and no significant correlation between ENPP2 expression and patient prognosis was found in rectal cancer (Figure 11J).