Differentially expressed prognostic MGs
Through the online TCGA database, we obtained the mRNA expression of 446 ccRCC tissues and 63 normal kidney tissues. We combined the RNA sequencing results of 39 other ccRCC patients in the GSE29609 dataset with TCGA cohort.
After correcting and standardizing the data, we obtained the differentially expressed gene profile between the ccRCC group and the normal group through the “Limma” package. Then, we extracted the information on the differential expression of 944 metabolic genes. The filtration conditions (|log2fold-change|>1, FDR < 0.05) were set. Next, 479 differentially expressed MGs were identified, including 196 up-regulated genes and 283 down-regulated genes (Fig. 1). Based on the obtained differentially expressed MGs, we carried out the univariate Cox regression analysis to identify MGs related to the ccRCC patient's overall survival (OS). Finally, a total of 187 genes were considered to be prognostic MGs.
Construction a prognostic model index based on MGs
Due to the lack of clinical data such as pathology stage and histological grade in the GEO database, we use the patients in the TCGA database to divide them into training cohort and testing cohort. Finally, the patients in the TCGA database and the patients in the GEO database were merged together as the entire cohort.
After excluding patients with a follow-up survival time of less than 90 days (n = 15) and excluding patients with incomplete clinical information (n = 11), we divided the remaining TCGA-ccRCC patients into two groups with similar composition ratios (296 patients for training cohort, 124 patients for testing cohort). Table 1 shows the clinical characteristics of patients included in the study.
Immediately afterwards, the lasso regression and the multivariate Cox regression analysis were performed in the training cohort to target key risk genes (Fig. 2). Finally, we identified 4 optimal risk genes. Among these four risk genes, P4HA3 was considered as predictors of poor prognosis. The higher the expression of P4HA3, the worse the prognosis of patients. Three other genes, ETNK2, PAFAH2 and ALAD, were protective factors. According to the results of multivariate Cox regression analysis, we obtained the risk coefficient of each differentially expressed MGs and then constructed a prognostic model index to predict the prognosis of patients with ccRCC. The 4 prognostic MGs related PI formula was as follows: (P4HA3 expression) * (0.07090771) + (ETNK2 expression) * (-0.0497429) + (PAFAH2 expression) * (-0.1753559) + (ALAD expression) * (-0.0880467).
Evaluation of the prognostic model index based on MGs in training cohort and testing cohort
After obtaining the prognostic model index based on MGs for predicting the prognosis of ccRCC patients, we took a series of measures to evaluate the model.
Firstly, we searched The Human Protein Atlas (HPA) database (https://www.proteinatlas.org/) for risk genes and found that at the protein level, the expression of these risk genes was consistent with the results of the mRNA differential analysis we obtained (Fig. 3). This indicated that the model we built is to some extent credible.
Then, we created Kaplan-Meier curves based on the log-rank test to visualize the prognostic value of our established prognostic model in training cohort and in testing cohort (Fig. 4A, Fig. 5A). From the survival curves, we could see that whether in the training cohort or in the testing cohort, patients in the high-risk group have worse prognosis than those in the low-risk group (HR = 1.1, 95%CI = 1.1–1.2, p < 0.001; HR = 1.1, 95%CI = 1.1–1.2, p < 0.001). Figure 4B and Fig. 5B respectively show the time-dependent ROC curves of riskScore in predicting the prognosis of training cohort patients and the time-dependent ROC curves of riskScore in predicting the prognosis of testing cohort patients. In the training cohort, the AUC of the prognostic model at 1 year, 3 years and 5 years were 0.709, 0.719 and 0.708 respectively. In the testing cohort, the AUC of the prognostic model at 1 year, 3 years and 5 years were 0.781, 0.769 and 0.703 respectively. Figure 4C and Fig. 5C show the result of risk classification of patients in training cohort and in testing cohort according to riskScore respectively. From Fig. 4D and Fig. 5D we found that as the risk score increases, the number of dead patients increases. The expression patterns of the risk genes in the high-risk group and the low-risk group were shown in the form of a heat map (Fig. 4E, Fig. 5E), from which we found that whether in the training cohort or in the testing cohort, P4HA3 was up-regulated in the high-risk group, down-regulated in the low-risk group. The patterns of ETNK2, PAFAH2 and ALAD were the opposite.
Evaluation of the prognostic model index based on MGs in entire cohort
FIGURE 6 shows the preliminary validation results of the performance of the prognostic model in all patients. A Kaplan-Meier curve based on the log-rank test and the ROC curve of multiple prognostic indicators were created to visualize the prognostic value of our established prognostic model in all patients (Fig. 6A-D). It was worth mentioning that the predictive ability of the prognostic model we established was better than the predictive ability of the pathology stage. Figure 6E shows the result of risk classification of patients according to riskScore. From Fig. 6F we could found that as the risk score increases, the number of deaths increases. The expression patterns of the risk genes in the high-risk group and the low-risk group were shown in the form of a heat map (Fig. 6G), from which we found that P4HA3 was up-regulated in the high-risk group, down-regulated in the low-risk group. The patterns of ETNK2, PAFAH2 and ALAD were the opposite.
Evaluation of the prognostic model index based on MGs in ArrayExpress cohort
In order to verify whether our model was reliable, we evaluated the prognostic value of risk score in the external cohort from ArrayExpress database (E-MTAB-1980). The external cohort contained 101 patients with ccRCC. Similarly, we calculated the risk score of each patient based on riskScore. Then we divided the patients into a high-risk group and a low-risk group according to the cutoff value we obtained in the training cohort. A Kaplan-Meier curve based on the log-rank test and the time-depedent ROC curve were created to visualize the prognostic value of our established prognostic model in external cohort (Fig. 7A and Fig. 7B). The areas under the ROC (AUC) values of PI were 0.763 for 1-Year-OS, 0.808 for 3-Year-OS and 0.752 for 5-Year-OS. In addition, we further investigated whether each risk gene is related to the prognosis of ccRCC. Figure 7C-F show that risk genes are significantly related to prognosis. Among them, the higher the expression of P4HA3, the worse the prognosis of patients. ETNK2, PAFAH2 and ALAD all show the role of protective prognostic factors, which is consistent with the conclusions we have obtained before.
Clinical correlation analysis
Based on the information of all ccRCC patients from TCGA database we obtained, the correlation analysis between risk factors in the prognostic model (riskScore and each component gene) and clinical characteristics such as age, gender, pathology stage, histological grade, TMN was performed. Figure 8 showed the results. We found that there was a significant correlation between the riskScore and gender, pathology stage, histological grade, T, M (Fig. 8A).
Independent prognostic factor evaluation and GSEA enrichment analysis
To further evaluate whether our model could be used as an independent prognostic factor, we included some key clinical characteristics containing age, gender, pathology stage, histological grade, TMN and riskScore as independent variables. By means of univariate and multivariate Cox regression analysis, our established PI (riskScore) remained significant (both P < 0.001, Table 2). At the same time, the results of multivariate Cox regression analysis also showed that histological grade and M could be used as independent prognostic indicators (both P < 0.05). Looking only at the results of multivariate Cox regression analysis, we could find that riskScore, histological grade and M were related to prognosis (p < 0.001, p = 0.041, p = 0.014 respectively).
In addition, in order to further explore the possible mechanisms that caused different outcomes in the high-risk group and the low-risk group, we performed Gene-set enrichment analysis (GSEA) on the gene expression profiles of the two groups of patients. Figure 9A plots enriched pathways in the high-risk group, while Fig. 9B plots enriched pathways in the low-risk group. The results of GSEA suggested that most of the differentially expressed genes in the low-risk group were genes related to metabolic pathways.
Nomogram development and validation
Finally, to better predict the 1-year OS, 3-year OS and 5-year OS of ccRCC patients, we constructed a new Nomogram based on the results of the multivariate Cox regression analysis of independent prognostic factors (Fig. 10A). Figure 10B-D show the Calibration curves of the nomogram for the probability of OS at 1, 3 and 5 year. The C-index of the nomogram for OS prediction was 0.763 (95%CI = 0.701–0.825), while the C-index of riskScore for OS prediction was 0.722(95%CI = 0.659–0.785).
Establishment of a ceRNA network regulating risk genes
For the lncRNA expression profile and microRNA expression profile of ccRCC patients obtained from the TCGA database, we used the “Limma” package to analyze the differentially expressed lncRNAs and microRNAs between patients in the high-risk group and patients in the low-risk group. We obtained 535 differentially expressed lncRNAs (425 up-regulated and 110 downregulated) and 176 differentially expressed microRNAs (114 up-regulated and 62 down-regulated).
Then, we used miRcode online software to predict possible matching pairs between differential lncRNAs and differential microRNAs. Finally, 26 of 176 differentially expressed miRNAs and 340 of 535 differentially expressed lncRNAs were successfully matched, and there was a correlation between them. According to the 26 target miRNAs and 4 risk genes, 23 microRNA–mRNA interactions were found in the three different databases of Targetscan, miRTarBase and miRDB, 159 lncRNA-microRNA interactions were also identified. In general, a lncRNA-microRNA-mRNA ceRNA network was established to further explore the possible mechanism for the difference in the expression levels of key risk genes between the two groups (Fig. 11).