Screening for glycolysis gene sets with significant differences between tumor tissues and adjacent normal tissues. As demonstrated in Fig. 1, the panoramic flow chart illustrated our comprehensive analysis process. 12 solid tumors with complete clinical information and gene expression profiles were reflected in the present study, consisted BLCA, BRCA, COAD, HNSC, KIRC, KIRP, LIHC, LUAD, LUSC, OV, PRAD, and THCA. All the above data were derived from TCGA and have undergone normalization before the implementation of GSEA. GSEA was performed to carry out analysis with 9 gene sets associated with the glycolysis process. We attempt to phase whether these gene set variants produced significant differences between tumors and their adjacent noncancerous tissues. At least one gene set with an FDR value less than 0.05 was selected for subsequent studies. As presented in Fig. 2, we have identified at least one significant gene set in BLCA, BRCA, HNSC, LIHC, LUAD, and LUSC. In COAD, KIRC, KIRP, OV, PRAD, and THCA, the FDR values of 9 gene sets were revealed to be greater than 0.05. The distribution of the NES value and FDR q-value value of each gene set in the GSEA analysis was shown in the Fig. S1a. A more detailed description of the GSEA results was provided in Fig. S2. Finally, 6 solid tumors (BLCA, BRAC, HNSC, LIHC, LUAD, and LUSC) and corresponding core genes (CORE ENRICHMENT: YES) were used for further analysis (Table S1).
To further investigate whether these core genes participated in the glycolysis process, we performed GO and KEGG pathway analysis by using the R package ClusterProfiler. Further dissection of these genes and pathways indicated enrichment for glucose metabolisms, such as pyruvate metabolic process, pyruvate biosynthetic process, a glycolytic process in GO (Figure S1b-d), and Glycolysis/Gluconeogenesis in KEGG (Figure S1e). These results indicate that these core genes have a profound effect on glucose metabolism, especially glycolysis.
Construction and validation of the prognostic glycolysis associated-gene signature. Core genes were assessed for correlation with OS through univariate regression analysis and future applied in multivariate Cox-PH regression model with a stepwise procedure to identify those important variables. We identified statistically significant gene signatures (GRGPI) in BLCA, BRAC, HNSC, LIHC, LUAD, and LUSC, the detailed results were shown in Table S2. These results highlight the power of GRGPIs to identify patients with adverse outcomes who would be classified as high risk according to these glycolysis gene-related classifiers. We further investigated the area under the time-dependent ROC curves (AUC) values for each cancer type. The highest AUC was observed in LIHC compared with BLCA, BRAC, HNSC, LUAD, and LUSC at 0.5 (0.852), 1(0.840), 2(0.871), 3(0.830), and 5-year (0.756) in Fig. S3. More specifically, in a univariate Cox regression analysis, 92 glycolysis-related genes with significant correlations with overall survival were regarded as significant (Fig. 3a) in LIHC. To avoid overfitting and unnecessary complexity, the independent prognostic factors were restricted to those variables that contributed most toward the final model coefficients based on the AIC and the model χ2 score. The selected features were incorporated into a least absolute shrinkage and selection operator (LASSO) regression model to penalize for model complexity overfitting. 8 genes (AURKA, DEPDC1, CDK1, CENPA, HMMR, KIF20A, PFKFB4, and STMN1) remained with their individually nonzero LASSO coefficients (Fig. 3b and 3c). Multivariate analysis of these variables contributed to their virtual statistical weighting, determining their impact on prognostic risk, using Cox proportional hazard regression. Finally, the risk score of 8 gene signatures was established as follows: Risk score = (0. 1224 * expression of AURKA) + (0.0534 * expression of CDK1) + (0.0920 * expression of CENPA) + (0.1323 * expression of DEPDC1) + (0.1140 * expression of HMMR) + (0.2425 * expression of KIF20A) + (0.1562 * expression of PFKFB4) + (0.0911 * expression of STMN1). Patients with high or low risk were clustered based on the median risk score of the TCGA discovery cohort. The distribution of risk scores, survival status, and gene expression landscapes of patients varied significantly within the two subgroups as shown in Fig. 4a. Kaplan-Meier survival analysis disclosed that the survival of the low-risk group was significantly longer than the high-risk group (Fig. 4b, P < 0.001). The cumulative event probability curve shows that HCC patients in the high-risk group have a significantly higher probability of cumulative events during the entire follow-up period than in low-risk patients (Fig. 4c, P < 0.001). We applied the classifier to assess whether the 8-mRNA panel can predict an individual or a specific HCC recurrence. The TCGA dataset containing recurrence events and recurrence time was used as an internal training cohort (TCGA training cohort). Our prognostic evaluations of survival analysis for 8 gene signatures were based on TCGA recurrence-free survival (RFS) outcomes. The distribution of risk score, survival status, and gene expression patterns of patients was demonstrated in Fig. 4d. Patients with low-risk scores also had longer RFS time than patients with high-risk scores (Fig. 4e, P < 0.001). The cumulative event occurrence curve revealed a significant cumulative risk (HR) of HCC patients in the high-risk group. (Fig. 4f)The analysis results show that this 8-gene signature can be used as a prognostic indicator for the outcome and recurrence of HCC patients. Subsequently, we performed another two independent analyses on the datasets from GEO and ICGC datasets. Consistently, as described earlier, in two independent validation sets, the 8-gene model sharply divided two risk subgroups (Fig. 5a and 5d). Not surprisingly, the survival analysis and cumulative risk curve indicated that the high-risk group had a shorter OS and higher cumulative risk (Fig. 5b-c and 5e-f). We have observed the robust prognostic value of the classifier in 3 independent cohorts.
Independent predictive value of the 8-mRNA signature. We constructed risk scores and developed predictive models to predict OS and RFS. To verify the assignments of sub-categories, we also performed t-SNE to constraint the dimensionality of the features. The T-SEN analysis revealed that the two risk subgroups are scattered in two discrete directions (Fig. 6a-d). In the TCGA discovery cohort, TCGA training cohort, GEO validation cohort, and ICGC validation cohort, we performed the time-dependent ROC curves analysis to estimate the prognostic accuracy of the 8 glycolysis-related signatures at 0.5-, 1-, 2-, 3- and 5-years, for the prediction of OS and RFS, respectively. These results confirmed the high prognostic accuracy of the 8-signature model (Fig. 6e-h).
The explanation for many clinical situations one can identify some standard variables that have previously been demonstrated to have prognostic value and are generally measured for most patients having the particular diagnosis. As mentioned earlier, we then explored tumor-related clinicopathological variables associated with our classifier based on TCGA (Fig. 7a, Table S3), GEO (Fig. 7b, Table S4), and ICGC (Fig. 7c, Table S5) cohorts. Patient clinicopathologic characteristics were listed in Table 1. The results of Pearson's Chi-Square test showed that there were several significant correlations between clinicopathological characteristics and HCC risk subtypes in three independent cohorts. More specifically, for the TCGA discovery cohort, T classification (P = 0.0032), Stage (P < 0.001 ), Grade (P = 0.0175), Family Cancer History (P = 0.0359), AFP level (P = 0.0147), Cancer Status (P < 0.001), Recurrence Event (P < 0.001) and patient Status (P < 0.001). In the TCGA training set, there was a significant correlation between T classification (P = 0.0095) and Stage (P = 0.0033) between HCC subgroups (Table S3). Similarly, in the GEO and ICGC cohorts, some important clinicopathological characteristics also have significant correlations between subgroups. More detailed results are shown in Table S4 and Table S5. To prove the independence of GRGPI, a Cox proportional hazard regression analysis was performed in the TCGA, GEO, and ICGC cohorts. The adjustment results of clinical variables suggested that risk score remained an independent prognostic factor, confirming its robust predictive ability for OS (HR = 1.267, P < 0.001) and RFS (HR = 1.027, P < 0.001) of HCC patients (Fig. 8a and b). In the Cox regression model, some clinical-pathological factors (Cancer Status (P = 0.017), Hepatitis Virus Infection (P = 0.041), and Child-Pugh Score (P = 0.011) for OS; Cancer Status (P < 0.001), Hepatitis Virus Infection (P = 0.017), and BMI (P = 0.009) for RFS) were also considered as independent poor prognostic factors, and could prove valuable for risk stratification in pathological subgroups in Fig. 8a and Fig. 8b. In addition, KM survival analysis revealed that disease-specific survival rates are significantly different in some pathological subgroups, such as T classification (T1-2 vs T3-4, P < 0.001), Stage (Stage I-II vs III-IV, P < 0.001), Cancer Status (Tumor Free vs With Tumor, P = 0.011), Hepatitis Virus Infection (HVI Negative vs Positive, P = 0.008), Child-Pugh Score (A/B vs C, P < 0.001) and AFP ( < = 200 vs > 200, P = 0.002) for TCGA OS (Figure S4). These results were consistent with those obtained using a univariate Cox regression for OS with adjustments for the prognostic factors (Fig. 8a, Table S3). In addition, the index can independently predict the OS of GEO (HR (95% CI) = 2.430 (2.054–2.874), P < 0.001) (Fig. 8c) and ICGC cohorts (HR (95% CI) = 1.108 (1.069–1.149), P < 0.001) (Fig. 8d). All in all, these results strongly suggested that GRGPI was an independent prognostic factor for HCC patients.
Subsequently, further stratified analysis was also performed to investigate the independence of the model within the same subgroups of clinicopathological features. Taking advantage of the clinicopathological parameters, we divided the TCGA discovery cohort into subgroups based on their clinical-pathological features, such as Gender (Male/Female), Age (≤ 65/>65), Grade (G1-2/G3-4), Stage (Stage I-II/III-IV), T classification (T1-2/3–4), Tumor Status (Tumor Free/With Tumor), etc. As revealed in Fig. S5 and Fig. S6, regardless of stratification, 8-mRNA signatures still can make a distinction from high-risk patients. Finally, similar results in the GEO (Fig. S7-S9) and ICGC (Fig. S10-11) cohorts were also statistically significant.
To assess the sensitivity and specificity of the OS/RFS prognostic model and other clinical pathology variables, we performed a multiple ROC curve analysis. To estimate its performance, we applied the 8-mRNA model to the TCGA discovery cohort, TCGA training cohort, GEO, and ICGC cohort, and compared its prediction quality by evaluating the area under the ROC curve. Following the multivariate Cox regression and AUC analysis, the prognostic model remained a moderate and independent prognostic indicator (TCGA discovery cohort: AUC = 0.860; TCGA training cohort: AUC = 0.801; GEO validation cohort: AUC = 0.834; ICGC validation cohort: AUC = 0.843; Fig. 8e-h). It has demonstrated that the risk score model outperformed the other clinical pathology variables for the prognostic evaluation of HCC patients. These results indicated that GRGPI signatures have a predominately higher favorable value than other parameters in predicting the OS and RFS of HCC patients.
Development and verification of a personalized Nomogram. To provide clinicians with a portable quantitative table to predict the prognosis of liver cancer patients, a nomogram integrating GRGPI and clinicopathological characteristics was constructed in TCGA, GEO, and ICGC cohorts. In the TCGA cohort, the risk score contributed the largest risk point, compared with other clinicopathological characteristics, then followed by T classification, Hepatitis virus infection, Child-Pugh Score and Stage, etc (Fig. 9a). As shown in Fig. 9b and 9c, a total of 371 patients were reclassified in the new Nomogram model for OS NRI (net reclassification index) = 0.415. ROC analysis revealed the accuracy of the Nomo model, which is a good predictor of patient survival, with an AUC value of 0.873 (Fig. 9d). In the decision curve analysis, the novel nomogram showed more net benefit across the range of decision threshold probabilities than the Risk score model and integrated clinicpathology model (Fig. 9e). The calibration curves showed a stable agreement between the prediction by the nomogram and the actual observation for 1-, 2-, and 3-year OS in Fig. 9f. This novel Nomogram model that integrated GRGPI and clinicopathological features maintained good agreement between the predicted and observed survival probabilities in the GEO (AUC = 0.854) and ICGC (AUC = 0.863) cohorts (Fig S12 and Figure S13).
The landscape of immune infiltration in HCC risk subgroups. Due to the significant differences between subtypes, immune infiltration was studied to characterize their immunological characteristics. The CIBERSORT algorithm was used to calculate the abundance of 22 immune-related cell types and displayed in the heatmap and box plot in the TCGA (Fig. 10a-b), GEO (Fig. 10c-d), and ICGC (Fig. 10e-f) cohorts, respectively. Consistently, the frequency of CD8 + T cells in the low-risk group was significantly higher, whereas the proportion of M2 macrophages were rather higher in the high-risk group, in 3 independent cohorts. After calculation of tumor immune infiltration levels of each patient, we found high CD8 + T cell levels could predict better survival, while high levels of M2 cells often indicated worse OS and RFS in HCC tissues (Fig. 10g-k).
WGCNA and GSEA analysis.
We performed WGCNA and GSEA analysis to identify gene expression patterns between different subgroups. Here, based on average clustering, we did not detect outlier samples (Fig. 11a). The soft threshold β was set at 7 to determine a scale-free network (Fig. 11b). Subsequently, the genes were assigned to 16 modules, among which, the gray modules included genes that cannot be clustered (Fig. 11c). The two gene modules most correlated with high- (pink, yellow) and low-risk (greenyellow, turquoise) groups were presented in Fig. 11d. We performed GO and KEGG analysis to identify the potential biological significance of related TOP2 modules in different subgroups (Fig. 11e-h). Besides, we also performed a GSEA analysis based on the overall TCGA-LIHC expression profiles. The terms existing in both WGCNA and GSEA analysis results were shown in Fig. 11i-l. Specifically, some terms related to cell cycle transition, chromatin separation, DNA replication, and DNA helicase activity were significantly enriched in a high-risk group.
Further verification of the 8-gene signature. To further determine the reliability of the observed gene signatures, we conducted additional verification at the transcriptome level and protein level. We evaluated the expression of 8 gene signatures, based on the TCGA, GEO, and ICGC databases. Inspection of the results revealed a general trend that these 8 gene signatures are dysregulated in HCC tumor tissues, as is demonstrated in Fig. 12a-c. Furthermore, we also examined the expression levels of 8 gene signatures at the protein level using immunohistochemistry (IHC), based on the Human Protein Altas. IHC confirmed the upregulation of protein in HCC tissue. Moderate or high staining intensity of these 8 proteins in HCC tissues contrasted sharply with the low intensity or lack of staining in normal tissues (Fig. 12d).
Table 1. The clinicopathological characteristics of the HCC patients enrolled in the TCGA, GEO, and ICGC cohorts.
Variables
|
TCGA Cohort (n=371)
|
ICGC Cohort (n=232)
|
GEO Cohort (n=221)
|
Age
|
|
|
|
<=65
|
233
|
90
|
200
|
>65
|
138
|
142
|
21
|
Gender
|
|
|
|
Female
|
121
|
61
|
30
|
Male
|
250
|
171
|
191
|
T Classification
|
|
|
|
T1
|
183
|
N/A
|
N/A
|
T2
|
95
|
N/A
|
N/A
|
T3
|
80
|
N/A
|
N/A
|
T4
|
13
|
N/A
|
N/A
|
Stage Classification
|
|
|
|
Stage I
|
179
|
36
|
93
|
Stage II
|
93
|
106
|
78
|
Stage III
|
92
|
71
|
50
|
Stage IV
|
7
|
19
|
0
|
Grade
|
|
|
|
G1
|
57
|
22
|
N/A
|
G2
|
178
|
142
|
N/A
|
G3
|
124
|
59
|
N/A
|
G4
|
12
|
9
|
N/A
|
BCLC Stage
|
|
|
|
Zero
|
N/A
|
N/A
|
21
|
A
|
N/A
|
N/A
|
149
|
B
|
N/A
|
N/A
|
22
|
C
|
N/A
|
N/A
|
29
|
Status
|
|
|
|
Alive
|
241
|
189
|
131
|
Dead
|
130
|
43
|
90
|
Recurrence
|
|
|
|
No
|
191
|
N/A
|
100
|
Yes
|
180
|
N/A
|
121
|
Cancer Status
|
|
|
|
Tumor Free
|
250
|
N/A
|
N/A
|
With Tumor
|
121
|
N/A
|
N/A
|
Family Cancer History
|
|
|
|
No
|
251
|
152
|
N/A
|
Yes
|
120
|
80
|
N/A
|
Prior Malignancy
|
|
|
|
No
|
N/A
|
202
|
N/A
|
Yes
|
N/A
|
30
|
N/A
|
Multi Nodular
|
|
|
|
No
|
N/A
|
N/A
|
176
|
Yes
|
N/A
|
N/A
|
45
|
Cirrhosis
|
|
|
|
No
|
N/A
|
N/A
|
18
|
Yes
|
N/A
|
N/A
|
203
|
Fibrosis Grade
|
|
|
|
No Fibrosis
|
127
|
N/A
|
N/A
|
Incomplete Cirrhosis
|
15
|
N/A
|
N/A
|
Established Cirrhosis
|
111
|
N/A
|
N/A
|
Fibrous Speta
|
42
|
N/A
|
N/A
|
Portal Fibrosis
|
76
|
N/A
|
N/A
|
Hepatitis Virus Infection
|
|
|
|
None Risk
|
195
|
N/A
|
0
|
HBV
|
61
|
N/A
|
221
|
HCV
|
18
|
N/A
|
0
|
HCV&HBV
|
97
|
N/A
|
0
|
Child−Pugh Score
|
|
|
|
A
|
198
|
N/A
|
97
|
B
|
99
|
N/A
|
75
|
C
|
74
|
N/A
|
49
|
BMI
|
|
|
|
<=24
|
174
|
N/A
|
N/A
|
>24
|
197
|
N/A
|
N/A
|
AFP level
|
|
|
|
Low
|
190
|
N/A
|
121
|
High
|
181
|
N/A
|
100
|
ALT level
|
|
|
|
<=50
|
N/A
|
N/A
|
130
|
>50
|
N/A
|
N/A
|
91
|
Tumor Size
|
|
|
|
<=5
|
N/A
|
N/A
|
140
|
>5
|
N/A
|
N/A
|
81
|
Abbreviation: BCLC Stage: Barcelona Clinic Liver Cancer Stage; AFP: alpha fetoprotein; ALT: Alanine aminotransferase.