Prognostic DEIGs Screening
The overall process of the study is shown in Fig. 1. We first analyzed the differences between 374 tumor samples and 50 normal samples in TCGA. According to adjust P < 0.01 and absolute value of logFC > 2, 2874 differential genes were screened out (Table S1), including 362 down-regulated genes and 2512 up-regulated genes. Visualization using heat and volcano maps (Figs. 2A,B). Then, 154 differential expression immune genes(DEIGs) were obtained through the intersection of immune-related genes (Fig. 2C). Univariate Cox proportional risk analysis was performed on DEIGs, and 28 DEIGs related to prognosis were obtained (Fig. 2D, Table S2).
Prognostic DEIGs Functional Enrichment and Genetic Alterations
GO and KEGG enrichment analysis was performed on 28 prognostic DEIGs. GO enrichment results showed that prognostic DEIGs were mainly enriched in Epithelial cell proliferation, gland development, regulation of cysteine-type endopeptidase activity, regulation of cysteine-type endopeptidase activity involved in apoptotic process, regulation of endopeptidase activity. (Fig. 3A). KEGG results showed that prognostic DEIGs were enriched in Bladder cancer, Cytokine-cytokine receptor interaction, IL-17 signaling pathway, Melanoma, Neuroactive ligand-receptor interaction (Fig. 3B).
We also studied the copy number variation of 28 prognostic genes and summarized the CNV variation frequency of 28 prognostic DEIGs in TCGA-LIHC (Fig. 3C). Visualized the position of CNV variation on chromosomes (Fig. 3D).
Construction and Prognostic Value of IRSS
In order to calculate the risk score, patients with no survival information and survival time less than 30 days were excluded. Finally, 342 HCC patients were included in the study, and the total sample set (Sum) was randomly divided into the train set and the test set, with 171 patients in each group. Chi-square test was used to determine that there was no statistically significant difference in clinical characteristics among each group (Table 1). LASSO regression analysis was performed on 28 prognostic DEIGs in a training set containing 171 patients, and the model fitted best when the penalty index was 10 (Fig. 4A,B). Then, 7 prognostic genes were obtained through multivariate Cox regression analysis :GAL, NR0B1, MAPT, CCR3, GLP1R, BIRC5 and IL-17B(Table S3). Combined with the corresponding regression coefficients, the final IRSS is established:
IRSS=(GAL exp * 0.359 )+(NR0B1 exp * 0.454)+ (MAPT exp*0.495)+(CCR3 exp * 0.641)+(GLP1R exp * 0.306)+(BIRC5 exp * 0.428)+(IL17B exp * 2.100)
The mRNA expression differences of seven genes in the train group were represented by heat maps (Fig. 4C). The risk score for each patient was calculated based on IRSS, and the sample was divided into high-risk and low-risk groups using the median. In the train set, the probability of OS was lower in the high-risk group than in the low-risk group (P < 0.05) (Fig. 4D). As the risk score increased, the survival time decreased and the number of patients in the state of death increased gradually (Fig. 4E). The accuracy of the model in predicting OS of HCC patients was evaluated by ROC curve, and the AUC values at 1, 2 and 3 years were 0.845, 0.823 and 0.808, respectively (Fig. 4F).The mRNA expression differences of 7 prognostic DEIGs in tumor and normal tissues in the TCGA data set (Figure S1).
Table 1
Chi-square test results of the TCGA training set, the test set and the ICGC cohort
Characteristics | | TCGA | | ICGC n = 229 | P-value |
Train (set1) n = 171 | Test (set2) n = 171 | Sum (set3) n = 342 |
Gender | | | | | |
Male | 125(73.1%) | 110(64.3%) | 233(68.1%) | 168(73.4%) | 0.103 |
Female | 46(26.9%) | 61(35.7%) | 109(31.9%) | 61(26.6) | |
Age | | | | | |
<=65 | 108(63.2%) | 109(63.2%) | 216(63.2%) | 88(38.4%) | 0.911 |
> 65 | 63(36.8%) | 62(36.8%) | 126(36.8%) | 141(61.6%) | |
Status | | | | | |
Alive | 113(66.1%) | 110(64.3%) | 223(65.2%) | 189(82.5%) | 0.733 |
Dead | 58(33.9%) | 61(35.7%) | 119(34.8%) | 40(17.5%) | |
Validation of The Risk Score with TCGA and ICGC Datasets
In the test set, according to the risk score calculated by IRSS, the samples were divided into high and low risk groups using the median value. The analysis found that there was a significant difference in survival probability between the high and low risk groups (P < 0.05), and the high risk group had a lower survival rate (Fig. 5A). The AUC values assessed by ROC curve at 1, 2 and 3 years were 0.748, 0.740 and 0.684, respectively(Fig. 5B). The mRNA expression differences of prognostic genes in the test set were represented by heat maps (Figure S2A).
In TCGA sum set, the total sample is also divided into high and low risk groups according to IRSS calculated risk score. Analysis showed that there was a significant difference in overall survival between the two groups (P < 0.05), and the survival rate was lower in the high-risk group (Fig. 5C). The AUC values of 1, 2 and 3 years were 0.798, 0.756 and 0.706, respectively (Fig. 5D). The mRNA expression differences of prognostic genes in TCGA sets were represented by heat maps (Figure S2B).
According to IRSS, ICGC data were divided into high risk group and low risk group, and the survival rate difference between the two groups was also statistically significant (P < 0.05)(Fig. 5E). The AUC values of 1, 2 and 3 years were 0.757, 0.759 and 0.772, respectively (Fig. 5F). After multiple validation, we find that the model has high robustness and accuracy. The mRNA expression differences of prognostic genes in the ICGC dataset were represented by heat maps (Figure S2C). Meanwhile, in order to further verify the reliability of the model, we compared it with four published prediction models of HCC(15–18), and the results showed that our model had higher accuracy and reliability (Fig. 6).
Independent Prognostic Analysis
To verify the reliability of the risk factors, independent prognostic analyses were performed. In univariate Cox analysis, risk score, Stage and T Stage were significantly correlated with OS (P < 0.05)(Fig. 7A). In multivariate Cox analysis, only risk score was confirmed as an independent predictor of OS (Fig. 7B). The above results again demonstrate the stability of the IRSS established by us. In order to further evaluate individual patients, Nomograms were used to simplify the statistical prediction model to comprehensively predict the prognosis of HCC patients by calculating the scores of clinical data and risk scores (Fig. 7C,D).
In addition, we demonstrated the correlation between clinical features and risk in heat map (Fig. 7E). Boxplot was used to show the clinical indicators and risk scores and the differences between the high and low risk groups. There were no significant differences in age, gender, N stage and M stage in the high and low risk groups (P > 0.05). Stage, Grade and T Stage were later with the increase of risk score (Figure S3).
Genomic Alterations Analyses
To determine whether risk score levels were associated with specific genomic traits, CNV and somatic mutation analyses were performed using the TCGA data set. According to the risk score levels, TP53(40%), CTNNB1(21%), TTN(21%) and MUC16(20%) had the highest mutation frequency in the high-risk group (Fig. 8A,B). In the low-risk group, CTNNB1(27%), TTN(26%), MUC16 and TP53(14%) were more frequent, and the mutation rate of TP53 was significantly higher in the high-risk group (Fig. 8C,D). TP53 is a well-known tumor suppressor gene, which is usually associated with poor prognosis(19). Therefore, we also conducted survival analysis on TP53 mutation data, and the results showed that the survival rate of TP53 mutation group was significantly lower (P < 0.05) (Fig. 8E).
Correlation Analysis between Tumor Microenvironment and Stem Cells
Analysis of tumor microenvironment suggested that the expression level of tumor microsatellite instability(MSI) was significantly different in the high-risk group, and higher in the high-risk group (Fig. 9A). In order to explore the relationship between MSI and survival of HCC, we divided the samples into the high MSI expression group and the low MSI expression group according to the expression level of MSI. Survival analysis of the two groups showed that the high MSI expression group had a lower survival rate (P < 0.05)(Fig. 9B), which further confirmed the low survival rate of the high-risk group. There was no significant difference in tumor mutation burden(TMB) between high and low risk groups (Fig. 9C).
In recent years, tumor stem cells have been considered as the root cause of tumor occurrence, metastasis and recurrence. We analyzed the association between risk score and tumor stem cells and found that there was a significant correlation between the two. The higher the risk score, the higher the score of tumor stem cells (P < 0.05)(Fig. 9D). It can be inferred that the higher the risk score, the lower the degree of tumor differentiation.
Immunotherapy Analysis
The infiltration of immune cells and stromal cells in HCC tissues was analyzed according to the risk score groups, and the results showed that there was significant difference in stromal cell score between high-risk and low-risk groups (P < 0.05), but no significant difference in immune cells (P > 0.05)(Fig. 10A). Comparison of TIDE scores showed that there were fewer dysfunction and immune rejection T cells in the high-risk group for HCC (Fig. 10B). In order to explore the relationship between the efficacy of immune checkpoint inhibitors and risk score, the expression of PD-1 and CTLA-4 in the high-risk group was significantly higher than that in the low-risk group (P < 0.05) by analyzing the expression differences of immune checkpoint between the high-risk group and the low-risk group (Fig. 10C,D). There was no significant difference in PD-L1 between the two groups (P > 0.05)(Fig. 10E).
Consensus Clustering of Seven Prognostic Genes
Consensus clustering of the seven prognostic DEIGs identified two clusters of HCC in the TCGA and CGGA datasets with distinct clinical outcomes, clinical features and pathological features(Fig. 11A,B). According to expression similarity, k = 2 was selected with clustering stability rising from k = 2 to 10 in the TCGA and ICGC datasets. A contingency table showed consistency between clustered groups and risk groups in both TCGA and ICGC datasets(Figure S4). In the TCGA and ICGC datasets, the survival difference between the two clusters was significant(Fig. 11C,D). Between groups, PCA distribution was clearly separated in the TCGA and CGGA datasets(Figure S5A,B). The relationship between each subtype and risk score and prognosis is shown in Figure S5C,D.
Differential Analysis of Immune Cell Composition
Firstly, the content of immune cells in each sample in the TCGA dataset was analyzed and shown in a histogram (Fig. 12A). By analyzing these seven immune genes and immune function, we found that risk score was correlated with immune cell regulation, and in the high-risk group, B cells naive, T cells CD4 memory resting, NK cells activated, monocyte, Macrophages M1, Macrophages M2, Mast cells resting significantly increased (P < 0.05), B cells memory, T cells CD4 memory activated, T cells follicular helper, T cells regulatory(Tregs), M0 of Macrophages was significantly decreased (P < 0.05) (Fig. 12B). The correlation between immune cells and 7 prognostic genes is shown in Figure S6. In addition, we used heat maps to show differences in immune cell content between the high and low risk groups (Fig. 12C).