Identification of key modules using WGCNA
To find the key modules that had the highest associations with clinical features of HCC, we performed WGCNA on TCGA-LIHC dataset that included clinical data of HCC samples, such as age, metastasis (TNM) stage, sex, node, tumour, and genes downloaded from the GSEA database. We finalized 9 modules with the soft threshold power of 6 (no scale, R2 = 0.85), the cutting height of 0.15, and the minimum number of modules of 50 (Fig. 1A-C) (Fig. 1A-B); non-clustered genes were presented in gray. In the heat map of module-feature correlation, brown and green modules were found to have the highest correlation with clinical features. The brown and green modules included a total of 1534 genes, as shown in Figs. 1E-F. With MM > 0.7 and GS > 0.5, 389 key genes were screened from the brown and green modules (Fig. 1).
Verification of pivot genes in TCGA-LIHC dataset
Totally 389 genes were subjected to UCR analysis for selecting 40 genes associated with the patient's prognosis, and 40 key genes screened were adopted to fit the LASSO Cox-PH model. With the lambda parameter (0.027) obtained during 1000 cross-validations, a combination of 12 genes was acquired. These 12 genes were summarized: AEBP1, ATP2A3, CD3E, CD4, CMTM7, EFEMP1, GMIP, HLA. DPB1, HLA. DRB1, LAMB1, LPCAT4, and PLAT; we then optimized the model through stepwise regression analysis and finally selected 6 genes (ATP2A3, CMTM7, EFEMP1, GMIP, HLA. DPB1, and LAMB1) for establishing predictive models and verifying their prognostic value and their correlation with clinical features (6 genes from LASSO Cox-PH model) (Fig. 2).
Establishment of prognostic scoring models
We calculated the RS for every sample in TCGA cohort. Subsequently, samples in the training cohort were grouped (high-risk and low-risk groups) in the light of their median RS, and the high-risk group in the training cohort had shorter OS than the low-risk one (P < 0.001), with area under the curve (AUC) values of 0.737, 0.671, and 0.753 for 1-, 3-, and 5-year OS, respectively (Fig. 3).
Validation of prognostic scoring models
The RS of each patient in the test group and the external validation group was calculated through the model. The training group and external validation group were grouped (high-risk and low-risk groups), each with a median RS as the cut-off value. The K-M curve was applied for obtaining and comparing the OS between two risk groups. The prediction accuracy of the model was evaluated by the ROC curve analysis, of which OS of the high-risk group in GSE116174 dataset was short (P = 0.0086), and AUCs of 0.77, 0.59, and 0.61 were obtained for 1-, 3-, and 5-year OS. The OS of GSE14520 dataset was short (P = 0.00013), with an AUC of 0.74 for 1-year OS, 0.70 for 3-year OS, and 0.53 for 5-year OS. The OS of ICGC-JP was short (P = 0.00092), with an AUCs of 0.70, and 0.71 for 1-year OS, 3-year OS, and 5-year OS (Fig. 4–6), The 6-gene model constructed based on the TCGA dataset has a good prediction effect in the external data validation set, but due to the particularity of the validation set, individual genes cannot be detected in it.
Differences between clinical phenotypes of different RSs
The clinical phenotypes of tumour data in TCGA database were analyzed, and the differences between the clinical phenotypes of different RSs were calculated, and we it was found that as the tumour level elevated, the RS also increased, and the number of deaths also increased, indicating that the model could indicate the patients’ clinical prognosis (Fig. 7). Due to the small sample size of stage IV patients in stage T and stage in the TCGA database, only 4 cases, this affected the results.
Clinical data analysis
The K-M survival curve was adopted for assessing OS of the risk subgroup in TCGA-LIHC cohort. The predictive sensitivity and specificity of gene traits were evaluated through ROC curve analysis. Through UCR and MCR analyses, whether clinical features and RSs are independent risk factors were determined. A line graph was plotted for forecasting 1-, 3-, and 5-year survival via the "rms" R package. Calibration plots, consistency indices (C exponents), and ROC curves were adopted for evaluating the performance of the model (Fig. 8).
GSVA enrichment analysis
The background set c2.cp.kegg.v7.5.1.symbols.gmt was obtained through the GSEA official website; pathway analysis was conducted with by "gsva" R package, and the Wilcoxon test was conducted for calculating the differences between the high- and low-risk groups, finding the relevant pathways with differences expressed in the high- and low-risk groups and heat map (Fig. 9–10).
Immune infiltration analysis of RSs
The CIBERSORT online tool was adopted for predicting the scores of immune cells in 22 samples. Differences between high- and low-RSs were compared by the Wilcoxon test, and ESTIMATE was adopted for determining the immune infiltration degree. According to the results, the low-risk group showed a higher immune infiltration degree. The TIDE tool was utilized for predicting the TIDE score of the sample, and it was revealed that a higher TIDE score indicates a greater possibility of immune escape (Fig. 11).
Validation of immunotherapy dataset
Using IMcigor210 immunotherapy cohort data, the risk model was established using 6 key genes and it was validated (Fig. 12). According the results, the low-risk group had a higher risk group in the immunotherapy sample. In addition, the KM curve was plotted, and P < 0.05 suggested a notable difference (Fig. 12).