3.1 Acquisition of the gene expression and clinical data
We downloaded the TCGA-LIHC dataset from The Cancer Genome Atlas (http://portal.gdc.cancer.gov/). The TCGA-LIHC dataset included 334 samples, and all samples included data regarding the RFS time and censoring status. The GSE76427 dataset was downloaded from the Gene Expression Omnibus database (http://www.ncbi.nlm.nih.gov/gov/). The GSE76427 dataset included 115 samples from HCC patients, but 7 patients had missing information regarding the RFS time and censoring status. Thus, 108 samples were included in this study. The median RFS times in the TCGA and GSE76427 series were 390 and 252 days, respectively, and the two datasets contained clinical information, such as sex, age, and the TNM stage.
3.2 Genes associated with RFS
We used the “survfit” function in the “survival” package and found 1331 genes associated with RFS. Then, to explore the genetic biological implications, we analyzed the 1331 genes through Gene Ontology (GO) functional and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses. As shown in Figure 1, in the KEGG analysis, we found that these genes are enriched in signaling pathways, such as the cell cycle, homologous recombination, DNA replication, the Fanconi anemia pathway, complement and coagulation cascades, and the T cell receptor signaling pathway.
3.3 Construction of the prognostic model in TCGA-LIHC
Then, “rbsurv” was used to identify seven genes to construct the risk score system. The seven genes included in the system were TTK protein kinase (TTK), chromosome 16 open reading frame 54 (C16orf54), phosphoribosyl pyrophosphate amido transferase (PPAT), CD3e molecule associated protein (CD3EAP), solute carrier organic anion transporter family member 2A1 (SLCO2A1), acetyl-CoA acetyltransferase 1 (ACAT1), and growth-arrest specific 2 like 3 (GAS2L3) (Table 1).
The risk score was calculated with the following formula: risk score= (-0.038)*expression of TTK+(-0.357)*expression of C16orf54+0.634*expression of PPAT+0.221*expression of CD3EAP+(-0.076)*expression of SLCO2A1+(-0.184)*expression of ACAT1+0.277*expression of GAS2L3.
In total, 334 patients were divided into two groups (134 high-risk patients and 200 low-risk patients) using a cut-off of 4.97976 for the risk score. Furthermore, the survival curve revealed that the RFS in the high-risk group was significantly poorer than that in the low-risk group (p<0.0001; Figure 2).
3.4 Validation of the prognostic model in GSE76427
We validated the risk score system in the GSE76427 cohort. In total, 108 patients were divided into two groups (45 high-risk patients and 63 low-risk patients) using a cut-off of 3.4144 for the risk score. Furthermore, the survival curve revealed that the RFS in the high-risk group was significantly poorer than that in the low-risk group (p=0.011; Figure 3). In summary, these results indicate that the prognostic model has moderate sensitivity and specificity.
3.5 Association between the prognostic model and the clinical characteristics of the patients
While assessing the correlation between the seven-gene signature and the clinical characteristics of the HCC patients, we found that a high risk score was significantly correlated with the TNM stage (p<0.001), grade (p=0.001), and AFP (p=0.014) but was not significantly associated with the sex, age, BMI, or Child-Pugh score of the patients with HCC (Table 2). In GSE76427, the results showed that the 7-gene signature was not significantly associated with sex, age, BCLC (Barcelona Clinic Liver Cancer) or the TNM stage (Table 3).
3.6 Independent prognostic role of the prognostic gene signature
Moreover, the results of the multivariate Cox regression analysis showed that the TNM stage (HR=1.680, p<0.001) and our prognostic model (HR=3.607, p<0.001) were both independent factors of RFS among the 334 TCGA-LIHC patients. However, among the 108 patients in the GSE76427 cohort, the TNM stage was not an independent prognostic factor for RFS [24]. The prognostic model (HR=2.407, p=0.014) was also an independent factor for RFS (Figure 4). In addition, we performed univariate and
3.7 Comparison of the TNM stage model and BCLC model
To compare the accuracy of the prognostic model and the TNM model, we calculated the AUCs of the 12-month, 15-month, and 18-month RFS. In the TCGA-LIHC dataset, the prognostic model’s AUCs of the 12-month, 15-month, and 18-month RFS were 0.7768, 0.7934, and 0.7529, and the TNM model’s AUCs of the 12-month, 15-month, and 18-month RFS were 0.6884, 0.7026, and 0.6721, respectively (Figure 5). In the GSE76427 dataset, the prognostic model’s AUCs of the 12-month, 15-month, and 18-month RFS were 0.6159, 0.6118, and 0.6217, and the TNM model’s AUCs of the 12-month, 15-month, and 18-month RFS were 0.6122, 0.6009, and 0.5762, respectively. In addition, the BCLC model’s AUCs of the 12-month, 15-month, and 18-month RFS were 0.5669, 0.5627, and 0.5684, respectively (Table 5). Overall, our prognostic model showed a benefit in predicting the RFS, which might help doctors with targeted treatment (Figure 6).
3.8 Development of the calibration curve
We calculated the C-index and drew calibration curves for the 12-, 15- and 18-month survival predictions to evaluate the calibration in the TCGA-LIHC dataset and the GSE76427 dataset. The C-index of the TCGA-LIHC dataset and GSE76427 dataset was 0.717 and 0.647, respectively, as shown in Figures 7 and 8.
3.9 External validation in an online database
The representative protein expression levels of SLCO2A1, PPAT, GAS2L3, CD3EAP, and ACAT1 were explored in the Human Protein Profiles. Then, we explored the TTK, C16orf54, PPAT, CD3EAP, SLCO2A1, ACAT1, and GAS2L3 genes in the CBioPortal for cancer genomics. TTK exhibited the most frequent genetic alterations (3%), and deep deletion was the most frequent alteration. The second most altered gene was CD3EAP (1.3%), and the most frequent alterations were amplification mutations (Figure 9). The expression levels of the seven genes in different cancers are shown in Figure 10. In summary, the aberrant expression of these seven genes may explain some of the abnormal expression of these genes.