Different pathological grades of HCC in the TCGA cohort
We obtained the expression and clinical data of 367 hepatocellular carcinomas, 3 fibrolamellar carcinomas, 7 hepatobiliary mixed carcinomas, and 50 adjacent normal controls from the TCGA-LIHC project (Fig. 1A). We first investigated the association between the histological grades of HCC (Fig. 1B, G1, G2, G3, and G4) and clinical outcomes of liver cancer. As shown in Fig. 1C, there was no statistically significant difference in overall survival (OS) and disease-free survival (DFS) between different histological grades (P value for Log-rank analysis > 0.05). Figure 1D specifically shows the clinical pathological stages (stage I, II, III-IV) and TNM staging of HCC cases. As expected, stage III-IV or T4 patients had the worst prognosis, whereas stage I or T1 patients had a better prognosis (Fig. 1E, P < 0.001). Therefore, our study focused on HCC cases.
We analyzed the correlation between different pathological stage (stage I, II, III-IV) and clinical indicators. As shown in Figure S1A-F, the total bilirubin, albumin, fetoprotein, and platelet count indicators, but not creatine and protherombin time, showed a statistically significant association with the different HCC pathological grades. In addition, we did not observe a correlation between HCC pathological grades and other factors, including age, height, weight, race, cluster, and gender (Figure S1G-L). Our study aims at analyzing the gene expression and mutational profiles associated with different clinical pathological grades of HCC in the TCGA cohort.
Differential gene screening
First, we attempted to screen the genes that exhibit the increment or decrement trend in the normal, stage I, stage II, and stage III-IV groups. A range of differential genes between three comparison groups, including Tumor vs. Normal, stage II vs. stage III, stage III+IV vs. stage II, were identified, using the “EdgeR” package. Figure S2A presents the volcano plots for the above three sets. Then, we performed the intersection analysis of the up-regulated and down-regulated genes. As shown in Figure S2B-C, twelve up-regulated genes were screened, but no down-regulated genes were obtained. The twelve up-regulated genes did not show the protein interaction relationship (Figure S2D) and mainly exist in stage III+IV, but not with a high proportion (Figure S2D). Figure S2E shows the full name information of these genes.
Next, based on the GEPIA online database, we analyzed the expression levels of these genes in normal and tumor, and different pathological grades of HCC. As shown in Figure S3A-B, except for the CRTAC1 gene, other genes were highly expressed in the tumor group, compared with the normal control group. However, only the gene expressions of DUCX2, IQCA1, PCSK1, HOXB9, KCNH2, and NPTX1 were statistically correlated with the stage I-IV distribution. The results of the overall survival and disease-free survival analysis further indicated that the high expression levels of CUZD1 and IQCA1 were associated with poor prognosis of HCC (Figure S3C).
Copy number variation analysis
We performed the somatic copy number variation (CNV) profile and identified a total of 16,644 genes with CNV from the TCGA HCC dataset. And the Circos 2D track plot for the CNV distribution in the chromosomes was shown in Figure 2A. Then, we utilized the Kolmogorov-Smirnov test to analyze the correlation between CNV and gene expression and obtained a series of genes. After the GO and KEGG analysis, we found that most of these genes were involved in the cell division or cell cycle processes, such as organelle fission, nuclear division, and spindle location (Figure 2B-E). For instance, cell cycle-associated CCNE2 gene in the Tumor, stage I, and stage II group exhibits the single deletion (sd) and single gain (sg) mutations, which are correlated with the gene expression of CCNE2 protein. However, the GADD45G expression level in HCC cases is higher than the negative controls, hinting the presence of other potential gene expression inhibition mechanisms (Figure 2G). Figure S4 shows some CNV-driven genes involved in the cell cycle pathway.
Protein-protein interaction network analysis
Based on the above identified genes, we constructed a protein-protein interaction (PPI) network using the “STRINGdb” package and “Cytoscape” software. We also performed the "Molecular Complex Detection" (MCODE) modular analysis to screen some hub genes within the PPI network. Figure 3A-B shows the two modules with the highest ratings. We further found that the expression levels of these hub genes within the two modules were statistically correlated with CNV. And the cell cycle-related genes, such as TTK, CDC20 and ASPM, were highly expressed and exhibited a significant positive correlation with CNV (Figure 3C).
Genetic mutation analysis
We downloaded the HCC-related simple nucleotide variation data from the TCGA database, and selected the top 15 genes with the most frequent mutation frequency, such as TTN, TP53, CTNNB1, MUC16, and ALB, to map the waterfall with clinical grading information. As shown in Figure S7A, the mutation types of these genes are mainly non-synonymous mutations; however, the gene mutation frequency in 286 HCC patients with mutations is not associated with the clinical pathology of HCC (stage I, II, III-IV). In addition, the high expression of CTNNB1 protein was related to the CTNNB1 mutation in overall HCC, stage I, II, III-IV groups (Figure S7B). TP53 gene mutation was associated with the reduced expression of TP53 in the overall HCC, stage I, II groups (Figure S7B). The OBSCN mutation also correlated with the low expression of OBSCN in overall HCC and specific stage I groups (Figure S7B).
We further conducted the waterfall map analysis on the above-mentioned CENPF, ASPM, MELK, TTK, GADD45G, CDC20, CCNE2, and other interesting genes, and did not observe the correlation between the low mutation frequency of these genes and pathological stages or gene expression, although mainly non-synonymous mutations as well (Figure S6). In addition, variations in the CTNNB1, TP53, TTN, and OBSCN genes were not found to be linked to the clinical prognosis in different pathological grades (Figure S7-8).
Next, we extracted the SNP data of HCC from the TCGA cohort and found that the rs121913396, rs121913400, rs121913407 SNP of CTNNB1 and rs28934571 SNP of TP53 gene were relatively high frequency (Figure S9A). There are more than ten types of SNP for CTNNB1 gene (Figure S9B). Compared with the wild type group, the CTNNB1 gene with rs121913396 and rs121913400 showed a higher expression level (Figure S9C). Nevertheless, we did not observe the positive correlation between the rs121913396, rs121913400, rs121913407 of CTNNB1 gene, and HCC clinical prognosis (Figure S10A-C). Although we did not detect the relationship between TP53 rs28934571 and gene expression (Figure S9C), the prognosis of AA and CA genotypes of TP53 rs28934571 was poorer than that of wild type (Figure S10D).
Random forest and decision tree analysis
We combined the above clinical, mutation and expression information to perform the random forest modeling analysis. Multiple dimension scale plot in Figure S11A suggested the effective classification of negative normal and overall HCC group. AUC value of ROC equals to 0.956, indicating high classification accuracy (Figure S11B). We also showed the feature vectors extracted from the classification model in Figure S11C-D, and obtained the largely contributed genes, such as ECM1, FCN2, ANGPTL6, OIT3, ADAMTS13 and LRRC14. Next, we performed the decision tree modeling analysis, based on the above genes. We first randomly selected 260 HCC cases for modeling, and then test other 125 cases, and finally found that the predicted rate of the genes was larger than 90% (Figure S11E). Meanwhile, we compared the expression of these genes in 50 HCC patients with adjacent non-tumor and found that ECM1, FCN2, ANGPTL6, OIT3 genes in overall HCC tissue showed the higher expression level than the adjacent non-tumor tissue (Figure S11F, P < 0.0001).
Subsequently, we performed random forest modeling with different pathological stages, which is primarily based on TNM information. To prove the validity of this classification method, we performed random forest and decision tree modeling without removing TNM information and found that T1 and T2 information can effectively distinguish stage I, II, III-IV with the AUC value of 0.994 in ROC (Figure 4A-C). Then, we removed the TNM information for random forest modeling and found that the classification effect was reduced (Figure 4D-E, AUC=0.675). Figure 4F-G shows the genes that contribute significantly to the classification model. We then performed the decision tree analysis with 210 cases for training and 116 cases for testing. The result showed a prediction accuracy of 56.0% (Figure 4H). Compared with 50 normal adjacent controls, VIPR1, FAM99A, and GNA14 genes were down-regulated in 50 HCC tissues (Figure 4I, P<0.0001), while CEP55, SEMA3F, and PRR11 genes were highly expressed (P <0.0001). The expression of these genes is closely related to different pathological stages (Figure 4J).
Principal component analysis
Finally, we used principal component analysis (PCA) to screen the target genes associated with different pathological grades of HCC. As shown in Figure 5A, the calculated variances of the principal component (PC) 1, 2, and 3 equaled to 9.4%, 8.1%, and 6.3%, respectively. Based on the PC1/2 (Figure 5B) and PC1/2/3 (Figure 5C), we can effectively classificate the negative normal and overall HCC groups, but not the stage I, II, III-IV groups. Figure 5D shows the top 10 genes that contributing mainly to PC1 and PC2. We analyzed the expression level of these genes between HCC tissue and adjacent normal tissue, or in different pathological stages. As shown in Figure 5E-F, compared with normal tissue, the SLC27A5, ALDH2, and DCXR genes were down-regulated (P<0.0001), while LAMTOR4 (P =0.003), SNRPA (P <0.0001) SNRPD2 (P <0.0001) genes were highly expressed, in overall HCC tissues. In addition, the expression of the SLC27A5, ADAM17, SNRPA, SNRPD2, and ALDH2 genes was associated with different pathologic stages (Figure 5E-F).
HLivH060PG02 HCC cohort analysis
After the above analyses of TCGA cohort, we obtained a series of HCC pathological grade-associated genes. We further assessed the survival prognosis value and of these genes through GEO database (data not shown), and analyzed the research status of genes through the on-line PubMed database retrieval. Thus, seven interesting genes, including GAS2L3, CUZD1, SNRPA, SNPRD2, SEMA3F, IQCA1, OIT3, were screened out. We analyzed the expression difference of these genes between the HCC tissues and corresponding adjacent normal tissues, in the Chinese HLivH060PG02 HCC cohort. Unfortunately, due to the lower amplification efficiency of the IQCA1 and OIT3, we finally analyzed the remaining five genes, namely GAS2L3, CUZD1, SNRPA, SNPRD2, and SEMA3F. Figure 6A illustrates the correlation between GAS2L3, SNRPA, SNRPD2 and the prognosis of HCC, as example. As shown in Figure 6B, compared with adjacent normal tissues, we observed a highly expressed level of GAS2L3 (P =0.036), SNRPA (P<0.001), and SNRPD2 (P=0.002) genes in HCC tissues. Moreover, these three genes in pathologic stage III showed a higher expression trend then that in stage III, but statistical significance was only detected for the GAS2L3 gene (P=0.013). Considering the small sample size, we do not rule out the correlation between SNRPA, SNRPD2 genes and pathological stage of HCC.