Identification of robust and stable DEGs
After screening and checking the detailed sample information and the matrix design of these three downloaded datasets, we included a total of 177 iCCA patients (32 from TCGA-CHOL, 77 from E-MTAB-6389, and 68 from GSE26566) with gene expression data in our study. Figure 1 shows the detailed workflow for identification, validation, and functional analyses of the intersecting DEGs. As shown in Figure 2A and C, we used R package “TCGA-biolinks” to process the raw matrix data of TCGA-CHOL for background correction and normalization. Similarly, we used the R package “oligo” to process the raw matrix data of E-MTAB-6389. As shown in Figure 2B and D, we identified one sample (JFE-058G-HTA) as an outlier in the process of sample by sample correlation after normalization; this outlier was excluded from E-MTAB-6398 in the subsequent analyses. A total of 2,696 intersecting, stable DEGs were identified between the iCCA samples and matched normal tissue samples (Figure 2E). As shown in Figure 2F, 1,470 DEGs were upregulated and 1,226 were downregulated in iCCA patients.
WGCNA and identification of hub genes
After extraction of the intersecting DEGs with expression values from the E-MTAB-6389 dataset, we conducted WGCNA to identify key genes modules significantly associated with clinical traits of iCCA patients. As shown in Figure 3A–B, in the analysis of network topology for selection of the optimal soft-threshold (power), a soft-threshold value of 3 (the lowest power for scale-free topology that fits an index of 0.9) was identified to construct a hierarchical clustering dendrogram. As shown in Figure 3C, no outliers sample was found in the sample clustering analysis. By setting the dendrogram height cut-off value to 0.25 for module merging, eight modules were generated (Figure 3D–E). From the heatmap of module-trait relationships (Figure 3F), we found that the black module was significantly correlated with alcohol drinking (correlation coefficient = 0.37, P = 9e-04) and cirrhosis of the liver (correlation coefficient = 0.32, P = 0.004). The brown (correlation coefficient = 0.25, P = 0.03) and pink (correlation coefficient = -0.28, P = 0.01) modules were significantly correlated with the OS of iCCA patients. Supplementary Figure 1A, which depicts the distribution of average GS and errors across modules associated with the Gleason score of iCCA, shows that the black module had the highest average GS among these modules. As shown in Supplementary Figure 1B–D, there were 1,307 DEGs in the black module, 232 DEGs in the brown module, and 192 DEGs in the pink module. The turquoise module contained genes that could not be placed in any module; these data were excluded from the subsequent analyses. According to the MM and GS values of each gene, we identified a total of 85 hub genes from the black and pink modules for further functional analyses.
GSEA and construction of the protein-protein interaction network
To find significant enrichment gene sets/functional pathways involved in the pathogenesis of iCCA, we conducted GSEA on the E-MTAB-6389 dataset. As shown in Figure 3G, the “hallmark-mitotic-spindle” functional pathway (NES = 1.55, NOM p-val = 0.002) was significantly upregulated in iCCA compared with normal tissue; the corresponding heatmap of this functional pathway is shown in Figure 3H. By using the STRING database (version 11.0) and Cytoscape software, we constructed protein-protein interaction network for the selected hub genes (Figure 4A).
Functional enrichment analysis
The GO enrichment analysis of cellar components showed that the top five components were collagen-containing extracellular matrix (ECM), collagen trimer, microbody lumen, microbody part and peroxisomal matrix (Supplementary Figure 2A). In terms of biological process associated with those hub genes, alpha-amino acid metabolic process was the most significantly enriched functional process (Supplementary Figure 2B). In the KEGG pathway analysis, these genes were predominantly associated with glycine, serine and threonine metabolism, retinol metabolism, cysteine and methionine metabolism, beta-alanine metabolism and primary bile acid biosynthesis (Supplementary Figure 2C). Based on the results of PPI and functional analyses, as well as a literature review, we finally selected FTCD, BTD, FER, FETUB, F13B, COL12A1, COL1A2, COL1A1and COL5A2 for survival analyses. We constructed a heatmap visualizing the expression status of these nine hub genes in 31 normal tissue and 77 iCCA tissue samples from the E-MTAB-6389 dataset. We also visualized the clinical traits (alcohol drinking, OS, cirrhosis of the liver and sample types) in these 108 samples. As shown in Figure 4B, BTD, FTCD, FETUB and F13B were downregulated in iCCA samples, while FER, COL1A1, COL1A2, COL12A1, COL5A2 were upregulated.
Establishment of a three-gene signature for predicting the prognosis of iCCA in the training cohort
The characteristics of iCCA patients from the TCGA-CHOL and E-MTAB-6389 datasets are summarized in Table 1. We randomly divided the E-MTAB-6389 dataset into the training cohort (n = 38) and validation cohort 1 (n = 39). Univariate Cox proportional hazard regression analysis was performed to identify the hub genes significantly associated with the OS of iCCA patients from the training cohort. As a result, a total of six hub genes (SULF1, LAMA4, COL12A1, FER, BTD and CTH) were found to be significantly associated with the OS of iCCA patients (P < 0.05). Subsequently, multivariate Cox regression was performed for these six hub genes as well as the clinical characteristics of iCCA patients. As shown in Supplementary Figure 3A–C, BTD, FER and COL12A1, as well as the cirrhosis of liver were identified as independent prognostic factors for iCCA patients. As shown in Figure 5A–C, BTD was identified as a favorable prognostic factor for iCCA patients, while COL12A1 and FER were unfavorable prognostic factors. Based on the results of the multivariate Cox regression analysis, we constructed a risk model for predicting the OS of iCCA patients. The risk score (RS) formula = BTD´1.927 + COL12A1´(-2.429) + FER´(-1.313). As shown in Supplementary Figure 3D, the RS was identified as an independent prognostic factor for iCCA patients. In addition, as shown in Figure 5D, iCCA patients with high RS had an unfavorable outcome compared with that of patients with low RS (P = 0.00047).
Validation of the independent prognostic value of the three-gene signature in iCCA patients from two validation cohorts
The prognostic capability of this three-gene signature (RS model) was validated in both the internal (validation cohort 1 from E-MTAB-6389) and external (validation cohort 2 from TCGA-CHOL) validation cohorts. Based on the expression value of BTD, COL12A1 and FER, the RS was calculated according to the formula described previously. In validation cohort 1, we found that iCCA patients with high RS had a poor outcome (P = 0.006) compared with that of patients with low RS (Figure 6A). As shown in Figure 6B, we also found that iCCA patients with high RS in validation cohort 2 had an unfavorable outcome compared with that of patients with low RS (P = 0.019). In addition, the results of multivariate Cox regression applied in both validation cohort 1 (Figure 6C; P = 0.038) and validation cohort 2 (Figure 6D; P = 0.015) indicated that this three-gene signature (RS model) was an independent risk factor for OS of iCCA patients.
Evaluation of the predictive capability of the three-gene signature using ROC analysis
We constructed ROC curves to assess the performance of the three-gene signature in predicting survival of iCCA patients in validation cohort 1. As shown in Figure 6E, the AUC of the three-gene signature was 70.8% (53.7%–80.9%), indicating that the three-gene signature had higher prognostic accuracy compared with that of FER (60.2% [41.2%–79.3%]), BTD (64.6% [44.7%–84.5%]), and COL12A1 (60.2% [40.5%–80.0%]) alone. Furthermore, as shown in Figure 6F, compared with the AUC values for biomarkers such as MUC1 (48.1% [28.5%–67.7%]) and MUC13 (68.0% [50.75–85.5%]), which have been identified as risk prognostic factors of iCCA patients [20, 21], this three-gene signature (70.8% [53.7%–80.9%]) had higher prognostic accuracy.
Validation of the expression status of the hub genes at the translational level
As shown in Supplementary Figure 4A–F, the immunohistochemical (IHC) staining analysis of iCCA and normal tissue samples obtained from The Human Protein Atlas database (https://www.proteinatlas.org/) showed that FER and COL12A1 were positively detected in iCCA tissue, whereas BTD was not. None of these three hub genes were detected in the bile duct cells in normal liver tissue. The expression status of these three hub genes at the protein level was consistent with that at the transcriptional level. These results demonstrate the validity of the hub genes identified in this study.
The relationship between the expression level of the hub genes and their methylation status
To clarify the potential mechanism of these three hub genes that were found to be abnormally expressed in iCCA tissue (FER and COL12A1 were upregulated, while BTD was downregulated in iCCA), we screened the MEXPRESS (http://mexpress.be) database to investigate the association of the expression level of these three hub genes with the methylation status of the corresponding promoter region. As shown in Supplementary Figures 5–7, the DNA sequences of FER, BTD and COL12A1 contained numerous methylation sites, which were negatively correlated with their expression level.
The association of the expression level of the hub gene with the degree of infiltration by tumor-infiltrating immune cells and tumor purity
We used TIMER (https://cistrome.shinyapps.io/timer/) to explore the potential associations of the expression level of these three hub genes with tumor purity, as well as the degree of infiltration by tumor-infiltrating immune cells (B cells, CD8+ T cells, CD4+ T cells, macrophages, neutrophils and dendritic cells). Unfortunately, no significant associations were found between these three hub genes and tumor purity (Supplementary Figures 8A–C). Conversely, moderate or strong positive correlations were found between FER and the degree of infiltration by CD4+ T cells, macrophages, and neutrophils (Supplementary Figures 8A). Weak or moderate associations were found between COL12A1 and the degree of infiltration by B cells, macrophages, neutrophils and dendritic cells (Supplementary Figures 8B). However, no significant associations were observed between the infiltration of tumor-infiltrating immune cells and BTD expression level (Supplementary Figures 8C). These results indicated that BTD and FER regulate the degree of infiltration by tumor-infiltrating immune cells in iCCA.