3.1 Integration and clustering of 1.TMB, MSI, and CNV information
We downloaded CNV, mutation maf file, and MSI information of COAD in TCGA database, and the corresponding clinical information. There were 379 samples with CNV information, mutation information, MSI information, and mRNA matrix information, of which 359 had complete survival information. According to the maf file, the TMB (Fig. 1A, Supplemented Table 1) of each sample is calculated first, and then three kinds of data are integrated to cluster 379 samples using the "factoextra" function package. samples can be clustered into three categories (Supplemented Table 2). PCA analysis shows that the samples can be well divided into three categories (Fig. 1B), and PCA results are visualized in three dimensions (Fig. 1C). Comparing PCA results with cluster analysis results, it is found that the classification results are consistent, which shows that our clustering is more accurate. The survival analysis among the three clusters shows that there are significant differences among the subgroups (Fig. 1D). A comparative analysis of TMB and MSI among the three types of samples shows that the TMB of Cluster3 is significantly higher than that of the other two types of samples, and the proportion of MSI-H samples in Cluster3 is also significantly higher than that of the other two types of samples (Fig. 1E-F).
Table 2
The best cutoff value of 8 prognostic gene
Gene
|
Cutpoint
|
Statistic
|
HYAL1
|
6.247927513
|
2.575825733
|
SPINK4
|
8.243173983
|
3.735550973
|
EREG
|
6.022367813
|
2.601903964
|
VWDE
|
4.247927513
|
3.885960058
|
ITLN1
|
8.592457037
|
3.467487375
|
ZBTB7C
|
9.317412614
|
2.644323788
|
KLK12
|
3.169925001
|
2.730170447
|
B3GNT6
|
6.285402219
|
3.632614081
|
3.2 Analysis of characteristic differences and immune infiltration among subgroups
The mutations among the three subgroups were analyzed, and the results were shown in Fig. 2A, Fig. 2B and Fig. 2C respectively. Immunoinfiltration analysis was carried out on the samples of three subtypes, and the display diagram of immune infiltration cells among the three types of samples is shown in Fig. 3A-C. Furthermore, the infiltration content of each immune cell in three groups of samples was compared. The results showed that compared with cluster2, cluster3 expression content of Macrophages.M0 in Cluster3 decreased, the expression content of Macrophages.M1 increased, and the expression content of Macrophages.M2 in cluster 1 decreased. Compared with cluster3, the expression of NK. Cells. activated in cluster3,cluster1 decreased, the expression of Plasma. cells increased, and the expression of t. cells. folliculous. helper decreased in cluster1. Compared with cluster 2, the expression level of NK. cells. restoring is higher in cluster1 and cluster 3, and the expression level of t. cells. CD4. memory. restoring is higher in cluster3 (Fig. 4A-D). Comparing and analyzing the expression of several important immune checkpoints in three groups of samples, the results showed that compared with cluster3, CTLA4 expression content of CTLA4, PDL1, LAG3, TIGIT, and IDO1 genes in cluster1 and cluster2 samples decreased, and the difference was statistically significant. The expression of TDO2 had no significant difference among the three groups (Fig. 5A-F).
3.3 Screening and enrichment analysis results of differential genes
According to the analysis of the difference of prognosis among the three subgroups, we found that the prognosis of patients with cluster1 and 3 was worse than that of patients with cluster2, and the difference was statistically significant; There is no significant difference in prognosis between cluster1 and cluster3 patients. Therefore, we combined the patients of cluster1 and cluster3 and analyzed the differential genes with those of cluster2. Comparing the gene expression differences between Cluster1 + Cluster3 samples and Cluster2 samples, we got 128 differentially expressed genes, including 51 up-regulated genes and 77 down-regulated genes (Fig. 5G). GO enrichment analysis is carried out on the differential genes, The results showed that the differential genes were mainly related to anion transport, Negative Regulation of Cell Proliferation, Multicellular Organic Homeostasis, Humoral Immunization Response, Organic anion transport, positive regulation of defense response, epigenetic cell differentiation and tissue homeostasis are related to biological activities (Fig. 5H-I). KEGG enrichment analysis showed that the differential genes were mainly enriched in alpha-Linolenic acid metabolism, linolenic acid metabolism, Fat digestion and absorption, Ether lipid metabolism, Arachidonic acid metabolism, Wnt signaling pathway, and etherlipid metabolism (Fig. 5J). The protein-protein interaction analysis results are shown in Fig. 5K-M. The results show that these differential genes can be divided into four interaction modules. CCK、GPR143、TAC1、NPFFR1、GNG4、MUC5A、MUC5B、MUC6、DRD2、KY6G6D and other genes are the core genes and interact most closely with other targets.
Comparing the difference in the expression of differential genes between the two groups, it was found that the expression of differential genes between Cluster1 + Cluster3 samples and Cluster2 samples was significantly different (Fig. 6A).
3.4 Construction and verification of prognosis model
Univariate TCGA regression analysis was carried out with 128 differential gene expression values as continuous variables, and the Hazard ratio(HR) of each gene was calculated. Eight genes were obtained by screening with P value < 0.05 as the threshold, and the survival analysis of these eight genes was carried out (Fig. 6B-Fig. 6I). The best cutoff value is shown in Table 2. COX regression analysis showed that the HR value of 7 of these 8 genes was less than 1, which was beneficial to the prognosis. There is one gene whose HR value is greater than 1, which is a dangerous gene and unfavorable to prognosis (Fig. 7A). Then, the selected eight genes were analyzed by LASSO regression. according to the lambda values corresponding to different gene numbers in LASSO analysis, we determined that the optimal number of genes was eight (Fig. 7B, lambda value was the smallest), and the eight genes were HYAL1, SPINK4, EREG, VWDE, ITLN1, ZBTB7C, KLK12, B3GNT6.
According to the expression level of each gene and the regression coefficient of LASSO Cox regression analysis, a risk-scoring model for predicting the survival of patients is established,
RiskScore=(-0.0725*HYAL1)+(-0.0151*SPINK4)+(-0.1859*EREG)+(0.1787*VWDE)+(-0.0226*ITLN1)+(-0.0926*ZBTB7 C)+(-0.0231*KLK12)+(-0.0000407*B3GNT6)。 We calculated the risk score of each patient and divided the samples of TCGA data set and GEO verification set into high-risk group and low-risk group according to the median. Survival analysis found that in TCGA and GEO data sets, the overall survival of high-risk colon cancer samples was worse than that of low-risk patients (Fig. 7C and Fig. 7D). It shows that the risk model can effectively predict the prognosis of colon cancer patients in both data sets. Generally speaking, the results show that the Risk Score calculated by the evaluation model constructed by HYAL1, SPINK4, EREG, VWDE, ITLN1, ZBTB7C, KLK12, and B3GNT6 can better predict the prognosis of colon cancer patients.
3.5 Risk Score is an independent prognostic marker of colon cancer
To further explore the prognostic value of Risk Score in colon cancer samples with different clinicopathological factors (including age, TNM Stage and gender), we regrouped colon cancer patients according to these factors and performed Kaplan-Meier survival analysis. It was found that among patients with Stage I + StageII and StageIII + StageIV, patients with low risk score had better prognosis than those with high risk score (Fig. 7E-F). Among patients aged > 67 and ≤ 67, patients with low risk score had better prognosis than those with high risk score (Fig. 7G-H). In female and male patients, the overall survival rate of was significantly lower in the high-risk group than in the low-risk group (Fig. 7I-J). The difference was statistically significant. These results indicate that the Risk Score can be used as an independent index to predict the prognosis of colon cancer patients.
Then, we included age, MSI classification, TNM Stage, gender and Risk Score for multivariate Cox regression analysis to determine whether the Risk Score is an independent prognostic indicator. The result is shown in Fig. 8A. We found that the Risk Score, TNM Stage and age were still significantly correlated with overall survival, and the samples with high risk score had a higher death risk and were adverse prognostic factors (HR = 2.63, 95%CI:1.83–3.8, P < 0.001).
3.6.Nomogram model can better predict the prognosis and survival of patients
The Nomogram model (Fig. 8B) was successfully constructed by using three independent prognostic factors: age, TNM Stage and Risk Score. For each patient, draw up three lines to determine the Points obtained from each factor in Nomogram. The sum of these Points is located on the "Total Points" axis, and then a line is drawn down from the “Total Points” axis to determine the survival probability of colon cancer patients for 1, 3 and 5 years. The corrected curve in the calibration chart is close to the ideal curve (a 45-degree line passing through the origin of the coordinate axis with slope 1). The result shows that the predicted result of the model is basically consistent with the actual result (Fig. 8C-E).
3.7. Immune status of colon cancer patients in high and low risk groups
CIBERSORT method and LM22 feature matrix were used to estimate the difference of immune infiltration among 22 kinds of immune cells in colon cancer patients with high and low risk groups, and the result was visualized (Fig. 9A-B). The result shows that there was a significant difference in the infiltration ratio of T cell CD4 memory resting immune cells between high and low risk groups, and its expression level was significantly higher in low risk groups which suggesting that the expression of T cell CD4 memory resting is related to the improvement of prognosis.
3.8.qPCR and immunohistochemical verification
There was no significant difference in baseline data between patients with Stage I and III (Table 3). Results of real-time quantitative PCR detection of selected 8 genes listed in Fig. 9C. The expression of HYAL1、SPINK4、EREG、ITLN1、ZBTB7C、KLK12、B3GNT6 were lower in tumor tissues of Stage III patients than in Stage I. Compared with Stage I patients, the expression of VWDE was higher in Stage III patients. Through immunohistochemical detection, we found that compared with patients with Stage I colon cancer, VWDE was expressed in higher levels in the tumor tissues of patients with Stage III colon cancer; Compared with patients with Stage I colon cancer, HYAL1、SPINK4、ITLN1、KLK12 and B3GNT6 were lower in tumor tissues of patients with Stage III colon cancer (Fig. 10). The experimental verification results were basically consistent with the previous data analysis results.
Table 3
Comparison of baseline data of patients with stage I and stage III colon cancer
Group/Character
|
age
|
male
|
female
|
Pathological type
|
stage I
|
63.5 ± 2.5
|
26
|
24
|
Adenocarcinoma
|
stage III
|
65.1 ± 1.5
|
28
|
22
|
Adenocarcinoma
|
P value
|
> 0.05
|
> 0.05
|