EMT factors screening
To identify EMT differential genes, we screen differentially expressed genes between 451 M0 samples and 85 M1 samples in the TCGA database.378 differentially expressed genes were obtained (Fig. 1, SupplementaryTable2_DEGs_M0M1_genes.txt).
A total of 1184 EMT-related genes (EMT_genes) were identified by database and literature search, and 18 among them are EMT differential genes (EMT_DE_genes). Then we uploaded the 18 EMT differential genes to the metascape website(http://www.metascap e.org/gp/index.html#/main/step1)for KEGG and GO pathway enrichment analysis. (Fig. 2, SupplementaryTable3_ DEEMT_matascape.txt). Then,the genes in these significantly enriched pathways were extracted and named EMT candidate gene set 1 (cd_EMT_geneset1). A total of 2431 genes were extracted (Supplementary Table 4_cd_EMT_geneset1.txt).
We screened the 18 EMT differential genes (EMT_DE_genes) to cluster the disease samples. When k = 3 (Fig. 3,A,B), the samples were divided into three groups. The three groups had significant differences in OS (Fig. 3C). According to the prognosis relationship, we combined two groups. The combined two groups had significant differences in OS (Fig. 3D). Finally, two kinds of samples were obtained (Supplementary Table 5 _cluster.csv)(Tekpli et al. 2019). We screen the differentially expressed genes between the two subpopulations,544 differential genes were obtained, which were named EMT candidate geneset2(cd_EMT_ geneset2, Supplementary Table 6_cd_EMT_geneset2.txt). The expres sion of these genes in the two groups of samples is shown in Fig. 4A. PCA analysis of cancer samples using these differential genes can distinguish the two types of samples to a certain extent (Fig. 4B).
WGCNA co-expression gene module analysis
Then we take and combine EMT_genes,cd_EMT_geneset1,cd_EMT_geneset2 (Fig. 5, SupplementaryTable7_WGCNA_genelist.txt),3555 genes were obtained, of which 3386 had expression data in TCGA-COADREAD. These genes were used for WGCNA. The results show that the co-expression network is scale-free, that is, the log (k) of the node with connectivity K is negatively correlated with the log (P (k)) of the node's probability, and the correlation coefficient is greater than 0.8. To ensure that the network is scale-free, we choose the optimal > = 4 (> 0.85, Fig. 6a).
Next, the expression matrix is transformed into an adjacency matrix, and then the adjacency matrix is transformed into topology matrix. Based on TOM, we use the average-linkage hierarchical clustering method to cluster genes. According to the standard of a hybrid dynamic cut tree, we set the minimum number of genes in each gene network module to 30. After using the dynamic cutting method to determine the gene module, we calculate the eigengenes of each module in turn, and then cluster the modules to combine the modules close to each other into new modules, and then cluster the modules, and merge the closer modules into new modules.
Setting height = 0.25, we get a total of 9 modules (Fig. 6B SupplementaryTable8 _module _ gene.txt). Then we calculate the gene significance value of each gene module (Fig. 6C, SupplementaryTable9_ModuleSignificance.txt).The larger the GS value is, the more relevant the module is to the phenotypic characteristics (MSI). Then the Pearson correlation coefficient (SupplementaryTable10_WGCNA_module_ trait.txt) between each module and MSI was calculated. The higher, the more important the module is (Figure.6D). The lines in Fig. 5D represent each module, the list shows the phenotypic characteristics of the sample, and the red to blue indicates the correlation. The coefficient decreases from high to low. The number in each cell indicates the correlation coefficient between the gene module and the phenotypic characteristics of the sample, and the number in brackets indicates the significance P value.
From the results of the above two analysis modules and phenotypic correlation methods, we finally selected green and yellow as the key modules. The two model genes were used to construct a protein interaction network (Fig. 7A, Supplementary Table 11_ppi.txt and Supplementary Table 11_ppi_node.txt).When the degree of freedom was more than 4, 296 core genes were screened. At the same time, according to module membership (MM) > 0.5 and gene significance (GS) > 0.1, 86 core genes were screened out from these modules (Fig. 7BC). The intersection of the two genes contained 51 genes, which were considered disease hub genes (Fig. 7D). Besides, we have also constructed a multi-factor regulatory network using these disease core genes. The results are shown in Fig. 8(SupplementaryTable12_edge.txt and Supplementary Table 12_node.txt).
Construction of a prognosis-predicting model
Univariate Cox regression analysis of the above 51 genes showed that 5 genes were significantly correlated with prognosis (pvalue < 0.05,SupplementaryTable13_cox Result. txt). Among them, there are three EMT genes (BCL2L1,CXCL9,C5AR1). Figure 9 shows the four genes most significantly associated with prognosis (Fig. 9ABC shows EMT gene, Fig. 9D shows non-EMT gene).
The results of single-factor Cox analysis of these five genes are shown in Fig. 10A. We further used multivariate Cox regression analysis to calculate the weight of these five key prognostic genes and constructed a prognostic model (SupplementaryTable14_mu lti _coef _ gene.txt)
By weighting the expression of these five genes with the regression coefficient, a risk score model was established to predict the survival of patients :
Risk score = BCL2L1 expression level * ( 0.0881) + CXCL9 expression level * (− 0.1451) + C5AR1 expression level * (0.1331) + RGS19 expression level * (0.2672) + VAV3 expression level * (− 0.1024)
We calculated the risk score of each patient and divided all samples into a high-risk group and low-risk group according to the median risk score(SupplementaryTable15_ riskscore_ group.txt).There was a significant difference in the prognosis between the two groups (Fig. 11D). Besides, we also used the model to predict the 0.5-year, 1-year, 3-year and 5-year survival of patients, and the predictive ability was good (Fig. 1E).
Model validation
To verify the model, we analyzed another two sets of public data sets (GSE14333, GSE38832).(Martinez-Romero et al. 2018)
For the first set of GSE14333, all samples were divided into a high-risk group and a low-risk group according to the median risk score (SupplementaryTable16_ GSE14333 _ riskscore_group.txt). There was a significant difference in the prognosis between the two groups (Fig. 12D). We also used the model to predict the 1-year, 3-year, 5-year, and 10-year survival of patients, with a good predictive ability (Fig. 12E).
For the second set of data GSE38832, patients were divided into a high-risk group and low-risk group according to the median cut-off, and there was also a significant difference in prognosis between the two groups (Fig. 12I, Supplementarytable 16 _ GSE38832 _ riskscore_group.txts).The model was used to predict the 1-year, 3-year, 5-year, and 10-year survival of patients, and the prediction ability was also good (Fig. 12J).
On the other hand, we also analyzed the stability of the prediction model in clinical features. The results show that in the age (Fig. 13A, B), N stage (Fig. 13C), M stage (Fig. 13D), T stage (Fig. 13E), and stage (Fig. 13FG), there are still significant differences in the prognosis between high-risk and low-risk groups in these stratified samples of clinical characteristics. Clinical information can be seen at Supplementary Table 17 _clin _infor.txt.
We also selected digestive system cancers (Esophageal cancer, Liver cancer, Gastric cancer, and Pancreatic cancer) from TCGA data, and analyzed the pan-cancer model according to the expression profile data of these cancers. In other words, the expression profile data of these cancers were used to score the risk model, and the optimal cutoff value was obtained according to the R package survminer. Then the samples were divided into high-risk and low-risk groups, and the prognosis difference (OS) between the two groups was compared. The results show that the model is effective in HCC and GC, and the samples can be divided into two subgroups with a significant difference in prognosis. Cox regression analysis results of the risk score in each cancer are shown in SupplementaryTable18.
Correlation analysis of the risk score
Furthermore, we compared the differences of risk scores in clinical feature groups (colon cancer/rectal cancer, left and right colon cancer, stage, lymphatic invasion or not, TNM stage, perineural invasion or not, venous invasion or not, age and gender). The results showed that there were significant differences in risk scores in Stage, PT, PN, and PM stage groups (Fig. 15CEFG)
At the same time, we hope to study the correlation between innate immune escape mechanisms and risk score further. The innate immune escape of tumors indicates that tumor cells directly mediate their immune escape. Some potential determinants of tumor immunogenicity were compared between high and low-risk groups, including mutation load, homologous recombination defect (HRD), neoantigen load, and level of chromosomal instability (Supplementary Table 19 _ publicinfor.txt). The results showed that the risk score was positively correlated with chromosome instability level, homologous recombination defect score and stemness index. We obtained Immune characteristics from GDC (https://gdc.cancer.gov/about-data/publications/panimmune), HRD score from TCGA_ DDR_ Data_ Resources.xlsx.(Knijnenburg et al. 2018)
Besides, we use the CIBERSORT method combined with the LM22 characteristic matrix to further estimate the difference of immune infiltration of 22 immune cell types in high-risk and low-risk groups. The results show that some immune cells have significant differences between high-risk and low-risk groups (Fig. 17, SupplementaryTable20).Then we look at the distribution of stromal score, immune score, and tumor purity among the risk groups, and find that the three have a significant difference in the high and low-risk groups (Figure.18,SupplementaryTable21_estimat e.txt).
To further explore the differences between high and low groups, we analyze the differences of somatic mutations between high and low groups. Figure.19AB shows the distribution of common mutations in CRC in two groups. Figure.19CD shows the distribution of copy number variation regions of CRC in two groups of samples. Figure 19E shows that there is no significant difference in the frequency distribution of CNV between high and low-risk groups (P = 0.72).
Comparative analysis between risk score and immunotherapy and other cancer treatments
To study the effect of immunotherapy in high and low-risk groups, we used the IMmvigor210Cor eBiologies dataset (348 samples, of which 298 samples had immunotherapy results, only 168 samples of bladder tissue without unknown response) to calculate the risk score of each sample. According to the threshold, 168 samples were divided into high and low groups. The two groups also had significant differences in prognosis (SupplementaryTable22_IMvigor 210.txt), and there were significant differences in risk scores among different treatment groups (Fig. 20).
To evaluate the prediction of risk score on chemotherapy effect, we downloaded the chemotherapy data of CRC samples. The results showed that there was a significant difference in chemotherapy effect between high and low groups, and the significance of Fisher's exact test was 0.0476 (Fig. 21A). Moreover, there were significant differences in risk scores among different groups (Fig. 21B). The risk score was used to predict the response to hormone therapy. The measure of the area under the ROC curve is 0.67(Fig. 21C).
Furthermore, we estimated the IC50 values of drugs according to the expression profile by using R packet pRRophetic(Paul et al. 2014), and compared the differences of IC50 values of these drugs between the high-risk group and low-risk group. The results showed that there were significant differences in IC50 between the high-risk group and the low-risk group (Fig. 22).
Independent prognostic factor analysis and nomogram construction of Risk Score
To test whether the prognostic model is an independent prognostic factor for colorectal cancer patients, we first used univariate Cox analysis and found that Age, PT, PN, PM, stage, and risk score were significantly correlated with OS (SupplementaryTable24_cox_ clin.txt). By combining risk score and clinical characteristics. Multivariate Cox analysis showed that age, PT, PM, PN, and Risk scores were significantly correlated with OS (Fig. 23, SupplementaryTable24 _ multicox_ clin.txt).
Finally, to provide clinicians with a quantitative method to predict the prognosis of patients with colorectal cancer, we constructed a prognostic model and nomograms of clinical risk factors, and combined with risk score and clinical characteristics, the predictive effect of the model can be improved to 0.79 (Fig. 24)