3.1 Screening of TEX-exo genes
The DEGs between normal exosome and colorectal cancer exosome samples was calculated by the “limma” algorithm, and 1016 DEGs were obtained (Fig. 1a). The expression difference of normal and tumor samples in TCGA-COADREAD was calculated by the same method, and 4352 DEGs were obtained (Fig. 1b). Taking the intersection of the DEGs of the above two resulted in 276 differential exo-genes.
The T cell exhaustion-related enrichment scores for each sample in TCGA-COADREAD are generated based on 40 T cell exhaustion-related genes with the "GSVA" algorithm, as detailed in Supplementary Table 2. Spearman correlation was calculated between the expression of 276 exo-genes and T cell exhaust-related enrichment score in TCGA-COADREAD, and 179 genes with significant correlation (p < 0.05) were retained as the TEX-exo genes for further analysis. Figure 1c shows 8 genes with the highest correlation among the genes with significant positive correlation, including DOCK2, LSP1, HCLS1, NCKAP1L, GIMAP1, ARHGEF6, GYPC and ARHGAP25.
3.2 Functional enrichment analysis
To understand the biological mechanisms related to T cell exhaustion in exosomes, KEGG and GO enrichment of TEX-exo genes were analyzed, and the results are shown in Fig. 2a. The biological process of GO enrichment showed that these TEX-exo genes were mainly distributed in the lymphocyte proliferation, mononuclear cell proliferation, leukocyte proliferation, and so forth. The celluar component of GO enrichment suggested that these TEX-exo genes were involved in the cell cortex, secrebory granule lumen, spliceosomal snRhP complex and so on. The molecular function of GO enrichment indicated that these TEX-exo genes enriched in the carbohydrate binding, MHC protein complex binding and other functions. In addition, the KEGG results demonstrated that these TEX-exo genes enriched in Hematopoietic cell lineage, Intestinal immune network for IgA production and other pathways.
To further confirm the function of proteins encoded by TEX-exo genes, we download COAD and READ level 4 protein expression data from the TCGA database (https://www.tcpaportal.org/tcpa/). Among the 179 TEX-exo genes, only the MS4A1 gene was able to correspond to its matching encoded protein, CD20. Therefore, We further compared the differences of CD20 in six different clinical feature groups (age, sex, stage, colon polyps, lymphatic invasion and venous invasion) of COAD and READ. There were no significant differences in all six groups. The results can be seen in Supplementary Fig. 1.
To gain further insight into the genetic variation of the TEX-exo genes, we also analyzed the incidence of copy number variation (CNV) and somatic mutations in the 179 TEX-exo genes. First of all, Fig. 2b showed the genes with single nucleotide variation (SNV) frequency above 3%. Among them, DOCK2 exhibited the highest mutation frequency, followed by AKAP12. Then, investigation of the frequency of CNV revealed that CNV was widespread in the 179 TEX-exo genes, for example, there were extensive amplification in IGF2 and MMP9, and extensive deletion in TCEA3 and STMN1, as shown in Fig. 2c. This systematically revealed genetic variation in TEX-exo genes, illustrating the potential pathogenic landscape of these genes in CRC.
3.3 Building the signature model
Univariate Cox regression analysis was used to evaluate the effect of TEX-exo genes on OS. Nineteen genes were found to be significant prognostic factors (p < 0.05) (Fig. 3a). Of these genes, seven (HR > 1; RBPMS2, AKAP12, SCARA3, TIMP1, HSPA1A, RBMX2 and C1orf35) were risk factors for OS and 12 (HR < 1; PPP1R16B, MS4A1, POU2AF1, BIRC3, FKBP5, DNASE1L3, CTNND1, FAM177B, SULT1B1, CEP70, UQCRFS1 and PSRC1) were protective factors for OS. Survival analysis was performed on these 19 prognostic factors in tumor samples, and the median expression level was used as the cut-off value to divide the high and low expression groups to compare the influence of expression level on prognosis. Figure 3b showed the survival curves of 8 genes with the most significant p value (p < 0.05). There are 4 genes with low expression that have a better prognosis (HR < 1; HSAP1A, RBMX2, C1orf35 and SCARA3). On the contrary, 4 genes with high expression have a superior prognosis (HR > 1; UQCRFS1, MS4A1, FAM177B and CEP70).
Given the prognostic value of the 19 genes, we would like to build a more reliable and useful predictive model from these genes to guide clinical practice in colorectal cancer patients. The expression profiles of the 19 genes above were analyzed by LASSO-Cox regression analysis, and the prognostic model was constructed. 16 signature genes were identified based on the optimal value of λ 0.0087, as shown in Fig. 3c-e. Furthermore, the regression coefficients of each signature gene were shown in Table 1. The results of univariate COX analysis of these 16 signature genes in OS are displayed in the forest plot of Fig. 3f, and all of their expressions were strongly associated with prognosis.
Table.1 Regression coefficients for 16 signature gene.
Symbol | Coef |
C1orf35 | 0.619111702 |
TIMP1 | 0.281871688 |
CTNND1 | 0.215980486 |
HSPA1A | 0.170078448 |
RBMX2 | 0.1222027 |
CEP70 | 0.088315003 |
MS4A1 | 0.03913604 |
POU2AF1 | 0.032869455 |
RBPMS2 | 0.004424399 |
AKAP12 | -0.010104709 |
DNASE1L3 | -0.0677533 |
SULT1B1 | -0.08324181 |
FAM177B | -0.089640959 |
UQCRFS1 | -0.151520918 |
FKBP5 | -0.206617298 |
PSRC1 | -0.401296281 |
Based on the coefficients calculated for each gene by the LASSO-COX algorithm and the expression values of the 16 signature genes for each CRC patient, we were able to compute the Score for each patient according to the formula in the Methods section. Patients were divided into two groups with high Score and low Score according to the median value (Fig. 4a). Survival times for each high Score patient and low Score patient were displayed in Fig. 4b. Overall, patients with high Score had the shorter survival time and those with low Score had the longer survival time. Moreover, the expression of the 16 signature genes in patients with high and low Score was exhibited in Fig. 4c. Kaplan-Meier curve showed that the survival probability of patients with higher Scores was significantly lower than that of patients with lower Scores (HR = 2.97; 95% CI = 1.98–4.47; P < 0.001; Fig. 4d). The area under ROC curve (AUC) was 0.712 in year 1, 0.676 in year 3, and 0.721 in year 5, which means that this risk scoring model has a high value for prognostic guidance (Fig. 4e).
In order to verify the stability of the model, the same algorithm was used to analyze the verification set (GSE41258). Similarly, the validation set patients were categorized into high and low Score based on the median Score (Fig. 4f), whereas the survival time of the high Score patients remained longer overall, while the survival time of the low Score patients continued to be shorter on the whole (Fig. 4g). In addition, the expression of 16 signature genes in high and low Score patients was demonstrated in Fig. 4h. Consistent with results obtained in the training set TCGA-COADREAD cohort, patients in the higher Score group had a shorter survival time compared to those in the lower Score group (HR = 1.75; 95% CI = 1.18–2.61; P = 0.005; Fig. 4i). In addition, in terms of survival effect prediction, the 1-year AUC was 0.724, 3-year AUC was 0.640, and 5-year AUC was 0.603, which reaffirmed the excellent efficacy of this risk score model in predicting prognosis (Fig. 4j).
In order to exclude the influence of other factors, univariate and multivariate Cox analyses were used to determine whether Score was an independent prognostic factor for OS. In univariate Cox analysis, Score obtained by the training set TCGA-COADREAD cohort was significantly correlated with OS (High vs. Low; HR = 2.97, 95% CI = 1.98–4.47, P < 0.001; Fig. 5a). After adjusting for other confounding factors, multivariate Cox analysis showed that Score was still an independent predictor of OS (High vs. Low; HR = 3.47, 95% CI = 1.74–6.93, P < 0.001), as shown in Fig. 5b.
Similarly, univariate and multivariate Cox analysis was used in the verification set. In univariate Cox analysis, Score obtained by the validation set cohort was significantly correlated with OS (High vs. Low; HR = 1.75, 95% CI = 1.18–2.61, P = 0.006; Fig. 5c). After adjusting for other confounding factors, multivariate Cox analysis showed that Score was still an independent predictor of OS (High vs. Low; HR = 1.52, 95% CI = 1.01–2.29, P = 0.043), as shown in Fig. 5d.
3.4 Nomogram construction
We combined Score with five different clinical features (age, sex, stage, lymphatic invasion and venous invasion) to construct a nomogram to predict the probability of 1-year, 3-year, and 5-year OS. Each factor was assigned in proportion to its risk contribution to survival (Fig. 6a), and calibration curves showed that the combined model (nomogram) showed high accuracy over 1 -, 3 -, and 5-year OS (Fig. 6b). Compared with a single prognostic factor, the nomogram constructed using a combination model may be a better predictor of patients' OS (Fig. 6c). We analyzed the robustness of Score in different clinical features, and the results were shown in Fig. 6d. In most clinical groups, we observed that the prognosis of the patients with low Scores was better than that of the patients with high Scores (HR < 1, P < 0.05). We also compared the differences in Scores among six different clinical feature groups, as shown in Fig. 6e. Scores were higher in stage III/IV patients than in stage I/II patients, in patients with lymphatic invasion than in patients without lymphatic invasion, and in patients with venous invasion than in patients without venous invasion. Additionally, there were no significant differences in Scores among different age, sex and colon polyps.
3.5 Single-cell analysis
The role and value of the Score was further validated using the single-cell analysis of colorectal cancer. The single-cell analysis of 23 colorectal cancer tumor samples showed that 40,000 cells were retained from the original 47,285 cells after quality control. Figure 7a and Fig. 7b showed the landscape of cell distribution before batch effect and after debatching effect, respectively. The distribution of the six cell types (B cells, Epithelial cells, Mast cells, Myeloids, Stromal cells and T cells) can be derived from the gene expression values of each cell (Fig. 7c). The same formula was used to calculate the Score of each cell, and the results were displayed by using UMAP, as shown in Fig. 7d. Among the six cell types, Stromal cells had the highest Scores, while T cells had the lowest (Fig. 7e). The differences in Scores between the other five cells and the Epithelial cells were shown in Fig. 7f. Among them, compared with Epithelial cells, the Scores of Mast cells, Myeloids and Stromal cells were relatively high (log2FC > 0), while the Scores of B cells and T cells were relatively low (log2FC < 0).
3.6 Evaluation of TME
We further explored the tumor immune microenvironment between the high and low Scores groups with the aim of identifying the underlying immune mechanisms. Different immune cell subsets were quantified using CIBERSORTx, and rank sum tests were used to compare the significance of infiltration degree between groups. We found that the infiltration degree of B cells memory, Dendritic cells activated Macrophages M0 cells, Plasma cells, T cells CD4 memory resting, and T cells regulatory were significantly different in higer and lower Score groups. The results are shown in Fig. 8a.
We also investigated the differences in the expression of immune checkpoints in the higher and lower Score groups. Expression of all active, inhibit and two-side immune checkpoint genes differed significantly between high and low Scores (Fig. 8b and Supplementary Fig. 2).
We also explored whether prognostic signature could predict a patient's response to immune checkpoint blocking therapy. Immunotherapy data for a metastatic melanoma was used [19], it was found that patients with high Scores have a better prognosis than those with low Scores (P = 0.038; Fig. 8c). Furthermore, patients who responded to immunotherapy drugs had higher Scores than those who did not respond to the drugs (P = 0.026; Fig. 8d). Among the high Scores patients, 28.57% responded to immunotherapy, while 71.43% did not. In addition, all patients in the low Scores patients did not respond to immunotherapy (Fig. 8e). Finally, the high Scores group had higher TIDE scores than the low Scores group in TCGA-COADREAD (P = 3.9e-11; Fig. 8f). In conclusion, the immunotherapy effect of the high Scores group was better than that of the low Scores group. Therefore, Scores can be used as a novel marker to guide immunotherapy in colorectal cancer patients.