3.1 Molecular immune subtypes in ccRCC patients
Six datasets, GSE15641 (32 ccRCC samples), GSE36895 (29 ccRCC samples), GSE40435 (101 ccRCC samples), GSE46699 (67 ccRCC samples), GSE53757 (72 ccRCC samples) and TCGA (530 ccRCC samples) were downloaded. The gene expression matrix of these six datasets was used to calculate the immune cells enrichment scores (ssGSEA score) by adopting ssGSEA method. The survival analysis of immune cells enrichment scores in TCGA dataset showed that 6 immune cells including activated CD4 T cells, activated CD8 T cells were significant survival-related biomarkers, and the patients with high levels of them had worse OS than those in low levels group (Supplementary Fig. 1). The PCA results of these six datasets gene expression matrix indicated the obvious batch effects since these datasets displayed a significant difference (Supplementary Fig. 2A). But the PCA results of immune cells enrichment scores of these six datasets (normalized ssGSEA score) demonstrated that differences among datasets were eliminated (Supplementary Fig. 2B).
Consensus clustering of a total of 831 ccRCC patients from six datasets using immune cells enrichment scores were performed. Two main immune subtypes were identified and named as subtype1/subtype2 (Fig. 1A). The tracking plots showed that 2 was the best value of subtypes number (Fig. 1B) while cumulative distribution function (CDF) results indicated 3 (Supplementary Fig. 3A-B). Since the sample numbers in subtype3 to subtype6 were too small (Fig. 1B), two main immune subtypes were finally identified. The distribution of immune subtypes (subtype1 and subtype2) among different datasets was shown in Table 1. As shown in Fig. 1C-D, both in the overall survival (OS) and progression-free survival (PFS) analysis, differences in the survival curves between the subtype1 and subtype2 were statistically significant (P.value = 0.027 and P.value = 0.014, respectively). Patients in subtype1 had a better prognosis than subtype2 patients. Subsequently, ssGSEA scores indicated that subtype2 samples were highly infiltrated with innate and adaptive immune cells including B cells, CD8 T cells, CD4 T cells, macrophages, NK cells, and regulatory T cells (Tregs), while subtype1 samples only showed a high level of neutrophils (Fig. 2). Besides, some immune checkpoint therapy biomarkers such as CD8A, PDL1, PD1, and tumor mutational burden (TMB) were also enriched in subtype2 (Supplementary Fig. 4).
Table 1
The distribution of immune subtypes among different datasets.
|
GSE15641
|
GSE36895
|
GSE40435
|
GSE46699
|
GSE53757
|
TCGA
|
ccRCC samples
|
32
|
29
|
101
|
67
|
72
|
530
|
Subtype1
|
18
|
15
|
46
|
38
|
36
|
301
|
Subtype2
|
14
|
14
|
55
|
29
|
36
|
229
|
3.2 DEGs and enriched pathways between immune subtypes
DEGs between immune subtypes were analyzed. A total of 614 DEGs were found to be highly expressed in subtype2 and 522 DEGs were defined as down-regulated DEGs in subtype2. The volcano plot of TCGA cohort was shown in Supplementary Figure 5. Metabolism pathways such as oxidative phosphorylation, fatty acid metabolism, retinol metabolism and tyrosine metabolism were mostly enriched in subtype1 (Supplementary Table 1). On the other hand, Immune-related pathways, involving natural killer cell mediated cytotoxicity, T cell receptor signaling pathway, antigen processing and presentation were mostly enriched in subtype2 (Supplementary Table 2).
3.3 Construction of Co-Expression Network
The TCGA dataset was selected for WGCNA since the clinical information of other datasets were not available, and the expression matrix of DEGs from TCGA was used to construct the co-expression network. Based on the results of scale-free topology fitting indices R2 and mean connectivity (Figure 3A-B), the best value of β was 3 since it could construct a scale-free network. A total of 11 different modules, ranging in size from 30 to 525 genes, was provided by WGCNA results (Figure 3C). Among these modules, the turquoise module was selected since it had the highest correlation value with the immune subtype (correlation = 0.62; P.value < 0.01) (Figure 3D). Besides, the blue module was also selected since it had a significantly negative value of correlation with the immune subtype (correlation = -0.55; P.value < 0.01) (Figure 3D).
3.4 Identification of protein-protein interactions (PPIs) and hub genes
The PPI networks of the turquoise module and the blue module were individually retrieved from the STRING database and then visualized by Cytoscape software. Subsequently, ten hub genes (FOXP3, CTLA4, PTPRC, CD28, CD19, LCK, CD27, CD2, IFNG, and CD5) from the network of the turquoise module were selected with the cut-off value of Degree > 10, using the cytoHubba plug of Cytoscape (Figure 4A). The survival analysis results of these hub genes revealed that high expression levels of CTLA4, FOXP3, IFNG, and CD19 were associated with the worse overall survival (Supplementary Figure 6). Similarly, ten hub genes (AGTR1, CRP, G6PC, IGFBP1, MGAM, PCK1, PLG, REN, SLC5A1, and WT1) from the network of the blue module were identified with the cut-off value of Degree > 4 (Figure 4B). The survival analysis result revealed that high expression levels of three genes (CRP, IGFBP1, and WT1) and low expression levels of seven genes (AGTR1, G6PC, MGAM, PCK1, PLG, REN, and SLC5A1) were associated with the worse overall survival (Supplementary Figure 7).
3.5 Validation of immune subtypes in the independent cohort
The two subtypes were further validated in an external cohort of IMvigor210, using a random forest model. The hub genes (CTLA4, FOXP3, IFNG, and CD19) from turquoise module were selected to construct a random forest model for predicting immune subtype by gene expression. IMvigor210 contains 348 tumor patients who received treatment with the immune checkpoint inhibitor therapy (Atezolizumab). Clinical data of these 348 tumor patients were described in Supplementary Table 3.
The pipeline of machine learning (random forest) workflow was plotted in Figure 5. In the training phase, the input data (TCGA dataset samples with their subtype information and four genes expression matrix) was randomly separated into the training dataset (70%) and the testing dataset (30%). The parameter tuning results showed that 2 and 300 were the best value for ‘mtry’ and ‘ntree’ because of their highest value of AUC (Figure 6A-B). The random forest is then trained with the best parameter (mtry=2, ntree=300). In the testing phase, the AUC value in the testing dataset indicated a good prediction performance with 0.78 (Figure 6C). Subsequently, the immune subtype of patients from IMvigor210 cohort was predicted by their expression data profiles of four hub genes (CTLA4, FOXP3, IFNG, and CD19). Patients in subtype2 behaved better overall response rate to Atezolizumab, about 29%, whereas subtype1 worst ORR, about 16% (Figure 6D). The overall survival analysis results in IMvigor210 cohort confirmed that patients with subtype2 had better prognosis than subtype1 patients (Figure 6E). Consistent with results from TCGA cohort, subtype2 in the IMvigor210 cohort were characterized as high expression of various immunotherapy indicators (CD8A, PDL1, TIGIT, CTLA4, CYT, IFNG, LAG3, PDCD1, TMB) in Supplementary Figure 8.
3.6 Web tool development
Using the RStudio shiny package, a web application (https://immunotype.shinyapps.io/ISPRF/) was built for the prediction of immune subtypes. In this web application, expression profiles of four hub genes (CTLA4, FOXP3, IFNG, and CD19) were required as the input data (Figure 7A). The input data will be sent to the servers where the application pre-processes the data, including four steps: (1) combining the input data with the training dataset; (2) transforming the matrix into the one-hot matrix by the median value of each gene; (3) deleting the training dataset. 4) after pre-processing the input data, this application predicts the probability of immune subtypes using the random forest model. Then, the immune subtype that results in the highest probability is picked as the predicted immune subtype. In Figure 7B, the interface shows an example of predicting the immune subtype by four genes expression.
3.7 Identification of potential drugs for the immunotherapy-resistant subtype
Since patients from immune subtype1 might have a low response rate to immune checkpoint antibodies, some potential drugs are needed. Thus, hub genes in the blue module which had a negative correlation with the immune subtype could be the targets for the identification of potential drugs. According to the above significantly survival prognosis results, three hub genes from the blue module (CRP, IGFBP1, and WT1) were selected for further analysis because of their negative effect on the prognosis. A total of 6 small molecules, 1 monoclonal antibody, and 1 synthetic peptide for targeting these hub genes were provided by the DGIdb website that contained drug-gene interactions (Table 2). Pharmacologic properties of 6 small molecule drugs were unearthed under Discovery Studio 2019 software (Table 3): 1) the aqueous solubility results showed that no drug was characterized with low aqueous solubility ability; 2) only one drug was high penetrant for blood-brain barrier; 3) all small molecules drugs were non-inhibitor of CYP2D6 which was responsible for drug metabolism; 4) 2 drugs, ZINC150338696 and ZINC169294721, were non-toxic drugs based on hepatotoxicity prediction results; 5) 1 drug had the good intestinal absorption level; 6) as to plasma protein binding, 3 drugs were predicted to be absorbent strong. Based on the above results, ZINC169294721 was selected as the potential small molecule drug for immune subtype1 patients since it had the good aqueous-solubility ability and was non-toxic.