Immune-related genes predict better survival in lung cancer cohorts with immunotherapy treatment
To investigate the transcriptional profile that predicts the effect of immunotherapy on lung cancer patients, we searched and collected four published immunotherapy-treated NSCLC cohorts with high-throughput RNA sequencing and disease progression data, including GSE126044, GSE135222, GSE190265, GSE190266. Principal component analysis (PCA) suggested that the data sets were highly separated from each other, indicating a batch effect between different experiments, and was removed using the “ComBat” function of the "sva" packages (Fig. 2A,B). A total of 156 patients with 9336 common genes were retained as integrated data for subsequent studies.
Univariate Cox regression was applied to each gene with a p-value cutoff of 0.05. A total of 584 genes were identified, of which 179 genes were prognostically favorable (hazard ratio < 1) and 405 genes were unfavorable (hazard ratio > 1), significantly influencing disease progression. We then apply gene enrichment analysis based on Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) datasets to the favorable and unfavorable genes, respectively. The favorable genes were mainly enriched in lymphocyte-mediated immunity and leukocyte-mediated immunity pathways in GO datasets and cytokine-cytokine receptor interaction and natural killer cell-mediated cytotoxicity pathways in KEGG datasets (Fig. 2C,D), while the unfavorable genes were mainly enriched in epidermis development and skin development pathways in GO and stem cell and cancer-related pathways in KEGG datasets (Fig. 2E,F).
A highly efficient prognostic model was developed using extensive machine learning based on immune-related genes
Considering the significant genes mainly enriched in immune functions in immunotherapy-treated cohorts, we intersected them with genes of the "immune system process" pathway from GO datasets and found 117 common genes (Fig. 3A). To construct a generalized and robust effect prediction model, we consulted current widely used survival machine learning methods (also known as "learners") and applied the “mlr3” package for unified input and output data, repeatable data splitting, and efficacy comparison. The Kaplan-Meier survival learner was set as the baseline model, while 22 survival machine learners with variable importance estimation were ultimately selected for interpretability.
In this study, we adopted the C-index as the main efficacy indicator. The integrated data were stratified by research and randomly divided into training and testing sets at a ratio of 4:1. All learners were trained, and then prediction was performed on the training and test sets. Meanwhile, 10-fold cross-validation (CV) was used for resampling and prediction. The aggregated C-index of CV was also calculated by “mlr3”. Among 22 learners, Accelerated Oblique Random Forests (AORSF) had the best overall performance in training, testing, and CV sets with C-index of 0.864, 0.748, and 0.647, respectively (Fig. 3B). Therefore, AORSF was finally selected as our survival prediction learner.
The importance rank of genes was ordered by the "anova" method, and the number of genes included in the AORSF model was evaluated in a 10-fold CV. We found that the 38-gene model performed best with the highest aggregated C-index and established the predictor (Fig. 3C). The Kaplan-Meier survival curve demonstrated that the model accurately estimated the progression risk in both the training and test sets (Fig. 3D). The relative importance of the genes is shown in the bar graph, while the confidence interval of the hazard ratio for each gene is shown in the forest plot (Fig. 4A).
The AORSF model accurately distinguishes between populations at different risk for disease progression and the immune environments in which they differ
The risk scores (RS) calculated by the AORSF model accurately predicted the risk of disease progression in both training data (p-value < 0.0001) and test data (p-value = 0.00053). We then applied RS to the entire integrated data and divided the data into high (n = 111) and low (n = 45) risk groups based on the "surv_cutpoint" function to further investigate the transcriptomic differences. Survival analysis confirmed a significantly longer progression-free survival in the low-risk group (p < 0.0001, Fig. 4B).
Gene set variation analysis (GSVA) on the integrated data showed that samples in the low-risk group had a stronger immune activation state as pathways including major histocompatibility complex (MHC)-presenting antigens, macrophage stimulation, antibody-dependent cellular cytotoxicity were enriched. In contrast, samples in the high-risk group had higher scores in cell cycle-related metabolism such as DNA replication synthesis and cyclic nucleotide catabolism (Fig. 4C).
Higher levels of immune checkpoint expression including programmed cell death 1 ligand 1 (PD-L1, also known as CD274), cytotoxic T-lymphocyte associated protein 4 (CTLA-4), hepatitis A virus cellular receptor 2 (HAVCR2), and lymphocyte activating 3 (LAG3) were detected in the low-risk group (Fig. 4D). Stromal score, immune score, and ESTIMATE score calculated by ESTIMATE packages were also higher in the low-risk group (Fig. 4E). In addition, immune cell infiltration estimated by single sample gene set enrichment analysis (ssGSEA) showed that samples in the low-risk group contained a higher proportion of myeloid cells (including macrophages, myeloid-derived suppressor cells, activated dendritic cells), cytotoxic cells (including activated CD8 T cells, effector memory CD8 T cells, natural killer cells), and helper T cells (including type 1 T helper cells, regulatory T cells, Fig. 4F). In summary, the comprehensive evaluation of the microenvironment indicated that samples in the low-risk group were more active in the anti-tumor immune process involving antigen presentation and cytotoxicity.
Genes contributing to different prognoses were expressed by different cells in the tumor microenvironment
We further analyzed the single-cell RNA sequencing data from GSE207422, a lung cancer cohort treated with combined immunotherapy and chemotherapy, for more transcriptomic detail and function of different cell populations. A total of 92330 cells were collected after quality control and clustered into 8 major populations based on their signature genes (Fig. 5A, B). An interesting phenomenon was that the prognostically unfavorable genes were mainly expressed by epithelial cells, while myeloid cells expressed the most favorable genes (Fig. 5C). To quantify the expression level of each cell, we scored the data using the "AUCell" package with unfavorable and favorable genes, respectively. The AUCell score of the cell population validated the transcriptomic profile of epithelial and myeloid cells (Fig. 5D, E).
Prognostically unfavorable genes were mainly expressed by malignant epithelial cells and were predictive of a poorer outcome of immunotherapy in lung cancer
The copy number variation (CNV) status of epithelial cells was scored using the “inferCNV” package (Fig. 5F). Clustered with endothelial cells and fibroblasts by CNV score were identified as normal epithelial cells, while the rest were malignant (Fig. 5G). The AUCell score of unfavorable genes was apparently higher in malignant cells, indicating that these genes characterize malignant cells (Fig. 5H). To investigate intrinsic tumor heterogeneity, we then clustered malignant cells into five tumor cell populations, of which tumor cluster A had the highest unfavorable AUCell scores (Fig. 6A) and was the population we focused on later. Furthermore, samples from patients with stable disease had higher AUCell scores than those with partial remission disease (Fig. 6B), demonstrating that the unfavorable genes predicted poorer outcomes of immunotherapy in lung cancer.
Specific tumor clusters possessed distinct transcriptional programs
Differential genes of each tumor cluster were analyzed by "Findallmarkers" function of "Seurat" package with log fold change threshold set to 0.25 and expressed by at least 10% cells. Top 10 genes with largest change rate were displayed in heat map and volcano graph (Fig. 6C, D). The top 50 genes of each cluster with p-value less than 0.05 were then included in the grouped enrichment analysis. The results showed that tumor cell A and C had the program of skin and epidermis development, keratinocyte differentiation, while tumor cell A was particularly responsive to necrosis factor. Tumor cell B was associated with apoptotic process. Tumor cell D showed a variety of immune responses, consistent with the fact that it was obtained from a patient with a major pathological remission. Tumor cell E was enriched in ribosome-related process (Fig. 6E).
Transcriptional signatures characterized by GDF15 enriching in epithelial migration program in tumor cell A with highest AUCell score
We then applied non-negative matrix factorization (NMF) as our dimension reduction method to all malignant cells to identify their specific characteristics. Malignant cells were clustered into 5 groups according to the results of Seurat's reduction (Fig. 7A). These groups were characterized by 5 factors, of which factor 5 was highly specific for tumor cell A (Fig. 7B). The top 50 genes in factor 5 were enriched in the epithelial cell migration pathway, indicating an active migration program in tumor cell A (Fig. 7C). These genes were then submitted to STRING datasets for protein-protein interaction (PPI) network analysis, in which growth differentiation factor 15 (GDF15) was the hub gene with the highest connectivity (Fig. 7D).
Tumor cell A had an active transcription factor that promotes cell proliferation
To further investigate the origin of these transcriptional differences, we used pySCENIC to calculate transcription factor (TF) regulon activity (Fig. 7E). The 10 most active TFs of tumor cell A included GADD45A, HOXD1, JUN, TFAP4, EGF1, FOSB, FPS, E2F1, EZH2, E2F7. NetworkAnalyst found that E2F1, which is highly associated with cell cycle, DNA repair and cell proliferation, played a central role in these 10 TFs (Fig. 7F).
Tumor cell A had a higher developmental potency
The “cytoTRACE 2” package was used to estimate developmental potency. This package reduced the dimension and calculated the potency score for each cell. Cells of Tumor cells A generally had highest predicted potency scores, indicating that they are least differentiated and have highest stemness (Fig. 8A).
Tumor cell A communicated less with T cells
Besides the internal features, we also quantified the communication strength between malignant cells and CD8 + T cells, as the CD8 + T cells were the direct recognizer and killer in the tumor environment (Fig. 8B). “CellChat” packages showed that major histocompatibility complex I (MHC-I) was the main pathway that malignant cells sent to T cells (Fig. 8C). In detail, tumor cells A had relatively weak interactions between CD8 and HLA, indicating that they had less opportunity to be recognized by CD8 + T cells (Fig. 8D).
Prognostically favorable genes were mainly expressed by CXCL9 + macrophages
We then focused on myeloid cells as they had the highest AUCell score of prognostically favorable genes. Myeloid cells were clustered into dendritic cells, monocytes, and macrophages based on their characteristic markers (Fig. 9A,B). A group of macrophages expressing high levels of CXCL9 and CXCL10 had the highest AUCell score and was selected as the target for further investigation (Fig. 9C,D). Group analysis showed that myeloid cells from patients with partial remission disease had a higher AUCell score of favorable genes than those with stable disease (Fig. 9E,F).
We then identified the differential genes of each myeloid cluster and used them as input for the enrichment analysis (Fig. 9G). CXCL9 + macrophages, the population we focused on, mainly activated immune-related pathways including bacterial response, adaptive immune response and immune response activating signaling. Compared to other populations, CXCL9 + macrophages exclusively activated the pathways of mononuclear cell migration, response to type II interferon, and cellular response to chemokine. These results demonstrate an important role for CXCL9 + macrophages in the anti-tumor immune response (Fig. 9H).
Macrophages are classical antigen-presenting cells that signal T cells for local recruitment and activation. Therefore, Cellchat was used for macrophage-T cell communication (Fig. 10A), which showed the strongest MHC I-CD8A signals from CXCL9 + macrophages to CD8 + T cells and natural killer (NK) cells or gamma delta T cells (Fig. 10D-G), and a relatively strong MHC II-CD4 signal from CXCL9 + macrophages to CD4 + T cells (Fig. 10B-C). The signal strength indicated that CXCL9 + macrophages had the strongest ability to stimulate T cells/NK cells.
The pattern of tumor cell A remarkably predicted the overall survival of melanoma patients with immunotherapy
An immunotherapy melanoma cohort - GSE91061 was used as external validation data. GDF15, the hub gene of tumor cell A, was observed at a lower expression level in both pre- and on-immunotherapy patients with better survival (Fig. 11A,B). To quantify the infiltration of cells with transcriptional programs similar to tumor cell A, we applied BayesPrism to reconstruct the cell fraction of the melanoma cohort using the single-cell dataset GSE207422. Both pre- and on-treatment patients with a lower proportion of tumor cell A-like cells also had better survival (Fig. 11C,D). However, the proportion of CXCL9 + macrophages did not predict melanoma prognosis (Fig. 11E, F).