3.1 Screening and Identification of NK Clusters in scRNA-seq Data
The flowchart depicting the design of this study is presented in Fig. 1. After an initial screening of the scRNA-seq data, 23,673 cells were obtained. Following log-normalization and dimensionality reduction, we identified 20 subclusters. Subsequently, utilizing five marker genes (NKG7, KLRD1, KLRB1GNLY, and GZMB), we successfully identified 680 NK populations (Fig. S1A, B).
To refine our analysis, we further clustered and downscaled the cells within the 680 NK populations, ultimately identifying five distinct NK clusters (Figure S1C, D). By observing the TSNE plots of the four sample distributions (Fig. 2A), we were able to generate and focus on these five NK clusters for subsequent analysis (Fig. 2B).
Examining the expression profiles of the 737 differentially expressed genes (DEGs) within the five NK clusters, we identified the top five DEGs, which were considered as marker genes for the NKF clusters (Fig. 2C). The distribution ratio of these five clusters within each cohort is illustrated in Fig. 2D.
Moreover, through KEGG analysis (Fig. 2E), we discovered that these DEGs were significantly enriched in various pathways, including the TNF signaling pathway, T-cell receptor signaling pathway, and NK cell-mediated cytotoxicity. Additionally, based on the CNV characteristics, we observed that the five NK clusters comprised a total of 680 tumor cells and normal cells (Fig. 2F).
3.2 Exploration of Cancer-Related Pathways in NK Clusters
To gain further insights into the relationship between NK clusters and tumor progression, we conducted GSVA scoring of 10 pathways associated with tumorigenesis within the 5 NK clusters (Fig. 3A). Notably, the proportion of malignant cells in NK_0, NK_1, and NK_3 clusters was significantly higher compared to the other two clusters (Fig. 3B), and there were significant differences between each cluster.
Furthermore, we analyzed the GSVA scores of the 10 tumor-associated pathways in both malignant and non-malignant cells within each NK cluster. Interestingly, we observed significant differences in the RAS pathway within the NK_0 cluster and the NOTCH pathway within the NK_1 cluster (Fig. 3C-G).
To investigate the potential correlation between NK clusters and prognosis, we initially computed the ssGSEA scores of marker genes for each NK cluster utilizing the TCGA cohort. Strikingly, the results revealed that all five NK clusters exhibited higher scores in normal samples compared to tumor samples (Fig. 4A).
Next, we categorized the LUAD samples from the TCGA dataset into high and low NK score groups based on optimal cutoff values determined by the survminer R package. Intriguingly, within the high-NK score group, the samples belonging to NK_1, NK_2, NK_3, and NK_4 groups displayed a more favorable prognosis compared to the low-NK score group. However, it is noteworthy that the prognosis of LUAD was not associated with the NK_0 cluster (Fig. 4B-F). Although there were differences in NK_0 enrichment between LUAD and normal samples, our findings suggest that the contribution of NK_0 clusters to LUAD progression is minimal.
3.3 Identification of NK cell-related core genes
To identify core genes related to NK cells, we conducted a comprehensive analysis by screening differentially expressed genes (DEGs) between tumor tissues and normal tissues. Out of the 19,495 DEGs identified, 12,525 were up-regulated and 6,970 were down-regulated. Remarkably, 725 genes exhibited significant correlation with prognostically relevant NK clusters. Subsequently, we performed one-way Cox regression analysis to assess the prognostic value of each gene, resulting in the identification of 133 genes with prognostic significance (Fig. 5A, B).
Enrichment analysis of these genes revealed their involvement in various biological processes (BP), cellular components (CC), molecular functions (MF), and KEGG pathways (Fig. 5C). However, it is important to note that the discussion section should be completed first to provide context and specific results before filling in the details of the enrichment analysis.
To further refine the gene selection, we employed Lasso Cox regression analysis, which ultimately filtered down to 13 genes with a lambda value of 0.0311 (Fig. 5D, E). These 13 genes, namely ANOS1, ADGRE3, ADAMTS8, CCT3, CD79A, CX3CR1, DPEP2, FMO3, FBP1, OASL, RRAS, SELENOP, and SFTPC, were included in the risk signature after multifactorial Cox regression analysis using stepwise regression (Fig. 5F). The final risk signature formula consists of the following 13 genes with corresponding coefficients: RiskScore = "-0.121 * CD79A − 0.056 * DPEP2 + 0.094 * OASL + 0.363 * RRAS + 0.052 * ADAMTS8–0.282 * ADGRE3–0.039 * ANOS1 + 0.155 * CCT3 + 0.075 * CX3CR1–0.145 * FBP1–0.115 * FMO3–0.087 * SELENOP − 0.021 * SFTPC".
After Z-mean normalization, we calculated the risk scores for each sample and subsequently classified them into high and low-risk groups.
In the TCGA cohort, the area under the curve (AUC) values for 1- to 5-year survival ranged from 0.66 to 0.71. Similarly, in the GSE3141 cohort, the AUC values ranged from 0.64 to 0.72, while in the GSE31210 cohort, they ranged from 0.65 to 0.69. In the GSE37745 cohort, the AUC values ranged from 0.57 to 0.71, and in the GSE50081 cohort, they ranged from 0.70 to 0.77. Lastly, in the GSE68465 cohort, the AUC values ranged from 0.60 to 0.68 (Fig. 6A-F).
Furthermore, the Kaplan-Meier survival analyses demonstrated a similar pattern of survival outcomes in both the TCGA and GEO cohorts. High-risk patients exhibited significantly worse survival outcomes compared to low-risk patients (Fig. 6A-F).
3.4 Mutation and pathway analysis of key genes
In our investigation of the risk signature genes, we examined SNV mutations in these 13 genes. Interestingly, all genes except RRAS exhibited SNV mutations in a greater number of samples (Fig. 7A). We further analyzed the probability of co-occurrence of mutations between these key genes and the 10 most frequently mutated genes. Figure 7B demonstrates that while no significant mutation co-occurrence was observed among these 12 genes, significant co-occurrence was observed between ADGRE3 and TTN, CX3CR1 and MUC16, RYR2 and ZFHX4, ANOS1 and LRP1B, DPEP2 and CSMD3, as well as RYR2 and ZFHX4. We also used OncogenicPathways to count known oncogenic signalling pathways in the TCGA cohort (Fig. 7C).
To gain further insights into the association between these risk genes and LUAD, we investigated their correlations with various molecular features of LUAD. Remarkably, most of the genes exhibited significant negative associations with aneuploidy scores, homologous recombination defects, score alterations, fragment counts, and non-silencing mutation rates. Notably, CCTS was the only gene significantly positively associated with these molecular features (Fig. 7D).
Additionally, we explored the potential pathways associated with each risk gene. The analysis revealed significant associations between these 13 genes and a total of 21 pathways. Noteworthy pathways included the B cell receptor signaling pathway, the chemokine signaling pathway, and leukocyte migration (Fig. 7E, 7F).
3.5 Analysis of key genes in relation to immunity
Our data analysis revealed notable findings regarding the relationship between key genes and immunity. Specifically, we observed that CCT3 exhibited a significant negative correlation with matrix score, immune score, and estimated score. Conversely, the remaining 12 genes displayed significant positive correlations with these scores (Fig. 8A).
To further explore these associations, we categorized each gene based on its median expression and compared the three scores among different expression groups. Notably, for the ANOS1 gene, the high-expression group exhibited significantly lower scores compared to the low-expression group, while the high-expression groups of the remaining genes displayed significantly higher scores compared to the low-expression group (Fig. 8B).
Furthermore, our correlation analysis demonstrated significant associations between the 13 key genes and various immune cells, including B cells, T cells, and NK cells (Fig. 8C). Moreover, we observed significant differences in several immune cells between the high and low expression groups of the risk genes (Fig. 8D).
3.6 Sensitivity analysis of risk signature for PD-L1 blockade immunotherapy
The emergence of T-cell immunotherapy as a promising approach for cancer treatment has highlighted the importance of evaluating the prognostic value of risk characteristics in the context of immune checkpoint therapy. To assess this, we examined the IMvigor210 and GSE78220 cohorts.
Within the IMvigor210 cohort, consisting of 348 patients with varying responses to anti-PD-L1 receptor blockers, including complete remission (CR), partial remission (PR), stable disease (SD), and progressive disease (PD), our analysis revealed compelling results. Patients in the low-risk group experienced a significant clinical benefit and demonstrated significantly longer overall survival compared to those in the high-risk group (Fig. 9A, p = 0.0017). Furthermore, the risk scores were higher in patients with SD/PD compared to those with CR/PR (Fig. 9B). Additionally, the high-risk group exhibited a higher percentage of SD/PD compared to the low-risk group (Fig. 9C).
Intriguingly, the survival analysis demonstrated a significant difference between stage I + II patients (Fig. 9D, p = 0.041) and stage III + IV patients (Fig. 9E, p = 0.018) within different risk groups. This suggests that risk scores were particularly sensitive in distinguishing patients with different stages of LUAD.
Similarly, in the GSE78220 cohort, we observed that patients in the low-risk group had significantly longer overall survival compared to those in the high-risk group (Fig. 9F, p = 0.0067). Notably, patients with PD exhibited higher risk scores compared to those with CR/PR (Fig. 9G), and the high-risk group displayed a higher percentage of PD cases than the low-risk group.
3.7 Independent risk factor identification and histogram construction
In order to enhance the predictive performance of risk features, we conducted univariate and multivariate COX regression analyses to integrate clinicopathological factors and risk scores. The results from the multivariate COX regression analysis revealed that the risk signature (HR = 2.69, 95% CI: 1.93–3.74, P < 0.001) and metastatic status (HR = 2.06, 95% CI: 1.39–3.05, P = 0.002) were the most crucial independent prognostic factors for LUAD.
Based on these findings, we developed a histogram that incorporated both staging and risk scores (Fig. 10A-C). This histogram demonstrated effectiveness in accurately predicting actual survival outcomes, as confirmed by the calibration curve (Fig. 10D). Furthermore, the decision curve analysis (DCA) indicated that the discriminative power of the histogram in identifying high-risk patients surpassed that of the risk score and staging alone (Fig. 10E). Additionally, the time-dependent receiver operating characteristic (TimeROC) analysis demonstrated that both the risk score and the histogram exhibited higher area under the curve (AUC) values compared to other metrics in the TCGA cohort (Fig. 10F).