3.1 Differences in Transcriptomic Profiles of Gastric Tissues in Individuals with and without GC and identification of survival-related genes
According to the cut-off criteria (fold change > 2 and adjusted P value < 0.05), a total of 3911 differentially expressed genes (2021 up-regulated and 1901 down-regulated) were identified in the gastric tissues of the training cohort (Fig. 1A, Supplementary Table S1). Unsupervised clustering based on these differentially expressed gene profiles indicated that the classification of samples was largely consistent with their respective groups (Fig. 1B), suggesting that the gastric tissues of cancer patients exhibited distinct transcriptomic profiles compared to the normal group. Using univariate Cox regression analysis, we identified 43 LMAGs that were significantly associated with survival in the training cohort (p < 0.05). Most of these genes were found to be poor prognostic factors in GC (Fig. 1C). The correlation of their expression levels is shown in Fig. 1D, with the majority displaying a significant positive correlation. The chromosomal distribution of these LMAGs is shown in Fig. 1E, with most located on chromosomes 1 to 12. CNV refers to alterations in the number of copies of a specific gene or genomic region, frequently occurring in tumors due to the various forms of genomic variations, including chromosomal CNVs, within tumor cells. CNV gain indicates an increase in the number of copies of a region, usually due to repeated DNA replication in that area. CNV loss indicates a decrease in the number of copies of a region, often due to the loss of DNA in that area[19]. In tumors, CNV gains or losses are often associated with tumor development and treatment response. For example, CNV gain may lead to the overexpression of oncogenes, promoting tumor growth and spread, whereas CNV loss may result in decreased expression of tumor suppressor genes, facilitating tumor progression. Therefore, detecting CNV gains or losses is crucial for diagnosis, prognosis evaluation, and treatment selection[20]. We found that in GC patients, most LMAGs exhibited stronger CNV gains than losses, suggesting their involvement in tumor growth and spread, contributing to poor prognosis (Fig. 1F).
3.2 Gene clustering analysis
Consensus clustering analysis was conducted on the TCGA GC dataset consisting of 335 samples. The CDF curve and the area under the curve changes indicated that the optimal number of clusters is 2 (Fig. 2A). All GC patients were divided into ARGcluster A and ARGcluster B. PCA significantly distinguished between these two molecular subtypes, indicating the reliability of the clustering (Fig. 2B). Survival analysis of the training cohort showed that the prognosis of the ARGcluster A group was significantly worse than that of the ARGcluster B group (p < 0.001) (Fig. 2C). Furthermore, there were significant differences in the expression of LMAGs between ARGcluster A and ARGcluster B groups (p < 0.05) (Fig. 2D). The heatmap of unsupervised clustering of the top 100 differentially expressed genes in GC patients also indicated distinct transcriptional profiles between ARGcluster A and ARGcluster B. Moreover, the analysis of the correlation between the two clusters and clinical characteristics indicated that the relationship of these top 100 DEGs with TNM stage, age, and gender was not significant (p > 0.05) (Fig. 3E). Next, we used "CIBERSORT" to analyze the relative proportions of 22 types of immune cells. As shown in Fig. 2F, 18 types of immune cells had different infiltration levels between ARGcluster A and ARGcluster B. There was a higher infiltration of immature B cells, natural killer T cells, natural killer cells, plasmacytoid dendritic cells, regulatory T cells, follicular helper T cells, and type 1 helper T cells in the ARGcluster A group, suggesting that immune infiltration may play an important role in poor prognosis.
3.3 Enrichment analysis
Functional annotation of differentially expressed genes was performed. The results showed that gene sets related to signaling, cardiovascular diseases, cellular outcomes, and metabolism were primarily enriched in ARGcluster A, while ARGcluster B was enriched not only in metabolism-related pathways but also in DNA replication and repair-related pathways (Fig. 2G). Gene set enrichment analysis indicated that metabolism-related gene sets such as "Calcium signaling pathway", "Dilated cardiomyopathy", "DNA replication", "ECM-receptor interaction" and "Focal adhesion" were predominantly enriched in ARGcluster B (Supplementary Fig. 1).
To avoid potential overfitting, we conducted LASSO Cox regression analysis (Fig. 3A). Ultimately, we selected 14 essential prognostic LMAGs for modeling based on the following formula:
Risk Score=[Expression (SERPINE1)×(0.159821454227009)] +[Expression (MLXIPL) ×(− 0.100918248708407)]+ [Expression (PDK4)×(0.11526650006906)]+
[Expression (BPI)×(− 0.176600664517854)]+[Expression (PROC) ×(0.146951174843476)] +[Expression (TNFRSF9) ×(− 0.429986834334003)]+[Expression (TYRP1) ×(0.132383812828114)] +[Expression (CALCR) ×(0.487315951765447)]+[Expression (TUBB3) ×(0.222312584421539)]+[Expression (CORIN) +[Expression (RTN4RL2) ×(0.284232367720243)]+[Expression (MMP11) ×(− 0.355369600424733)] ×(0.153006941051159)]+[Expression (PAEP) ×(0.22333853930601)] +[Expression (FOXD4) ×(− 0.299546648377551)].
Using the optimal cut-off values determined by the R package 'survminer' (version 0.4.9), each dataset was automatically divided into low-risk and high-risk groups. Kaplan-Meier analysis demonstrated that the prognosis of patients in the high-risk group was significantly worse than that of the low-risk group in the training set, test set, and entire dataset (p < 0.001) (Fig. 3B). The risk score prediction for 1-year, 3-year, and 5-year mortality rates were also evaluated. The area under the receiver operating characteristic curve (AUC) values for the training set were 0.702, 0.761, and 0.757, respectively. For the test set, the AUC values were 0.621, 0.638, and 0.634, respectively. For the entire dataset, the AUC values were 0.702, 0.761, and 0.757, respectively (Fig. 3C). Multivariate Cox regression analysis was used to validate the impact of clinical characteristics and risk scores on patient survival. The risk score was identified as an independent risk factor (Fig. 3D). As shown in Fig. 4E, patient age, N stage, and risk score were associated with the prognosis of GC patients. Age (HR = 1.03, 95% CI = 1.017–1.04, P < 0.001), N2 stage (HR = 1.88, 95% CI = 1.344–2.62, P < 0.001), N3 stage (HR = 2.59, 95% CI = 1.770–3.78, P < 0.001), and risk score (HR = 1.12, 95% CI = 1.085–1.16, P < 0.001) were all independent predictors of prognosis for GC patients. Unsupervised clustering analysis of lipid metabolism-related gene expression profiles based on risk grouping also showed differences between the two groups (Fig. 3E).
Additionally, a nomogram was developed using the identified independent predictors to forecast the 1-year, 3-year, and 5-year survival probabilities of GC patients in the training cohort. The cumulative risk curves and calibration curves indicated that the nomogram had a strong predictive capability for the prognosis of GC patients (Fig. 4A, B, C).
The results of the consensus clustering analysis were consistent with the risk scores, with ARGcluster A having significantly higher risk scores than ARGcluster B (p < 0.001) (Fig. 4D). The Sankey diagram showed that the majority of patients in ARGcluster A belonged to the high-risk group, while most patients in ARGcluster B were classified into the low-risk group (Fig. 4E). To predict the prognosis of GC patients, we developed a comprehensive evaluation model that included age, staging, TMB, and the nomogram. The nomogram demonstrated optimal predictive performance for 1-year, 3-year, and 5-year survival outcomes (Fig. 4F).
3.4 Immune infiltration
The tumor microenvironment is primarily composed of immune cells, extracellular matrix, various growth factors, inflammatory factors, and specific physicochemical characteristics, all of which significantly influence disease diagnosis, survival outcomes, and clinical treatment sensitivity. By analyzing the relationship between key genes and immune infiltration in the GC cohort, we further explored the potential molecular mechanisms by which these genes influence GC progression. Figures 5A and 5B showed the proportions of immune cells in each patient and the correlations between these immune cells, respectively. T cells CD4 memory activated and T cells CD8 had the highest positive correlation (r = 0.42), while T cells CD4 memory resting and T cells CD8 had the highest negative correlation (r=-0.51). Furthermore, the study identified seven significantly different immune cell types between the two groups, including five lymphoid cells (CD8 T cells, CD4 memory resting T cells, CD4 memory activated T cells, and follicular helper T cells) and two myeloid cells (M1 macrophages, and resting mast cells). This indicates that immune cells may play an important role in poor prognosis. Correlation network analysis of infiltrated immune cells within the two groups was conducted to further explore the immune cell types most associated with lipid metabolism. Most types of immune cells were negatively correlated with CD8 T cells (Fig. 5B, C).
The prognostic model can directly affect the content of various immune infiltrating cells in the tissue, such as B cells, M1 macrophages, M2 macrophages, monocytes, CD8 + T cells, and others (Fig. 5D, E). Comparative results of the tumor microenvironment between the two groups indicated that the stromal score of the low-risk group was significantly lower than that of the high-risk group (Fig. 5F), while the immune cell infiltration score was significantly higher in the low-risk group. There was no significant difference in tumor purity between the two groups, indicating that the differences in the tumor microenvironment between the two groups were mainly reflected in stromal cells and immune cells. Using the TISCH database for single-cell analysis of GC from the GSE134520 dataset (Fig. 6A). Expression of 13 LMAGs including SERPINE1, MLXIPL, PDK4, BPI, PROC, TNFRSF9, TYRP1, TUBB3, CORIN, RTN4RL2, MMP11, PAEP, FOXD4, except CALCR (Fig. 6B).