Identification of prognostic genes from the training set
To develop a new model for precisely predicting GC prognosis, we selected the ACRG dataset that has detailed clinical information and gene expression profiles (Fig. 1a). We evaluated the impact of sample size on prognostic power for two genes, TEAD1 and GZMB [30, 31], which have been reported as prognostic factors for GC patients previously. The data showed that about 150 patients were required for reliable assessment of prognostic power (Supplementary Fig. 1a and b). Therefore, we randomly split 300 GC patients from ACRG cohort into the training (n = 150) and validation sets (n = 150) (Table 1). The univariate Cox proportional regression analysis was used to identify prognostic genes in the training set. As a result, 2069 genes were considered as survival associated genes (P < 0.01, 10000 permutations; Supplementary Table 2). To eliminate the noise caused by other factors, these genes were further fitted into a multivariate Cox proportional regression model, adjusted by patients’ clinical characteristics including the AJCC stages, age, gender, and Lauren’s subtypes. Finally, 558 genes whose expression was significantly associated with survival independently were identified in the training set (P < 0.01; Supplementary Table 2).
GRS for GC patients' prognosis prediction
To obtain the minimal set of genes to build GRS for GC prognosis prediction, we applied the LASSO penalty algorithm to assess the prognostic value of previously identified 558 genes (Supplementary Fig. 1c). After LASSO selection, nine genes were retained (Fig. 1b, Supplementary Table 2). One gene whose expression was significantly associated with favorable prognosis was ST8SIA5 (ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 5). The remaining eight genes whose expression were significantly associated with adverse outcomes were STARD3NL (STARD3 N-terminal like), GSDME (gasdermin E), TMEM245 (transmembrane protein 245), VAT1 (vesicle amine transport 1), CCDC92 (coiled-coil domain containing 92), TSPYL5 (TSPY like 5), APOD (apolipoprotein D), and CYS1 (cystin 1). Coefficients for these nine genes were determined by the multivariate Cox regression model, and GRS was then calculated in terms of the normalized expression levels of these nine genes (Fig. 1c).
The patients from the training set were assigned to high (n = 75) and low GRS groups (n = 75) using the median GRS value as the cutoff. Kaplan-Meier analysis showed there existed a significant difference in 5-year overall survival between high and low GRS groups (HR = 2.70, 95% CI = 2.07 to 3.52, P = 2.35e-12) (Fig. 1d). Further univariate Cox analysis revealed that GRS remained prognostic in each subgroup (Fig. 1e). Moreover, after calculating the correlations between GRS and global gene expression profiles in the training set, we found that 1205 genes were found to be significantly correlated with GRS (absolute R > 0.4, P < 0.01) (Supplementary Table S4). These genes were clustered into two largest clusters in the training set, which were further compared with gene sets from Molecular Signatures Database to assess the enrichment of biological pathways and processes. The results indicated that cluster 1 shared genes associated with extracellular matrix and genes expressed in stem cells, and cluster 2 was overlapped with cell cycle related genes and genes highly expressed in the early stage of cancer (Fig. 1f; Supplementary Table 3).
To further validate the prognostic value of GRS, GC patients from ACRG validation set were stratified into high and low GRS groups by the cutoff (the median GRS value of the training set, the same below). GRS was significantly associated with overall survival (HR = 1.49, 95% CI = 1.21 to 1.83, P = 1.66e-4), which was further confirmed in whole ACRG cohort (HR = 1.89, 95% CI = 1.61 to 2.22, P = 3.70e-14) (Fig. 2a and b). To further evaluate the performance of GRS, 192 patients from the Singapore cohort, 433 patients from the Korea cohort, and 388 patients from TCGA cohort were stratified into high and low GRS groups according to the cutoff, respectively. GRS remained significantly associated with GC prognosis in all the cohorts (HR = 1.31, 95% CI = 1.09 to 1.57, P = 4.77e-3 in Korea cohort; HR = 1.46, 95% CI = 1.20 to 1.77, P = 1.40e-4 in Singapore cohort; HR = 1.29, 95% CI = 1.10 to 1.52, P = 2.21e-3 in TCGA cohort) (Fig. 2c-e). The multivariate Cox analysis showed that GRS was a prognostic signature independent of the AJCC staging system, age, gender, and Lauren’s subtypes (Supplementary Table 4). Moreover, GRS was also prognostic within subgroups of patients harboring wild-type or mutant forms of TP53, MUC16, ARID1A, or PIK3CA in ACRG and TCGA cohorts whose gene mutation statuses were available (Supplementary Table 5).
To conveniently apply GRS in clinical practice, we employed qRT-PCR assays on fresh frozen tumor specimens for the nine GRS genes and one addition housekeeping gene RNU6-1 that lacks prognostic association and displays stable expression [32]. Specimens were obtained from 109 patients with GC who underwent gastrectomy from 2008 to 2013 at Zhejiang Cancer Hospital, termed as the Zhejiang cohort (Table 1). GRS was shown to remain significantly prognostic in Zhejiang cohort (Table 2; Supplementary Table 4). The patients with low GRS had longer survival time than that of high GRS patients (HR = 1.40, 95% CI = 1.12 to 1.75, P = 2.93e-3) (Fig. 2f). Further multivariate Cox analysis showed that GRS was associated with GC prognosis independent of age, gender, and the AJCC stage in Zhejiang cohort (Supplementary Table 4). Taken together, these data suggest that GRS may be applied for GC prognosis prediction in clinical practice across different platforms, such as microarray, RNA sequencing, and qRT-PCR.
CGRS for prognosis prediction of GC patients
Given that the AJCC stages and age are significantly associated with GC prognosis, and GRS is a prognostic factor independent of the AJCC staging system and age. We integrated GRS with clinical variables to create CGRS for predicting GC survival. First, SEER database that contains 33250 GC patients was used to determine coefficients for different AJCC stages and age by the multivariate Cox regression model. The data showed that clinical risk score (CRS) for each patient could be calculated by the following formula, CRS = 0.021*Age (years) + AJCC stage, where the values for different stages are 0 (stage I), 0.31 (stage II), 0.75 (stage III), and 1.56 (stage IV), respectively (Fig. 2g). The univariate and multivariate Cox analyses, as well as the Kaplan-Meier curve, showed that CRS was significantly associated with GC prognosis in all cohorts we studied (Table 2; Supplementary Fig. 2).
Since there was no significant difference in patients’ distribution between ACRG training set and SEER set (Supplementary Table 6), we integrated CRS with GRS into CGRS through the formula determined by multivariate Cox regression model (CGRS = 1.25*CRS + 0.88*GRS) in ACRG training set (Fig. 2h). CGRS was validated to be significantly associated with GC prognosis when stratified GC patients into high and low CGRS groups according to the median value from the ACRG training set (HR = 2.70, 95% CI = 1.57 to 2.16, P = 2.53e-19) (Fig. 2i). Moreover, CGRS showed strong robustness in predicting overall survival of GC patients in internal (HR = 1.80, 95% CI = 1.51 to 2.16, P = 3.21e-10 in ACRG validation set; HR = 2.12, 95% CI = 1.85 to 2.42, P = 1.27e-27 in whole ACRG cohort) and external validation sets (HR = 2.10, 95% CI = 1.71 to 2.57, P = 2.33e-13 in Singapore cohort; HR = 1.72, 95% CI = 1.44 to 2.05, P = 1.61e-9 in TCGA cohort; HR = 2.72, 95% CI = 1.71 to 4.33, P = 2.00e-5 in Zhejiang cohort) (Fig. 2j-n). Further univariate and multivariate Cox analyses confirmed the survival prediction power of CGRS (Supplementary Table 7). Additionally, CGRS remained prognostic within subgroups of patients harboring wild-type or mutant forms of TP53, MUC16, ARID1A, or PIK3CA in ACRG and TCGA cohorts whose gene mutation statuses were available (Supplementary Table 8). Together, these results reveal that CGRS can be used to assess GC prognosis independent of other clinical characteristics including AJCC stages, age, gender, and Lauren’s subtypes across different platforms.
The prognosis prediction of GRS and CGRS in different AJCC stages
The AJCC staging system is generally considered as the golden standard for evaluating GC prognosis in current clinical practice [33]; however, it remains some deficiencies in predicting patients with similar clinical and pathological characteristics [34, 35]. In this study, we applied our GRS and CGRS in GC patients within the same stage. Due to the small population of stage I GC patients, the performance of GRS and CGRS was fluctuated in different cohorts (Supplementary Fig. 3). For stage II GC patients, GRS and CGRS were significantly associated with GC prognosis in several independent cohorts when stratified the stage II GC patients into high and low risk groups, however they failed in Singapore cohort and ACRG validation sets due to relatively fewer patients (Supplementary Fig. 4). Patients with GC are often diagnosed at advanced stage, and stage III accounts for about 35% of GC cases [36, 37]. Both GRS and CGRS were able to classify stage III patients into high and low risk groups with statistically significantly different survival time in all of the training and validation sets (all HRs > 1, all P < 0.05; Fig. 3). Further multivariate Cox analysis confirmed the robustness of GRS and CGRS in stage III GC patients (Table 2; Supplementary Table 9). Finally, we examined the prediction power of GRS and CGRS in stage IV GC patients, the performance of GRS and CGRS were unstable in different cohorts because of relatively small population size (Supplementary Fig. 5). Together, these data indicate that both GRS and CGRS are able to predict the prognosis of stage III GC patients, and can be important complements for the AJCC staging system.
The association between GRS, CGRS and molecular subtypes
Emerging studies have established several molecular subtype systems of GC in the past few years [12, 13, 18]. Here, we systematically analyzed the association between our risk scores and molecular subtypes of GC. In TCGA study, GC can be divided into four molecular subtypes. Though there is no significant relevance between clinical outcome and TCGA subtypes, the microsatellite instability (MSI) group that has relatively favorable outcome exhibited lower value of GRS and CGRS (Supplementary Fig. 6a-c). Further analysis indicated that CGRS and GRS were negatively correlated with the levels of mutation load and DNA methylation (Supplementary Fig. 6g-i), which was consistent with GC patients with high mutation or methylation loads tend to have better prognosis (Supplementary Fig. 6d-f). According to Singapore study, the metabolic subtype of GC patients that have relatively longer survival time acquired lower value of GRS and CGRS than other subtypes. The invasive subtype of GC patients that showed relatively poor prognosis displayed high value of GRS and CGRS (Supplementary Fig. 7a-c). In ACRG cohort, GC patients have been classified into four molecular subtypes with different clinical outcomes. The MSS/EMT subtype that has the poorest outcome acquired relatively higher value of GRS and CGRS (Supplementary Fig. 7d-f). Taken together, these results suggest that our CGRS and GRS are significantly associated with molecular subtypes with significant survival differences.
Comparisons with other established GC signatures
To investigate the prediction accuracy of GRS and CGRS in GC prognosis, we compared the prediction power of GRS and CGRS with other three published gene signatures [15, 16, 38]. All of the three signatures were significantly associated with GC prognosis in multiple cohorts (Supplementary Table 10). Since GRS and CGRS contained no overlap genes with other signatures, we computed ROC of signatures and the AJCC staging system in four cohorts. GRS had larger area under the curve (AUC) according to ROC analysis compared with published signatures (Fig 4a and b). Further prediction error curve analysis also indicated that GRS showed lower prediction error rate in evaluating GC prognosis (Supplementary Fig. 8). However, GRS showed no advantages in predicting GC prognosis compared with the AJCC staging system. Moreover, CGRS that integrated GRS with clinical characteristics could predict GC prognosis with more sensitivity and specificity according to the ROC analysis (Fig. 4a and b). The prediction error curve analysis also revealed that CGRS had relatively lower prediction error rate in four independent cohorts (Supplementary Fig. 8). The above results demonstrate that CGRS has more advantages in predicting GC prognosis compared with the AJCC staging system and several published signatures in most cohorts we obtained.
Potential clinical application of CGRS
To facilitate the clinical applications of CGRS, we generated an easy-to-use nomogram for predicting the 1-, 3- and 5-year overall survival probability of GC patients using CGRS (Fig. 5a). The nomogram was evaluated for its calibration by plotting predicted probabilities at 1, 3, and 5 years, respectively. The overall survival probability predicted by nomogram was close to the observed probability at these three thresholds (Fig. 5b). Furthermore, the decision curve analysis showed that CGRS could bring more benefits for high risk GC patients in clinical applications (Fig. 5c). Moreover, we developed an online tool for conveniently applying CGRS in clinical practice (http://39.100.117.92/CGRS/). In the web application, the oncologists only need to select the data type, and then input age, the AJCC stage, and nine gene expression values of an individual GC patient. When clicking the Calculate button, 1-, 3- and 5-year overall survival predicted probabilities will be calculated for the patient. These findings indicate that the easy-to-use nomogram and web application may accelerate the application of CGRS in predicting GC prognosis in clinical practice.
Biological pathways involved in GC prognosis
To investigate the biological processes and pathways involved in GC prognosis, we dichotomized the patients from ACRG and TCGA cohorts into high and low CGRS groups according to the median CGRS value of ACRG training set, respectively. GSEA was subsequently performed to identify prognostic biological processes and pathways. Functional networks based on significantly enriched gene sets were built by enrichment map (FDR < 0.05) (Fig. 6a and b; Supplementary Table. 11). Intriguingly, the cell cycle, RNA transcription, apoptosis and cell metabolism pathways were significantly enriched in low CGRS patients from ACRG (Fig. 6c-f) and TCGA cohorts (Figure. 6i-6l). However, extracellular matrix pathways that play important roles in tumor invasion and metastasis were significantly enriched in high CGRS patients (Fig. 6g and m). Furthermore, T cell receptors were also significantly enriched in high CGRS patients, which indicated that high CGRS patients might have more neoantigens for immunotherapy (Fig.6h and n). Taken together, these data suggest that genes correlated with cell cycle and tumor microenvironment might be involved in GC prognosis.