Data acquisition and processing
Retrospective research and comprehensive analysis of the relationship between IRGPs and the prognosis of GC patients were performed based on public data sets. The clinical data of GC patients and the corresponding gene expression profile data were obtained from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo). In total, six GC cohorts were found, including GSE13911, GSE118916, GSE84437, GSE13861, GSE26901, and GSE66229. Specifically, GSE13911 and GSE118916 were used to identify differently expressed immune-related genes (DEIRGs) between GC and normal tissues. GSE84437 set was used to construct a prognostic signature and the remaining three sets were used as validation sets to verify the effectiveness of the IRGPs signature. For training and validation sets, patients without complete survival data will not be included in this study.
Identification of tumor-associated IRGs and enrichment analysis
There are gene expression profiles of both GC tissues and adjacent normal gastric tissues in GSE13911 (T:38, N:31) and GSE118916 (T:15, N15). In previous studies, comparing biomarkers in different pathological conditions to determine tumor-related biomarkers has been widely used [19]. Thus, we performed the analysis by the GEO2R tool, and those genes with adjusted P < 0.05 and |Log2 FC|>1 were defined as differentially expressed genes (DEGs). Then, we downloaded IRGs from the ImmPort database (https://www.immport.org/shared/)[20] to identify DEIRGs. After that, the interaction DEIRGs between GSE13911 and GSE118916 were used for later studies.
To initially understand the mechanism and function involved in DEIRGs, these DEIRGs were then incorporated into the enrichment analysis, including Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways with metascape [21].
Construction of a prognostic signature based on IRGPs
It was reported that IRGPs have a robust prognostic ability in different cancers [17, 22]. the cohort of GSE84437 (n=433) that contain both IRGs and clinical data was used as the training set to form various IRGPs (Supplementary file1). And each IRGP was formed by comparing the gene expression level of a specific sample or profile of IRGs. Specifically, in a pairwise comparison, if the expression of the first element is greater than the latter element, the output is 1. Otherwise, the output is 0. This method of forming IRGPs can avoid the difficulties caused by different gene expression profiles, and does not need to be standardized for personal evaluation. And those IRGPs with a constant value (more than 80% of the samples are assigned the same score on the training set) would be eliminated in the subsequent model construction because they do not provide enough discriminatory information [23].
To identify the prognostic value of IRGPs in GC patients, the univariate Cox analysis was used to identify OS-related IRGPs in the training set, and those with p<0.05 were considered as candidate prognostic IRGPs. Then, based on the candidate IRGPs identified in the univariate Cox analysis, LASSO penalized Cox regression was performed to choose the most significant pairs [24]. Finally, multivariate Cox analysis was used to select the most appropriate OS-related pairs and construct an IRGPs signature. To assess the performance of the IRGPs signature for GC patients, we generated receiver operating characteristic (ROC) curves at 1-, 3-, 5-years [25]. In addition, based on the individual risk score, GC patients were divided into high- and low-risk groups by the median risk score. Then, the Kaplan-Meier (K-M) survival curve was used to compare the differences in OS between the two groups with the log-rank test. Moreover, the subgroup analysis was applied for validation of the signature in different subgroups, including age (age<65, age>65), gender (female, male), N stage (N0-1, N3-4), T stage (T1-3, T4).
Validation of the expression of IRGPs in Oncomine
Oncomine is a cancer microarray database and integrated data-mining platform (http://www.oncomine.org) that contains more than 65 gene expression data sets comprised of 48 million gene expression measurements. By using expression data and clinical data from this database, all the IRGs of OS-related IRGs in the prognostic signature were explored to verify their differential expression in GC and normal gastric tissues.
Validation of the IRGPs signature
In order to verify the prognostic value of the IRGPs signature, three validation sets were applied, including GSE13861 (n=65), GSE26901 (n=109), and GSE66229 (n=300). According to the risk score formula in the training cohort, the risk score of each patient in the validation cohort was calculated. Based on the optimal cut off value identified in the training cohort, all patients in the validation cohort were divided into high- and low-risk groups. The K-M survival curves with the log-rank test were used to compare the differences of prognosis between these two groups.
The explosion of immune features of IRGPs in GC
CIBERSORT is a common algorithm to obtain cell composition from volume tumors or gene expression profiles [26], which was used in our study to understanding the enrichment of immune cells in different risk groups. For each sample, CIBERSORT infers the relative proportion of 22 infiltrating immune cells, including T cells, B cells, natural killer cells, macrophages, plasma cells, neutrophils, dendritic cells (DCs) and so on. Additionally, associations between IRGPs and a variety of immune checkpoints were explored. The associations between risk groups and infiltration levels of 22 kinds of immune cells and immune checkpoints were assessed using the Wilconson’s rank-sum test or Kruskal-Wallis test. A Circos diagram was generated by R software to display the genes in IRGPs on the corresponding chromosomes and visualize their interactions.
Development of a nomogram based on the immune and clinical prognostic factors
In order to improve the clinical utilization of IRGPs signature for GC patients, we conducted the univariate Cox analysis based on risk score and clinical variables in the training set, including age, sex, T stage, and N stage. As for the N stage, it was found that the prognosis of GC patients with N0 stage was significantly better than other stage patients in previous studies, and the prognosis of N1, N2, and N3 differed less [27]. Therefore, we divided the N stage into N0 and N1-3 groups. Then, the variables with p <0.05 were included in the multivariate Cox analysis to identify the independent OS-related factors. Based on independent OS-related factors, a nomogram was developed by rms package. The concordance index (C-index), the calibration curve, and decision curve analyses (DCA) were used to evaluate the performance of the nomogram.
Statistical analysis
All statistical analyses and graphics in our study were performed using R software(version 3.6.1). The student's t-test or Wilcoxon rank-sum test was used to comparing the mean standard error of the mean of continuous variables among groups. The Cox proportional hazards regression model was used for univariate and multivariate survival analysis. The gene model was conducted with multivariate Cox analysis, and the ROC curves were conducted by “survival ROC”. A p-value of less than 0.05 (two sides) was considered as statistical significance.