Gene expression and clinical data from TCGA and GEO cohorts
Transcriptome expression profiles and corresponding clinical information of gastric cancer were downloaded from the Genomic Data Commons Data Portal of The Cancer Genome Atlas (TCGA, https://www.cancergenome.nih.gov/). The expression data was FPKM (fragments per kilobase million) type, which contained 376 gastric cancer cases as of December 2018. Samples without clinical information or with overall survival time <30 days were excluded. Genes with FPKM of zero in more than half of samples were also eliminated. Only expression profiles of immune-related genes were included.
For the second dataset, the microarray data and clinical information of GSE62254 were downloaded from Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/), which contained 300 gastric cancer samples. Only primary tumor data and expression profiles of immune-related genes were included, and samples with overall survival time <30 days were removed.
Immune-related gene pair construction
We constructed a prognostic signature based on immune-related genes from data downloaded from the InnateDB database (https://www.innatedb.com/). The data included various endogenous immune-related genes (IRGs) of several species supported by the literature and manually corrected [14]. In this study, human IRGs were collected, and a total of 1039 genes were eligible after removing genes with repeated names. The list of immune-related genes is given in Additional File 1.
To assess the difference between RNA-seq data from TCGA and microarray data from GEO, we used IRG and IRGP values to perform correlation clustering analyses for samples from the two cohorts, respectively.
First, a number of IRGPs were constructed, with the number being (1039 × 1038)/2. The IRGP values were calculated based on the specific gene expression level to generate a score. The IRGP score was assigned as 1 if IRG1 was less than IRG2; otherwise, it was 0. Then, the IRGP score of each sample in both the TCGA and GEO cohorts was calculated. Because of platform factors and biologically preferential transcription [15], some IRGPs had constant values (0 or 1). To improve the accuracy of the results, we removed IRGP scores with constant values (0 or 1) for all samples in both cohorts.
Integrating IRGPs from different datasets and grouping
To build a prognostic signature, the IRGPs shared by the two datasets were extracted to divide them into different sets. The TCGA data were divided into a training set and validation set. The IRGPs of GEO were used as an independent testing set.
To prevent the bias of random allocation from affecting the stability of the subsequent model, samples in the TCGA cohort were repeatedly grouped 100 times in different randomizations. The group sampling was in accordance with a training set:validation set ratio of 0.5:0.5. Criteria for selecting the most appropriate training set and testing set should be that (1) the two groups were similar in age distribution, clinical stage, follow-up time, and patient death rate, and (2) the number of the two classifications of the two data sets randomly clustered was close.
Construction and validation of an immune-related prognostic model
We developed the immune-related prognostic model using the training set. The significant different IRGPs were defined by univariate Cox regression analysis with p value = 0.05. Because a traditional Cox regression model could not be used directly to select highly correlated genes, least absolute shrinkage and selection operator (LASSO) was used to shrink the regression coefficients; LASSO is a type of biased estimation with complex collinearity data and it better solves the multicollinearity problem in regression analysis [16-17]. LASSO L1-penalized Cox analysis was performed by using glmnet R package, and a relatively small number of potential parameters were selected. Finally, an immune-related prognostic model was constructed using the regression coefficients derived from multivariate Cox regression analysis. Then, the risk score of each sample was calculated. To separate patients into low- or high-risk groups, the median risk score was determined as the cut-off value. The prognostic value of the IRGP model was evaluated using receiver operating characteristic (ROC) regression and Kaplan-Meier survival analysis in different cohorts. Then, the immune-related prognostic model was validated in other data sets.
Functional annotation and enrichment analysis of the immune-related genes in the model
To gain biological understanding of the IRGPs, functions of all the immune-related genes in IRGPs were explored. Then, enrichment analysis was conducted using Gene Ontology (GO). The immunological pathways of genes in Kyoto Encyclopedia of Genes and Genomes (KEGG) were also determined using an R package (clusterProfiler, v3.8).
Construction and validation of a composite immune-clinical prognostic model
Prognostic risk models were constructed using clinical features of TNM, grade, age, and stage from the TCGA dataset. Then, we integrated traditional prognostic clinical factors and the IRGP risk model to create a composite immune-clinical prognostic index by applying Cox regression in the TCGA dataset using the R package rms.
To individualize the predicted survival probability for 1, 3, 5, or 7 years, we constructed a nomogram based on the clinical characters and RiskScore results of the multivariate analysis. The nomogram, which uses the length of a line to indicate the degrees of different factors affecting the result, is a method that can display the results of a risk model more intuitively and effectively. It was convenient to apply in the prediction of outcomes. We constructed the nomogram model with the clinical features TNM, stage, age, and RiskScore.
A concordance index (C-index) was used to determine the discrimination of the nomogram, which was assessed by a bootstrap approach with 1000 resamples [18]. Furthermore, the predictive accuracies of the nomogram and other clinical characteristics were compared by using the C-index. ROC curve analysis was also assessed to predict the prognostic values of models by age, stage, and IRGPs. Then, we validated the accuracy of the nomogram in predicting prognosis using different datasets.
Statistical analysis
All analyses were conducted using R software (version 3.5.2; https://www.r-project.org/). Univariate Cox proportional hazards regression analysis of the relationship of IRGP and clinical factors with overall survival was assessed using log-rank test. LASSO regression analysis was used to further condense the quantity of factors significantly associated with overall survival in univariate analyses. Survival R package was used for Kaplan-Meier curve analysis. The C-index was estimated using the R rms package, which was also used to build the nomogram, including significant clinical characteristics and calibration plots. Survival rate of ROC analysis was calculated by time ROC R package. All statistical tests were two-sided, and p values < 0.05 were considered statistically significant.