Data acquisition
The research strategy is shown in Fig. S1. Raw files of three GPL-related gene expression profiles (GSE55696 [12], GSE87666 [13] and GSE130823 [14]) were downloaded from the GEO database on NCBI (www.ncbi.nlm.nih.gov/geo/). GSE55696 performed on the Agilent-014850 Whole Human Genome Microarray 4x44K G4112F includes 19 LGIN, 20 HGIN, 19 EGC and 19 chronic gastritis (CG) samples. GSE87666 and GSE130823 both performed on the Agilent-039494 SurePrint G3 Human GE v2 8x60K Microarray 039381 include 7 LGIN, 9 HGIN, 6 EGC and 17 LGIN, 14 HGIN, 15 EGC, 47 paired inflammation controls respectively. In order to investigate the difference between individuals from different pathological stage, data of 47 paired inflammation controls of GSE130823 were excluded.
Raw files of three GC data sets on the GEO database with clinical information (GSE66229 [15] (normal controls (NC) n = 100, GC n = 300), GSE15459 [16] (GC n = 200), GSE34942 [17] (GC n = 56)) performed on the same microarray platform of [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array were downloaded. RNA sequencing data of 375 stomach adenocarcinoma (STAD) samples with clinical information from The Cancer Genome Atlas (TCGA) and 175 stomach NC from Genotype-Tissue Expression (GTEx) datasets were obtained from the UCSC Xena browser (https://xenabrowser.net/) [18].
Characteristics of datasets in this study were displayed in Table 1.
Data preprocessing
Raw data of the three GPL-related gene expression profiles were extracted. Background correction and normalization were performed by “limma” package of R respectively [19].
Data of the three GC data sets were extracted and normalized using the RMA algorithm. Then they were transformed to the form of log base 2 and consolidated as a GC meta-GEO cohort totally with 100 NC and 556 GC. The batch effects were removed by the “combat” function of “sva” package of R [20].
As for GTEx and TCGA datasets, RNA sequencing data (FPKM values) were transformed into log2 (FPKM+1). Then, they were merged as a STAD TCGA-GTEx cohort and normalized according to the “normalizeBetweenArrays” function of the package of “limma” in R software so that the expression values have similar distribution across a set of arrays [19].
Estimation of immune cells abundance
Immune Cell Abundance Identifier (ImmuCellAI, http://bioinfo.life.hust.edu.cn/web/ImmuCellAI/) is a gene set signature-based method for precisely estimating the abundance of 24 immune cell types, especially T-cell properties (18 T-cell subsets). It is reported that it has the best performance in immune cell abundance estimation [10], compared with other five methods (xCell [21], CIBERSORT [22], EPIC [23], MCP‐counter [24], and TIMER [25]). Therefore, gene expression data of GPL-EGC cohorts (GSE55696, GSE87666 and GSE130823), GC meta-GEO cohort and GC TCGA-GTEx cohort were submitted to ImmuCellAI to acquire the estimation of immune cells abundance.
Identification of key immune cells associated with GPL progression to EGC
We used “stage” to represent the severity of the pathology of tissues. In following correlation calculation, CG was represented by "1", LGIN was represented by "2", HGIN was represented by "3" and EGC was represented by "4". The relationship between immune cell abundance with pathological stage was preliminarily analyzed by one-way analysis of variance (ANOVA) test with p < 0.05 set as the cut-off. Then, correlation between immune cell abundance and pathological stage was calculated using Spearman’s correlation test. Common immune cells with a Spearman’s r greater than 0.5 and a p-value less than 0.05 in GSE55696 and GSE87666 cohorts were considered as key immune cells. After further verification between GC and NC in the GC meta-GEO cohort and STAD TCGA-GTEx cohort, immune cells with correspondingly consistent alterations were subjected for identifying related gene co-expression network.
Construction of gene co-expression networks
In order to remove outlier samples, sample clustering was performed based on Pearson’s correlation matrices and the average linkage method. Through utilizing the ‘WGCNA’ package in R[11], top 25% genes with the largest variance differences were used to construct weight gene co-expression networks in GSE55696 and GSE87666 datasets respectively. Firstly, a similarity matrix was established based on the Pearson’s correlation value between the paired genes. Then, adjacency matrix was created using following formula [26]: amn = |cmn|β (amn: adjacency matrix between gene m and gene n, cmn = Pearson’s correlation between paired genes, β: softpower value) and converted into a topological overlap matrix. The value of soft threshold power was chosen when the scale free topology scale R^2 exceeding 0.85. Genes with similar expression patterns were categorized into modules by average linkage hierarchical clustering with a module minimum size set as 100.
Identification of hub modules associated with key immune cells
Module eigengenes (MEs) was the major component of a gene module. To identify the modules significantly associated with key immune cells infiltration, we calculated the correlation between MEs and the infiltration level of key immune cells using Pearson’s correlation test in GSE55696 and GSE87666 cohorts respectively, with p-value < 0.05 set as the cut-off. Then, we took intersection of the modules with same correlation direction in GSE55696 and GSE87666 cohorts respectively, and performed KEGG pathway enrichment analysis on the overlapping genes to understand their potential functions to assist in determining the hub modules, with a cut-off criterion of adjusted p-value < 0.05.
Identification of candidate genes
After hub modules identification, candidate hub genes were identified. Module hub genes, which are highly connected intra-modular genes, have the high Module Membership (MM) scores to the respective module. MM is the parameter that correlates the expression profile of a gene with the MEs of a module which can measure the distance between genes and a given module. The absolute value of gene significance (GS), measuring the Pearson’s correlation between a given gene and the clinical trait, can also serve as the parameter to filter out the hub genes. We set the |MM| ≥ 0.70 and the |GS| ≥ 0.50 for hub genes in co-expression network. Meanwhile all genes in the hub module were uploaded to the Search Tool for the Retrieval of Interacting Genes (STRING; https://string-db.org/) database [27] to construct protein-protein interaction (PPI) network with the species limited to ‘Homo sapiens’ and confidence > 0.9. We used Cytoscape to visualize the network (https://cytoscape.org/) [28]. Genes with node degree ranking in top 10% were considered as central nodes in PPI network. Both in the hub modules of GSE55696 and GSE87666 datasets, common genes serve as hub genes in co-expression network and central nodes in PPI network were considered as the candidate hub genes, which was visualized in the form of Venn diagram (http://bioinformatics.psb.ugent.be/webtools/Venn/).
Identification of real hub genes
For the candidate hub genes, in STAD TCGA-GTEx and GC meta-GEO cohort, we compared their expression level between GC and NC and performed prognosis analysis with the quartile setting as the group cutoff. Genes with significant difference between GC and NC and prognostic significance were identified as real hub genes. P-value was defined as significantly different when it was less than 0.05 in above analysis.
Real hub genes expression patterns and relationship with pathological stage and key immune cells in GPL-EGC datasets and GC dataset
The expression level of real hub genes was plotted as box plots and the relationship with pathological stage was analyzed by one-way ANOVA test and Spearman correlation test in primary GPL-EGC datasets (GSE55696 and GSE87666). Correlation between hub genes and key immune cells was evaluated using Pearson’s correlation test. P-value was defined as significantly different when it was less than 0.05 in above analysis.
Verification of hub genes and key immune cells as well as their relationship
In order to confirm the pattern of key immune cells infiltration and hub genes expression as well as their relationship, we plotted the expression level of hub genes and the abundance of key immune cells across different pathological stages of LGIN, HGIN, and EGC in the independent cohort GSE130823 and analyzed by one-way ANOVA test and Spearman’s correlation test. Their correlation was also verified using Pearson’s correlation test in the independent GPL-EGC cohort GSE130823 and GC datasets (STAD TCGA-GTEx and GC meta-GEO cohort). P-value was defined as significantly different when it was less than 0.05 in above analysis.