Data acquisition and preprocessing
The research strategy is shown in Fig. 1. Raw files of three GPL-related gene expression profiles (GSE55696 [12], GSE87666 [13] and GSE130823 [14]) were downloaded from the GEO database on NCBI (www.ncbi.nlm.nih.gov/geo/). GSE55696 performed on the Agilent-014850 Whole Human Genome Microarray 4 × 44K G4112F includes 19 LGIN, 20 HGIN, 19 EGC and 19 chronic gastritis (CG) samples. GSE87666 and GSE130823 both performed on the Agilent-039494 SurePrint G3 Human GE v2 8 × 60K Microarray 039381 include 7 LGIN, 9 HGIN, 6 EGC and 17 LGIN, 14 HGIN, 15 EGC, 47 paired inflammation controls respectively. In order to investigate the difference between individuals from different pathological stage, data of 47 paired inflammation controls of GSE130823 were excluded.
Raw files of three GC data sets on the GEO database with clinical information (GSE66229 [15] (normal controls (NC) n = 100, GC n = 300), GSE15459 [16] (GC n = 200), GSE34942 [17] (GC n = 56)) performed on the same microarray platform of [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array were downloaded. RNA sequencing data of 375 stomach adenocarcinoma (STAD) samples with clinical information from The Cancer Genome Atlas (TCGA) and 175 stomach NC from Genotype-Tissue Expression (GTEx) datasets were obtained from the UCSC Xena browser (https://xenabrowser.net/) [18].
Characteristics of datasets in this study were displayed in Table 1.
Table 1
Characteristics of datasets in this study.
ID
|
Platform
|
Number of samples
|
GSE55696
|
Agilent-014850 Whole Human Genome Microarray 4 × 44K G4112F (GPL6480)
|
19 CG, 19 LGIN, 20 HGIN and 19 EGC
|
GSE87666
|
Agilent-039494 SurePrint G3 Human GE v2 8 × 60K Microarray 039381 (GPL17077)
|
7 LGIN, 9 HGIN and 6 EGC
|
GSE130823
|
Agilent-039494 SurePrint G3 Human GE v2 8 × 60K Microarray 039381 (GPL17077)
|
47 paired inflammation controls (excluded), 17 LGIN, 14 HGIN and 15 EGC
|
GSE66229
|
[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array (GPL570)
|
100 NC and 300 GC
|
GSE15459
|
[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array (GPL570)
|
200 GC
|
GSE34942
|
[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array (GPL570)
|
56 GC
|
TCGA-STAD
|
Illumina Genome Analyzer
|
375 GC
|
GTEX-STOMACH
|
Illumina Genome Analyzer
|
175 NC
|
NC: normal control; CG: chronic gastritis; LGIN: low-grade intraepithelial neoplasia; HGIN: high-grade intraepithelial neoplasia; EGC: early gastric cancer; GC: gastric cancer; |
Data Preprocessing
Raw data of the three GPL-related gene expression profiles were extracted. Background correction and normalization were performed by “limma” package of R respectively [19].
Data of the three GC data sets were extracted and normalized using the RMA algorithm. Then they were transformed to the form of log base 2 and consolidated as a GC meta-GEO cohort totally with 100 NC and 556 GC. The batch effects were removed by the “combat” function of “sva” package of R [20].
As for GTEx and TCGA datasets, RNA sequencing data (FPKM values) were transformed into log2 (FPKM + 1). Then, they were merged as a STAD TCGA-GTEx cohort and normalized according to the “normalizeBetweenArrays” function of the package of “limma” in R software so that the expression values have similar distribution across a set of arrays [19].
Estimation Of Immune Cells Abundance
Immune Cell Abundance Identifier (ImmuCellAI, http://bioinfo.life.hust.edu.cn/web/ImmuCellAI/) is a gene set signature-based method for precisely estimating the abundance of 24 immune cell types, especially T-cell properties (18 T-cell subsets). It is reported that it has the best performance in immune cell abundance estimation [10], compared with other five methods (xCell [21], CIBERSORT [22], EPIC [23], MCP-counter [24], and TIMER [25]). Therefore, gene expression data of GSE55696, GSE87666, GSE130823, GC meta-GEO cohort and GC TCGA-GTEx cohort were submitted to ImmuCellAI to acquire the estimation of immune cells abundance.
Identification of key immune cells associated with GPL progression to EGC
We used “stage” to represent the severity of the pathology of tissues. In the following correlation calculation, CG was represented by "1", LGIN was represented by "2", HGIN was represented by "3" and EGC was represented by "4". The relationship between immune cell abundance with pathological stage was preliminarily analyzed by one-way analysis of variance (ANOVA) test with p < 0.05 set as the cut-off. Then, correlation between immune cell abundance and pathological stage was calculated using Spearman’s correlation test. Common immune cells with a Spearman’s r greater than 0.5 and a p-value less than 0.05 in GSE55696 and GSE87666 cohorts were considered as key immune cells. After further verification between GC and NC in the GC meta-GEO cohort and STAD TCGA-GTEx cohort, immune cells with correspondingly consistent alterations were subjected for identifying related gene co-expression network.
Construction Of Gene Co-expression Networks
According to sample clustering, there was an outlier sample in GSE87666 while none in GSE55696 (Fig. 4a and 4b). After removing the outlier sample (GSM2338022), 4359 genes in GSE55696 and 4663 genes in GSE87666 with the highest expression variance (top 25%) were selected for subsequent WGCNA. β = 6 (scale-free R2 = 0.87) and β = 13 (scale-free R2 = 0.86) were the lowest power fit scale-free index over 0.85 and determined as the soft-thresholding power parameter to ensure a scale-free network in GSE55696 and GSE87666 respectively (Fig. 4c-f). Eventually, genes with similar expression patterns were grouped into eight co-expression modules respectively in GSE55696 and GSE87666 (Fig. 5a and 5c). The grey module is a set of genes that cannot be clustered to any module. In GSE55696, the modules of blue (r = 0.46, p = 3e-05) and yellow (r = 0.26, p = 0.02) were significantly positively correlated with the abundance of NKT cell while pink (r = -0.40, p = 3e-04), green (r = -0.75, p = 5e-15), and red (r = -0.50, p = 5e-06) displayed significantly negative (Fig. 5b). In GSE87666, the red module (r = 0.83, p = 4e − 06) was significantly positively correlated with the abundance of NKT cell while the green module (r = -0.65, p = 0.002) displayed significantly negative (Fig. 5d).
Identification Of Hub Modules Associated With Key Immune Cells
Module eigengenes (MEs) was the major component of a gene module. To identify the modules significantly associated with key immune cells infiltration, we calculated the correlation between MEs and the infiltration level of key immune cells using Pearson’s correlation test in GSE55696 and GSE87666 cohorts respectively, with p-value < 0.05 set as the cut-off. Then, we took intersection of the modules with same correlation direction in GSE55696 and GSE87666 cohorts respectively, and performed KEGG pathway enrichment analysis on the overlapping genes to understand their potential functions to assist in determining the hub modules, with a cut-off criterion of adjusted p-value < 0.05.
Identification Of Candidate Genes
After hub modules identification, candidate hub genes were identified. Module hub genes, which are highly connected intra-modular genes, have the high Module Membership (MM) scores to the respective module. MM is the parameter that correlates the expression profile of a gene with the MEs of a module which can measure the distance between genes and a given module. The absolute value of gene significance (GS), measuring the Pearson’s correlation between a given gene and the clinical trait, can also serve as the parameter to filter out the hub genes. We set the |MM| ≥ 0.70 and the |GS| ≥ 0.50 for hub genes in co-expression network. Meanwhile all genes in the hub module were uploaded to the Search Tool for the Retrieval of Interacting Genes (STRING; https://string-db.org/) database [27] to construct protein-protein interaction (PPI) network with the species limited to ‘Homo sapiens’ and confidence > 0.9. We used Cytoscape to visualize the network (https://cytoscape.org/) [28]. Genes with node degree ranking in top 10% were considered as central nodes in PPI network. Both in the hub modules of GSE55696 and GSE87666 datasets, common genes serve as hub genes in co-expression network and central nodes in PPI network were considered as the candidate hub genes, which was visualized in the form of Venn diagram (http://bioinformatics.psb.ugent.be/webtools/Venn/).
Identification Of Real Hub Genes
The expression level comparison between GC and NC and survival analysis of 4 candidate genes were conducted to identify the real hub genes. In the GC meta-GEO cohort, CXCR4 was significantly upregulated in GC and had poor prognosis (Fig. 7a and 7e). For CD53, the expression between GC and NC and the prognosis had no significant difference (Fig. 7b and 7f). IL10RA and CD19 were significantly downregulated in GC but the prognosis was not significant (Fig. 7c, 7d, 7 g and 7 h). In the STAD TCGA-GTEx cohort, CXCR4 was significantly upregulated in GC and had poor prognosis (Fig. 8a and 8e). CD53 and CD19 were significantly upregulated in GC but the prognosis was not significant (Fig. 8b, 8f, 8d and 8 h). For IL10RA, there was no significant difference in expression between GC and NC and prognosis (Fig. 8c and 8 g). Combining the results of above two cohorts, CXCR4 was identified as the real hub gene.
CXCR4 expression patterns and relationship with pathological stage and NKT cell
Data showed that CXCR4 expression was increased with GPL progression (GSE55696: p in one-way ANOVA test < 0.0001, Fig. 9a; GSE87666, p in one-way ANOVA test = 0.0376, Fig. 9d) and significantly positively correlated with tumorigenesis (GSE55696: r = 0.6157, p < 0.0001, Fig. 9b; GSE87666, r = 0.4705, p = 0.0271, Fig. 9e). Pearson’s correlation test indicated that CXCR4 was significantly negatively correlated with NKT cell infiltration during GPL progression (GSE55696: r = -0.7726, p < 0.0001, Fig. 9c; GSE87666, r = -0.6305, p = 0.0017, Fig. 9f).
Verification of the pattern of NKT cell infiltration, CXCR4 expression and their relationship
The pattern of NKT cell infiltration and CXCR4 expression during GPL progression were verified in the independent GPL cohort GSE130823. Their relationship was also verified in the independent GPL cohort GSE130823 and GC datasets (TCGA-GTEx and GC meta-GEO cohort). During GPL progression, data indicated that the abundance of NKT was gradually decreased with GPL progression (p in one-way ANOVA test = 0.0007, Fig. 10a) and significantly negatively correlated with tumorigenesis (r = -0.5799, p < 0.0001, Fig. 10b) while CXCR4 expression was opposite (p in one-way ANOVA test = 0.0004, Fig. 10c; r = 0.5799, p < 0.0001, Fig. 10d). CXCR4 expression also displayed a significantly negative correlation with NKT cell infiltration both in GPL progression to EGC (r = -0.8070, p < 0.0001, Fig. 10e) and GC (GC meta-GEO cohort: r = -0.3651, p < 0.0001, TCGA-STAD cohort: r = -0.2518, p < 0.0001 Fig. 10f-g). Above analysis results were consistent with those in GSE55696 and GSE87666 cohorts.