Collection of public datasets
Gene expression dataset GSE54129 (accession number) was got from the GEO repository, which is an online international resource for the retrieval of functional genomic data [22]. In total, GSE54129 (platform: GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array) includes data from 111 human GC tissues from 111 patients with GC that underwent subtotal gastrectomy and 21 noncancerous gastric tissues from 21 volunteers who underwent gastroscopy for a health examination.
TCGA database (http://firebrowse.org/) is a leading comprehensive repository of cancer genomic profiles that stores data about multi-dimensional major cancer-causing genomic alterations in various cancers [23]. We also downloaded the stomach adenocarcinoma (STAD) patients' clinical and RNA-Seq data (Level_3__RSEM_genes_normalized__data) from TCGA database, which includes a total of 415 GC samples and 35 controls. Copy number variation (CNV) data (CopyNumber_Gistic2.Level_4) was also obtained from TCGA database.
Data processing and identification
The R package affy [24] (version 1.50.0) and the limma package [25] (version 3.10.3l) were applied to process the data downloaded from the GEO database. The Robust Multi-array Average (RMA), background correction, normalization, and expression calculation was performed. The platform annotation file was used for probe annotation and a few probes that were not annotated with gene symbols were removed. In the case of multiple probes targeting the same gene, the mean expression value of those probes was considered as the final expression of the gene. For the data downloaded from TCGA database, the expression profile data was presented as the processed RSEM value, which can be used directly with a log2 conversion.
Subsequently, an empirical Bayesian method was implemented in the limma package [25] for identification of DEGs between GC samples and controls in GSE54129 and the dataset from TCGA database, respectively. A gene was considered differentially expressed when this analysis satisfied the conditions of Benjamini & Hochberg (BH)-corrected p-value < 0.05 and |log fold change (FC)|> 2.
Finally, the VENNY2.1.0 (http://bioinfogp.cnb.csic.es/tools/venny/index.html) online analysis tool was used to select common DEGs identified in the two gene datasets.
Functional and pathway enrichment analysis
The commonly used enrichment analysis tool DAVID [26] (version 6.8) was applied to perform functional enrichment analysis of common DEGs, including Gene Ontology (GO) Biological Process (BP) enrichment analysis [27] and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis [28]. The cut-off criteria for included genes was a p-value<0.05 and a gene count ≥2.
PPI network analysis
We determined the PPI relationships of the identified common DEGs using the Search Tool for the Retrieval of Interacting Genes (STRING) (version: 10.0) [29]. The PPI score was set to medium confidence (0.4) and the organism was set to Homo sapiens. Cytoscape software was used to visualize the resulting complex networks (version: 3.2.0) [30]. Subsequently, the CytoNCA plug-in [31] (version 2.1.6) was used to perform PPI node topology analysis (parameter =‘without weight’). Each node’s ranked score was used to identify the key nodes involved in the protein interaction relationships in the PPI network.
Driver gene prediction
It is generally thought that nodes in a gene interaction network are more important and more likely to be key genes in a network if they are more closely related to other genes [32]. DawnRank [33] applies this theory in its algorithm and assigns high score values to genes that significantly affect the abnormal expression of downstream genes. A gene’s rank score can then be used to determine which genes in the sample can be used as driver genes.
We extracted gene expression data from the common DEGs in the TCGA STAD cancer samples and control samples, as well as gene mutations in tumor samples. A PPI network was constructed as an adjacency matrix (e.g., node i and node j were connected, Aij was considered to be 1, otherwise Aij = 0) in order to predict and analyze cancer driver genes. The DawnRank algorithm was used to obtain the gene rank score for each gene in each patient. The higher the score, the more likely it is that the gene is a driver gene. Finally, functional enrichment analysis of the identified driver genes was conducted.
Prediction of small molecule drugs
The Drug-Gene Interaction database (DGIdb) mines multiple existing resources for drug–gene interactions and generates assumptions about how genes may be developed as therapeutic targets or prioritized for drug development [34]. In the present study, we used DGIdb2.0 [34] to predict drug-gene interactions for the identified GC driver genes. The preset filter was set to ‘FDA Approved.’ Cytoscape was then used to build a drug-gene network map.
Survival analysis
To perform survival analysis, we used TCGA clinical information, including overall survival (OS), survival status, and disease samples screened. The STAD samples were divided into high expression group and low expression group according to the mean gene expression. Log-ranktest was done and the statistical significance threshold set to p < 0.05, and the relationship between survival and driver genes were determined.