Identification and screening of DEGs between LUAD tissue and normal tissue in each dataset
We employed a multistep strategy to explore candidate genes, which are important in LUAD pathogenesis by integrating bioinformatics tools (figure 1). After the preliminary search, we identified 4 datasets; GSE30219, GSE33532, GSE32863 and GSE43458 based on the inclusion criteria. GSE32863 and GSE43458 datasets contained only LUAD patient gene expression profiles, whereas only LUAD pathological subtype patients in GSE30219 and GSE33532 were screened. A total of 263 LUAD and 123 normal lung tissue samples were obtained in this study. The information of the dataset used in this study is summarized in table 1.
After screening the datasets based on the cut-off criteria, a set of 9014 DEGs including 3182 and 5832 up- and down-regulated genes were identified between LUAD and normal lung tissue (table 1). There were 2656 DEGs, including 857 upregulated and 1799 downregulated genes in GSE30219, 3756 DEGs, including 1362 upregulated and 2394 downregulated in GSE33532, 1026 DEGs, including 319 upregulated and 707 downregulated in GSE43458, and 1576 DEGs, including 644 upregulated and 932 downregulated genes in GSE32863 dataset. The volcano plots representing the distribution of the DEGs in each dataset are shown in figure 2A-D. Next, we further identified the genes, which were overlapped among all the four datasets. A total of 73 and 259 overlapping upregulated and downregulated genes, respectively were identified among all the datasets, which are represented in Venn diagrams (figure 2E and 2F, table 2). Moreover, among these commonly shared DEGs we identified the top 20 upregulated and downregulated genes and displayed on the heatmap (figure 2G).
Functional pathway enrichment analysis
GO and KEGG pathway enrichment analysis deciphers the function and relationship of the genes with various biological functions. To assess the functional roles of the 73 upregulated and 257 downregulated genes, DAVID database was used (table S2). Based on the p-value and cut-off criteria, the DEGs were classified into three functions; biological function, molecular function (MF) and cellular function (CF). The biological function (BF) category of the GO analysis revealed the enrichment of the upregulated genes in cellular and metabolic processes, cell division, tissue development and regulation of biological processes (figure S1A). Regarding the CF, DEGs were enriched in extracellular region part, proteinaceous extracellular matrix, cell-cell junction. (figure S1B). Moreover, the DEGs were enriched in protein and chromatin binding and catalytic activity in MF category of the GO analysis (figure S1C). Circos plot showing GO enrichment of the 20 upregulated genes illustrates the role of these genes in collagen catabolic and metabolic processes, extracellular structure organization and tissue development (figure 3A). Based on KEGG enrichment analysis, the upregulated genes significantly enriched in the cytoplasm and extracellular region (figure 3B). Moreover, the downregulated genes were enriched in multicellular organismal processes and regulation of biological processes, metabolic process, immune system (BF) (figure S1D) extracellular matrix, extracellular region part, (CF) (figure S1E), and binding and catalytic activity (MF) (figure S1F). Circos plot showing GO enrichment of the 20 downregulated genes shows the regulation of vasculature and system development, angiogenesis, cell motility and regulation of cell proliferation (figure 3C). Based on KEGG enrichment analysis, the downregulated genes significantly enriched in multicellular organismal processes, developmental processes and extracellular region (figure 3D).
Protein-protein interaction (PPI) network analysis
The establishment of PPI network interaction of these DEGs were constructed based on the information obtained from STRING database. A list of 330 DEGs were submitted to STRING database and a network was obtained, which consisted a total number of 935 edges and 292 nodes (figure S2A). Significant cluster module identification via MCODE-plugin in Cytoscape revealed 4 modules containing 55 genes in the whole PPI network (figure 4A, 4B, 4C and 4D). We calculated the top-ranked hub genes by using the MCC. The top 4 functional clusters of modules were selected (module 1, MCODE score=12; module 2, MCODE score=9; module 3, MCODE score=5.0; module 4, MCODE score=5). The number of edges, nodes and genes in 4 modules are listed in table 3. DAVID database was used to perform the KEGG pathway analysis of the genes in modules (table S3). The genes in module 1 were involved in cell cycle, cell division, p53 signaling pathways. The genes in module 2 were enriched in HIF-1, TNF and PI3-Akt signaling pathways. The genes in module 3 were involved in vascular smooth muscle contraction, protein digestion and absorption, relaxin signaling pathway and pathways in cancer. Moreover, the genes in module 4 were enriched in adherens junction, acute myeloid leukemia, PPAR signaling pathway and FoxO signaling pathway. Next, the cytohubba plugin in cytoscape was applied to identify the hub genes in the modules and obtained 10 hub genes (KIF20A, MELK, CCNB2, CDC20, TOP2A, PECAM1, CDH5, IL6, ZBTB16 and RNF144B) for LUAD. Among the 10 hub genes, 5 genes were upregulated and 5 genes were downregulated. STRING database was used to construct the protein interaction network of the hub genes with each other and with neighboring genes (figure S2B and S2C). The interaction network of the hub genes comprised up of 10 nodes, 17 edges with a p-value of 3.57e-05. In addition, hub genes coexpression analysis revealed possible interaction of among the hub genes (figure 4E).
Validation of the prognostic and diagnostic potentials of the hub genes
In order to corroborate the importance of the and hub genes in LUAD, the prognostic potential of the hub genes was investigated by using KM plotter database. The overall survival (OS) analysis curves of the hub genes are represented in figure 5A-5J. Among the 10 hub genes, the upregulation of KIF20A [HR=1.66 (1.46-1.88)], MELK [HR= 1.61 (1.41-1.82)], CCNB2 [1. 99 (1.74-2.26)], CDC20 [1.82 (1.6-2.07)] and TOP2A [1. 55 (1.36-1.76)] predicted the poorer OS of the LUAD patients, whereas the lower expression of PECAM [HR=0.61 (0.52-0.72)], CDH5 [HR=0.69 (0.61-0.78)], LEPREL1 [HR= 0.76 (0.67-0.86)], RFN144B [HR=0.58 (0.49-0.7)] and ZBTB16 [HR=0.72 (0.62-0.7)] were significantly related to the poorer OS. Moreover, the upregulation of KIF20A, MELK, CCNB2, CDC20 and TOP2A predicted a poor progression-free survival (figure S6A-6E), whereas the lower expression of PECAM1, CDH5, ZBTB16 and RNF144B predicted the poorer PFS (figure S6F-6I).
In addition, we verified the expression and prognostic potentials of the 330 DEGs using GEPIA and KM Plot, respectively and found a set of 49 genes with significantly high expression and prognostic value (figure S3A). The hub genes with highest hazard rate (HR) and high expression including the UBE2T, TYMS, PRC1, CDC45, and TPX2 with HR of 2.23, 2.32, 1.91, 1.84 and 1.86, respectively, which indicated poor prognosis in LUAD patients (figure S3B-S3F and table 4). Similarly, a set of 109 genes with low expression and the good prognosis was obtained (figure S3G) including PTPRM, PRICKLE2, CPED, LPL, and DCN with an HR of 0.53, 0.48, 0.49, 0.65 and 0.61, respectively (figure S3H-S3L). Next, we determined the diagnostic value of the hub genes based on their TCGA-based gene expression profiles and found that these genes possess high diagnostic accuracy with AUC (area under the receiver operating characteristic curve) values of 0.98, 0.97, 0.98, 0.98, 0.99, 0.98, 0.98, 0,93, 0.87 and 0.96 value ranging for KIF20A, MELK, CCNB2, CDC20, TOP2A, PECAM1, CDH5, IL6, ZBTB16 and RNF144B, respectively (figure S4).
Validation of the expression of hub genes
In order to validate the expression of hub genes in TCGA, fragments per kilobase of transcript per million mapped reads (FPKM) gene expression profiles from the TCGA database available at (https://portal.gdc.cancer.gov/) were downloaded filtered and merged to produce one file, which contained the gene expression data for both LUAD and normal samples. The expression of the hub genes was assessed. The result showed that KIF20A, MELK, CCNB2, CDC20 and TOP2A genes were significantly upregulated in LUAD patients compared to the normal samples (figure 6A). However, PECAM1, CDH5, LEPREL1, ZBTB16 and RNF144B were downregulated in LUAD patients (figure 6B).
Next, the expression of hub genes was determined in two lung cancer cell lines; A549 and PC9, and normal bronchial epithelial cells BEAS-2B. We found that the expression of KIF20A, MELK, CCNB2, CDC20 genes was upregulated in lung cancer cell lines compared to the normal bronchial epithelium cell (figure 6C), whereas, the expression of PECAM1, CDH5, LEPREL1 genes was reduced in the lung cancer cell lines compared to BEAS-2B (figure 6D).
Verification of the expression of hub genes on protein level
In addition, the expression of the hub genes was assessed at protein level in the LUAD and normal lung tissue and using HPA database. The expression of KIF20A, MELK, CCNB2, CDC20 and TOP2A was highly expressed in LUAD tissue compared to the normal lung tissue (figure 7A-7E). Contrarily, a reduced expression of PECAM1, CDH5, LEPREL1, ZBTB16 and RNF144B in LUAD as compared to the normal lung tissue (figure 7F-7J). The clinical information of the immunohistochemistry (IHC) in human proteome atlas is available in table S4.
Hub gene-drug interaction network
Using DGIdb for exploring the drug-gene interaction, a list of 117 potential drugs for treating LUAD was identified (table 5). These drugs are all FDA-approved drugs. Among the hub genes, TOP2A was found to interact with 97/118 drugs, either as an inhibitor or in an unknown mechanism, followed by MELK, which interacted with 18 drugs (table 5). The downstream interaction networks of TOP2A and MELK was constructed with STITCH (figure 8A-8B). TOP2A might have a downstream effect on Ciprofloxacin, Dexrazoxane, Etoposide and Amsacrine (figure 8A), whereas MELK might have effect on Dovitinib, Nintedanib, Sunitinib and Staurosporine (figure 8B).
miRNA network analysis:
Given the role of miRNA in regulating biological processes, we identified the miRNAs targets of the hub genes using mirRTarbase and GCBI databases. The miRNA network was constructed and visualized with Cytoscape, which consisted of 9 hub genes and 165 interacting miRNAs. The interaction of the miRNAs with their corresponding hub genes are represented in figure 9A. Based on their degree of closeness, the top 10 molecules in the interaction network were filtered. RNF144B, ZBTB16 and CDH5, LEPREL1 and TOP2A with a degree score of 30,30,27,25 and 16, respectively were the top 5 miRNA targeting genes. Moreover, hsa-miR-16-5p, hsa-miR-128-3p, hsa-let-7f-5p, hsa-let-7c-5p and hsa-let-7a-5p, were the top 5 interactive miRNAs with a score of 52, 47, 46, 45.95 and 45.96, respectively (figure 9B). The miRNA network of the 9 hub genes constructed by using GCBI database is shown in figure S5.
Expression of hub genes in different cancer stages
The expression of the identified hub genes in different lung cancer stages were assessed with GEPIA database. We observed a significant difference in the expression of KIF20A (F value= 6.03 Pr >F=0.00049), CDC20 (F value= 4.95 Pr >F=0.00217), MELK (F value= 4.56 Pr >F=0.00368), CCNB2 (F value= 7.43 Pr >F=7,16E-05), and TOP2A in different stages of lung cancer (F value= 2.88 Pr >F=0.003) (figure 10A-10E). However, the differential expression of PECAM1, CDH5, LEPREL1, ZBTB16 and RFN144B was insignificant (figure 10F-10J). The expression of CDC20, MELK and TOP2A decreased with continuous progression of LUAD (figure 10B, 10C and 10E). Whereas, the expression of KIF20 was higher in stage1 and stage 4 compared to stage 2 and stage 3 (figure 10A), and the expression of CCNB2 was highest in stage 2 (figure 10D).
Genetic alteration of the hub genes
Investigation of the mutation in genes associated with biological functions are important. Several mutations in these genes can lead to LUCA [20] We investigated the genetic alteration in the hub genes using Cbioportal and identified various alteration in 19% of patients or samples. The highest frequency of alteration was observed in MELK, CDC20, TOP2A, PECAM1 and CDH5 (7%, 3%, 2.6%, 2.6% and 1.3%), respectively. These alterations included missense mutation, truncating mutation, amplification, deep deletion and no alteration (figure 11A). Among the alterations, amplification accounted for the highest percentage of the alterations followed by deep deletion and mutations (figure 11B).