Identification of robust DEGs in breast cancer by the RRA analysis
Differentially expressed genes (DEGs) of eight datasets from the GEO database were integrated to perform RRA analysis, and the characteristics for each dataset was shown in Table 1. We used |log2FC| > 1 and FDR < 0.05 as screening criteria to obtain the robust (DEGs) between breast cancer tissues and normal tissues. A total of 512 robust DEGs were identified, containing 202 up-regulated genes and 310 down-regulated genes. COL11A1 (P = 2.47E-19, adjusted P = 6.53E-15, logFC= 2.86) and S100P (P = 1.24E-17, adjusted P = 3.28E-13, logFC=3.50) were the two most significant up-regulated genes. Meanwhile, LEP (P = 2.68E-14, adjusted P = 7.06E- 10, logFC= -3.12) and FGF2 (P = 2.84E-14, adjusted P = 7.48E- 10, logFC= -1.87) were the two most significant down-regulated gene in breast cancer tissues. Fig 1 shows the top 20 most significant up-regulated and down-regulated robust DEGs obtained by RRA methods from these eight different datasets.
GO functional enrichment analysis and KEGG pathway enrichment analysis of robust DEGs
To further understand the biological functions of those 512 robust DEGs, GO functional enrichment analysis and KEGG pathways analysis were performed. These DEGs were significantly enriched in GO terms related to biological process including extracellular structure organization, extracellular matrix organization, ossification, mitotic nuclear division and regulation of lipid metabolic process (Fig. 2a). Cellular component GO terms were mainly distributed in collagen−containing extracellular matrix, extracellular matrix component, lipid drople and fibrillar collagen trimer (Fig. 2b). The molecular function GO terms consisting of extracellular matrix structural constituent, glycosaminoglycan binding, heparin binding, sulfur compound binding growth factor binding those DGEs were significantly enriched (Fig. 2c). What′s more, KEGG pathway enrichment analysis revealed that PI3K-Akt signaling pathway, PPAR signaling pathway, ECM−receptor interaction, Relaxin signaling pathway, IL−17 signaling pathway, AMPK signaling pathway were significantly associated with these robust DEGs (Fig. 2d)
PPI network analysis of robust DEGs
Next, the PPI network of the 512 DEGs was constructed via STRING database, including 493 nodes and 2993 edges. To identify the hub genes in this PPI network, we ranked the network nodes using 9 topological analysis methods including both local- and global-based algorithms from cytoHubba plugin of Cytoscape, and finally found the EZH2 score ranked in the top 10 by 9 above-cited consisting of “MCC”, “MNC”, “Degree”, “BottleNeck”,“EcCentricity”, “Closeness”, “Radiality”, “Betweenness” and “Stress” algorithms (Table 2). Subsequently, gene module analysis was conducted using MCODE plugin in Cytoscape, and the hub gene EZH2 was also found in module 1 (Additional file 1: Figure S1), which is the most important module (MCODE score = 31.942) in all modules. Lastly, GEPIA database analysis showed that the mRNA expression levels of EZH2 were significantly higher in breast cancer tissues than normal breast tissues (Additional file 1: Figure S2) and the expression levels of EZH2 was significantly associated with the OS of patients in breast cancer (Additional file 1: Figure S3). GSEA demonstrated that high expression level group of EZH2 was significantly enriched in “Cell cycle” and “DNA replication”, which have been confirmed as tumor cell proliferation related pathways [37](Additional file 1: Figure S4).
WGCNA identifies key gene module and four novel key genes in breast cancer
We performed WGCNA on the TCGA_BRCA dataset that incorporate 3,769 up-regulation DEGs (p value <0.05) derived from the RRA analysis to find the key modules with breast cancer. After series of quality assessment for gene expression matrix, we set soft threshold as 5 (scale free R2 = 0.97, slope= −1.92) to construct and validate a scale-free network(Fig. 3a-b). By setting minimal module size as 50 genes and cut height as 0.25 to merge similar modules, seven modules were obtained eventually (Fig. 3c; non-clustering DEGs shown in gray). From the heatmap of module–trait correlations (Fig. 3d), we identified that the blue module (cor = 0.44, p = 4e-55) and brown module (cor = 0.46, p = 3e-63) were most correlated with breast cancer (Fig. 3e-f). The blue module contained 920 genes and the brown module contained 730 genes. Next, we set the filter standard of hub gene associated with breast cancer with module membership (MM) value >0.6 and gene significance (GS) value >0.3, and 64 hub genes from the blue module and 38 hub genes from the brown module met the eligibility criteria (Table 3). Among these 102 hub genes defined by GS and MM, four genes CENPL, ISG20L2, MRPL3, and LSM4 were selected for analysis due to they were never reported before in breast cancer.
Correlation analysis of the four novel hub genes with clinicopathological variables in breast cancer
In view of the METABRIC dataset contains a relatively large number of samples and rich clinical information, we explored the relationships between the expression levels of the novel four hub genes and the clinicopathological characteristics in this dataset. Clinicopathological variables in METABRIC dataset mainly include age at diagnosis, grade, tumor stage, tumor size and the number of positive lymph nodes. The results of spearman correlation analysis were shown in Fig. 4, showing the expression levels of hub genes (CENPL, ISG20L2, MRPL3, and LSM4) were notably correlated with clinicopathological variables in breast cancer samples (P < 0.05). Higher expression levels of hub genes of all four selected hub genes (CENPL, ISG20L2, MRPL3, and LSM4) were associated with higher grade or latter tumor stage. And higher expression levels of CENPL, ISG20L2, and LSM4 correlated with bigger tumor size. While higher expression of CENPL, ISG20L2 were correlated with more the number of positive lymph nodes.
Validation of the expression differences of the four novel key genes in breast cancercompared to normal tissues
Based on the TCGA_BRCA and match TCGA normal and GTEx data of GEPIA database, the mRNA expression levels of these genes (CENPL, ISG20L2, MRPL3 and LSM4) were also significantly higher in breast cancer tissues than normal tissues (Additional file 1: Figure S5). Immunohistochemistry analysis from The Human Protein Atlas database also demonstrated the up-regulated of those hub genes expression in protein level in breast lobular carcinoma (Fig. 5). In order to elucidate the underlying mechanisms of abnormal up-regulation of these four hub genes in breast cancer, we investigated the association between gene expression and their methylation levels. DiseaseMeth version 2.0 analysis displayed that the mean methylation levels of CENPL, MRPL3 and LSM4 were all significantly reduced in breast cancer compared with normal breast tissues (p<0.05) (Fig. 6a, 6c and 6d). While the mean methylation levels of ISG20L2 significantly increased in breast cancer compared to normal breast tissues (p<0.05) (Fig. 4b). Additionally, Genetic alterations of CENPL, ISG20L2, MRPL3, and LSM4 were further examined in cBiportal database, as shown in Fig. 6e, the four hub genes were altered in 570 (26%) of 2173 breast cancer patients. And CENPL and ISG20L2 altered the most (20%), showing gene amplification was the main alteration type.
Identifying the diagnostic performance and prognostic value of eachhub gene in breast cancer
We first performed ROC analysis to assess the diagnostic performances of the four hub genes for detecting breast cancer in TCGA_BRCA dataset, and their AUC values (CENPL AUC: 0.934, LSM4 AUC: 0.948, MRPL3 AUC: 0.891, ISG20L2 AUC: 0.918) were showed in Fig. 7a, indicating good diagnostic performance. Subsequently, ROC analysis in GEO datasets further validate the diagnostic value of these four hub genes: AUC values showed in Fig. 7b, CENPL AUC: 0.83, LSM4 AUC: 0.913, MRPL3 AUC: 0.841, ISG20L2 AUC: 0.951, meaning that the four hub genes all yielded good prediction performance. Plus, we used TCGA_BRCA dataset and METABRC dataset to perform OS rates analysis to evaluate the prognostic values of these four novel hub genes in breast cancer patients. Although the two datasets composition were inconsistent, the Kaplan-Meier curves still showed that the difference between high expression groups and low expression groups were significant (all P < 0.05), and the higher expression levels of these four hub genes were significantly associated with the poor OS of breast cancer patients (HR ˃1, Fig. 8).
GSEA and GSVA exhibit a tight relationship between the four hub genes and tumor cell proliferation
To further elucidate the lurking biological functions of CENPL, ISG20L2, MRPL3 and LSM4 in breast cancer occurrence and development, we conducted GSEA and GSVA analyses based on METABRIC dataset. The results of GSEA were shown in Fig. 9, the genes in high expression groups of CENPL, ISG20L2, MRPL3, and LSM4 were all significantly enriched in tumor cell proliferation related pathways such as “cell cycle” and “DNA replication” pathways. Meanwhile, GSVA results substantiated that these cell proliferation-associated genesets were significantly up-regulated in the high-expression groups of CENPL, ISG20L2, MRPL3 and LSM4 (Fig. 10).
RNA in situ detection
We measured the expression abundance and spatial localization of the five hub genes by RNA in situ detection technology. The results show that the signal of five hub genes were mainly distributed in the cytoplasm and nucleus, and the amount of signal originates from hybridization to each probe varied greatly. Among these five genes, ISG20L2 showed the highest expression level, while CENPL and LSM4 have fewer signal (Fig. 11a-d). Compared to normal mammary epithelial cell (MCF10A), the RCPs of each hub gene in both breast cancer cell lines (MCF7 and MD-MB-231) show a significant increase, but the RCPs of ISG20L2 reduced and the RCPs of LSM4 no change was observed in SKBR3 cell (Fig. 11e-i). Considering the five hub genes, which were identified based on PPI networks and WGCNA, share the same signaling pathways during breast cancer progression, we conducted correlation analysis between novel four hub genes (CENPL, ISG20L2, LSM4 and MRPL3) and EZH2. The results of correlations analysis are shown in Table 4, the expression levels of each hub gene (CENPL, ISG20L2, MRPL3, and LSM4) was correlated with EZH2 in three different breast cancer cell lines (p < 0.01). Furthermore, we also performed the correlations analysis in GEPIA database to assess the four hub genes correlation with EZH2 in breast cancer and the results remained satisfactory (Additional file 1: Figure S6a-d, p < 0.0001).