Biological significance of expression similarity
We first used the GO terminology for EC analysis to examine the biological significance of the similarity in the transcriptome datasets of the 1,028 samples downloaded in this study (Table S1). EC analysis is a statistical analysis method used to detect whether the transcriptional profiles of genes belonging to a predefined functional gene set are interrelated[10]. EC scores can measure expression similarity within a predefined functional genome[10]; therefore, when most genes with the same GO term are co-expressed with each other, a higher EC score is obtained. In our EC analysis, EC scores (biological processes [BP], molecular function [MF], and cellular components [CC]) were higher in all three GO groups than in the random sampling (Fig 1). Among the three GO groups, CC showed the highest expression similarity, of which approximately 28.55% of the categories showed higher EC scores than random sampling at threshold EC score of 0.15, while approximately 25.86% and 25.62% did so in BP in MF, respectively (Fig 1 and Table S2).
Construction of the breast cancer network
Using the PCC method, the PCC threshold was 0.722, corresponding to median of all PCC values that were >99th percentile of the random PCC distribution obtained for 1,000 random genes. We determined the total number of human breast cancers generated from the 1,028 sample data points that we downloaded. The final dataset of the network used in this study contained 40,750 genes (guide) and 209,928 gene pairs (edges). From the entire dataset of the human breast cancer co-expression network, we were able to construct a network around the gene of interest, which served as a guide gene, and used MR-based truncation to obtain genes with very close expression profiles.
The ECM, glycoprotein, signal transduction, and secretion play important roles in tissue development, cancer formation, and invasion. To test whether our co-expression network analysis could identify useful transcriptional networks related to human breast cancer development, we selected 11 genes involved in these pathways or functional processes to find which genes were co-expressed in all samples. After screening by cutoff and MF, a complex network of 72 genes was obtained (Fig 2) and a co-expression network centered on Integrin α 11 (ITGA11) (guide gene; green circle in Fig 2) was constructed. The blue circle represents the remaining 10 guide genes. As can be seen from Fig 2 and Table S3, the remaining 10 guide genes share a highly positive correlation with ITGA11. These genes were counted according to the UniProtKB keyword enrichment analysis (Table 1). As can be seen from the table, ITGA11 shares similar functions with the 10 guide genes, and it can also be found in the same pathway.
GO analysis and KEGG pathway analysis of the breast cancer network
The good association between the 11 guide genes in the primary expression cluster indicates that the genes involved in breast cancer can be strictly co-regulated at the expression level; thus, the breast cancer development event may be an appropriate subject for co-expression analysis. The results of the GO analysis also support this (Fig 3). During biological process analysis, terms related to collagen catabolism (e.g. “collagen catabolic process”, “collagen fibril organization”, and “skeletal system development”), terminology related to the structural components of the ECM (e.g. “extracellular matrix organization”, “cell adhesion”, “extracellular matrix disassembly”, “cell–matrix adhesion”, and “integrin-mediated signaling pathway”) appear in the genes of breast cancer collagen catabolism and in the structural components of the ECM, respectively (Fig 3a). the cellular components of those genes obtained in the breast cancer networks were mainly associated with the ECM (e.g., “extracellular matrix”, “proteinaceous extracellular matrix”, “collagen trimer”, “extracellular region”, and “extracellular space”; Fig 3b). Additionally, The molecular functions of the obtained genes are primarily related to binding (e.g. “collagen binding”, “integrin binding”, “heparin binding”, and “calcium ion binding”; Fig 3c).
Furthermore, from the GO analysis (Fig 3), it seems various other genes can be included in the breast cancer network (white circles, Fig 2). This suggests how some genes that have not been reported in breast cancer can be mined from the existing data, and new interactions between these genes can also be identified from connections in the network (Fig 2). Cancer development is a very specific biological event; as such, co-expression network analysis may have the greatest success in identifying genetic interactions in breast cancer.
In addition, we performed a KEGG pathway analysis of the genes obtained from breast cancer co-expression networks. Through the above GO analysis, the pathway related to ECM receptor interaction, focal adhesion, and protein digestion and absorption is constant. We were surprised to find that pathways for infectious diseases, such as amoebiasis, were involved in the pathway enrichment of those genes obtained by co-expression analysis in this study (Table S4). All of these results support our co-expression analysis of gene predictions for breast cancer, which may be useful for subsequent studies and in the design of various medical treatments.
Identification and validation of hub genes
Based on the UniProtKB keyword enrichment analysis, 62 genes with high correlation with ITGA11 were identified as hub genes (Table 1). Survival analysis of hub genes were performed using Kaplan Meier-plotter[9]. The patients were stratified into high-level group and low-level group according to different expression ratio. The customized cutoff-high and cutoff-low of the eight genes is ADAMTS12 (72:28), CEMIP (43:57), COL11A1 (51:49), CTHRC1 (25:75), ITGA11 (54:46), LOXL1 (50:50), LUM (73:27) and P4HA3(76:24). Among them, CEMIP, COL11A1, CTHRC1, ITGA11, LUM and P4HA3 were negatively associated with the overall survival, while ADAMTS12 and LOXL1 were positively associated with the overall survival at early stage (Fig 4). However, there was no significant difference between the eight gene expressions and Disease-free survival (Fig S1).