Patient samples data collection and processing
Public gene-expression data and clinical annotation were downloaded from the cBioportal online database (http://www.cbioportal.org/).The data we selected is the Metastatic Breast Cancer Project. The samples we selected were met the following criteria: 1) Include the sample with the character like bone metastasis, brain metastasis, live metastasis, lung metastasis and ovary metastasis. 2) Exclude the sample with missing data. And then we choose the top 30% most variable genes for our study. In the end, a total of 80 patient samples and 10083genes were included in our study.
Weighted gene co-expression network construction
In the present study, the soft-threshold β was set as 7. Subsequently, the adjacency matrix was transformed into a topological overlap matrix (TOM). Next, we performed hierarchical clustering to identify modules, each module included at least 20 genes (min Module Size = 20). Finally, we calculated the eigen gene, hierarchically clustered the modules, and merged similar modules.
Clinically significant modules identification
The co-expression module is defined as a class of genes with high topological overlap similarity, and genes in the same module generally have a higher degree of co-expression. In this study, two methods were used to identify the important modules associated with clinical traits. First, the module eigengene (ME) represents the principal component of the module to describe the expression pattern of the module in each sample. Second, module membership (MM) refers to the correlation coefficient between genes and module eigengenes to describe the reliability of a gene belonging to a module. Finally, the correlation was calculated between the modules and the clinical data to identify significantly clinical modules.
Gene Ontology Enrichment and KEGG Pathway Analysis
WebGestalt (http://www.webgestalt.org/) is a functional enrichment analysis web tool for users to comprehend biological function information of genes and proteins, which supports three well-established and complementary methods for enrichment analysis. We used WebGestalt online tools to perform the GO enrichment and KEGG pathway analysis of the genes in royalbule module. “adjusted P < 0.05” was used as the threshold value to identify the significant terms.
Multivariate Cox regression
Multivariate Cox regression was performed using SPSS software. 39 genes were selected for screening the optimal prognostic signatures for breast cancer with lung metastasis. All data was evaluated by the Pearson’s Chi-Square method with SSPS software.
GEPIA Database
GEPIA (http://gepia.cancer-pku.cn/index.html) is an online database which facilitates the standardized analysis of RNA-seq data from 9,736 tumor samples and 8,587 normal control samples in the TCGA and GTEx data sets. In our study, we used this database to analyze the transcription levels of hub genes in breast cancer sample and normal sample. The P value was cut off at 0.01.
Kaplan-Meier Plotter Analysis
In our study, we used Kaplan-Meier plotter (http://kmplot.com/analysis/), which is an online database to explore the impact of genes on patient survival in different types of cancer, to verify the prognostic value of hub genes in breast cancer patients.
bc-GenExMiner v4.0
Breast Cancer Gene-Expression Miner v4.0 (http://bcgenex.centregauducheau.fr/BC-GEM/GEM-Accueil.php?js=1) is an online dataset containing published annotated genomic data, which includes 36 annotated genomic datasets and 5861 patients with breast cancer. Based on these, it can be used as a statistical mining tool to estimate the Pearson's correlation module.