60 genes and 54 lncRNAs were identified prognosis-associated
After the first quality check by WGCNA R package, 7 tumor samples and 49 normal low-quality samples were removed from subsequent analysis, the remaining samples needed to be integrated and processed. Statistical analysis software R was used for preprocessing the differential expression analysis of microarray data. The data from the statistical results included the 753 differentially expressed genes,1315 differentially expressed lncRNAs. The volcano plot is used to show the significantly different expressions of the two sets of samples. The p-value and fold change values obtained by accurate T-test statistical analysis were used to draw the volcano map between the two groups (Figure1A, FDR < 0.05 and |log2 fold change (FC)| ≥ 2). In the volcano plot, one of the coordinates shows the negative log of p-values computed by the T-test, and the other shows the converted log of p-values compared to the two conditions. Similarly, we can also find that compared with the normal sample, the breast cancer samples down-regulated more in the differentially expressed genes and lncRNAs. Single variable Cox proportional risk regression analysis was performed to screen genes significantly associated with overall survival (OS) in the TCGA breast cancer dataset. A total of 874 genes and 3262 lncRNAs were significantly associated with prognosis. The set analysis software was used to take the intersection of the two (Venn diagram) to obtain 60 genes and 54 lncRNAs, which were not only differentially expressed between normal and breast cancer tissues, but also had prognostic differences in cancer tissues (Figure1B). Only the intersected genes and lncRNAs were selected for the subsequent bioinformatics.
In order to inquire about the potential signal pathways related to prognosis related genes in breast cancer, we screened the top30 prognosis related DEGs and analyzed them with GO and KEGG by R-packet “limma”, and the screening criteria were | lgfc | > 2, and adj. P < 0.05. GO analysis results showed that prognosis related DEGs can be enriched in basic biological processes(BP), including neurological system process, sensory perception; in cellular component, extracellular space and chromosome were enriched; in molecular function, receptor binding and protein dimerization activity were enriched. KEGG analysis showed that the top30 prognosis related DEGs were mainly enriched in systemic lupus erythematosus, cytokine-cytokine receptor interaction, salivary secretion and chemokine signaling pathway and cardiac muscle contraction pathways (Figure 1C).
Although accumulating studies have attempted to reveal the functional significance of lncRNAs, the biological roles of most lncRNAs are still unknown. Biological processes and cellular regulation networks are very complex, involving the interactions of various molecules, such as proteins, RNAs, and DNAs. We constructed an lncRNA/ mRNA co-expression network based on the correlation coefficient of lncRNAs and mRNAs and investigated the potential interaction between mRNAs and lncRNAs. The co-expression network was composed of 27 differentially expressed lncRNAs, 32 differentially expressed mRNAs and 11 network nodes (Figure 1D). The network showed that several lncRNAs (AC104211.2, RBMS3-AS3, AC002401.3, AC063919.1 and BOK-AS1) correlated with a great number of targeted mRNAs, and vice versa. This co-expression network also indicated that one lncRNA (AC104211.2) could target 15 network nodes and one coding gene (LVRN) could target 17 network nodes. In addition, the second ranked lncRNA (RBMS3-AS3) and mRNA (KLHL33), could both target 15 network node. Taken together, these results suggested the closer inter-regulation of lncRNAs and mRNAs in breast cancer. Through the co-expression network we obtained 31 hub genes, including 15 lncRNAs (AC104211.2, RBMS3-AS3, AC002401.3, AC063919.1, BOK-AS1, AL020994.3, LINC01497, LINC02766, AC135584.1, LINC01612, AP003031.3, LINC01402, AC025280.1, AC096637.3, LINC00922) and 16 genes (LVRN, KLHL33, HIST1H2AM, PENK, SLC24A2, HIST1H2AD, HIST1H2AJ, MIR23A, MIR4269, C20orf85, CDC20B, CST2, HIST1H1T, HIST1H2AL, OR2B6, RDM1).
Preliminary Identification of Optimal Prognostic Biomarkers
LASSO logistic regression were used to screen the characteristic genes by appling the glmnet package. The 31 genes entered the LASSO regression analysis (Figure 2.A, B) and then further analyzed by Stepwise Multivariate Cox Regression analysis to establish a prognostic model for patients with breast cancer on the basis of lncRNA and gene expression levels. Finally, 3 lncRNAs and 2 genes (LINC01497, LINC02766, LINC02528, C20orf85, CST1) related to the prognosis of breast cancer were obtained. The risk scores for patients were calculated on the basis of the relevant RNA expression level and risk coefficient of each gene, and patients were categorized as low or high risk according to the optimal cut-off (Figure 2C). The AUC of risk score was 0.634, which proved that the Cox model had an ability to predict prognosis and was an independent prognostic indicator (Figure 2D). Scatter plot was drawn to show gene expression profiles in high-risk and low-risk groups (Figure2f). Kaplan-Meier curves of OS in all patients with breast cancers based on 3 lncRNAs and 2 genes expression showed that the survival time of patients with low-risk score was significantly longer than that of patients with high-risk score (Figure 2E).
The 5 biomarkers were independent prognostic indicators
In an attempt to confirm the independent prognostic impact of individual lncRNA or gene, we performed univariate COX regression of the 5 screened variables, the results showed that the 5 biomarkers coul be independent prognostic indicator (Figure 3A). Calculated by multivariate Cox regression model, hazard ratio with 95% confidence interval (95% CI) for independent 3 lncRNAs and 2 genes signature were shown in forest plot. The forest map results showed that the genes with HR>1 (ENSG00000237560(LINC01497), ENSG00000229484(LINC02766)) are considered to be dangerous genes, while the genes with HR<1(ENSG00000124237(C20orf85), ENSG00000170373 (CST1), ENSG00000226004(LINC02528)) to be a protective gene (Figure 3B).
Composition differences of TME between high and low risk breast cancer groups and their associations with prognosis
Of all breast cancer samples, high and low risk groups were eligible based on xCell analysis. The results showed that there were 44 cell types differently expressed between high and low risk groups, and the immune score and stroma score were also different between the two groups (Figure 4A, p<0.05). The two most differently expressed components of the TME in breast cancer tissues were Epithelial cells and HSCs. Specifically, the proportions of Adipocytes, Endothelial cells, HSCs, Fibroblasts were significantly lower in low risk score tissues compared with the high risk score tissues, while the proportions of M1 macrophages, MSCs, Th2 cells were significantly higher (Figure 4B). Correlations among the components ranged from weak to moderate. Obviously, Astrocytes showed highly positive correlations with CD8+ Tcm and MSCs, Myocytes showed highly positive correlations with Mast cells, Macrophages M2 and Osteoblast (Figure 4C). Then single variable Cox proportional risk regression analysis was performed to screen cells types significantly associated with overall survival (OS). The results showed that there were 13 differently expressed cell types between high and low risk groups were associated with survival, of which, aDC, CLP, Melanocytes, NKT, pDC, Tgd.cells were protection factors, while Endothelial cells, Hepatocytes, ly Endothelial cells, mv. Endothelial cells, Neurons, Neutrophils, Preadipocytes were associated with poor prognosis (Figure 5), indicating that the TME compositions have high sensitivity and specificity to predict the prognosis of breast cancer patients.
Validation of the 5 indicators expressions Using RT-PCR
To verify the reliability of the aberrant lncRNAs and mRNAs found in the TCGA database, a RT-PCR assay was used to detect the expression levels of the 3 lncRNAs and 2 mRNAs in 25 paired tumor tissues and paracancer tissues from patients with breast cancer. The results indicated that the levels of LINC01497 and LINC02766 were decreased in the tumor tissues, and the LINC02528,c20orf85 and CST1were upregulated(Figure 6A,n=10,tumor vs normal, Student’s t test,*p<0.05). The results were consistent with the expressions in TGCA BRCA data.
Construction of the breast cancer-specific ceRNA Network
The biological roles of most lncRNAs are still unknown, accumulating studies have attempted to reveal the functional significance of lncRNAs on gene by modulating miRNAs. To establish the ceRNA network, we obtained the miRNAs that might interact with both the 3lncRNAs and 2 genes. Based on the results of the online prediction, by integrating them, a ceRNA network containing 3 lncRNAs, 2 mRNAs, and 158 miRNAs was finally constructed (Figure 6B). A total of 286 pairs of miRNA-lncRNA interactions containing 147 miRNAs and 3 lncRNAs, 70 pairs of miRNA-mRNA interactions containing 52 miRNAs and 2 mRNAs were contained in the ceRNA network. The degree of nodes in the ceRNA network was calculated and red color presents a highest degree, and purple presents a lowest degree.