Identification of EMT associated lncRNAs in TNBC Cohort
Our study has downloaded the original mRNA expression data of TNBC(FRKM)from TCGA database, and extracted 21 EMT related regulators. Firstly, we screened a total of 14 differential EMT genes in the expression profile between tumor group and normal group by differential analysis (logFC > 1 & logFC < -1, and p < 0.05), including 7 up-regulated genes and 7 down-regulated genes (Fig. 1A). After that, the expression data of 3234 LncRNAs from TNBC, as well as data from EMT genes, were screened by correlation analysis to find the LncRNAs highly correlated with EMT (|r| >0.3 & P value < 0.001). It revealed that a total of 1033 LncRNAs were highly associated with EMT (Table S1). Among them, 20 LncRNAs and 14 EMT genes were randomly selected to show the correlation in the form of heat map (Fig. 1B). Finally, the significantly down-regulated LncRNAs were screened out (the expression level was 0 in more than half of the samples, or the average expression level was less than 0.3 in the samples), and ultimately 536 LncRNAs were used as candidate gene sets for further modeling and analysis.
Gain Of Prognostic Genes And Construction Of Prognostic Model
In order to further identify the key genes in the screened LncRNAs set, we collected clinical information of TNBC patients and screened out the feature genes in TNBC by Cox univariate regression and lasso regression feature selection algorithm (Fig. 2A-2C). It was demonstrated that 285 LncRNAs (shared genes of candidate gene set and external validated data set) were screened by Cox univariate regression analysis to find the prognostic genes (Table S2), in which 22 prognostic genes with significance (P value < 0.05) were obtained as follows: YTHDF3-AS1, UBE2E2-AS1, SOCS2-AS1, TINCR, A2M-AS1, CYB561D2, TUG1, NIFK-AS1, LINC00667, NDUFB2-AS1, CASC15, PINK1-AS, ZSCAN16-AS1, EPB41L4A-AS1, TRIM52-AS1, LINC00839, ASB16-AS1, RGS5, LINC01023, SLC16A1-AS1, MBNL1-AS1, LINC01315. The patients from TCGA were randomly divided into training dataset and testing dataset at a ratio of 4:1, and we used lasso regression analysis to get the best risk score value for further analysis (Risk Score = NIFK-AS1 x (-0.33684413) + LINC01315 x (-0.322333697) + LINC00667 x (-0.288743288) + ASB16-AS1 x (-0.161362422) + PINK1-AS x (-0.07986676) + RGS5 x 0.169608076 + UBE2E2-AS1 x 0.26526301 + YTHDF3-AS1 x 0.268478279 + ZSCAN16-AS1 x 0.271742665 + SOCS2-AS1 x 0.371413484 + TINCR x 0.398092567 + NDUFB2-AS1 x 0.484513929). According to the median of risk score, patients were divided into high-risk group and low-risk group (median value of TCGA training dataset: -0.209563547920045; median value of TCGA testing dataset: -0.294563095125158) and analyzed by Kaplan-Meier curve. The overall survival (OS) of high-risk group in both sets was significantly lower than low-risk group (Fig. 2D, 2E). Additionally, ROC curve showed that the C-index of both sets are 0.91 and 0.79 respectively, (Fig. 2F, 2G), indicating the model’s better verification efficiency.
Clinical Predictive Value of the Model based on Multi-Omics Analysis
Tumor microenvironment is mainly composed of tumor-associated fibroblasts, immune cells, extracellular matrix, multiple growth factors, inflammatory factors, specific physical and chemical characteristics, and cancer cells. Tumor microenvironment significantly affects the diagnosis, survival outcome, and sensitivity of clinical treatment in cancers. Through analyzing the relationship between risk score and tumor immune infiltration, we further investigated the potential molecular mechanism of risk score in TNBC development, which demonstrated that risk score was positive related with Macrophages M2, Mast cells resting, NK cells activated, Mast cells activated, etc., and negative correlated with T cells CD4 memory activated, Dendritic cells resting, T cells CD4 memory resting, B cells naive, etc. (Fig. 3A). Since surgery combined with chemotherapy is effective in early breast cancer, our research was based on the drug sensitivity data of GDSC database, and the sensitivity of each tumor sample was predicted by R-package "pRRophetic" to further explore the relationship between risk score and sensitivity of common chemotherapy drugs. The results showed that risk score significantly affected the sensitivity of patients to Bicalutamide, Bryostatin.1, Dasatinib, Gefitinib, Lapatinib, and Metformin (Fig. 3B). By investigating the mutation spectrum of high/low-risk groups, we found that there was significant difference between the two groups in the mutation proportion of multiple genes (Fig. 3C).
Prognostic Model Related Signal Mechanism
Subsequently, we analyzed the signaling pathways involved in high/low-risk models to explore the potential molecular mechanism of risk score affecting tumor progression. Results of GSVA revealed that the differential pathways of the two groups were mainly enriched in UV_RESPONSE_UP, ADIPOGENESIS, UNFOLDED_PROTEIN_RESPONSE, HYPOXIA, APICAL_SURFACE, and other pathways (Fig. 4A). Finally, we found that there were significant enrichments in various related pathways through GSEA analysis. Some of the highly significant signaling pathways were shown (Fig. 4B, 4C), which suggested that the disturbance of these signaling pathways in high/low-risk groups affected the prognosis of TNBC.
Robustness Analysis By External Datasets
We downloaded the data of TNBC patients with survival data processed in GEO database (GSE135565、GSE103091), predicted the clinical classification of TNBC base on the model, evaluated the survival differences between two groups through Kaplan-Meier analysis, and investigated the stability of the prediction model. The results demonstrated that the OS of high-risk group was obviously lower than low-risk group in both GEO external verification sets (Fig. 5A, 5B). In order to verify the accuracy of the model, we did ROC curve analysis using external data sets, which showed that the model had a strong efficiency on prognostic prediction (GSE135565-C-index = 0.72, GSE103091-C-index = 0. 65) (Fig. 5C, 5D).
Risk And Independent Prognostic Analysis
Since the samples were divided into high/low-risk groups by the median value of risk score, the results of regression analysis were displayed by nomogram. Among which, the logistic regression analysis revealed that the different stages of TNBC were obviously associated with the distribution of risk score value obtained by our model analysis in all the samples (Fig. 6A). Through the analysis of general linear model (GLM) and Cox proportional hazards (CoxPH) model, we found that the distribution of risk score and several clinical parameters (such as age, stage, T, etc.) had different contributions to the scoring in distinct stages of cancer (Fig. 6B, 6D). At the same time, we also did some prediction analyses on the five-year and seven-year periods (Fig. 6C), and identified that risk score was an independent prognostic factor for TNBC patients through univariate and bivariate analysis (Fig. 6E, 6F).
Correlation Analysis Of Risk And Multiple Clinical Parameters
We grouped all the risk score values by different clinical parameters, which was shown in the form of boxplot graph (Fig. 7A-7D), and found that these risk scores were significant among the groups with multiple clinical indicators through Kruskal-Wallis test (P < 0.05) (Fig. 7A, 7C). As the risk score rose, the stage grade and lymph node involvement increased.