Derivation of de novo pan-sarcoma EMT molecular subtypes from the perspective of EMT signature
First, we collected EMT process-related genes that were curated by Tuan et al., Rokavec et al, Kandimalla et al., Koplev et al., and Hollern et al.’s pan-cancer studies. In total, 630 genes were annotated and combined into a merged EMT signature. Detailed gene symbols and gene types (epithelial/mesenchymal marker) are delineated in Supplementary file 1. As shown in Fig. 1A and 1B, a total of 1440 sarcoma patients of various histological subtypes were enrolled in the present study, in which RNA-seq expression data was contained in TCGA-SARC and TARGET-OS, and microarray expression data based on the same platform was contained in the other datasets. To obtain a comprehensive understanding of the pan-sarcoma EMT molecular subtypes, we combined transcriptomic profiling and available clinical information of 17 datasets that were tested on the same platform (GSE13433, GSE142162, GSE14827, GSE17618, GSE20196, GSE20559, GSE23980, GSE34620, GSE34800, GSE37371, GSE66533, GSE71118, GSE87437, E-MEXP-1922, E-MEXP-3628, E-MEXP-964, E-TABM-1202) (Fig S1A). A large pan-sarcoma expression dataset that contains 1085 samples with over 12 subtypes was involved for further clustering analysis.
Through an unsupervised consensus clustering method on the expression pattern of the merged EMT signature, we classified sarcoma patients into distinct EMT molecular subtypes, where 636 patients were assigned to EMT Cluster_1 (EMT_C1), and 449 patients were assigned to Cluster_2 (Supplementary file 2). Consensus matrix and silhouette analysis (average width: 0.93) showed satisfactory clustering results (Fig. 2A and 2B). To reveal the association of EMT molecular subtypes and prognosis of sarcoma patients, we performed Kaplan-Meier survival analysis on patients with matched expression profiling and clinical information. For Chibon et al.’s Sarcoma cohort (GSE71118), we obtained a p-value of 0.003596 from the log-rank test, indicating that patients of EMT_C2 had significantly worse metastasis-free survival (MFS) (Fig. 2C). A consistent result was also found as shown in Fig. 2D that patients of EMT_C2 had significantly worse overall survival (OS) in Williamson et al.’s rhabdomyosarcoma cohort (E-TABM-1202, log-rank p = 0.02605). Furthermore, more patients with metastatic diseases were observed in EMT_C2 (Fig. 2E, 49% vs 35%, p = 0.019). To analyze the biological processes and pathways variation of the distinct EMT molecular subtypes, we implemented gene set variation analysis (GSVA). As shown in Fig. 2F, oxidative damage, TGF-β signaling, and several immune-related pathways including interleukin-10 (IL-10) signaling, type II-interferon (IFNG) signaling, T/B cell receptor signaling pathway, and NK cell chemotaxis/cytotoxicity were significantly enriched in the EMT_C1 group while mRNA capping/processing/splicing, nucleolus organization, and several DNA damage repair related pathways including mismatch repair and base excision repair were significantly enriched in the EMT_C2 group. Accumulating studies have reported the potential association between TME-infiltrating immune cells and dysregulated EMT/MET in the tumor. Thus, we applied the xCell tool, a novel gene signature-based ssGSEA method to estimate the overall TME infiltration status, and found that both stromal and immune scores of EMT_C1 were significantly higher than those of EMT_C2 (Fig. 2G). Moreover, CIBERSORTx, a deconvolution algorithm was applied to assess the infiltrating abundance of various immune cell types between EMT subgroups (Fig. 2H and Fig S1B, C). Activated memory CD4+ T cells, activated NK cells, γδ-T cells and CD8+ T cells showed high infiltration in the EMT_C1 group, whereas regulatory T cells (Tregs), resting NK cells and activated Dendritic cells were more abundant in the EMT_C2 group. In addition, we found that patients of EMT_C1 possessed higher EMT scores, which indicated a tendency to mesenchymal phenotype (Fig S1D).
WGCNA and identification of lncRNAs associated to EMT molecular subtypes
We used variance-stability-transformed expression data via DESeq2 as the input data for WGCNA. The best β value in the co-expression network was calculated to be 7 (Fig S2A-C). A total of 21 gene modules were finally determined after dynamic tree cutting and module merging processes (Fig. 3A, Fig S2E and Supplementary file 3). As shown in the module-trait relationship, many modules were found significantly correlated (P-value < 0.05) with the EMT clusters (Fig. 3B). We screened modules with relatively high correlation coefficients (≥ 0.3). Furthermore, after the intramodular analysis, we finally determined five gene modules which have a good correlation of module membership and gene significance for the EMT molecular subtype (Fig. 3C and Fig S2F). According to gene biotype annotation of Ensemble GRCh38.104, 72 lncRNAs in the five gene modules were identified as EMT molecular subtype-associated lncRNAs.
Identification of immune-related lncRNAs across pan-sarcoma types
To identify candidate lncRNA modifiers that are relevant with tumor immune across pan-sarcoma types, we proposed a three-line parallel computational approach, which involves correlations of lncRNAs expression to 1) immune marker genes expression, 2) immune-related pathway activity, and 3) abundance of TME-infiltrating immune cells. Briefly, the Pearson correlation test on normalized lncRNA expression and corresponding term was performed for each step as shown in the schematic diagram of Fig. 3D. LncRNAs in the correlation pairs with a Pearson correlation coefficient ≥ 0.3 and an adjusted p-value < 0.05 were selected. A total of 37 lncRNAs were identified to be robust candidates involved in tumor immunity across pan-sarcoma types (Supplementary file 3).
Construction and validation of a pan-sarcoma EILncRNA signature scoring model
As shown in Fig. 3E, we finally determined 26 lncRNAs that are concurrently related to EMT molecular subtype and tumor immunity across pan-sarcoma types (EILncRNA). Considering the heterogeneity of sarcoma subtypes and the complexity of interaction between EMT and tumor immunity, we proposed to develop an EILncRNA signature-based scoring model (EILncSig) to quantitatively estimate the cross-talk characteristics of EMT, tumor immune microenvironment (TIME) and tumor immunity for individual sarcoma patients. We selected Chibon et al.'s sarcoma dataset (GSE71118) as the training cohort, which has the largest sample size (n = 311) with clinical information (MFS) in the present study. We performed univariate Cox proportional hazards regression analysis to clarify the prognostic significance of the 26 EILncRNAs. A total of seven EILncRNAs (MIR22HG, LINC01140, LBX2-AS1, WWP1-AS1, AFTPH-DT, MIR155HG, and MCM3AP-AS1) were then selected to construct the EILncRNA signature-based scoring model. EILncSig score was computed as the sum of the normalized expression of the seven EILncRNAs weighted by corresponding multivariate Cox regression coefficients (Supplementary file 4).
As shown in the time-dependent receiver operating characteristic (ROC) curve analysis for MFS prediction, areas under the curve (AUC) were 0.714, 0.684, and 0.680 for 1, 3, and 5 years, respectively. By using the optimal cutoff value of EILncSig score, patients in the training cohort were stratified to high- and low-EILncSig groups. Kaplan-Meier survival analysis showed patients of the high-EILncSig group had significantly worse MFS (log-rank P = 2.708e-9) (Fig. 4A1). The distribution of the EILncSig score and the seven-EILncRNA expression between high- and low- EILncSig groups was shown in Fig. 4A2.
To validate whether the EILncSig scoring model acquires robust effectiveness across pan-sarcoma patients, we enrolled three independent datasets as testing cohorts (Williamson et al.’s rhabdomyosarcoma cohort, E-TABM-1202, n = 101; TCGA-SARC sarcoma, n = 259; TARGET-OS osteosarcoma, n = 95) for further validation. The risk score for each patient was calculated and all patients were stratified into the high- and low-risk groups. As for Williamson et al.’s rhabdomyosarcoma cohort, the time-dependent ROC curve analysis indicates EILncSig as a prognostic predictor for OS. Kaplan-Meier survival analysis showed significantly worse OS of patients in the high-risk group (log-rank P = 0.01509, Fig. 4B). Consistent results of time-dependent ROC curve and Kaplan-Meier survival analyses on both OS and relapse-free survival (RFS) were also successfully validated in the other 2 validation cohorts (TCGA-SARC and TARGET-OS) as shown in Fig. 4C and 4D.
To confirm whether the EILncSig scoring stratification could be an independent prognostic factor of other clinical features, patients from TCGA-SARC and TARGET-OS with available clinicopathologic parameters were involved in univariate and multivariate Cox regression analyses (Supplementary file 4) to test the performance of the EILncSig after being adjusted by clinicopathologic parameters such as age, gender, tumor metastasis, tumor grade, etc. As shown of the multivariate Cox regression analyses in Fig. 4C4 and 4D4, the HRs of high-EILncSig versus low-EILncSig for OS were 2.598 (P = 0.00613; 95% CI: 1.313–5.143) in TCGA-SARC testing cohort, and 3.938 (P = 0.04687; 95% CI: 1.019–15.217) in TARGET-OS testing cohort, respectively. Therefore, the EILncSig was identified as an independent factor for OS and RFS prediction. Taken together, the results of the training and testing cohorts indicated that the EILncSig scoring model could be an excellent model for predicting the prognosis of sarcoma patients, which may aid in formulating precise therapeutic strategies for patients with sarcoma.
We further examined the associations of EILncSig scores and multiple tumor characteristics across pan-sarcoma patients. Chibon et al. established a prognostic gene expression signature, complexity model in sarcomas (CINSARC), to improve sarcoma patients grading. As shown in Fig. 4E, higher EILncSig scores were found in the CINSARC_C2 group (p = 4.6e-8). As for the TCGA-SARC cohort, relapse patients and patients with metastasis were found with higher EILncSig scores (p = 0.0078 and 0.00043, Fig. 4F1-2). A congruent result was also found in the integrative clustering (iCluster) molecular subtypes of sarcoma identified by Alexander et al. The iCluster_C1 group in which patients have the worst prognosis, possesses higher EILncSig scores, whereas the iCluster_C3 group has the lowest EILncSig scores (p<2e-16, Fig. 4F3). In addition, we found that EILncSig scores were positively correlated to EMT scores, and the EMT_C2 cluster had higher EILncSig scores in the combined pan-sarcoma dataset (Fig S3A, B), which demonstrated the significant association between EILncSig and EMT molecular phenotype across pan-sarcoma patients.
TME and immune patterns associated with EILncSig in sarcoma
Bagaev et al. developed 29 sets of gene expression signatures describing pan-cancer TME characteristics and applied them in exploring TME patterns in pan-cancer patients. Four TME subtypes (immune-enriched, fibrotic (IE/F); immune-enriched, non-fibrotic (IE); fibrotic (F); and depleted (D), respectively) were defined to demonstrate the role of TME in cancer progression and metastasis. We selected sarcoma datasets (TCGA-SARC, TARGET-OS, GSE71118, E-TABM-1202) to analysis the characteristics of TME across pan-sarcoma patients. After computing EILncSig scores and assigning patients to high- and low-EILncSig levels within each cohort, all patients were included in the clustering analysis of the TME pattern. We utilized an unsupervised clustering method to assign the pan-sarcoma patients into 4 groups by using robustly standardized GSVA enrichment scores of the 29 functional gene expression signatures (FGES) sets (Supplementary file 5 and Fig S3C). As shown in the heatmap (Fig. 5A), sarcoma patients of distinct FGES characteristics along with high- and low- EILncSig stratifications were distributed among the four TME patterns. We utilized the t-SNE analysis to demonstrate the definite diversity of sarcoma patients per TME pattern (Fig. 5B). Furthermore, high- and low- EILncSig stratifications and four TME patterns presented significant concordant relationships among sarcoma patients (Fig. 5C). Consistent with previous results, the TME-Depleted pattern with the worst prognosis covered 50% of the high-EILncSig group whereas TME-IE and IE/F patterns representing better prognosis were more enriched in the low-EILncSig group.
Thorsson et al. identified immune subtypes (wound healing, IFN-γ dominant, inflammatory, lymphocyte depleted, and TGF-β dominant) to define pan-cancer immune response patterns that impact prognosis and tumor-immune interactions. We collected the immune subtype information of TCGA-SARC samples (five immune subtypes involved totally) (Supplementary file 5). As shown in Fig. 5D, there is a significant difference in EILncSig scores among five immune subtypes (p = 0.00034), where extremely low EILncSig scores were in the TGFβ-Dominant immune subtype. Additionally, further analysis revealed that expression levels of five EILncRNAs (lncRNAs WWP1-AS1, AFTPH-DT, LBX2-AS1, MCM3AP-AS1 and miR155HG) were also significantly different among five immune subtypes (Fig. 5E and Fig S3D). In the aspect of TME-infiltrating immune cells estimated by CIBERSORTx (Supplementary file 5 and Fig S3E), CD8+ T cells, activated memory CD4+ T cells, Tregs, γδ-T cells, monocytes and macrophages (M1 and M2) showed high infiltration in the low-EILncSig group of better prognoses, whereas resting NK cells and Dendritic cells (resting and activated) were more abundant in the high- EILncSig group (Fig. 5F). Furthermore, the Spearman correlation analysis showed that the EILncSig score was negatively correlated with CD8+ T cells, activated memory CD4+ T cells, and activated NK cells, while positively correlated with resting NK cells (Fig. 5G).
The transcriptomic alteration, SNV, and sCNA associated with EILncSig in sarcoma
Given the development of EILncSig is originated from the lncRNA modulation in the pan-sarcoma cross-talk of EMT molecular and tumor immune characteristics. We further assessed the potential value of EILncSig in the perception of transcriptomic genomic alterations in sarcoma. First, we performed DEG analysis on the 259 samples of the TCGA-SARC dataset via DESeq2, and found that 6621 genes (3384 upregulated and 3237 downregulated) were significantly differentially expressed in the high-EILncSig group (Fig. 6A and Fig S4A). As shown in the heatmap of Fig. 6B, 186 EMT-related genes belong to the DEGs set, in which a major part of mesenchymal marker genes upregulated in the low-EILncSig group. This result is also consistent with the positive correlation between EILncSig scores and EMT scores in the combined pan-sarcoma microarray dataset. We further used DESeq2 Wald statistic as a rank list for pre-ranked gene set enrichment analysis (GSEA). As shown in Fig. 6C, ridge plots of GSEA revealed that several gene sets, including DNA damage repair, TP53 activity regulation, histone methylation and protein acetylation, were enriched in the high-EILncSig group, whereas tumor immune activity-related gene sets, such as immune response regulation, cytokine production, interferons and interleukins signaling, were enriched in the low-EILncSig group.
We analyzed the somatic mutation data of samples with matched EILncSig scores from TCGA-SARC, with 98 and 137 patients in the high- and low-EILncSig groups, respectively (Fig S4B, C and Fig. 6D). TP53 mutation was found as top1 mutation both in the high- and low-EILncSig groups. However, a higher mutation frequency (47% vs. 32%) was observed in the high-EILncSig group. The mutation frequency of RB1, a well-known tumor suppressor gene, was much higher in the high-EILncSig group (ranks 2nd) than that in the low-EILncSig group. Another widely-studied cancer-related gene TTN were also found mutated with relatively high differential frequencies in the low-EILncSig group. We disclosed specific mutation sites of TP53, RB1 and TTN corresponding to their amino acid location between the high- and low-EILncSig groups (Fig. 6E, F and Fig S4D).
As for the somatic copy number alteration (sCNA), we evaluated its divergence associated with EILncSig by using GISTIC 2.0, which involved 258 samples with matched EILncSig scores in TCGA-SARC (Fig S4E). As shown in Fig. 6G, higher copy number deletion events were found in the high-EILncSig group while no significant difference of amplification was observed. Additionally, there existed a significantly positive correlation between copy number deletion events and EILncSig scores (R = 0.241, P = 8.99e-05) (Fig S4F and Fig. 6H). As the previous GSEA showed that DDR-related pathways were found activated in the EILncSig-high group, these results pointed out that the EILncSig might potentially reflect the genome instability in sarcoma. Moreover, we implemented functions of GISTIC 2.0 to identify recurrent focal sCNA regions. As shown in Fig. 6I, there existed multiple obvious amplification peaks in the low-EILncSig group, while amplifications on chromosomes 8, 13 and 17 and deletions on chromosomes 1, 13 and 17, were found with higher absolute G-scores in the high-EILncSig group. We identified that several sCNA peaks were distinctly detected in the high-EILncSig group, such as focal amplification peaks, such as the well-studied cancer-driven gene MYC (8q.24.21), several oncogenic genes TFDP1, CUL4A, GAS6 (13q34), DNA damage response related genes TOP3A, ALKBH5 (17p11.2), along with focal deletion peaks including the tumor suppressor gene TP73 (1p36.32) (Supplementary file 6).
EILncSig as a potential predictor of immunotherapy response
Accumulating studies are focusing on identifying robust indicators of immunotherapy response of cancer patients. Predictive efficacy of biomarkers such as expression of certain immune checkpoint inhibitors (ICI), tumor neoantigen burden (TNB), and microsatellite instability (MSI) have been studied in specific cancer types. The clinical development of cancer immunotherapy and the advances in genomic analysis also validated the important role of the TME in response to ICB therapy. Considering the association of EILncSig with immune-infiltrating cells and immune processes activation, we evaluated the potential capacity of EILncSig as a predictor of immunotherapy response. Previous studies have demonstrated that there existed complex cross-talk among tumor immune response, immune infiltration, and expression of ICI genes.
Herein, we firstly compared the expression of several common ICI genes between patients stratified by EILncSig in the TCGA-SARC dataset as shown in Fig. 7A and Fig S5, the expressions of multiple ICI genes including CTLA-4 and PD-1 were significantly higher in the low-EILncSig group. Considering the globally high level of immune infiltration of the low-EILncSig group, ICI genes that are majorly expressed in immune cells are supposed to be of abundant expression. However, we found that expressions of PD-L1, LAG-3, SIGLEC6 and IDO2 had no difference between EILncSig groups, and the expression of VTCN1 was even significantly higher in the high-EILncSig group. In addition, VTCN1 expression was positively correlated to the EILncSig scores (Fig. 7B).
Next, we examined the capacity of the EILncSig to predict the ICB therapy response on an independent clinical cohort. Kim et al.’s cohort (GSE176307), a publicly accessible PD1/PD-L1 therapy dataset with RNA-Seq and follow-up data, was used in this study. Patients were stratified to high- and low-EILncSig groups in the same method (Supplementary file 7). The time-dependent ROC curve analysis showed that EILncSig scores could be used to predict patients’ PFS and OS. The Kaplan-Meier survival analysis revealed patients in the high-EILncSig group had worse OS and PFS after ICB therapy (log-rank P = 0.03753 and 0.01187, Fig. 7C and 7D). Moreover, a lower percentage of high-EILncSig patients achieved complete/partial response (CR/PR) while a higher percentage suffered from stable/progressed disease (SD/PD) compared to the low-EILncSig group (p = 0.018, Fig. 7E). Taken together, low-EILncSig patients were provided with significant clinical benefits, better therapeutic responses, and markable prolonged survival after ICB therapy.
Discovery of potential drugs that target EILncSig in sarcoma
Exploring the complex molecular interactions and regulatory mechanisms of tumor immunity is indeed the exact direction to improving immunotherapeutic efficacy. However, it is noteworthy that the combination of immunotherapy and classical chemotherapeutic drugs could be an achievable approach to promote the effectiveness of immunotherapy. Herein, we mined the CMAP database and interactively analyzed large-scale pharmacogenetic data with molecular characteristics of EILncSig, to discover drugs that may have the potential capacity to convert sarcoma from high-EILncSig into low-EILncSig status (Fig. 8A and Supplementary file 7).
As shown in Fig. 8B, promising drugs with positive connective scores were predicted, such as topoisomerase I inhibitor irinotecan, retinoid drugs isotretinoin, Ca2+ ionophore ionomycin, and antimetabolite drug tioguanine were predicted to be. Although these drugs have different molecular targets, an increasing number of recent publications have validated the potential of these drugs in immune modulation. For example, He et al. developed a PD-L1-targeting immune liposome (P-Lipo) for co-delivering irinotecan and JQ1, which can successfully elicit antitumor immunity in colorectal cancer through inducing immunogenic cell death (ICD) by irinotecan and interfering in the immunosuppressive PD-1/PD-L1 pathway by JQ1[59]. The antitumor immunity or immune-enhancing effect of specific compounds still need to be further validated in sarcoma, while we surmised that these results may be supportive to expand novel combination strategies of classic drugs with immunotherapy for sarcoma patients, and provide fundamental basis for further experiments and clinical trials.