Study design and workflow
The study design and workflow is shown in Figure 1. Briefly, a total of 197 patients with primary tumor tissues and matched normal tissues were enrolled in this study at Fudan University Shanghai Cancer Center (FUSCC). Among them, 98 were AIS (24) or MIA (74) LUAD, termed pre/minimally invasive, and 99 were stage I (83) or IIIA (16) invasive LUAD, termed invasive, according to the 7th TNM staging (Figure 1A). RNA sequencing (RNA-seq) and whole-exome sequencing (WES) were carried out for the 197 (98+99) tumor-normal pairs of samples (Figure 1B). First, we identified 264 DEGs and 25 DMGs between pre/minimally invasive and invasive LUAD corresponding to the transcriptomic and genomic alterations, respectively (Figure 1C). Secondly, we identified 11 and 35 pathways that were enriched with the DEGs and DMGs, respectively. Strikingly, focal adhesion (FA), involving a total of 199 genes including 15 DEGs and 1 DMGs (with an overlap of COL11A1), was the only pathway that was significantly perturbed both transcriptomically and genomically (Figure 1C). Thirdly, two genes (COL11A1 and THBS2) that were most significantly differentially expressed between invasive and AIS/MIA, and that at the same time showed no difference in expression between AIS/MIA and normal, were identified to develop a model (FA2) using an unsupervised consensus clustering method called Partition Around Medoids (PAM), clearly separating stage I invasive LUAD patients into two distinct subtypes (S1 and S2) (Figure 1D). Finally, the differences between subtypes S1 and S2 in molecular characteristics including genomics, transcriptomics, proteomics, tumor microenvironment and clinical outcomes were comprehensively evaluated and confirmed with multiple data sets previously reported in the literature (Figure 1E).
Genomic and transcriptomic alterations from pre/minimally invasive to invasive LUAD identified focal adhesion (FA) pathway and elevated expression of COL11A1 and THBS2 as key changes of LUAD progression
AIS and MIA are similar in genomic and transcriptomic characters. The principal component analysis (PCA) suggested that the expression profiles of AIS and MIA were similar to each other and were closer to that of invasive LUAD than to normal lung tissue (Figure 2A). Meanwhile, almost no significant DMGs and DEGs between AIS and MIA could be identified (Figures S1A and S1B). Therefore, we combined AIS and MIA into one group (AIS/MIA) in the following analyses.
We set out to determine important and reliable disease progression-associated pathways by first detecting DEGs and DMGs between AIS/MIA and invasive LUAD. We thus identified 264 DEGs (|log2FC| >=1 and P < 0.05) and 25 DMGs (P < 0.05) (Figures 2B, 2C, S1C and S1D). Except for BRAF (AIS/MIA vs invasive, 8% vs 1%), the other 24 DMGs, such as TP53 (AIS/MIA vs invasive, 6% vs 38%), showed much higher mutation frequency in invasive LUAD than in AIS/MIA (Figures 2C, S1C and S1D). We performed Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis using the 264 DEGs and 25 DMGs separately. As a result, the focal adhesion (FA) pathway was commonly shared by the 11 DEG-enriched pathways and the 35 DMG-enriched pathways (Figures 2D, S1E and S1F). It has been reported that the FA complex is a bridge between cells and the extracellular matrix, and plays an important role in cell proliferation, invasion, and migration16. We further identified 15 DEGs from the 199 FA pathway genes, which were downloaded from the molecular signature database (MsigDB) (Figure 2E). COL11A1, which was the only gene shared by the 15 DEGs and the 25 DMGs in the FA pathway, may play an important role in the progression of AIS/MIA to invasive LUAD (Figure 2F).
We hypothesize that genes whose expression is significantly increased only from AIS/MIA to invasive (corresponding to good and bad prognosis, respectively), but not from normal to AIS/MIA (both with good prognosis), may play a more prominent role in disease progression and prognosis. By setting a more stringent log2FC > 1.5 cutoff, we retained five genes (SPP1, COL11A1, COL1A1, COMP, and THBS2) with significantly increased expression level from AIS/MIA to invasive. However, two (SPP1 and COMP) of them had already showed significantly higher expression level in AIS/MIA than in normal, and therefore were removed from further consideration (Figure 2G). This process led to three remaining genes, namely COL11A1, THBS2, and COL1A1 (Figure 2H). Considering that COL1A1 and COL11A1 are from the same gene family with similar functions, we only used COL11A1 and THBS2 for subsequent molecular subtyping analysis of stage I LUAD. Obviously, we can see that the expression levels of COL11A1 and THBS2 both increased significantly from normal/AIS/MIA to stage IA (Figure 2I), whereas no appreciable change in expression was observed from normal to AIS to MIA.
Unsupervised consensus clustering classified stage I LUAD into AIS/MIA-like subtype S1 and AIS/MIA-diverging subtype S2 based solely on the expression of COL11A1 and THBS2
We assumed that stage I LUAD patients may be further divided into multiple molecular subtypes including those similar or dissimilar to AIS/MIA in different degrees based on the expression of the two FA genes (FA2). Thus, we used the unsupervised Partition Around Medoids (PAM) consensus clustering method, which is well-known for its robust performance in identifying the true underlying number of clusters within a data set, to cluster stage I LUAD patients using the expression profiles of COL11A1 and THBS2. The expression levels of COL11A1 and THBS2 were used to calculate the Euclidean distance between each sample and the center point of each cluster. After the number of clusters was evaluated from 2 to 10, we identified two subtypes (clusters) named S1 (low expression of COL11A1 and THBS2) and S2 (high expression of COL11A1 and THBS2). This clustering showed the clearest cut (between-cluster distance) and highest Average Silhouette Width (ASW)14,15, a popular cluster validation index to estimate the number of clusters within a data set (Figures 3A and 3B).
To evaluate the molecular subtypes underlying the whole set of 394 samples including normal, AIS, MIA, and invasive LUAD, we performed PAM clustering using the expression profiles of COL11A1 and THBS2. Consistent with clustering of stage I LUAD samples alone, the average silhouette width indicated that the optimal number of clusters was two (Figure 3C). It was interestingly and gratifying to notice that 100% normal, 95.8% AIS, 94.6% MIA, 64.3% IA, 40.7% IB, and 37.5% IIIA were assigned to S1 (Figures 3D and 3E). These results indicated that S1 was closer to AIS/MIA, and that as the disease stage progresses, more and more patients became S2-like.
Distinct molecular differences between S1 and S2 subtypes
We extensively explored the differences in molecular characteristics between S1 and S2 subtypes within stage I LUAD, and included AIS/MIA as a control group in subsequent analyses. We identified seven genes with significantly different mutation frequency among AIS/MIA, S1, and S2 using Fisher’s exact test (Figure 4A). Four (EGFR, TP53, TTN, and CSMD3) of the seven genes were among the top 20 most frequently mutated genes in our data sets (Figure S2A). Except for EGFR and MGAM, the mutation frequency of the other five genes (TP53, TTN, CSMD3, DST, and FSCB) significantly increased in S2 over S1 (Figure 4B). Similarly, tumor mutation burden (TMB) gradually increased from AIS/MIA to S1 to S2 (Figure 4C). The same trend was also seen in APOBEC-related mutations, although only AIS/MIA vs S2 was significantly different (Figure 4D). These results indicated that S1 was genetically closer to AIS/MIA than S2.
Consistent with the trend of genomic characteristics, transcriptomic analysis also indicated that S1 was similar to AIS/MIA. PCA demonstrated that gene expression profiling of S1 was closer to AIS/MIA than S2 (Figure 4E). We further compared the expression profiles between AIS/MIA, S1 and S2, and identified 83 DEGs between AIS/MIA and S1, 881 DEGs between AIS/MIA and S2, and 383 DEGs between S1 and S2 (Figure 4F). Several pathways (ECM-receptor interaction, protein digestion and absorption, and focal adhesion) were enriched with most genes upregulated in S2 (Figure S2B). Meanwhile, the expression of all 15 FA pathway DEGs between AIS/MIA and invasive LUAD (Figure 2E) were also significantly altered between S1 and S2 (Figure S2C). We further explored cancer-associated biological functions of DEGs between AIS/MIA, S1, and S2 using Get Set Variation Analysis (GSVA). A total of 22 hallmarks of cancer were identified using hallmark gene sets from MSigDB (Figure 4G). The enrichment scores of these identified hallmarks showed continuous changes from AIS/MIA, S1, S2, to IIIA, indicating that S1 may be an intermediate biological stage during the development of AIS/MIA to S2 or IIIA.
Finally, we explored the differences in tumor microenvironment (TME) between AIS/MIA, S1, and S2. We analyzed the composition of TME using EPIC17 and MCP-counter18, two widely used software packages for such purposes. Associations between CAF and molecular subtypes were observed in that S2 with COL11A1 upregulation had more activated CAF than S1 (Figure 4H). Consistent with published research, COL11A1 in CAF was increased compared with normal fibroblasts in non-small cell lung cancer (NSCLC)19. CAF, one of the most abundant cell types in tumor tissues, was closely associated with promoting lung cancer development20–22. In fact, many clinical studies aimed at inhibiting the interplay between CAF and tumors are ongoing22. Meanwhile, several studies had suggested that CAF may promote cancer invasion by remodeling extracellular matrix (ECM)22. We also observed that CAF and the ECM-receptor interaction pathways were more active in S2 (Figures 4H and S2B), suggesting that CAF and ECM played important roles in LUAD progression. Therefore, patients with S2, which showed more activated CAF than S1 and AIS/MIA, may benefit from treatment aimed at preventing CAF activation.
Distinct S1 and S2 subtypes were also observed with proteogenomic data
We subsequently reanalyzed the multi-omics data from the study of Gittelle et al.23 to explore differences in proteogenomic characteristics between S1 and S2 subtypes. Gene mutation and expression data were downloaded from Genomic Data Commons (GDC)24 and proteomics and phosphoproteomics data were downloaded from Clinical Proteomic Tumor Analysis Consortium (CPTAC)25. In this unique data set, PAM consensus clustering was also used to classify stage I LUAD based on the expression of COL11A1 and THBS2. Again, the optimal number of clusters was determined to be two after the average silhouette width was evaluated from 2 to 10 (Figure S3A). Therefore, PAM consensus clustering was performed to identify two subtypes based on the expression of COL11A1 and THBS2 of stage I patients. Consistent with what were observed in the FUSCC data set, S2 showed more mutation events, more dead or relapse events than S1 (Figures 5A and 5B). The mutation frequency of TP53, RYR2, USH2A, KRAS, and XIRP2 was much higher in S2 than S1 (Figure 5B). Moreover, the event of copy number variations, such as amplification peaks, was less common in S1 than S2 (Figure 5C). Consistent with our FUSCC data set, the genomes of S1 were relatively simpler than S2.
Quantitative omics, including transcriptomics, proteomics, and phosphorylated proteomics analysis confirmed the distinct differences between S1 and S2. We performed differential expression analysis between S1 and S2 and identified 371 DEGs, 64 differentially expressed proteins (DEPs), and 121 differentially expressed phosphoproteins (DEPPs) (Figure S3C). To further explore biological functions associated with DEGs and DEPs, we performed KEGG pathway enrichment analysis. We found that DEGs and DEPs between S1 and S2 were both enriched in protein digestion and absorption, ECM−receptor interaction, focal adhesion, bladder cancer, and steroid hormone biosynthesis pathways (Figure 5D). Meanwhile, we also found that S2 showed more activated CAF than S1 (Figure S3D). Consistent with what we identified from our FUSCC data set, more activated CAFs and ECM−receptor interaction were also identified for S2 in the Gittelle et al. data set.
We observed a strong correlation between gene and protein expression levels for both COL11A1 and THBS2 (Figures 5E, S3E and S3F), indicating that protein expression, like gene expression, may also be used for molecular subtyping of stage I LUAD patients. To explore the relationship between clinical outcomes and molecular subtypes identified by consensus clustering based on the protein expression of COL11A1 and THBS2, we downloaded proteomics data and the corresponding clinical information from the study of Xu et al.26. We identified two subtypes closely associated with relapse-free survival (RFS) (P < 0.001), although the association with overall survival did not achieve statistical significance (P = 0.31, Figure 5F).
AIS/MIA-like S1 subtype had better prognosis than AIS/MIA-diverging S2 subtype as validated with 11 external data sets
Similar molecular characteristics between S1 and AIS/MIA indicated that they may exhibit similarly excellent prognosis. In addition to our FUSCC data set, we downloaded 11 external published data sets containing both gene-expression data and prognosis information of stage I LUAD (Table S1). A total of 1,368 patients of stage I LUAD were involved in the 12 data sets. When each of the 12 data sets was subjected to the PAM analysis using the expression data of COL11A1 and THBS2, the optimal number of clusters for all data sets was found to be two (Figure S4), indicating the true number of underlying molecular subtypes of stage I LUAD. After subtyping each data set, we merged the 1,368 patients from the 12 data sets together and performed survival analysis (Figure 6). As we expected, survival analysis indicated that subtype S1 had better overall survival (Figure 6A, P < 0.0001) and relapse-free survival (Figure 6D, P < 0.001) than S2. Multivariate Cox analysis also indicated that the two-class subtyping was an independent prognostic variable (Figures S5A and S5B). We further compared the prognostic predictive performance of the two-class subtyping for patients within stage IA and IB separately. As a result, the OS and RFS of S1 were significantly better than S2 for stage IA patients (Figures 6B and 6E, P < 0.001). However, the prognosis predictive performance of two-class subtyping for stage IB was not as good as in stage IA (Figures 6C and 6E). Combining IA and S1 would help identify patients who may be truly at low risk (Figures S5C and S5D). These findings also suggested that the FA2 model had better prognostic stratification for early-stage IA.
To comprehensively evaluate and compare the prognostic predictive power of the simple FA2 model involving COL11A1 and THBS2, we identified 42 published prognosis gene signatures of lung cancer through literature research and the review of Tang et al.7,9,27, with mean and median number of genes of 65 and 42, respectively (Table S3). PAM consensus clustering was performed using the expression profiles of the 42 gene signatures to identify two subtypes for the stage I patients within each of the 12 data sets. To facilitate fare comparisons between different signatures (models), the optimal number of clusters was chosen to be two for all models. We merged the 1,368 patients from the 12 data sets together after we finished subtype classification for each data set. AUC of time-dependent receiver-operating characteristics (ROC) curve and concordance index (C-index) were used to evaluate the performance of the 42 literature signatures plus the FA2 signature with the 1,368 patients. The 43 models were ranked by the mean of AUC and C-index. As a result, the FA2 model, with the least number of genes, ranked top three (OS, Figure S6A) and top five (RFS, Figure S6B) in prognostic predictive power for stage IA patients only. For all stage I patients, the FA2 model ranked top five (OS, Figure S7A) and top 15 (RFS, Figure S7B). These results further suggested that FA2 genes were biologically important and had a good predictive performance for prognosis of stage I patients, especially stage IA patients.