Patients’ characteristics of TCGA and GTEx: We utilized the publicly available TCGA and GTEx RNA-Seq database as our primary sources of GBM tumour and normal brain tissue transcriptomic data, respectively. We downloaded the datasets containing RNA-Seq gene expression profiles and clinical information of 166 patients from TCGA-GBM and 408 normal brain tissues from GTEx database. The combined data were stratified based on gender, age and treatment as shown in Table 1. Out of a total of 166 GBM cases, 104 cases (62.7%) were male and 56 cases (33.7%) were female. GBM is more prevalent in patients aged ≥ 60 years old which accounts for 42.8% of total cases in the TCGA GBM cohort. Fifty-two patients (31.3%) have undergone treatments whereas 62.1% of cases did not have any treatment data. Unfortunately, the clinical data for the GTEx normal brain samples are not publicly available.
Identification of differentially expressed genes in glioblastoma: The analysis pipeline employed in this study is depicted in Figure 1. Briefly, the RNA-Seq raw read counts from the two large compendiums, TCGA and GTEx were utilized to identify the differentially expressed genes between GBM and normal brain tissues. Since most GBM cases are generally found in the supratentorial region of the brain such as the cerebral hemisphere [16], we only extracted the RNA-Seq profiles of this region namely the cortex, frontal cortex, anterior cingulate cortex as per GTEx description. We performed t-distributed stochastic neighbour embedding (t-SNE) analysis to reflect the directionality of transcripts expression among GBM tumour and normal brain tissues read count values. The t-SNE plot showed that all RNA-Seq profiles of all GTEx cortex region clustered together while the GBM RNA-Seq profiles form a separate cluster, thus confirming distinct expression patterns between these groups (Figure 2A). In total, RNA expression data from 18,021 genes were obtained from these combined TCGA and GTEx dataset but only 13,548 genes passed the quality control check. By applying the cut-off criteria log2 fold change |2| and adjusted p-value <0.05, we identified 2381 genes as significantly differentially expressed genes (DEGs) in GBM, of which 648 genes were upregulated and 1733 genes were downregulated (Figure 2B). The detailed information of the differential gene expression analysis is listed in Supplementary Table S1.
Functional enrichment analysis and classification of DEGs: The significant DEGs were then subjected to functional enrichment analysis using Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) tools to define their properties and putative biological relevance in GBM. Interestingly, the GO cellular component analysis of both upregulated and downregulated DEGs showed enrichment of cell surface and membrane-associated proteins (Supplementary Figure S1A and S1B). The KEGG pathway enrichment indicated that the upregulated DEGs are involved in pathways related to infectious diseases, pathway in cancer and cell adhesion (Supplementary Figure S1C). Downregulated genes mainly involve in neuroactive ligand-receptor interaction and major cellular signalling pathways (Supplementary Figure S1D).
Identification of GBM cell-surface antigen candidates: The DEGs were then further filtered and classified into the surfaceome gene set as previously defined by Bausch-Fluck et. al [13], Cunha et. al [17] and Lee et. al [18]. These studies utilized different criteria and stringency in curating the surfaceome gene list. From the overall DEGs in GBM, we identified 395 common cell surface genes within these three surfaceome definitions, including 124 upregulated and 271 downregulated genes (Supplementary Figure S2A and Supplementary Table S2). We further classified the surfaceome according to their main subclasses, which are receptors, transporters, enzymes, miscellaneous and unclassified, as previously reported by Almén et. al [19]. Among the defined surfaceome subclasses, 42.8% of the significant differentially expressed surfaceome in GBM belong to the receptor subclass (Supplementary Figure S2B). KEGG analysis of the GBM-enriched cell surface proteins identified pathways related to immune defence and infectious disease pathways while GBM-deficient cell surface genes are enriched in pathways related to in neuroactive ligand-receptor interaction and major cellular signalling pathways (Supplementary Figure S3A and S3B). These findings are almost similar to the enrichment analysis of overall DEGs in GBM (Supplementary Figure S1C and S1D) suggesting that surfaceome has significant roles in dictating GBM cellular activities.
Identification of GBM cell-surface signature by integration of proteomics and transcriptomics data analysis: Thus far, we have (i) classified the overall DEGs in GBM using transcriptomics data and (ii) highlighted the differentially expressed cell-surface genes in GBM. Even though this transcriptomics analysis is very informative for biomarker discovery, we aimed to add another layer of analysis to select for a more high-confidence cell surface signature for GBM. To attain this, we integrated our transcriptomics analysis data with the publicly available proteomics data. This integration will validate the cell surface genes prediction and eliminate the possible discrepancy between the expression levels of mRNAs and proteins due to post-transcriptional and post-translational modifications. Thus, we gathered the publicly available quantitative mass spectrometry analysis data for both GBM tissues and cell lines. We postulated that GBM tissues and cell lines might have different cell surface repertoires and therefore it is important to stratify between these two sources. Additionally, GBM cell lines cell surface signature, as identified in this present study, could be validated experimentally in future functional studies.
Mass spectrometry analysis of five GBM cell lines revealed the upregulation of EGFR, CD44, PTPRJ, SLC1A5, F2R, and TSPAN6 proteins in these samples [12], whereby the expression level of these proteins were in concordance with our transcriptomics data analysis (Figure 3). For tissue proteomics, we found several studies that performed comparative GBM vs. normal brain tissues proteome profiling [11, 20–23]. However, some of these studies either identified only a limited number of proteins or the data are not downloadable. Only one study by Polisetty et al. that has identified a large number of proteins in their proteome profiling study that included 1834 high-confidence membrane proteins with more than 2-fold change [11]. We, therefore, used this dataset where we performed integrative analysis with our analyzed transcriptomics data and identified 10 overlapped genes, MRC2, FCGR3A, HLA-DRA, CD44, CD74, MSR1, CD163, EGFR, ITGB2, PTPRZ1 (Figure 3). The mRNA expression levels correlated with the protein expression levels except the PTPRZ1 where the mRNA levels showed upregulation while proteomics data showed downregulation (Supplementary Table S3 and S4). In total, there are 14 genes from the combined tissues and cell lines proteomics that overlapped with our transcriptomics data (Figure 3). It is important to note that proteins identification in mass spectrometry can be limiting due to protein isolation methods, proteins solubility, and other intrinsic variations that affect the proteins abundance as well as the sensitivity and detection capability of the MS instrumentation [24, 25]. Thus, these limitations may underestimate the results between transcriptomics prediction and proteomics discoveries.
Surfaceome protein-protein interaction network cluster analysis and prioritization of high-confidence GBM cell surface markers: We set out to further analyze the GBM-enriched cell surface markers using protein-protein interaction (PPI) network analysis. This is to better understand the interplay between the cell surface genes within the identified DEGs as well as with other genes. More importantly, this would enable us to further select the genes that are highly interconnected from the integrated proteomics and transcriptomics analysis. Network analysis of the identified differentially expressed cell surface protein genes was performed using NetworkAnalyst [14] to determine the relationship between genes according to the network topological parameters such as degree and betweenness. These parameters reflect the role and property of proteins within the network. The nodes and edges in the PPI network represent the proteins and their interactions, respectively. The GBM-enriched cell surface proteins network contains 1,321 nodes and 1,767 edges interactions based on a number of validated features including functional experiments, co-expression analysis, text mining, neighbourhood, gene fusion and databases (Figure 4A). We identified 87-gene modules of clusters and the top cluster genes with more than 30 interactions include VCAM1, EGFR, TGFBR1, CD44, NGFR, ITGB2, DCC, PTPRJ, ANBCA1, HLA-DRA, CCR5 and CSF1R (Figure 4A and Supplementary Table S5). Vascular Cell Adhesion Molecule 1 (VCAM1) has the highest interacting cluster as it was found to have 426 degree with 422,712.18 betweenness score. We subsequently mapped the 14 genes identified from the integrated transcriptomics and proteomics data analysis (Figure 3) with the top genes that have at least 20 interactions from the PPI network analysis. We found 6 genes that were in common between these two datasets which represent the high-confidence GBM predictive surfaceome markers (Figure 4B).
Validation of high-confidence GBM signature gene and survival-expression correlation analysis: Next, we validated the expression profiles of the identified 6 high-confidence cell surface markers using an independent database, Gene Expression Profiling Interactive Analysis (GEPIA) [26]. GEPIA also combines the TCGA and GTEx gene expression data that were processed from raw reads count and unified using its own pipeline. In line with our findings, the identified GBM cell surface signature genes were confirmed to be significantly upregulated in the GBM GEPIA database (Supplementary Figure S4A – S4F). To investigate whether the expression level of these signature genes would modulate/influence GBM patients’ prognosis, we first performed the overall survival analyses on GBM patients who had high or low expression of each of these 6 genes (Supplementary Figure S5A – S5F).
However, there were no significant differences in the overall survival between patients who had high or low expression of these 6 individual prioritized genes. Since GBM patients have low overall survival rate (average <2 years’ survival post-diagnosis), we postulated that it would be more appropriate to look at the disease-free survival endpoint rather than the overall survival. Moreover, the overall survival endpoint is more suited for a longer follow-up period (typically 5 years) for the data to be meaningful [27]. Hence, we examined the disease-free survival profile of the GBM patients in a similar fashion. We found that high expression of CD44, PTPRJ and HLA-DRA were significantly correlated (p<0.05) with poor disease-free survival in GBM patients (Supplementary Figure S6A – S6F).
In addition to performing survival analysis on the individual gene, we also assessed whether combining the level of all 6 GBM signature genes as a group could predict the GBM patients’ overall survival and disease-free survival. We observed that there was no statistically significant difference in the overall survival and disease-free survival between patients who had high expression and low expression of the signature group (Supplementary Figure S7A – S7B). Interestingly, by combining only CD44, PTPRJ and HLA-DRA in the gene signature, we found that subjects with high expression of this signature group had significantly poor disease-free survival (p<0.0084) compared to patients who had low expression of these genes (Supplementary Figure S8B). However, there was still no significant difference in the overall survival between GBM patients in this signature group (Supplementary Figure S8A).
Co-expression network of CD44: CD44 is a transmembrane receptor and has multifaceted functions in both normal and disease physiology. OMICS studies have identified CD44 to be overexpressed in many types of cancer including glioblastoma [28, 29]. Based on our analysis, CD44 seems particularly important as it can be both identified in transcriptomics and proteomics-based approaches, among the top hub gene and whose high expression correlate with poor disease-free survival. We performed a co-expression network analysis to further interrogate its association with other genes using our transcriptomics. The nodes represent in the network analysis represent genes, while the edges represent Pearson correlation above r>0.75. The neighbouring genes connected to CD44 was extracted and shown in Figure 5A. There are 27 genes in this complex connected to CD44. Among the highly correlated genes are ELK3, CLIC4, GALNT2, TNC, and VIM. All genes in this CD44 co-expression cluster are highly expressed in GBM compared to normal brain samples (Figure 5B), further corroborating the biological relevance of CD44 in supporting GBM pathogenesis.
Identification of drugs targeting GBM signature and CD44 network: We next determined whether there are any clinically approved drugs targeting the identified high confidence GBM cell surface markers (Supplementary Figure S4) and components of the constructed CD44 co-expression network (Figure 5A). To achieve this objective, we utilized the Drugbank database (https://go.drugbank.com/) and our searches yielded several approved drugs that can be potentially effective or repurposed to target CD44, EGFR, C1R, CALR and TNFSR1A (Table 2). Hyaluronic acid, for example, is a clinically approved ligand for CD44 and this drug has been administered in the clinic to treat disease such as osteoarthritis [30]. Excessive hyaluronic acid administration has been demonstrated to inhibit tumour growth, possibly by impeding cell-cell interaction [31]. Besides, the use of nanomaterials to enhance the efficiency of hyaluronic acid delivery for cancer therapy is also actively being explored [32, 33]. Thus, the promising features of hyaluronic acid in mediating enhanced drugs or genes delivery to cancer cells via the overexpressed CD44 receptor could potentially be applied and developed for novel GBM therapeutic strategies. In regards to EGFR, several inhibitors and monoclonal antibodies have already been therapeutically approved to target this protein due to its roles as an important driver of tumorigenesis in many cancer types [34].
Moreover, of the 28 components of CD44 co-expression network (Figure 5A), only C1R, CALR, and TNFSR1A have drugs that can modulate them (Table 2). For instance, 3 drugs can be used or repurposed to target C1R. The pharmacological activity of Palivizumab to bind C1R subcomponent is under investigation, whereas the conestat alfa and human C1-esterase inhibitor can directly target C1R subcomponent and disrupt the complement system activation.