The literature frequency for various brain cancer subtypes
Based on our comprehensive literature curation, we cleaned up all the associations between brain cancer genes and the literature before conducting further analyses. As shown in Figure 2A, we found 27 genes that were each supported by more than 20 PubMed abstracts. However, 883 of the 1,421 genes implicated in brain cancer (62%) were supported by only a single evidentiary mention in the literature; so obviously, those genes’ functions need further experimental validation. Using cancer subtype keywords, we assigned the 1,421 genes to different subtypes, while a gene could be associated with multiple cancer subtypes, each subtype has its own literature-based evidence (Table S2). As shown in Figure 2B, the top three keywords were: glioma (associated with 582 genes), lymphoma (associated with 450 genes), and medulloblastoma (associated with 245 genes). To explore the genetic heterogeneity of brain cancer, we grouped curated subtype information. For example, astrocytoma, oligodendroglioma, ependymoma, GBM, LGG, ganglioglioma, and oligoastrocytoma were all grouped as gliomas, and medulloblastoma was grouped with neuroectodermal tumors. Then, we subsequently identified 809 glioma-related genes and 354 neuroectodermal tumor-related genes in those two major subtype groups.
After we curated 227 and 25 genes for GBM and LGG, respectively, we summarized all the GBM and LGG CNVs on the gene pages in BCGene. To demonstrate how well our data identifies potential tumor suppressors and oncogenes, we first identified 85 GBM-associated tumor suppressors with more copy number loss (the ratio between copy number loss and copy number gain > 2.0) and 39 GBM-associated oncogenes with more copy number gain (the ratio between copy number gain and copy number loss > 2.0). Then, by cross mapping to the tumor suppressor and oncogene databases (TSGene 2.0 [14] and ONGene [7], respectively) (Figure 2C), we found that 23 GBM genes with more frequent copy number loss are known tumor suppressor genes, and another 15 GBM genes with more frequent copy number gain are known oncogenes.
Functional enrichment of those genes shared by different subtype groups
To check the genetic heterogeneity of the high-level cancer subtype groups, we overlapped their associated genes to compare the common and unique genetic features of the five subtype groups (glioma, lymphoma, meningioma, neuroectodermal tumor, and pituitary tumor) (Figure 3A) and found 44 genes belonging to four or more groups. Gene ontology enrichment analysis revealed that those 44 genes are highly associated with 12 functional categories (Figure 3B). Some of those categories are highly related to cancer, such as negative regulation of programmed cell death (Benjamini and Hochberg false discovery rate (FDR) corrected p-value = 4.35E-05), DNA metabolism regulation (Benjamini and Hochberg FDR corrected p-value = 1.42E-04), and regulation of the mitotic G1/S transition (Benjamini and Hochberg FDR corrected p-value = 3.79E-04). A most interesting finding was the response to hypoxia (Benjamini and Hochberg FDR corrected p-value = 3.31E-04). In general, hypoxia is important in drug resistance and poor survival [15]. Therefore, targeting hypoxia might be a practical way to improve patient survival rate of patients with astrocytoma and GBM [16].
KEGG pathway analysis further highlighted a few important cancer-related signaling pathways, such as the PI3K-Akt signaling pathway (corrected p-value = 8.04E-05), pathways in cancer (corrected p-value = 5.32E-10), proteoglycans in cancer (corrected p-value = 3.33E-06), and the advanced glycation end products-receptor for advanced glycation end products pathway (corrected p-value = 1.201E-5). More interestingly, signaling by interleukins (corrected p-value = 3.7E-05) and cytokine signaling in the immune system (corrected p-value = 1.06E-03) highlighted the importance of interleukins in the progression of brain cancer. Previous observations confirmed that many cytokines (mainly interleukins) are involved in brain cancer aggressiveness and the generation of disease-associated pain [17]. In summary, all our functional analyses demonstrated that subtype-specific gene mining using the BCGene database may be used to identify common genes in different brain cancer subtypes and to explore potential common molecular mechanisms.
Potential prognostic applications
To further explore potential prognostic applications of curated brain cancer-implicated genes, we overlapped the 44 shared genes with 18 brain cancer datasets that have survival outcomes and that are in the prognostic database PRECOG [18] (Figure 3C). Those datasets were grouped into two categories: 12 related to glioma and 6 related to non-glioma. For each gene, PRECOG calculated z-scores that characterized gene expression features and clinical outcomes. In general, a positive z-score for a gene related to a specific dataset means higher expression and adverse survival, while a negative z-score reflects lower expression and favorable survival. By clustering the z-scores, all genes could be ordered into three clusters. We then used signal-to-noise ratios (the ratio of the level of a desired signal to the level of background noise) to compare each gene between the glioma and non-glioma groups. In the first group, PTGS2 had the best signal-to-noise ratio (0.63), meaning that PTGS2 more powerfully shows signals than noise, making it more useful to distinguish the glioma and non-glioma groups. In contrast, TP53 in the second cluster had a negative signal-to-noise ratio (-0.79), meaning that its signal was lower than its noise. Additionally, in terms of the fold change of the z-scores between the two groups, PTGS2 is 2696.87 while TP53 is just 0.03, so PTGS2 may be a better differential prognostic indicator than TP53 [19]. In summary, these distinguishing links to different subtypes may provide evidence for the distinct mechanisms related to the survival of patients having different cancer subtypes.
Identify top-ranked genes with evidence mentioned only once in the literature
To further explore the curated genes’ relevancies to brain cancer, we ranked all the 1421 genes based on the 27 most reliable brain cancer genes as training set. The reliability of these 27 genes are based on each gene having 20 or more evidentiary mentions in the literature. This ranking result is to generate relatively importance to the remaining 1,394 (1421 minus 27) genes in our database (Table S3). With similar functions to the 27 genes in the training set, the subsequent 100 top-ranked genes are likely important in brain cancer development. And within those top-ranked genes, 33 were linked only by a single support from the literature. Thus, we consider that the roles of those 33 genes in brain cancer development are likely underestimated.
To investigate the potential oncogenic roles of those 33 genes, we used the large-scale cancer genomics datasets in cBioportal [10]. Altogether, we combined 2,997 samples from 14 independent studies, including four datasets related to medulloblastoma, two datasets related to glioma, two GBM studies, two LGG studies, a study of anaplastic oligodendroglioma and anaplastic oligoastrocytoma, a study of a brain tumor patient-derived xenograft, an investigation of pilocytic astrocytoma, and a dataset of pheochromocytoma and paraganglioma. As shown in Figure 4, sample-based mutational patterns revealed 536 samples (18% of the total 2,997 samples) that had at least one genetic mutation related to one of the 33 genes. After closely scrutinizing their subtype information (Figure 5A), we found that the 33 genes were highly mutated in the glioma and GBM datasets but had relatively low mutational rates in the four datasets related to medulloblastoma. Interestingly, those 33 genes had a huge effect on patient survival (Figure 5B). Among the 2,303 patients with survival information, 467 of them had one or more genetic mutations in the 33 genes. The median survival of those 467 patients was 24.59 months, but the remaining 1,836 patients’ median survival was 42.20 months, a very significant difference (log rank test, p = 2.30E-8).
Among the 536 samples with genetic mutations in one or more of the 33 genes, the top-ranked gene, CDK4, was mutated in 202 samples (8% of the 2,997 samples) and the second-ranked gene, MAP3K1, was mutated in 79 samples (2.8%), and 8 of those samples also had a CDK4 mutation. Since the mutated genes in that mutational pattern are almost mutually exclusive, they may have complementary roles in the progression of brain cancer [20]. As shown in Figure 6A, amplified CDK4 in five samples coincided with mRNA up-regulation, but four of the five samples had low methylation, which could have caused the increased mRNA expression (Figure 6C). However, MAP3K1’s correlation patterns were strikingly different than CDK4’s (Figure 6B, D). Altogether, CDK4 provides a good example of consistent mRNA up-regulation based on both amplification and methylation patterns, and MAP3K1 may be a good candidate for evaluating some brain cancers’ progressions, but those possibilities need further study.