Overview of known functional variants
We extracted a set of 2505 missense variants from databases and the literature within 441 human kinases that we classified according to their impact on kinase enzymatic activity (Figure S1A; Table S1A). There was a total of 375 in 140 kinases (247 or 66% in the kinase domain) missense variants that lead to constitutive activation or greatly increased activity (hereafter activating). This class of variant is most famously seen in certain cancers, where a key (most often) somatic variant leads to greatly increased signalling as part of the pathogenesis of the disease and is thus often a drug target. However, our survey found more activating variants in genetic diseases (144 compared to 67 considering the 211 true disease variants; 126 of these are non-tumour/cancer predisposition variants; Table S1A), including kinases activating in Parkinson’s disease (kinase LRRK2), Immunodeficiency (SYK), or Pfeiffer syndrome (FGFR2).
There were 1028 variants (718 or 70% in the kinase domain) in 289 kinases that have been shown to lead to either a complete loss of function or decreased enzyme activity (deactivating). This type of variant is less common in cancer, with just 27 of 233 (including seven tumour predisposition variants) of the true disease variants being from somatic tumours. Loss-of-function variants predominate in kinase variants causative of genetic diseases (Table S1A).
We considered 98 variants (94 or 96% in the kinase domain) in 17 kinases that arose owing to cancer drug resistance17 (Figure S1A,D). These are more homogeneous than the other class as resistance is highly context-dependent (i.e. on the inhibitor, the kinase mechanism, etc.) and they very often overlap (see below) with either activating or deactivating variants. We thus considered them separately from the other classes throughout the analysis.
Lastly, we defined a set of 1004 variants (203 or 20% in the kinase domain) in 304 kinases that we presumed to be neutral based on their constitutional appearance with a high frequency in healthy humans18 (MAF ≥ 0.001; homozygous counts ≥ 2).
There are 168 instances of the same position in the same kinase having different known effects, of which many are also related to phosphorylation. These positions are highlighted in Table S1A.
The above trends on the training set largely hold for the variants within our test set (Table S1B).
We constructed an alignment of the catalytic domain plus maximally 30 residues at the N- and C- terminus for 464 human kinases (Methods) that we used to position all these variants onto a common positional frame. We also marked regions on this alignment according to the canonical functional regions of protein kinases15 (Fig. 1) that we refer to below.
Inspection of the location of the variants of different types reveals immediate trends (Fig. 1; Table S1A,B). First, as mentioned above, functional variants are greatly enriched in the catalytic domain (Table S1C). This is true for activating, deactivating and resistance variants in both the above training (and testing; see below) datasets, in addition to genetic disease variants from UniProt and somatic variants from COSMIC, both of which increase as evidence for disease/function increases (either evidence from UniProt or variant frequency in COSMIC; Table S1C). The opposite is true of neutral variants, which show a slight tendency to avoid the catalytic domain that increases slightly with the frequency of the variant in the population (Table S1C) as might be expected if they were truly neutral. There is a small, but interesting subset of functional variants outside of the catalytic domain, including some variants with very high sample counts in COSMIC. Most of these are Cysteine losses or gains in the extracellular part of transmembrane tyrosine kinases that have been shown to lead to activating by inter-subunit disulphide bond formation (e.g. 19).
Resistance and activating variants tend to avoid the most conserved and functionally core parts of the kinase catalytic domain (Fig. 1, Figure S1B) and often overlap with each other. In contrast, deactivating variants very often hit key parts of the enzymatic machinery, particularly the catalytic Lysine and the Aspartate residues in the “HRD'' and “DFG'' motifs (Figure S1E). For instance, BTK p.Lys430Glu (catalytic Lys) abolishes kinase activity, leading to X-linked agammaglobulinemia, a rare genetic disorder characterized by the body's inability to produce normal B cells20 and DAPK3 p.Asp161Asn (in the DFG motif) greatly reduces kinase activity, promoting cell survival and cell proliferation in ovarian mucinous carcinoma21. We found 15 positions within the alignment where known activating and deactivating variants overlap (at least 2 counts of each type and at least 5 variants). 10 of these positions lie within the A-loop, and 2 each in the N- or C-terminal tails of the kinase domain. For example, one alignment position in the activation loop has 23 activating and 60 deactivating variants (Fig. 1, Figure S1E). Inspection shows that these are almost always at phosphorylation sites, where most often activation is accomplished by mutation to a negative charge (Asp/Glu) and deactivation by a loss of the phosphorylatable residue (e.g. to Ala or similar) (Figure S1E). For example, IKBKB p.Ser181Glu, mutating the uncharged Serine to a negative charge, leads to full activation of its kinase activity and activation of NF-kappa-B pathway22, while LATS2 p.Ser872Ala (at the equivalent alignment position), mutating the phosphorylatable Serine to an unphosphortylatable Alanine, is reported to lead to loss of its tumour suppressor activity in mice23.
There are also many variants outside of, but near to the catalytic domain, particularly at the N- (76 variants) and C-terminal (61 variants) tails, with roughly the same proportion (of activating, deactivating, etc.) as the entire dataset, though with no resistance variants in the C-terminal tail (Fig. 1, Figure S1A). Just under half (43%) of these are at phosphosites, as might be expected, given that many kinases are phosphorylated in their tails as part of the activation process. Resistance mutations occur almost exclusively at or near the ATP binding pocket (N-lobe, catalytic and activation loops) and indeed we found none in or after the C-lobe (Figure S1A, C).
We initially explored separate analyses of Tyrosine and Serine/Threonine kinases. However, we observed considerable overlap in the datasets in the two classes, for instance, in terms of where the variants were on the canonical kinase structure (Figure S2A, B). Moreover, since certain datasets tend to skew more heavily to one class (e.g. resistance variants to Tyrosine kinases), we reasoned that the separation would also make our predictors (below) less effective owing to data paucity.
A machine-learning trained predictor for kinase variant function
The patterns clearly visible above suggested that it would be possible to derive a predictor of variant type by way of machine learning. Accordingly, we applied multiple machine learning algorithms (Gradient Boosting Classifier, Random Forest, Gaussian Naive Bayes, Support Vector Classifier, Multi-Layer Perceptron Classifier and an ensemble of these) to develop three contrasting predictors based on seven types of sequence and structural features (see Methods) (Table S2E, Figure S3). Based on the observations above, we only considered variants that are within the kinase domain or within 30 amino acids of the N- or C-terminus (excluding these residues that overlap with other domains; see Methods).
The first predictor, activating vs deactivating, represents a typical situation when one has what is believed to be a functional variant (e.g. observed many times in a cohort or dataset) and wishes to distinguish these two possibilities. The second, activating, deactivating or neutral, is more reflective of a situation where one does not know if a variant is functional at all and thus one needs to predict neutrals. The third predictor, resistance vs neutral, predicts if a given mutation is resistant or not. We avoided contrasting resistance to activating or deactivating due to the considerable overlap between activating and resistance (above) and because resistance variants were from an entirely distinct source (Figure S2A, B). Note that this also means that predictions of resistance should probably be considered alongside the other two predictors since there will necessarily be a tendency to predict many activating or deactivating sites as resistant.
All predictors were trained using sequence and structural-derived features obtained for the activating, deactivating, resistance and neutral variants above (within the catalytic domain region) and then tested on a smaller, not-overlapping test set of variants (Methods).
The best results were obtained using Gradient Boosting Classifier (GBC) with random forest (RF) being only marginally worse (Table S2E, Figure S3). Numbers and predictions in the sections that follow are from the best-performing GBC data apart from where stated. AUC values from ROC analysis during cross-validation (0.91–0.95) were comparable with those obtained on 145 variants from an independent test set absent from the training phase (0.80–0.94;Table S2B; Figure S4A). The biggest difference seen when predicting activating vs neutral (0.91 training, 0.80 testing), which is possibly to do with an increase in somatic relative to germline activating variants in the test set compared to the training.
As our dataset is unbalanced in terms of the number of each type of kinase variant (e.g. there are more than twice as many deactivating as activating variants) we also computed balanced accuracy (values between 0.72 and 0.85 for all predictors) and MCC (values between 0.38 and 0.69) all of which suggested good overall performance.
We also tested whether the training set was unfairly biassed owing to variants at the same position (i.e. to a different mutant residue) as those in the training set, though the differences were marginal (Figure S4B). The difference between GBC and RF on the test set was also small, with the former better for most, but not all predictors (Figure S4A,C).
The better performance for these decision tree based approaches agreed with our impressions from the dataset analysis above; namely that there are diverse contextual reasons for a position to be activating or deactivating (e.g. see discussion of phosphosites above). This suggests that more additive approaches (e.g. Naive Bayes) would be less optimal to distinguish these sites.
The fact that predictions of resistance have such a strong performance is somewhat surprising as inhibitors vary greatly and variants can be highly specific to one of a set of similar drugs. However, there are nevertheless features of resistance variants that seem to distinguish them from others. It is clear that the data (from COSMIC) are biassed towards ATP-site inhibitors and it is likely that performance on non-canonical inhibitors would be worse.
We also compared our results to existing predictors of variant impact 5,6,24. It is important to emphasise that these other predictors do not attempt to distinguish different types of functional variants (activating/deactivating/resistance) but rather to distinguish broadly structurally/functionally disruptive variants (i.e. including all three types of variants in one) from neutral. When grouping activating, deactivating and resistance together, mimicking the more general prediction of “pathogenicity” used by other prediction tools, our approach performs similarly to other predictors (Table S2D, Figure S4D) on variants within the kinase domain, though clearly AlphaMissense is best overall. Interestingly, PolyPhen2 and PMUT fared worse on activating and resistance variants when considered separately, probably as these are likely under-represented in sets of “damaging” or “pathogenic” variants used to train older methods. It is crucial to emphasise that our goal is not to predict pathogenicity better than these predictors, but to distinguish more specific functional consequences that these generic predictors do not, and that the best-use case is to use our predictor alongside a more generic predictor of pathogenicity for variants that lie in the protein kinase domain (as indeed is often how clinically relevant candidate kinase variants are currently identified). Nevertheless, the fact that these results are similar to these other methods gives us confidence in our machine learning strategy for the other predictors. Figure S4D suggests that taking an AlphaMissense approach to these data might give still better results when attempting (e.g.) to discriminate activating from deactivating and neutral.
We also attempted to compare our predictions to an earlier method to predict kinase activating variants16 however, the need to specify particular experimental structures for each prediction made this problematic. To validate that the predictors did not overfit, we conducted a randomization test (see Methods). The performance ([Table S2C)] of randomized predictors (AUC close to 0.5) was worse than the original predictors, suggesting that our approach is not susceptible to overfitting.
The most important features for machine learning (determined using Gini importance) driving the predictions were found to be conservation across human kinases, homologs, paralogs, orthologs, and post-translational modification (PTM) information at the variant site (Figure S5). Interestingly, certain features have distinct contextual importance. Conservation across different sets of homologs (e.g. orthologs, paralogs, etc.) and PTMs are key to distinguishing activating and deactivating from neutral variants (Figure S5A), they are insufficient for distinguishing between activating and deactivating variants (e.g. consider variants occurring at phosphosites) (Figure S5B). For this, conservation across all kinases is highly relevant, since the most conserved positions are known to disrupt the kinase activity (e.g. catalytic Lys, DFG-motif, etc) and are more likely to indicate deactivating positions, whereas positions conserved, for example, in mammalian orthologs (Figure S1B), could be either activating, resistance or deactivating. We suspect that this is the reason why different conservation values play such distinct roles across the various predictors. We found the ATP binding information to be most relevant for resistance variants (Figure S5C), which is expected since most kinase inhibitors are known to be ATP-competitive25.
To aid in the visualisation and accessibility of the prediction results and known information about a given mutation, we developed a web application (activark.russelllab.org) which allows users to input multiple variants, rank them according to different predictors and peruse them in depth alongside data from our curated variant set. For this purpose we also developed a technique to rapidly annotate alignment sections.
Known and potentially new functional variants within existing datasets
We applied our approach to kinase variants from three large datasets: somatic cancer variants from COSMIC17, hereditary disease variants from UniProt14 and naturally occurring (healthy) variants from gnomAD18. Overall, the disease variants sets show enrichment in the kinase domain that increases as one raises the level of certainty (i.e. requiring an increasing number of COSMIC samples or evidence in UniProt annotations; Table S1A,B). Neutral variants, in contrast, show a slight tendency to avoid the kinase domain as might be expected (Table S1A,B). These observations support the general notion of a predictor to discriminate functional variants lying in the kinase domain.
The overall proportions of activating, deactivating and neutral predictions for these sets agreed with our expectations (Fig. 2C). For instance, activating variants were more pronounced in the somatic set, in contrast to deactivating that were more often in hereditary diseases, reflecting the high proportion of kinases among oncogenes and the general tendency for hereditary diseases to lead to loss of function rather than gain. Reassuringly, the natural variants had the greatest proportion of neutral predictions and very few activating or deactivating instances. Resistance variants (that were not already predicted as activating or deactivating) are nearly absent in hereditary diseases with the greatest proportion in somatic and a smattering in hereditary variants.
Within somatic variants in COSMIC17, deactivating variants from our curated set are comparatively rare, with only 20 observed in total (including 6 known tumour suppressors) across 187 samples, with very low individual sample counts (all < 50; 14 < 10; Table S3A). Surprisingly, known activating variants (even known somatic variants) also often have low sample counts in COSMIC. Although there are several that are seen thousands of times mostly due to a predominance of tumours driven by well-known cancer driver genes (e.g. JAK2 p.Val617Phe, EGFR p.Leu858Arg, KIT p.Asp816Val, BRAF p.Val600Glu/Lys), nearly half (54/110) of both activating and resistance variants are seen with counts of 50 or fewer; just under a third (33/110) seen with 10 counts or fewer (Table S3A). These include activating variants BRAF p.Leu597Val (seen 16 times) and MAP2K1 p.Gln56Pro (33 times) both of which are well-established oncogenic variants26,27. This suggests the possibility that many other rare constitutively active variants might lie within the existing data with comparatively low sample counts. We explore some of these in greater detail below.
COSMIC also provided some additional insights into the nature of variants in kinases and some support for the efficacy of our predictor. Considering the set of kinases (of 68 in total) that are unambiguously assigned as oncogenes (45) or tumour suppressors (9) in the COSMIC Cancer Gene Census, and excluding variants already known to be deactivating or activating, we found a reasonable separation in activating or deactivating predictions when comparing oncogenes to tumour suppressors (Fig. 3A; note only three tumour suppressors and 12 oncogenes had at least four variants remaining after filtering knowns). Moreover, when considering the fraction of unclassified variants predicted as one or the other, it is clear that oncogenes have a higher fraction of activating and tumour suppressors of deactivating, as expected (Fig. 3B). This is interesting as information about the kinase itself was not part of the predictor and suggests that frequently observed variants in previously known oncogenes or tumour suppressors are likely to have the expected (i.e. activating or deactivating respectively) effect.
This analysis also highlighted several very strongly predicting activating variants in well-established oncogenes or the opposite in tumour suppressors (red and green text in Fig. 3A). This has clear implications for both types of cancer genes. Naturally, activating variants can lead to personalised treatments in the form of specific kinase inhibitors. However, knowledge of deactivating variants in tumour suppressors can also have important implications for diagnostics, for instance, both CHEK2 and STK11 variants are difficult to assess in the context of hereditary cancers28. We also observed a handful of instances where an oncogene has clearly deactivating, frequently observed variants, notably BRAF p.Asp594Gly/Gln in the DFG motif and p.Lys483Glu at the catalytic lysine, though these are well understood (and genuine exceptions)29.
Considering kinases not in the Cancer Gene census and for which sufficient variants are seen in COSMIC (black triangles in Fig. 3B), the predicted fraction of activating/deactivating variants argues that CSNK2A1, PRKCB and TGFBR1 are more likely to be tumour suppressors and PAK5, MAP2K3, EPHA3 and EPHA7 oncogenes. This largely agrees with what has been recently postulated in the literature except for EPHA7 that is most often referred to as a tumour suppressor, though it also has roles in promoting tumours30. It is important to emphasise that these findings are biassed towards the data that are currently in the COSMIC database.
We also performed a similar analysis of all kinase hereditary disease variants in the UniProt database that are not already annotated as activating or deactivating (Fig. 3B; Table S3B). We first identified kinase:disease pairs with at least two variants previously characterised as activating or deactivating and for which at least three predictions were made by the system. This gave three pairs being only activating, 13 only deactivating and two having both. When considering predictions for the uncharacterized variants (excluding those variants known to be activating/deactivating) we see a reasonable separation between only-activating and deactivating (Fig. 3C). As above for cancer genes, a number of kinase:disease pairs lacking prior characterization fall into discrete regions of the plot, thus suggesting whether the pathology is dictated by kinase activation or deactivation. For example, mutations in Microtubule-associated serine/threonine-protein kinase 3 (MAST3) are implicated in developmental and epileptic encephalopathy (DEE108)31. We predicted 6 out of 6 variants in MAST3 linked to DEE108 as activating (albeit with one marginal activating/neutral prediction). Two of these (p.Gly510Ser and p.Gly515Ser) are associated with increased phosphorylation of a MAST3 target protein and thus are likely a gain of function31. In contrast, we predict 5 out of 5 variants in FGFR1 associated with Hartsfield syndrome (HRTFDS) to be deactivating, and inspection shows four very likely are as they involve highly conserved active site residues such glycines in the N-terminal Glycine-rich loop (p.Gly490Arg) or within the catalytic loop/HRD motif (p.Asp623Tyr, p.Arg627Thr, p.Asn628Lys).
Among variants seen in healthy humans (gnomAD18), we found 53690 kinase domain variants of which 194 display a MAF ≥ 1%. From those, we found 7 variants with ≤ 5 homozygous counts (Table S3C). The high allele frequency together with depleted homozygous counts argues that such variants may be functional32. For example, p.Val124Gly in CDK1 is conserved across orthologs and lies within the activation loop predicted to be deactivated, as it is only 2 positions N-terminal of the HRD motif, where many deactivating mutations are seen in other kinases (STK1133, CSFR134 and others). Elsewhere, the gain of a negative charge in the TNK1 variant p.Ala299Asp is predicted to be activating as it is 1 position N-terminal of known phosphosites in AKT1 and IKBKB (Table S1G). Our predictions together with the absence/low frequency of homozygous counts suggest that these and other variants could be functional.
Predictions for all kinase domain variants, including the subsets from the datasets above, are available for download at activark.russelllab.org/datasets.
Experimental tests of selected kinase variant predictions
To illustrate how the predictor can be used to determine interesting candidates for further experimental study, we selected 9 variants from four kinases for rapid experimental tests (Table S4). Seven of these came from the COSMIC analysis above, including four predicted activating variants (one of which was already known) and three predicted deactivating variants. We added two additional deactivating variants in the form of well-known alanine variants of key active site residues. We could not identify any clear data source of previously unknown resistance variant candidates on which to perform a similar screen. Moreover, experimentally testing resistance requires information on particular small-molecules that are often missing from COSMIC, and those for which these data were available were our (limited) source of resistance variants for training the method.
For each variant, we transfected T-REx-293 cells with tetracycline-compatible pDest30 vectors containing the gene of interest (wild type or mutant). After 24 h of tetracycline induction, we extracted RNA and assessed gene expression for biological replicates via microarrays (Fig S10). We compared induced cells to their transfected and non-induced counterparts and induced mutants to induce wild-type kinase. While T-REx-293 cells should not be used to investigate particular diseases (e.g. lymphoma) we used this standardised cell line to detect changes in kinase activity. For PIM1 we additionally used commercially available phospho-antibodies to assess the phosphorylation levels of key sites targeted by the kinases. For CHEK2 we also assessed sensitivity to ROS-induced DNA damage and for MAP2K3 we investigated mitochondrial activity using the Mitotracker dye.
We saw little or no signal from our methods from the predicted deactivating variants, as might be expected given the traditional difficulties in establishing a loss of function. However, for one, the high expression similarity between the uncharacterized (predicted either resistance or deactivating) MAP2K1 p.Val211Asp and the known deactivating p.Lys97Ala suggests loss of activity, particularly as the known activating variant p.Gln56Pro looks very different (Figure S8; Table S4). For three of the four predicted activating variants we saw clear differences to wild type that we discuss in the next sections. Activating variants are of particular interest in that they can immediately suggest treatment by way of specific inhibitors. These (and resistance variants) are thus of the clearest clinical relevance.
The PIM1 p.Ser97Asn lymphoma variant leads to increased or constitutive activation
Ser97 in PIM1 lies in the C-helix (Fig. 4A top). Within COSMIC, Ser97 is the position in PIM1 with the most missense variants (97) of which 37 are p.Ser97Asn and all in haematopoietic and lymphoid cancers35. This variant is predicted to be weakly activating by the random forest predictor. The loss of Asn at the equivalent position in NEK7 (p.Asn90Lys, p.Asn90Arg and p.Asn90Ala)36 leads to a strong reduction in its kinase activity, suggesting that an Asn might be favoured at this position. There are also two known activating variants N-terminal to this position in ALK (p.Phe1174Leu/Val)37.
Overexpression of mutant PIM1 showed no significant expression differences between p.Ser97Asn compared to wild type after 24 h of induction. We additionally investigated the phospho-Thr180 levels of MAPK14/p38, a well-known protein downstream of PIM1 signalling. We detected a two-fold depletion in phosphorylation of MAPK14/p38 Thr180 (via ɑ-p38/pThr180 antibody), which would be expected if PIM1 activity was elevated (Fig. 4A, bottom, Figure S5A-B, Table S4A, B) as MAPK14/p38 activation is dependent on ASK1 activation, which PIM1 inhibits38.
MYC overexpression via enhancer hijacking is the hallmark of several lymphoid cancers, particularly Burkitt Lymphoma39. PIM1 phosphorylation of MYC is known to enhance its stability40. There are thus good arguments for why increased PIM1 function would be expected in these tumours as either a primary or secondary driver event. PIM1 is thought to be intrinsically active (i.e. it does not require other factors to become active)41. It is possible that p.Ser97Asn is augmenting an already active enzyme in the context of these tumours, possibly to stabilise MYC. This raises the possibility that the other COSMIC PIM1 variants seen in these tumours and predicted to be activating by our method (e.g. p.Gly28Asp, p.Glu135Lys, p.Pro33Ser; see Fig. 3B) also increase its activity. Indeed, structural variants of MYC are largely absent in samples from a Follicular Lymphoma cohort where PIM1 variants are enriched35. All of this argues that p.Ser97Asn and likely many of the other variants are indeed increasing PIM1 enzymatic activity as a possible driver of Lymphoid tumours. PIM1 is an established oncogene, though this is generally because of observed elevation of expression, for instance in prostate cancer42. Our findings and the growing number of missense observations argue that activating variants might also play a role in PIM1-mediated pathogenesis.
MAP2K3 p.Ala84Thr: a rare constitutively active driver in various cancers that greatly depletes mitochondrial gene expression
MAP2K3 (MEK3, MKK3) p.Ala84Thr is seen in 24 samples within COSMIC, with the most common (10 samples) from head and neck cancers44 in respiratory and upper-digestive tract tumours and predicted to be activating. Ala84 lies in the ꞵ2/3 loop and is eight residues C-terminal of the conserved GxGxxG motif (Fig. 4B, top).
Inspection of other kinases shows evidence of the activating variants in the same region (Fig. 4B, top). For instance, PRKD1 (protein kinase Db1) has an activating variant at this position: p.Arg603His, seen in telangiectasia-ectodermal dysplasia-brachydactyly-cardiac anomaly syndrome shows constitutive catalytic activity29. Variants at positions two residues N-terminal to Ala84 lead to constitutive activation in JAK2 (p.Arg867Gln43) and CDK4 (p.Arg24Cys44), as do variants C-terminal of this position, such as MET p.Asn1100Tyr45 or ZAP70 p.Lys362Glu46. In addition, several kinases have phosphorylation sites in this region (Fig. 4B, top). For example, the Arabidopsis kinase SnRk2 is autophosphorylated at a position equivalent to Ser-86 in MAP2K3 (Ser-4347).
Over-expression of the mutant MAP2K3 showed a marked difference in gene expression compared to the wild-type. Specifically, only p.Ala84Thr (and not wild-type or other variants, Fig. S8A) showed a drastic reduction (adj. p-value < 10− 20, Figure S7B) in the expression of hundreds of mitochondrial genes (Fig. 4B, bottom; Figure S6C, Figure S7A; Table S4C-E). Mitochondrial gene under-expression is seen when comparing the mutant to WT at 24h or when comparing the mutant at 24 h to 0h (nothing is significant when comparing WT at 24h to 0h apart from roughly 4-fold overexpression of MAP2K3). This is strong support for this mutation activating MAP2K3 as several studies have shown that down-regulation produces the opposite effect. Deletion of this enzyme in mice leads to an increase in mitochondria number and function48,49 and hyaluronan-mediated suppression of MAP2K3 expression in human mesenchymal stem cells similarly led to an increase in mitochondrial number and membrane potential50. The fact that we are seeing the opposite behaviour and that we see no effect of the wild type supports the idea that this mutant has elevated kinase activity. Assessing mitochondrial activity using MitoTracker (Thermo Fisher) supported this finding, showing that the tetracycline-induced p.Ala84Thr had significantly lower activity than the un-induced (Fig S9, Table S4H). To our knowledge, no mitochondrial phenotype has to date been reported during the over-expression of MAP2K3. Though the true nature of this variant in the context of cancer might be different (i.e. in contrast to T-REx-293 cells), it has been proposed that general suppression of respiratory (i.e. mitochondrial) gene expression is seen in many cancer types51.
Breast cancer patients showing elevated expression of MAP2K3 have worse survival rates, particularly triple-negative, and the kinase is proposed to be oncogenic in driving MYC in certain patients52. MAP2K3 p.Ala84Thr has previously been classified as benign in a large screen of cancer somatic and germline genomes21, likely as the kinase itself did not stand out as a major driver across the pan-cancer set. This variant shows a suspiciously high frequency in gnomAD (though never homozygous making it possibly not viable in two copies32), though it is intriguingly absent from the 1000 genomes dataset53 (the gnomAD frequency would suggest over 300 counts in 1000 genome), making its population status somewhat unclear. Many samples for this variant in gnomAD are also marked as having failed the inbreeding coefficient filter. Moreover, p.Ala84Thr is one of only two of eleven variants with an allele frequency > 0.001 that were not filtered out completely (indeed p.Ala84Val is filtered out), which further questions the legitimacy of this record. Regardless, the gene-expression and MitoTracker results support the notion that this variant could lead to increased MAP2K3 activity.
The curious case of CHEK2 p.Lys373Glu
We predicted p.Lys373Glu as a putative activating variant in checkpoint kinase 2 (CHEK2; Fig. 4C, top). Through phosphorylation of numerous substrates, CHEK2 regulates cell cycle arrest, DNA repair and apoptosis upon DNA damage, thus acting as a tumour suppressor (e.g. 54). Most somatic or cancer-predisposition variants in this kinase have been shown to result in loss or decreased kinase activity (e.g. 55,56).
Within COSMIC, this is the most common variant (present in 102 samples; the next is 47) seen in the large intestine, nervous system, kidney and other cancers. Lys373 lies three residues C-terminal of the conserved DFG motif. This region in other kinases contains a mixture of activating and deactivating variants. For example, the exact equivalent variant leads to constitutive or increased activity in IKBKB (p.Lys171Glu57) and an Arg to Gln change in ALK is a known constitutively active mutation (p.Arg1275Gln57). In contrast, a gain of a Lysine at this position can be deactivating, such as seen in PASK (p.Ala1151Lys58). The loss of a positive charge at this position has also been observed to be deactivating, for example in CDK8 (p.Arg178Gln)59. Deactivating variants are less similar, for example, p.His371Tyr two positions N-terminal in CHEK2 in breast cancer60 and the equivalent N-terminal position p.Leu597Val in BRAF was shown to be activating61.
There are thus good arguments for why this modification changing a lysine to glutamate might be activating in CHEK2 despite its role as a tumour suppressor. Running against this, activation of CHEK2 has recently been shown to confer resistance to oxaliplatin treatment in colorectal cancer62 and overexpression of CHEK2 was linked to worse survival in adrenocortical carcinoma63. Intriguingly, CHEK2 p.Lys373Glu is strongly correlated with patients’ progression-free survival in high-grade serous ovarian carcinoma post-olaparib treatment64. Taken together, this might suggest that the outcome of CHEK2 signalling perturbation is tissue-specific.
We could see no gene expression difference between induced over-expressed CHEK2 p.Lys373Glu and the wild-type enzyme or a kinase-dead variant (CHEK2 p.Thr68Ala65\). However, there is a pronounced difference between both variants and the CHEK2 WT when monitoring cell counts over a period of 72 hours with periodic treatment with the DNA-damaging agent H2O2 (Fig. 4C, bottom; Table S4F,G). Both variants appear to show a perturbation of the enzyme in that they have higher cell numbers than the wild type, resulting in increased apoptosis resistance. Thus it is clear, as has been shown previously66, that p.Lys373Glu is likely deactivating in T-REx-293 cells.
Inspection of CHEK2 structures offers some possible insights as to why this predicted activating variant is, in fact, deactivating. Lys-373 lies at the dimer interface in both dimeric forms of the enzyme67,68, and is thought to play a key role in CHEK2 activation68, in a manner that likely differs from most other kinases. The context of CHEK2 activation (not considered in our predictor) is thus likely why this was wrongly predicted. We are currently experimenting with adding additional features related to dimer-contacts, though this requires a more complete set of reliable homo/hetero-dimer structures than are currently available.