Discriminating activating, deactivating and resistance variants in protein kinases

doi:10.21203/rs.3.rs-5001235/v1

Download PDF

Research Article

Discriminating activating, deactivating and resistance variants in protein kinases

https://doi.org/10.21203/rs.3.rs-5001235/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

We present a data-driven approach to predict the functional consequence of genetic changes in protein kinases. We first created a large curated dataset of 375 activating/gain-of-function, 1028 deactivating/loss, 98 resistance and 1004 neutral protein variants in 441 human kinases by scouring the literature and various databases. For any variant, we defined a vector of 7 types of sequence, evolutionary and structural features. We used these vectors to train machine learning predictors of kinase variant classes that obtain excellent performance (Mean AUC = 0.941), which we then applied to uncharacterized variants found in somatic cancer samples, hereditary diseases and genomes from healthy individuals. Encouragingly we predicted a greater tendency of activating variants in cancers, deactivating in hereditary diseases and few of both in healthy individuals. Using this method on clinical data can identify potential functional variants. In cancer samples we experimentally assessed the impact of several such mutations, including potential activating variants p.Ser97Asn in PIM1, where phosphorylation analysis suggests an increase in activity, and p.Ala84Thr in MAP2K3, where gene expression and mitochondrial staining shows a reduction in mitochondrial function when contrasting mutant to wild type, the opposite having been observed previously during deletion experiments. We provide an online application to study any variant in the kinase domain that provides prediction scores in addition to a detailed list of what is known across all kinases near the position of interest. Besides supporting the interpretation of genomic variants of unknown significance, knowledge of kinase activation can lead to immediate therapeutic suggestions, we thus believe our approach will be a key component in the repertoire of tools for personalised medicine.

The growth in high-throughput sequencing in biomedicine presents clinicians and biomedical researchers with new challenges^1,2. Among these is the need to assess which, among hundreds or thousands of genetic variants detected in a typical sequencing experiment, are associated with the disease under study. Variants of unknown significance (VUS) are currently the clear majority of variants found in somatic tumour sequencing or clinical genetics^3,4 and the increased availability and affordability of sequencing means that this situation will remain for several years.

Tools for characterising variants go back to the 1990s and to this day tend to rely mostly on sequence conservation, sometimes combined with other structure or sequence features (e.g. ^4–6), to assess whether a particular variant will affect protein function. Variants linked to diseases are typically characterised as “damaging” or “pathogenic” with few additional insights into what they are actually doing in terms of protein function. Certain more recent tools offer additional insights from protein structure (e.g. ^7,8) or protein interactions (e.g. ⁹), but here too there is limited specificity beyond the fact that a variant lies at or near a protein interface or functional region.

Decades of molecular biology research have produced an abundance of results that can help interpret genetic variants in more detail. This includes known¹⁰ and predicted (e.g. ¹¹) protein structures, post-translational modifications(e.g. ¹²), proteomics, interaction discovery (e.g. ¹³) and decades of manual curation on protein function (e.g. ¹⁴). If appropriately integrated, these data can provide key insights as to what a particular region or position in a protein does and the consequences of what perturbing it might do. Moreover, proteins fortunately often fall into large families, meaning that insights learned from one member can be transferred to another.

Here we focus solely on protein kinases, one of the most important protein families in biomedicine, and present an approach that supplements popular predictors of variant impact by predicting whether a variant in the protein kinase domain is activating, resistance-causing, deactivating, or neutral. Kinases are one of the only families where there appear to be certain trends in terms of activating or resistance variants and where the data are sufficient \to derive such predictors. GTPases (particularly G-proteins and Ras-like proteins) and G-protein coupled receptors are also possibilities, though for the former there are well established key positions that suggest gain-of-function and for the latter there are comparatively few observations of constitutive activation (e.g. ¹⁵). In the future, when many more gain- or loss- of function variants are available for other families, it might be possible to make proteome-wide predictors.

Unlike other kinase function predictors¹⁶, our approach does not require specific structural information or knowledge of inhibitors, exploits a wide range of prior curated data on kinase-altering variants, and post-translational modifications and requires nothing from the user other than the particular variant. To demonstrate that this computational approach can also be used to guide clinically relevant research we show experimental evidence for a selection of variants in the kinases PIM1, MAP2K3 and CHEK2 suggesting that these kinases are indeed altered in function with two showing likely increased activity, an observation that can immediately suggest treatments. The analytical and experimental approach, and the associated web application, will be of immediate use to those interested in identifying functional kinase variants that could dictate treatment or diagnostic decisions.

Overview of known functional variants

We extracted a set of 2505 missense variants from databases and the literature within 441 human kinases that we classified according to their impact on kinase enzymatic activity (Figure S1A; Table S1A). There was a total of 375 in 140 kinases (247 or 66% in the kinase domain) missense variants that lead to constitutive activation or greatly increased activity (hereafter activating). This class of variant is most famously seen in certain cancers, where a key (most often) somatic variant leads to greatly increased signalling as part of the pathogenesis of the disease and is thus often a drug target. However, our survey found more activating variants in genetic diseases (144 compared to 67 considering the 211 true disease variants; 126 of these are non-tumour/cancer predisposition variants; Table S1A), including kinases activating in Parkinson’s disease (kinase LRRK2), Immunodeficiency (SYK), or Pfeiffer syndrome (FGFR2).

There were 1028 variants (718 or 70% in the kinase domain) in 289 kinases that have been shown to lead to either a complete loss of function or decreased enzyme activity (deactivating). This type of variant is less common in cancer, with just 27 of 233 (including seven tumour predisposition variants) of the true disease variants being from somatic tumours. Loss-of-function variants predominate in kinase variants causative of genetic diseases (Table S1A).

We considered 98 variants (94 or 96% in the kinase domain) in 17 kinases that arose owing to cancer drug resistance¹⁷ (Figure S1A,D). These are more homogeneous than the other class as resistance is highly context-dependent (i.e. on the inhibitor, the kinase mechanism, etc.) and they very often overlap (see below) with either activating or deactivating variants. We thus considered them separately from the other classes throughout the analysis.

Lastly, we defined a set of 1004 variants (203 or 20% in the kinase domain) in 304 kinases that we presumed to be neutral based on their constitutional appearance with a high frequency in healthy humans¹⁸ (MAF ≥ 0.001; homozygous counts ≥ 2).

There are 168 instances of the same position in the same kinase having different known effects, of which many are also related to phosphorylation. These positions are highlighted in Table S1A.

The above trends on the training set largely hold for the variants within our test set (Table S1B).

We constructed an alignment of the catalytic domain plus maximally 30 residues at the N- and C- terminus for 464 human kinases (Methods) that we used to position all these variants onto a common positional frame. We also marked regions on this alignment according to the canonical functional regions of protein kinases15 (Fig. 1) that we refer to below.

Inspection of the location of the variants of different types reveals immediate trends (Fig. 1; Table S1A,B). First, as mentioned above, functional variants are greatly enriched in the catalytic domain (Table S1C). This is true for activating, deactivating and resistance variants in both the above training (and testing; see below) datasets, in addition to genetic disease variants from UniProt and somatic variants from COSMIC, both of which increase as evidence for disease/function increases (either evidence from UniProt or variant frequency in COSMIC; Table S1C). The opposite is true of neutral variants, which show a slight tendency to avoid the catalytic domain that increases slightly with the frequency of the variant in the population (Table S1C) as might be expected if they were truly neutral. There is a small, but interesting subset of functional variants outside of the catalytic domain, including some variants with very high sample counts in COSMIC. Most of these are Cysteine losses or gains in the extracellular part of transmembrane tyrosine kinases that have been shown to lead to activating by inter-subunit disulphide bond formation (e.g. ¹⁹).

Resistance and activating variants tend to avoid the most conserved and functionally core parts of the kinase catalytic domain (Fig. 1, Figure S1B) and often overlap with each other. In contrast, deactivating variants very often hit key parts of the enzymatic machinery, particularly the catalytic Lysine and the Aspartate residues in the “HRD'' and “DFG'' motifs (Figure S1E). For instance, BTK p.Lys430Glu (catalytic Lys) abolishes kinase activity, leading to X-linked agammaglobulinemia, a rare genetic disorder characterized by the body's inability to produce normal B cells²⁰ and DAPK3 p.Asp161Asn (in the DFG motif) greatly reduces kinase activity, promoting cell survival and cell proliferation in ovarian mucinous carcinoma²¹. We found 15 positions within the alignment where known activating and deactivating variants overlap (at least 2 counts of each type and at least 5 variants). 10 of these positions lie within the A-loop, and 2 each in the N- or C-terminal tails of the kinase domain. For example, one alignment position in the activation loop has 23 activating and 60 deactivating variants (Fig. 1, Figure S1E). Inspection shows that these are almost always at phosphorylation sites, where most often activation is accomplished by mutation to a negative charge (Asp/Glu) and deactivation by a loss of the phosphorylatable residue (e.g. to Ala or similar) (Figure S1E). For example, IKBKB p.Ser181Glu, mutating the uncharged Serine to a negative charge, leads to full activation of its kinase activity and activation of NF-kappa-B pathway²², while LATS2 p.Ser872Ala (at the equivalent alignment position), mutating the phosphorylatable Serine to an unphosphortylatable Alanine, is reported to lead to loss of its tumour suppressor activity in mice²³.

There are also many variants outside of, but near to the catalytic domain, particularly at the N- (76 variants) and C-terminal (61 variants) tails, with roughly the same proportion (of activating, deactivating, etc.) as the entire dataset, though with no resistance variants in the C-terminal tail (Fig. 1, Figure S1A). Just under half (43%) of these are at phosphosites, as might be expected, given that many kinases are phosphorylated in their tails as part of the activation process. Resistance mutations occur almost exclusively at or near the ATP binding pocket (N-lobe, catalytic and activation loops) and indeed we found none in or after the C-lobe (Figure S1A, C).

We initially explored separate analyses of Tyrosine and Serine/Threonine kinases. However, we observed considerable overlap in the datasets in the two classes, for instance, in terms of where the variants were on the canonical kinase structure (Figure S2A, B). Moreover, since certain datasets tend to skew more heavily to one class (e.g. resistance variants to Tyrosine kinases), we reasoned that the separation would also make our predictors (below) less effective owing to data paucity.

A machine-learning trained predictor for kinase variant function

The patterns clearly visible above suggested that it would be possible to derive a predictor of variant type by way of machine learning. Accordingly, we applied multiple machine learning algorithms (Gradient Boosting Classifier, Random Forest, Gaussian Naive Bayes, Support Vector Classifier, Multi-Layer Perceptron Classifier and an ensemble of these) to develop three contrasting predictors based on seven types of sequence and structural features (see Methods) (Table S2E, Figure S3). Based on the observations above, we only considered variants that are within the kinase domain or within 30 amino acids of the N- or C-terminus (excluding these residues that overlap with other domains; see Methods).

The first predictor, activating vs deactivating, represents a typical situation when one has what is believed to be a functional variant (e.g. observed many times in a cohort or dataset) and wishes to distinguish these two possibilities. The second, activating, deactivating or neutral, is more reflective of a situation where one does not know if a variant is functional at all and thus one needs to predict neutrals. The third predictor, resistance vs neutral, predicts if a given mutation is resistant or not. We avoided contrasting resistance to activating or deactivating due to the considerable overlap between activating and resistance (above) and because resistance variants were from an entirely distinct source (Figure S2A, B). Note that this also means that predictions of resistance should probably be considered alongside the other two predictors since there will necessarily be a tendency to predict many activating or deactivating sites as resistant.

All predictors were trained using sequence and structural-derived features obtained for the activating, deactivating, resistance and neutral variants above (within the catalytic domain region) and then tested on a smaller, not-overlapping test set of variants (Methods).

The best results were obtained using Gradient Boosting Classifier (GBC) with random forest (RF) being only marginally worse (Table S2E, Figure S3). Numbers and predictions in the sections that follow are from the best-performing GBC data apart from where stated. AUC values from ROC analysis during cross-validation (0.91–0.95) were comparable with those obtained on 145 variants from an independent test set absent from the training phase (0.80–0.94;Table S2B; Figure S4A). The biggest difference seen when predicting activating vs neutral (0.91 training, 0.80 testing), which is possibly to do with an increase in somatic relative to germline activating variants in the test set compared to the training.

As our dataset is unbalanced in terms of the number of each type of kinase variant (e.g. there are more than twice as many deactivating as activating variants) we also computed balanced accuracy (values between 0.72 and 0.85 for all predictors) and MCC (values between 0.38 and 0.69) all of which suggested good overall performance.

We also tested whether the training set was unfairly biassed owing to variants at the same position (i.e. to a different mutant residue) as those in the training set, though the differences were marginal (Figure S4B). The difference between GBC and RF on the test set was also small, with the former better for most, but not all predictors (Figure S4A,C).

The better performance for these decision tree based approaches agreed with our impressions from the dataset analysis above; namely that there are diverse contextual reasons for a position to be activating or deactivating (e.g. see discussion of phosphosites above). This suggests that more additive approaches (e.g. Naive Bayes) would be less optimal to distinguish these sites.

The fact that predictions of resistance have such a strong performance is somewhat surprising as inhibitors vary greatly and variants can be highly specific to one of a set of similar drugs. However, there are nevertheless features of resistance variants that seem to distinguish them from others. It is clear that the data (from COSMIC) are biassed towards ATP-site inhibitors and it is likely that performance on non-canonical inhibitors would be worse.

We also compared our results to existing predictors of variant impact ^5,6,24. It is important to emphasise that these other predictors do not attempt to distinguish different types of functional variants (activating/deactivating/resistance) but rather to distinguish broadly structurally/functionally disruptive variants (i.e. including all three types of variants in one) from neutral. When grouping activating, deactivating and resistance together, mimicking the more general prediction of “pathogenicity” used by other prediction tools, our approach performs similarly to other predictors (Table S2D, Figure S4D) on variants within the kinase domain, though clearly AlphaMissense is best overall. Interestingly, PolyPhen2 and PMUT fared worse on activating and resistance variants when considered separately, probably as these are likely under-represented in sets of “damaging” or “pathogenic” variants used to train older methods. It is crucial to emphasise that our goal is not to predict pathogenicity better than these predictors, but to distinguish more specific functional consequences that these generic predictors do not, and that the best-use case is to use our predictor alongside a more generic predictor of pathogenicity for variants that lie in the protein kinase domain (as indeed is often how clinically relevant candidate kinase variants are currently identified). Nevertheless, the fact that these results are similar to these other methods gives us confidence in our machine learning strategy for the other predictors. Figure S4D suggests that taking an AlphaMissense approach to these data might give still better results when attempting (e.g.) to discriminate activating from deactivating and neutral.

We also attempted to compare our predictions to an earlier method to predict kinase activating variants¹⁶ however, the need to specify particular experimental structures for each prediction made this problematic. To validate that the predictors did not overfit, we conducted a randomization test (see Methods). The performance ([Table S2C)] of randomized predictors (AUC close to 0.5) was worse than the original predictors, suggesting that our approach is not susceptible to overfitting.

The most important features for machine learning (determined using Gini importance) driving the predictions were found to be conservation across human kinases, homologs, paralogs, orthologs, and post-translational modification (PTM) information at the variant site (Figure S5). Interestingly, certain features have distinct contextual importance. Conservation across different sets of homologs (e.g. orthologs, paralogs, etc.) and PTMs are key to distinguishing activating and deactivating from neutral variants (Figure S5A), they are insufficient for distinguishing between activating and deactivating variants (e.g. consider variants occurring at phosphosites) (Figure S5B). For this, conservation across all kinases is highly relevant, since the most conserved positions are known to disrupt the kinase activity (e.g. catalytic Lys, DFG-motif, etc) and are more likely to indicate deactivating positions, whereas positions conserved, for example, in mammalian orthologs (Figure S1B), could be either activating, resistance or deactivating. We suspect that this is the reason why different conservation values play such distinct roles across the various predictors. We found the ATP binding information to be most relevant for resistance variants (Figure S5C), which is expected since most kinase inhibitors are known to be ATP-competitive²⁵.

To aid in the visualisation and accessibility of the prediction results and known information about a given mutation, we developed a web application (activark.russelllab.org) which allows users to input multiple variants, rank them according to different predictors and peruse them in depth alongside data from our curated variant set. For this purpose we also developed a technique to rapidly annotate alignment sections.

Known and potentially new functional variants within existing datasets

We applied our approach to kinase variants from three large datasets: somatic cancer variants from COSMIC¹⁷, hereditary disease variants from UniProt¹⁴ and naturally occurring (healthy) variants from gnomAD¹⁸. Overall, the disease variants sets show enrichment in the kinase domain that increases as one raises the level of certainty (i.e. requiring an increasing number of COSMIC samples or evidence in UniProt annotations; Table S1A,B). Neutral variants, in contrast, show a slight tendency to avoid the kinase domain as might be expected (Table S1A,B). These observations support the general notion of a predictor to discriminate functional variants lying in the kinase domain.

The overall proportions of activating, deactivating and neutral predictions for these sets agreed with our expectations (Fig. 2C). For instance, activating variants were more pronounced in the somatic set, in contrast to deactivating that were more often in hereditary diseases, reflecting the high proportion of kinases among oncogenes and the general tendency for hereditary diseases to lead to loss of function rather than gain. Reassuringly, the natural variants had the greatest proportion of neutral predictions and very few activating or deactivating instances. Resistance variants (that were not already predicted as activating or deactivating) are nearly absent in hereditary diseases with the greatest proportion in somatic and a smattering in hereditary variants.

Within somatic variants in COSMIC¹⁷, deactivating variants from our curated set are comparatively rare, with only 20 observed in total (including 6 known tumour suppressors) across 187 samples, with very low individual sample counts (all < 50; 14 < 10; Table S3A). Surprisingly, known activating variants (even known somatic variants) also often have low sample counts in COSMIC. Although there are several that are seen thousands of times mostly due to a predominance of tumours driven by well-known cancer driver genes (e.g. JAK2 p.Val617Phe, EGFR p.Leu858Arg, KIT p.Asp816Val, BRAF p.Val600Glu/Lys), nearly half (54/110) of both activating and resistance variants are seen with counts of 50 or fewer; just under a third (33/110) seen with 10 counts or fewer (Table S3A). These include activating variants BRAF p.Leu597Val (seen 16 times) and MAP2K1 p.Gln56Pro (33 times) both of which are well-established oncogenic variants^26,27. This suggests the possibility that many other rare constitutively active variants might lie within the existing data with comparatively low sample counts. We explore some of these in greater detail below.

COSMIC also provided some additional insights into the nature of variants in kinases and some support for the efficacy of our predictor. Considering the set of kinases (of 68 in total) that are unambiguously assigned as oncogenes (45) or tumour suppressors (9) in the COSMIC Cancer Gene Census, and excluding variants already known to be deactivating or activating, we found a reasonable separation in activating or deactivating predictions when comparing oncogenes to tumour suppressors (Fig. 3A; note only three tumour suppressors and 12 oncogenes had at least four variants remaining after filtering knowns). Moreover, when considering the fraction of unclassified variants predicted as one or the other, it is clear that oncogenes have a higher fraction of activating and tumour suppressors of deactivating, as expected (Fig. 3B). This is interesting as information about the kinase itself was not part of the predictor and suggests that frequently observed variants in previously known oncogenes or tumour suppressors are likely to have the expected (i.e. activating or deactivating respectively) effect.

This analysis also highlighted several very strongly predicting activating variants in well-established oncogenes or the opposite in tumour suppressors (red and green text in Fig. 3A). This has clear implications for both types of cancer genes. Naturally, activating variants can lead to personalised treatments in the form of specific kinase inhibitors. However, knowledge of deactivating variants in tumour suppressors can also have important implications for diagnostics, for instance, both CHEK2 and STK11 variants are difficult to assess in the context of hereditary cancers²⁸. We also observed a handful of instances where an oncogene has clearly deactivating, frequently observed variants, notably BRAF p.Asp594Gly/Gln in the DFG motif and p.Lys483Glu at the catalytic lysine, though these are well understood (and genuine exceptions)²⁹.

Considering kinases not in the Cancer Gene census and for which sufficient variants are seen in COSMIC (black triangles in Fig. 3B), the predicted fraction of activating/deactivating variants argues that CSNK2A1, PRKCB and TGFBR1 are more likely to be tumour suppressors and PAK5, MAP2K3, EPHA3 and EPHA7 oncogenes. This largely agrees with what has been recently postulated in the literature except for EPHA7 that is most often referred to as a tumour suppressor, though it also has roles in promoting tumours³⁰. It is important to emphasise that these findings are biassed towards the data that are currently in the COSMIC database.

We also performed a similar analysis of all kinase hereditary disease variants in the UniProt database that are not already annotated as activating or deactivating (Fig. 3B; Table S3B). We first identified kinase:disease pairs with at least two variants previously characterised as activating or deactivating and for which at least three predictions were made by the system. This gave three pairs being only activating, 13 only deactivating and two having both. When considering predictions for the uncharacterized variants (excluding those variants known to be activating/deactivating) we see a reasonable separation between only-activating and deactivating (Fig. 3C). As above for cancer genes, a number of kinase:disease pairs lacking prior characterization fall into discrete regions of the plot, thus suggesting whether the pathology is dictated by kinase activation or deactivation. For example, mutations in Microtubule-associated serine/threonine-protein kinase 3 (MAST3) are implicated in developmental and epileptic encephalopathy (DEE108)³¹. We predicted 6 out of 6 variants in MAST3 linked to DEE108 as activating (albeit with one marginal activating/neutral prediction). Two of these (p.Gly510Ser and p.Gly515Ser) are associated with increased phosphorylation of a MAST3 target protein and thus are likely a gain of function³¹. In contrast, we predict 5 out of 5 variants in FGFR1 associated with Hartsfield syndrome (HRTFDS) to be deactivating, and inspection shows four very likely are as they involve highly conserved active site residues such glycines in the N-terminal Glycine-rich loop (p.Gly490Arg) or within the catalytic loop/HRD motif (p.Asp623Tyr, p.Arg627Thr, p.Asn628Lys).

Among variants seen in healthy humans (gnomAD¹⁸), we found 53690 kinase domain variants of which 194 display a MAF ≥ 1%. From those, we found 7 variants with ≤ 5 homozygous counts (Table S3C). The high allele frequency together with depleted homozygous counts argues that such variants may be functional³². For example, p.Val124Gly in CDK1 is conserved across orthologs and lies within the activation loop predicted to be deactivated, as it is only 2 positions N-terminal of the HRD motif, where many deactivating mutations are seen in other kinases (STK11³³, CSFR1³⁴ and others). Elsewhere, the gain of a negative charge in the TNK1 variant p.Ala299Asp is predicted to be activating as it is 1 position N-terminal of known phosphosites in AKT1 and IKBKB (Table S1G). Our predictions together with the absence/low frequency of homozygous counts suggest that these and other variants could be functional.

Predictions for all kinase domain variants, including the subsets from the datasets above, are available for download at activark.russelllab.org/datasets.

Experimental tests of selected kinase variant predictions

To illustrate how the predictor can be used to determine interesting candidates for further experimental study, we selected 9 variants from four kinases for rapid experimental tests (Table S4). Seven of these came from the COSMIC analysis above, including four predicted activating variants (one of which was already known) and three predicted deactivating variants. We added two additional deactivating variants in the form of well-known alanine variants of key active site residues. We could not identify any clear data source of previously unknown resistance variant candidates on which to perform a similar screen. Moreover, experimentally testing resistance requires information on particular small-molecules that are often missing from COSMIC, and those for which these data were available were our (limited) source of resistance variants for training the method.

For each variant, we transfected T-REx-293 cells with tetracycline-compatible pDest30 vectors containing the gene of interest (wild type or mutant). After 24 h of tetracycline induction, we extracted RNA and assessed gene expression for biological replicates via microarrays (Fig S10). We compared induced cells to their transfected and non-induced counterparts and induced mutants to induce wild-type kinase. While T-REx-293 cells should not be used to investigate particular diseases (e.g. lymphoma) we used this standardised cell line to detect changes in kinase activity. For PIM1 we additionally used commercially available phospho-antibodies to assess the phosphorylation levels of key sites targeted by the kinases. For CHEK2 we also assessed sensitivity to ROS-induced DNA damage and for MAP2K3 we investigated mitochondrial activity using the Mitotracker dye.

We saw little or no signal from our methods from the predicted deactivating variants, as might be expected given the traditional difficulties in establishing a loss of function. However, for one, the high expression similarity between the uncharacterized (predicted either resistance or deactivating) MAP2K1 p.Val211Asp and the known deactivating p.Lys97Ala suggests loss of activity, particularly as the known activating variant p.Gln56Pro looks very different (Figure S8; Table S4). For three of the four predicted activating variants we saw clear differences to wild type that we discuss in the next sections. Activating variants are of particular interest in that they can immediately suggest treatment by way of specific inhibitors. These (and resistance variants) are thus of the clearest clinical relevance.

The PIM1 p.Ser97Asn lymphoma variant leads to increased or constitutive activation

Ser97 in PIM1 lies in the C-helix (Fig. 4A top). Within COSMIC, Ser97 is the position in PIM1 with the most missense variants (97) of which 37 are p.Ser97Asn and all in haematopoietic and lymphoid cancers³⁵. This variant is predicted to be weakly activating by the random forest predictor. The loss of Asn at the equivalent position in NEK7 (p.Asn90Lys, p.Asn90Arg and p.Asn90Ala)³⁶ leads to a strong reduction in its kinase activity, suggesting that an Asn might be favoured at this position. There are also two known activating variants N-terminal to this position in ALK (p.Phe1174Leu/Val)³⁷.

Overexpression of mutant PIM1 showed no significant expression differences between p.Ser97Asn compared to wild type after 24 h of induction. We additionally investigated the phospho-Thr180 levels of MAPK14/p38, a well-known protein downstream of PIM1 signalling. We detected a two-fold depletion in phosphorylation of MAPK14/p38 Thr180 (via ɑ-p38/pThr180 antibody), which would be expected if PIM1 activity was elevated (Fig. 4A, bottom, Figure S5A-B, Table S4A, B) as MAPK14/p38 activation is dependent on ASK1 activation, which PIM1 inhibits³⁸.

MYC overexpression via enhancer hijacking is the hallmark of several lymphoid cancers, particularly Burkitt Lymphoma³⁹. PIM1 phosphorylation of MYC is known to enhance its stability⁴⁰. There are thus good arguments for why increased PIM1 function would be expected in these tumours as either a primary or secondary driver event. PIM1 is thought to be intrinsically active (i.e. it does not require other factors to become active)⁴¹. It is possible that p.Ser97Asn is augmenting an already active enzyme in the context of these tumours, possibly to stabilise MYC. This raises the possibility that the other COSMIC PIM1 variants seen in these tumours and predicted to be activating by our method (e.g. p.Gly28Asp, p.Glu135Lys, p.Pro33Ser; see Fig. 3B) also increase its activity. Indeed, structural variants of MYC are largely absent in samples from a Follicular Lymphoma cohort where PIM1 variants are enriched³⁵. All of this argues that p.Ser97Asn and likely many of the other variants are indeed increasing PIM1 enzymatic activity as a possible driver of Lymphoid tumours. PIM1 is an established oncogene, though this is generally because of observed elevation of expression, for instance in prostate cancer⁴². Our findings and the growing number of missense observations argue that activating variants might also play a role in PIM1-mediated pathogenesis.

MAP2K3 p.Ala84Thr: a rare constitutively active driver in various cancers that greatly depletes mitochondrial gene expression

MAP2K3 (MEK3, MKK3) p.Ala84Thr is seen in 24 samples within COSMIC, with the most common (10 samples) from head and neck cancers44 in respiratory and upper-digestive tract tumours and predicted to be activating. Ala84 lies in the ꞵ2/3 loop and is eight residues C-terminal of the conserved GxGxxG motif (Fig. 4B, top).

Inspection of other kinases shows evidence of the activating variants in the same region (Fig. 4B, top). For instance, PRKD1 (protein kinase Db1) has an activating variant at this position: p.Arg603His, seen in telangiectasia-ectodermal dysplasia-brachydactyly-cardiac anomaly syndrome shows constitutive catalytic activity²⁹. Variants at positions two residues N-terminal to Ala84 lead to constitutive activation in JAK2 (p.Arg867Gln⁴³) and CDK4 (p.Arg24Cys⁴⁴), as do variants C-terminal of this position, such as MET p.Asn1100Tyr⁴⁵ or ZAP70 p.Lys362Glu⁴⁶. In addition, several kinases have phosphorylation sites in this region (Fig. 4B, top). For example, the Arabidopsis kinase SnRk2 is autophosphorylated at a position equivalent to Ser-86 in MAP2K3 (Ser-43⁴⁷).

Over-expression of the mutant MAP2K3 showed a marked difference in gene expression compared to the wild-type. Specifically, only p.Ala84Thr (and not wild-type or other variants, Fig. S8A) showed a drastic reduction (adj. p-value < 10^− 20, Figure S7B) in the expression of hundreds of mitochondrial genes (Fig. 4B, bottom; Figure S6C, Figure S7A; Table S4C-E). Mitochondrial gene under-expression is seen when comparing the mutant to WT at 24h or when comparing the mutant at 24 h to 0h (nothing is significant when comparing WT at 24h to 0h apart from roughly 4-fold overexpression of MAP2K3). This is strong support for this mutation activating MAP2K3 as several studies have shown that down-regulation produces the opposite effect. Deletion of this enzyme in mice leads to an increase in mitochondria number and function^48,49 and hyaluronan-mediated suppression of MAP2K3 expression in human mesenchymal stem cells similarly led to an increase in mitochondrial number and membrane potential⁵⁰. The fact that we are seeing the opposite behaviour and that we see no effect of the wild type supports the idea that this mutant has elevated kinase activity. Assessing mitochondrial activity using MitoTracker (Thermo Fisher) supported this finding, showing that the tetracycline-induced p.Ala84Thr had significantly lower activity than the un-induced (Fig S9, Table S4H). To our knowledge, no mitochondrial phenotype has to date been reported during the over-expression of MAP2K3. Though the true nature of this variant in the context of cancer might be different (i.e. in contrast to T-REx-293 cells), it has been proposed that general suppression of respiratory (i.e. mitochondrial) gene expression is seen in many cancer types⁵¹.

Breast cancer patients showing elevated expression of MAP2K3 have worse survival rates, particularly triple-negative, and the kinase is proposed to be oncogenic in driving MYC in certain patients⁵². MAP2K3 p.Ala84Thr has previously been classified as benign in a large screen of cancer somatic and germline genomes²¹, likely as the kinase itself did not stand out as a major driver across the pan-cancer set. This variant shows a suspiciously high frequency in gnomAD (though never homozygous making it possibly not viable in two copies³²), though it is intriguingly absent from the 1000 genomes dataset⁵³ (the gnomAD frequency would suggest over 300 counts in 1000 genome), making its population status somewhat unclear. Many samples for this variant in gnomAD are also marked as having failed the inbreeding coefficient filter. Moreover, p.Ala84Thr is one of only two of eleven variants with an allele frequency > 0.001 that were not filtered out completely (indeed p.Ala84Val is filtered out), which further questions the legitimacy of this record. Regardless, the gene-expression and MitoTracker results support the notion that this variant could lead to increased MAP2K3 activity.

The curious case of CHEK2 p.Lys373Glu

We predicted p.Lys373Glu as a putative activating variant in checkpoint kinase 2 (CHEK2; Fig. 4C, top). Through phosphorylation of numerous substrates, CHEK2 regulates cell cycle arrest, DNA repair and apoptosis upon DNA damage, thus acting as a tumour suppressor (e.g. ⁵⁴). Most somatic or cancer-predisposition variants in this kinase have been shown to result in loss or decreased kinase activity (e.g. ^55,56).

Within COSMIC, this is the most common variant (present in 102 samples; the next is 47) seen in the large intestine, nervous system, kidney and other cancers. Lys373 lies three residues C-terminal of the conserved DFG motif. This region in other kinases contains a mixture of activating and deactivating variants. For example, the exact equivalent variant leads to constitutive or increased activity in IKBKB (p.Lys171Glu⁵⁷) and an Arg to Gln change in ALK is a known constitutively active mutation (p.Arg1275Gln⁵⁷). In contrast, a gain of a Lysine at this position can be deactivating, such as seen in PASK (p.Ala1151Lys⁵⁸). The loss of a positive charge at this position has also been observed to be deactivating, for example in CDK8 (p.Arg178Gln)⁵⁹. Deactivating variants are less similar, for example, p.His371Tyr two positions N-terminal in CHEK2 in breast cancer⁶⁰ and the equivalent N-terminal position p.Leu597Val in BRAF was shown to be activating⁶¹.

There are thus good arguments for why this modification changing a lysine to glutamate might be activating in CHEK2 despite its role as a tumour suppressor. Running against this, activation of CHEK2 has recently been shown to confer resistance to oxaliplatin treatment in colorectal cancer⁶² and overexpression of CHEK2 was linked to worse survival in adrenocortical carcinoma⁶³. Intriguingly, CHEK2 p.Lys373Glu is strongly correlated with patients’ progression-free survival in high-grade serous ovarian carcinoma post-olaparib treatment⁶⁴. Taken together, this might suggest that the outcome of CHEK2 signalling perturbation is tissue-specific.

We could see no gene expression difference between induced over-expressed CHEK2 p.Lys373Glu and the wild-type enzyme or a kinase-dead variant (CHEK2 p.Thr68Ala⁶⁵\). However, there is a pronounced difference between both variants and the CHEK2 WT when monitoring cell counts over a period of 72 hours with periodic treatment with the DNA-damaging agent H2O2 (Fig. 4C, bottom; Table S4F,G). Both variants appear to show a perturbation of the enzyme in that they have higher cell numbers than the wild type, resulting in increased apoptosis resistance. Thus it is clear, as has been shown previously⁶⁶, that p.Lys373Glu is likely deactivating in T-REx-293 cells.

Inspection of CHEK2 structures offers some possible insights as to why this predicted activating variant is, in fact, deactivating. Lys-373 lies at the dimer interface in both dimeric forms of the enzyme^67,68, and is thought to play a key role in CHEK2 activation⁶⁸, in a manner that likely differs from most other kinases. The context of CHEK2 activation (not considered in our predictor) is thus likely why this was wrongly predicted. We are currently experimenting with adding additional features related to dimer-contacts, though this requires a more complete set of reliable homo/hetero-dimer structures than are currently available.

Despite great evolutionary diversity, there are still clear common trends within kinases about how variants affect them. For instance, just 14 alignment positions capture roughly 33% (494 out of 1501) of known functional sites. The simple presence of a variant at a position, while often predictive, is not sufficient, however, particularly for variants occurring at phosphosites where the charge on the mutated amino acid (negative or not) determines the likely effect (activating or deactivating).

A major result of this work is the simple overlay of carefully annotated positional information on variants and functions in the context of a well-constructed multiple-sequence alignment (e.g. Figure 1). The ability to exploit these data via machine learning to predict whether variants are activating, deactivating or resistance-related should prove useful to those wishing to interrogate kinase variants in the context of diseases and particularly cancers. Application of our predictor to the entire sets of somatic and hereditary disease variants affecting kinase domain positions both provided additional evidence for the method’s efficacy and showed that variants for particular kinases as a whole tend to be predicted as expected (e.g. oncogenes have mostly activating variants; tumour suppressors deactivating). Given predictions for variants of interest, the comparatively simple tests we performed (i.e. via gene expression, phosphoantibodes or comparatively rapid cell biology tests) demonstrate that it is possible to couple the computational analysis (i.e. using our predictor alongside other general predictors of pathogenicity) to experiments within the time frame of diagnostic and treatment decisions (i.e. 2–3 months, Table S4I).

Our two new likely activating variants (in PIM1 and MAP2K3) have very low sample counts in current cancer datasets. Indeed, considering only tumour-derived, confirmed somatic variants the counts are only 11 for PIM1 p.Ser97Asn and six for MAP2K3 p.Ala84Thr. This suggests that there might be many others previously overlooked (e.g. in Table S3A) owing to their rarity, but which could nevertheless be informative in the context of the particular patients harbouring them. Regardless of frequency, the ability to identify rapidly and characterise such variants of unknown significance can have immediate consequences, particularly for protein kinases. Knowing that a kinase contains an activating or resistance variant can immediately suggest changes to treatment regimens that are more appropriate to the specific patient. For 293 out of 505 kinases there are drugs already on the market, for an additional 7 there are compounds in development and earlier candidates likely for many of the remainder (data from PKIDB⁶⁹; retrieved 30 June 2023). For example, both PIM1 and MAP2K3 lack clinically approved inhibitors, but both have compounds in different stages of development^42,70. We thus believe that approaches like the one provided here, together with the exponential growth in sequencing and an increasing arsenal of modulation strategies, will assist researchers to rapidly assess novel variants of unknown significance and help medical professionals make personalised medicine a reality.

Human kinase set and alignment

We obtained a set of human proteins containing kinase domains by extracting annotations from UniProt¹⁴ and HMMer⁷¹ searches of Pfam⁷² hidden Markov model profiles (Pkinase and PK_Tyr_Ser-Thr) against human UniProt sequences (Table S1C). For the proteins with multiple kinase domains, we divided the protein sequence into segments corresponding to the number of domains present, ensuring that each segment encompassed the sequence specific to its corresponding domain. This led to a total of 517 kinase domains from 477 proteins (Table S1D). We defined a functional kinase set of 484 domains (454 proteins) to build profiles for aligning sequences by excluding kinases or domains marked as catalytically inactive or pseudokinase in UniProt. The full set of kinases is available for analysis, though we considered it important to exclude functionally ambiguous domains during the training of the predictor.

We aligned the kinase sequences using the hmmalign tool from HMMER (version 3.1b2) against their respective Pfam hidden Markov model profiles (Pkinase and PK_Tyr_Ser-Thr) and trimmed the regions that were outside of the kinase domain. We added a maximum of 30 residues to the N- and C-termini for each kinase sequence, stopping before this if a domain was present. We merged the two alignments using MAFFT (v7.520) (Table S1E) and used the hmmbuild tool (symfrac = 0) to construct a profile hidden Markov model of the merged alignment (Table S1F). The alignment is available at activark.russelllab.org/alignment.

Variant sets

We used a number of databases to obtain what we consider to be a reliable set of variants that modulate kinase function. Specifically, we identified variants that were activating, increasing, deactivating, and decreasing kinase function, those that were associated with drug resistance and a presumed neutral set.

Functional kinase variants and mutagenesis from UniProt

We downloaded all known mutagenesis or disease variant data (n = 6995) within the kinases above from the UniProt¹⁴ database (version 2023_02), considering only single amino acid changes. We then went through the description (and often cited references) of each variant manually to identify clear evidence for kinase activity. We annotated a variant as ‘constitutively active’ or ‘increase’ if it led to constitutive activation or an increase in the kinase activity, respectively. Similarly, we annotated a variant as ‘loss’ or ‘decrease’ if it led to complete loss or a decrease in kinase activity, respectively. This led to a set of 172 (121 in kinase domain) ‘constitutively active’, 203 (126) ‘increase’, 561 (435) ‘loss’ and 467 (283) ‘decrease’ variants in human kinases.

Additional activating variants from PubMed

We downloaded the entirety of PubMed (up to 21 December 2022) and sought sentences in abstracts or titles containing a human gene name synonym (from UniProt) and matches to regular expressions for mutations (one or three-letter codes arranged before and after an integer). We then used BLAST⁷³ alignments of UniProt reviewed and un-reviewed sequences corresponding to human sequences sharing the same canonical gene name to identify putative sequences having the right wild-type amino acid at the right position and then to map these (where possible) back to the canonical UniProt reviewed entry. These variants were ranked according to the number of PubMed entries mentioning them and filtered as to whether they were kinases and whether the variant occurred within the kinase domain as identified by HMMer searches of the Pfam Pkinase hmm profile.

These 84,802 putative variants were then cross-referenced with the PubMed entries matching searches for “constitutive* AND (activati* OR activate*)” to give 607 candidate constitutively active variants. We went through both lists (manually) to identify variants with clear evidence for each phenomenon and to remove spurious variants arising owing to chance amino acid matches or spurious text matches. This gave an additional 62 constitutively active variants including 42 within the kinase domain (Table S1A).

Candidate driver and resistance variants from COSMIC

We obtained all missense and confirmed somatic variants (and their sample counts) within the kinases above from the COSMIC¹⁷ database (version 97, 29 Nov 2022), having first determined canonical UniProt sequence positions by aligning COSMIC to UniProt sequences with Muscle⁷⁴. We also retrieved the set of 98 resistance variants (and the corresponding kinase inhibitor; 86 within the kinase domain) from the same version of COSMIC, considering only confirmed somatic variants derived from tumour samples (Table S1A).

Neutral variants from gnomAD

We obtained naturally occurring variants within the kinases above from the gnomAD database (version v2.1.1, 124,748 exomes)¹⁸ together with minor allele frequencies (MAF) and counts of heterozygous and homozygous instances. We mapped dbSNP variants in gnomAD to UniProt canonical kinase protein accessions. We converted instances with Minor Allele Frequency (MAF) > 50% by inverting them (100%-MAF). We defined neutral variants as those having a MAF > 0.1% and required all such variants to have at least two homozygous instances to avoid oddities related to exclusively heterozygous (and potentially disease) variants³². This led to a list of 1004 neutral variants with 203 within the kinase domain (Table S1A).

A validation set of variants

We computed a validation variant set by considering publications arising after the dataset above and mining them for kinase variants, as described above, from PubMed articles published after 21 Dec 2022, and removing any that were already in our dataset. We then manually annotated these as to whether they were reported to be activating, deactivating or resistance. From 527 variants in kinases, we identified 173 that were one of these three (62 activating, 56 deactivating and 58 resistance) of which 131 (76%) were inside the catalytic domain (41, 37 and 53). We also attempted to update the data from UniProt, though we did not find a single additional annotated variant associated with increase/gain/decrease/loss/resistance that was not already in our training set.

We also computed a neutral variant test set by updating our gnomAD set (20 April 2024), also ignoring those already in the set above (22 Dec 2022). This led to 673 neutral variants, 166 (25%) in the catalytic domain.

This set is given in Table S2B.

Phosphosites

We obtained the dataset of 7740 post-translational modifications (PTMs) in kinases from PhosphoSitePlus84 (retrieved on 11 November 2022). To minimise the possibility of false positives, we limited our analysis to sites with PTMs that were supported by at least one low-throughput or two high-throughput citations (Table S1G).

Sequence conservation & structure features for machine learning

We used seven types of sequence, evolutionary and structural features (Table S2A) to construct a vector for each variant within the kinase domain or within 30 residues of the N- or C-terminus in the datasets above (Fig. 2).

One-hot encoding and charges of wild-type and mutated amino acids

We constructed two distinct vectors that represent the wild-type and mutated amino acids. Each amino acid was encoded by a binary vector of length 20, with a value of 1 at the corresponding position and 0s elsewhere. We constructed an additional vector that encodes the charge on the wild-type and mutated amino acids.

Phosphomimetic or acetylation mimicking

A variant was considered phosphomimetic if the amino acid changed from a Ser(S) or Thr(T) to an Asp(D) or Glu(E), and acetylation mimicking if the amino acid changed from Lys(K) to Gln(Q).

ATP binding pocket

We calculated the number of known ATP binding sites at the position equivalent to the variant in the alignment. We obtained the list of known ATP binding sites in human kinases from UniProt (version 2023_02) (Table S1H).

Post-translational modification information

We incorporated known post-translational modification (PTM) information (see Phosphosites section above) of the variant position and its adjacent positions (window size = 5) as a feature vector, with a length equal to the number of possible PTM types (phosphorylation, acetylation, methylation, etc.). The presence of a specific PTM type was represented by 1, and otherwise as 0. We repeated the procedure to incorporate known PTM information at the alignment position equivalent to the variant position, and its adjacent residues (window size = 5). Each element in the vector encoded the number of kinases harbouring the corresponding PTM type at the given position in the alignment (Table S1G).

Loss/gain of amino acids in known mutations

We also incorporated the number of times an amino acid was observed to be a wild-type (loss) or mutated (gain) in a mutation type (i.e. activating, deactivating, and resistance) at the position equivalent to the variant (and its adjacent residues; window size = 5) in the alignment. We set the count initially to zero for all the amino acids at all alignment positions. For a loss of an amino acid at an alignment position in a mutation type, we decreased the corresponding count by 1, and increased for a gain (Table S1A).

Conservation metrics across different sets of homologs

We extracted log scores for each amino acid and position from the profile hidden Markov model (see section Human kinase set and alignment above) and used the wild-type and mutated scores as features. We did the same for three additional alignments determined after pan-proteome comparisons and ortholog/paralog determination. We divided the orthologs based on the phylogeny into eukaryotes, metazoa, vertebrates, and mammals and used conservation across them as features. Specifically, this included conservation scores from three alignments (all homologs, best-per-species orthologs and exclusive paralogs used previously³²).

Structural features

We used Alphafold2 structures for each kinase to determine the secondary structure, accessibility and backbone psi/phi angles using DSSP. We used IUPred⁷⁵ to determine disorder scores. We scored intra-protein side-chain-to-side-chain contacts using Mechismo. For all values, we determined log-odds values for each amino acid in each environment and used these values and their mutant-wild-type differences as features (as described previously³²).

All datasets and features are available and can be downloaded as tar files from activark.russelllab.org/datasets.

A machine learning predictor of kinase variant class

We applied the a Gradient Boosting Classifier (Scikit-learn library⁷⁶, Python version 3.10), which involved building a predictive model by sequentially adding weak learners, where each new learner corrects errors made by its predecessors, and thus creating a combined model that involved constructing an ensemble of decision trees, to predict the classification of variants according to all potentially applicable contrasts. We did this to test the ability of the system to distinguish variant sets from each other and to arrive at a final set of useful contrasts.

We used the ‘predict_proba’ method from the Scikit-learn library, which calculates the predicted class probabilities for a variant by averaging the probabilities across the trees in the forest. The class with the highest probability is considered the predicted class. To avoid bias arising from data imbalance, we used the class_weight parameter in balanced mode.

We performed stratified 10-fold cross-validation to tune the parameters of the predictors. The parameters tested included the number of boosting stages to perform {10, 25, 75, 100}, maximum depth of individual regression estimators {3,4,5}, minimum samples required at the leaf {3,5,7,10,12}, and the minimum number of samples required to split a node {3,5,7,10,12}. We determined the optimal parameters via grid search using the area under the receiver operating characteristic curve (AUC-ROC) as the evaluation metric. To ensure result robustness, we repeated the procedure 10 times for all the predictors and calculated the average AUC-ROC and standard deviation (Table S2B). We used standard performance metrics including the Matthews Correlation Coefficient (MCC), Recall/Sensitivity (REC/SEN) and Specificity (SPE). To ensure the models did not over fit, we also performed a randomization test⁷⁷ by repeating the above procedure with randomly shuffled labels (Table S2C). We used the feature_importances_ property (Scikit-learn library) to calculate the feature importance (Figure S5).

We finally tested the performance of the predictors on 145 missense variants (Activating: 37 Deactivating: 40, Resistance: 52, Neutral: 146) missense mutations that were absent in the training set and are known to be both constitutively activating and resistant ([Table S2D)]. Additionally, we predicted the functional consequence of uncharacterized variants in COSMIC and UniProt datasets using our predictors (see Results).

We selected the Gradient Boost Classifier after testing a battery of machine learning methods using the same conditions described above (Random Forest, Gradient Boost Classifier, Support Vector Machine, Neural Network, and Naive Bayes, and an ensemble of these methods). These approaches showed decreased performance during the cross validation phase and at best similar performance on the test set (Table S2E). As Random Forest was only marginally worse than Gradient Boost Classifier, we kept both sets of predictors in terms of data provided and the web application.

Web app

To complement the analysis, we developed a web application using the Flask web framework⁷⁸ and JavaScript libraries. The front end of the application was developed using HTML and CSS while the back end was developed using Python (v3.10). The web application is freely available to users at activark.russelllab.org. The site also contains downloads of data used and additional figures to add interpretation.

Modulation of gene expression

We ordered gene sequences (Integrated DNA Technologies) containing the gene of interest with two stop codons flanked by the 5’ attB1 (5’-ACAAGTTTGTACAAAAAAGCAGGCTTC-3’) and 3’ attB2 sequences (5’-ACCCAGCTTTCTTGTACAAAGTGGT-3’). GatewayTM cloning into a pDest30 backbone (Thermo Fisher, #12301016) was performed following supplier instructions (Thermo Fisher, #11789020 and #11791020).

We then grew and maintained T-REx-293 cells (Thermo Fisher, #R71007) in T-REx Standard Culture Medium (high glucose DMEM, Thermo Fisher, #61965026 ) with 10% v/v FBS (Thermo Fischer, #26140079) and 100 U/mL penicillin and 100 µg/mL streptomycin (Thermo Fisher, #15140122). We transfected with pDest30 vectors to transiently overexpress wildtype or mutant genes in T-REx-293 cells, stably expressing the Tet repressor in the presence of 5 µg/mL blasticidin (Thermo Fisher, #R21001). Transfection was performed following the supplier’s instructions (Thermo Fisher, #L3000001) by transfecting 400,000 cells with 2.5 µg plasmid DNA.

After 96h cells under selection pressure with 350 µg/mL geneticin (Thermo Fisher, #10131035) we induced expression using tetracycline (1 µg/mL, Thermo Fisher, #A39246) for 24 h and harvested for whole cell RNA isolation (Qiagen, #79254 and #74104).

DNA damage induction assay

We transfected T-REx-293 cells with plasmids containing CHEK2 WT, p.Tyr86Ala or p.Lys373Glu to test the sensitivity to DNA damage⁷⁹. After antibiotic selection, we seeded approximately 5x105 cells and induced them with tetracycline (1 µg/mL) for 24 h. We exposed cells to 400 µM H2O2 (Sigma Aldrich, #1086001000)⁸⁰ for 15 mins and counted viable cells 24 h later (trypan blue, Thermo Fisher, #15250061).

Microarray analysis

We confirmed overexpression of target genes by q-RT-PCR (Thermo Fisher, StepOnePlus) and prepared 10 µL total RNA with a concentration of 50 ng/µL in biological replicates for microarray analysis (via the Genomics and Proteomics Core Facility, German Cancer Research Center, 69120 Heidelberg, Germany). We deposited data in the Gene Expression Omnibus (GSE232293).

We analysed raw data using the R package maEndtoEnd⁸¹ and assessed data quality via arrayQualityMetrics⁸² removing any flagged chips before background correction and calibration. To remove low-intensity signals we filtered data by setting a threshold based on median intensities. We defined contrast groups (mutation vs control, Figure S6, Figure S10) and used empirical Bayes statistics to define differential expression (eBayes). For selected sets of significantly dysregulated genes (Padj-value ≤ 0.05) pathway we performed enrichment analysis via gProfiler⁸³ (Figure S7b).

Detection of kinase phosphosites

To test human MAPK14/p38 Thr180 and Tyr182 phosphorylation we used an enzyme-linked immunosorbent assay (ELISA) kit (RayBiotech, #CBEL-P38-2). We grew 4x104 T-REx-293 cells and then transfected, selected and induced them (in triplicates) in wells of a 96-well plate (VRW, #734 − 0025), using a volume of 200 µL. The 96-well plate was coated with 20 µL of 0.1 mg/mL Poly-L-Lysine (Sigma Aldrich, #P9155-5MG) for 2 h at RT prior to seeding. 24 h after tetracycline induction cells we performed the ELISA procedure following the supplier’s protocol and measured absorbance at 450 nm (Tecan, Spark).

After removing the remaining solvent from the wells, we washed cells with deionized water and then stained them with 50 µL 0.1% (w/v) crystal violet (Sigma Aldrich, #G2039-100G) for 20 min at RT. We then washed cells again with deionized water before destaining them with 100 µL 80% (w/v) ethanol for 30 min at RT. We measured absorbance at 590 nm to determine cell density.

We normalised the resulting α-phospho-p38 values based on the average signal and the relative cell density of each well determined by crystal violet staining. We determined outliers using a Z-score transformation:

$$\:Z=\stackrel{-}{X}-\mu\:\sum\:_{i=1}^{n}{n({x}_{i}-\mu\:)}^{2}-1$$

Z = x-µ∑(xi-µ)²n-1

where x is the value of a single measurement and µ corresponds to the mean of an experimental group. We considered a Z-score ≥ |3| to be an outlier and we removed the associated value.

Mitotracker Staining

Cells were grown on coverslips (ibidi, #81158) and transfected as described above followed by induction with tetracycline for 48 h. Culture medium was replaced with medium containing 100 mM MitoTracker Red CMXRos (Thermo Fisher, #M7512) for 15 min at 37°C. Staining medium was then replaced with growth medium and cells were imaged immediately on a Nikon Ti2 microscope (Ex 542/20, Em 620/52) (Figure S10).

Competing interests

The authors declare no competing interests.

Author Contribution

G.S., T.S. J-C.G-S and R.B.R. designed the main study and led writing and figure generation. A.K., G.D.D., P.M. C.L. and R.So. provided key datasets, T.S. and N.B. performed tge aboratory experiments. R.Si. helped design the approach. Original idea by R.B.R.

Acknowledgements

This research was supported by the German Research Foundation (DFG) de.NBI, the Wellcome Trust grant 210585/B/18/Z: Impact of missense mutations in recessive Mendelian disease: insight from ciliopathies, the grants TheRaCil (Grant ID 101080717) and PrecisionTox (Grant ID 965406) from the European Commission, the grant PREDICT from the EJP-RD and grant 2018–05882 from the Swedish Research Council (VR). CL is supported by a postdoctoral Beatriu de Pinós grant from Secretaria d’Universitats i Recerca del Departament d’Empresa i Coneixement de la Generalitat de Catalunya and by Marie Sklodowska-Curie COFUND program from H2020 (2018-BP-00055).

Data Availability

Gene expression data are deposited in the Gene Expression Omnibus (GSE232293).

Lappalainen, T., Scott, A. J., Brandt, M. & Hall, I. M. Genomic Analysis in the Age of Human Genome Sequencing. Cell 177, 70–84 (2019).
Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216 (2019).
Lucci-Cordisco, E. et al. Variants of uncertain significance (VUS) in cancer predisposing genes: What are we learning from multigene panels? Eur J Med Genet 65, 104400 (2022).
McLaughlin, H. M. et al. A systematic approach to the reporting of medically relevant findings from whole genome sequencing. BMC Med Genet 15, 134 (2014).
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nature Methods 7, 248–249 (2010).
Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).
Betts, M. J. et al. Mechismo: predicting the mechanistic impact of mutations and modifications on molecular interactions. Nucleic acids research 43, e10 (2015).
Mosca, R. et al. dSysMap: exploring the edgetic role of disease mutations. Nature methods 12, 167–8 (2015).
González-Sánchez, J. C., Ibrahim, M. F. R., Leist, I. C., Weise, K. R. & Russell, R. B. Mechnetor: a web server for exploring protein mechanism and the functional context of genetic variants. Nucleic Acids Research 49, W366–W374 (2021).
Burley, S. K. et al. RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Research 47, D464–D474 (2019).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021) doi:10.1038/s41586-021-03819-2.
Johnson, J. L. et al. An atlas of substrate specificities for the human serine/threonine kinome. Nature 613, 759–766 (2023).
Burke, D. F. et al. Towards a structurally resolved human protein interaction network. Nat Struct Mol Biol 30, 216–225 (2023).
UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49, D480–D489 (2021).
Raimondi, F. et al. Genetic variants affecting equivalent protein family positions reflect human diversity. Scientific reports 7, 12771 (2017).
Rodrigues, C. H., Ascher, D. B. & Pires, D. E. Kinact: a computational approach for predicting activating missense mutations in protein kinases. Nucleic Acids Res 46, W127–W132 (2018).
Tate, J. G. et al. COSMIC: The Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Research 47, D941–D947 (2019).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Hatch, N. E., Hudson, M., Seto, M. L., Cunningham, M. L. & Bothwell, M. Intracellular retention, degradation, and signaling of glycosylation-deficient FGFR2 and craniosynostosis syndrome-associated FGFR2C278F. J Biol Chem 281, 27292–27305 (2006).
Vihinen, M. et al. Structural basis for chromosome X-linked agammaglobulinemia: a tyrosine kinase disease. Proc Natl Acad Sci U S A 91, 12803–12807 (1994).
Greenman, C. et al. Patterns of somatic mutation in human cancer genomes. Nature 446, 153–158 (2007).
Mercurio, F. et al. IKK-1 and IKK-2: cytokine-activated IkappaB kinases essential for NF-kappaB activation. Science 278, 860–866 (1997).
Li, Y. et al. Lats2, a putative tumor suppressor, inhibits G1/S transition. Oncogene 22, 4398–4405 (2003).
López-Ferrando, V., Gazzo, A., de la Cruz, X., Orozco, M. & Gelpí, J. L. PMut: a web-based tool for the annotation of pathological variants on proteins, 2017 update. Nucleic Acids Res 45, W222–W228 (2017).
Olivieri, C. et al. ATP-competitive inhibitors modulate the substrate binding cooperativity of a kinase by altering its conformational entropy. Sci Adv 8, eabo0696 (2022).
Dahlman, K. B. et al. BRAF(L597) mutations in melanoma are associated with sensitivity to MEK inhibitors. Cancer Discov 2, 791–797 (2012).
Kang, H. et al. Somatic activating mutations in MAP2K1 cause melorheostosis. Nat Commun 9, 1390 (2018).
Batalini, F. et al. Li-Fraumeni syndrome: not a straightforward diagnosis anymore-the interpretation of pathogenic variants of low allele frequency and the differences between germline PVs, mosaicism, and clonal hematopoiesis. Breast Cancer Res 21, 107 (2019).
Alter, S. et al. Telangiectasia-ectodermal dysplasia-brachydactyly-cardiac anomaly syndrome is caused by de novo mutations in protein kinase D1. J Med Genet 58, 415–421 (2021).
Chen, X. et al. The role of EphA7 in different tumors. Clin Transl Oncol 24, 1274–1289 (2022).
Spinelli, E. et al. Pathogenic MAST3 Variants in the STK Domain Are Associated with Epilepsy. Ann Neurol 90, 274–284 (2021).
Schmenger, T., Diwan, G. D., Singh, G., Apic, G. & Russell, R. B. Never-homozygous genetic variants in healthy populations are potential recessive disease candidates. NPJ Genom Med 7, 54 (2022).
Zeqiraj, E., Filippi, B. M., Deak, M., Alessi, D. R. & van Aalten, D. M. F. Structure of the LKB1-STRAD-MO25 complex reveals an allosteric mechanism of kinase activation. Science 326, 1707–1711 (2009).
Rademakers, R. et al. Mutations in the colony stimulating factor 1 receptor (CSF1R) gene cause hereditary diffuse leukoencephalopathy with spheroids. Nat Genet 44, 200–205 (2011).
Mozas, P. et al. Genomic landscape of follicular lymphoma across a wide spectrum of clinical behaviors. Hematol Oncol (2023) doi:10.1002/hon.3132.
Haq, T. et al. Mechanistic basis of Nek7 activation through Nek9 binding and induced dimerization. Nat Commun 6, 8771 (2015).
Mazot, P. et al. The constitutive activity of the ALK mutated at positions F1174 or R1275 impairs receptor trafficking. Oncogene 30, 2017–2025 (2011).
Gu, J. J., Wang, Z., Reeves, R. & Magnuson, N. S. PIM1 phosphorylates and negatively regulates ASK1-mediated apoptosis. Oncogene 28, 4261–4271 (2009).
Schmitz, R., Ceribelli, M., Pittaluga, S., Wright, G. & Staudt, L. M. Oncogenic mechanisms in Burkitt lymphoma. Cold Spring Harb Perspect Med 4, a014282 (2014).
Zhang, Y., Wang, Z., Li, X. & Magnuson, N. S. Pim kinase-dependent inhibition of c-Myc degradation. Oncogene 27, 4809–4819 (2008).
Qian, K. C. et al. Structural basis of constitutive activity and a unique nucleotide binding mode of human Pim-1 kinase. J Biol Chem 280, 6130–6137 (2005).
Luszczak, S. et al. PIM kinase inhibition: co-targeted therapeutic approaches in prostate cancer. Signal Transduct Target Ther 5, 7 (2020).
Maie, K. et al. Progression to polythythemia vera from familial thrombocytosis with germline JAK2 R867Q mutation. Ann Hematol 97, 737–739 (2018).
Blanchet, E. et al. E2F transcription factor-1 regulates oxidative metabolism. Nat Cell Biol 13, 1146–1152 (2011).
Pan, B.-S. et al. MK-2461, a novel multitargeted kinase inhibitor, preferentially inhibits the activated c-Met receptor. Cancer Res 70, 1524–1533 (2010).
Chan, A. Y. et al. A novel human autoimmune syndrome caused by combined hypomorphic and activating mutations in ZAP-70. J Exp Med 213, 155–165 (2016).
Belin, C. et al. Identification of features regulating OST1 kinase activity and OST1 function in guard cells. Plant Physiol 141, 1316–1327 (2006).
Srivastava, A. et al. MKK3 deletion improves mitochondrial quality. Free Radic Biol Med 87, 373–384 (2015).
Srivastava, A., Shinn, A. S., Lee, P. J. & Mannam, P. MKK3 mediates inflammatory response through modulation of mitochondrial function. Free Radic Biol Med 83, 139–148 (2015).
Solis, M. A. et al. Hyaluronan Upregulates Mitochondrial Biogenesis and Reduces Adenoside Triphosphate Production for Efficient Mitochondrial Function in Slow-Proliferating Human Mesenchymal Stem Cells. Stem Cells 34, 2512–2524 (2016).
Reznik, E., Wang, Q., La, K., Schultz, N. & Sander, C. Mitochondrial respiratory gene expression is suppressed in many cancers. Elife 6, e21592 (2017).
Yang, X. et al. High expression of MKK3 is associated with worse clinical outcomes in African American breast cancer patients. J Transl Med 18, 334 (2020).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Bartek, J. & Lukas, J. Chk1 and Chk2 kinases in checkpoint control and cancer. Cancer Cell 3, 421–429 (2003).
Perona, R., Moncho-Amor, V., Machado-Pinilla, R., Belda-Iniesta, C. & Sánchez Pérez, I. Role of CHK2 in cancer development. Clin Transl Oncol 10, 538–542 (2008).
Boonen, R. A. C. M. et al. Functional Analysis Identifies Damaging CHEK2 Missense Variants Associated with Increased Cancer Risk. Cancer Res 82, 615–631 (2022).
Azarova, A. M., Gautam, G. & George, R. E. Emerging importance of ALK in neuroblastoma. Semin Cancer Biol 21, 267–275 (2011).
Kikani, C. K. et al. Structural bases of PAS domain-regulated kinase (PASK) activation in the absence of activation loop phosphorylation. J Biol Chem 285, 41034–41043 (2010).
Calpena, E. et al. De Novo Missense Substitutions in the Gene Encoding CDK8, a Regulator of the Mediator Complex, Cause a Syndromic Developmental Disorder. Am J Hum Genet 104, 709–720 (2019).
Liu, Y. et al. A recurrent CHEK2 p.H371Y mutation is associated with breast cancer risk in Chinese women. Hum Mutat 32, 1000–1003 (2011).
Davies, H. et al. Mutations of the BRAF gene in human cancer. Nature 417, 949–954 (2002).
Hsieh, C.-C. et al. CHK2 activation contributes to the development of oxaliplatin resistance in colorectal cancer. Br J Cancer 127, 1615–1628 (2022).
Subramanian, C. & Cohen, M. S. Over expression of DNA damage and cell cycle dependent proteins are associated with poor survival in patients with adrenocortical carcinoma. Surgery 165, 202–210 (2019).
Hu, D. et al. Mutation profiles in circulating cell-free DNA predict acquired resistance to olaparib in high-grade serous ovarian carcinoma. Cancer Sci 113, 2849–2861 (2022).
Ahn, J.-Y., Li, X., Davis, H. L. & Canman, C. E. Phosphorylation of threonine 68 promotes oligomerization and autophosphorylation of the Chk2 protein kinase via the forkhead-associated domain. J Biol Chem 277, 19389–19395 (2002).
Higashiguchi, M. et al. Clarifying the biological significance of the CHK2 K373E somatic mutation discovered in The Cancer Genome Atlas database. FEBS Lett 590, 4275–4286 (2016).
Cai, Z., Chehab, N. H. & Pavletich, N. P. Structure and activation mechanism of the CHK2 DNA damage checkpoint kinase. Mol Cell 35, 818–829 (2009).
Oliver, A. W. et al. Trans-activation of the DNA-damage signalling protein kinase Chk2 by T-loop exchange. EMBO J 25, 3179–3190 (2006).
Carles, F., Bourg, S., Meyer, C. & Bonnet, P. PKIDB: A Curated, Annotated and Updated Database of Protein Kinase Inhibitors in Clinical Trials. Molecules 23, 908 (2018).
Kwong, A. J. & Scheidt, K. A. Non-’classical’ MEKs: A review of MEK3-7 inhibitors. Bioorg Med Chem Lett 30, 127203 (2020).
Eddy, S. R. A new generation of homology search tools based on probabilistic inference. Genome informatics. International Conference on Genome Informatics 23, 205–11 (2009).
Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Research 49, D412–D419 (2021).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 25, 3389–402 (1997).
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research 32, 1792–7 (2004).
Erdős, G., Pajkos, M. & Dosztányi, Z. IUPred3: prediction of protein disorder enhanced with unambiguous experimental annotation and visualization of evolutionary conservation. Nucleic Acids Research (2021) doi:10.1093/nar/gkab408.
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011).
Salzberg, S. L. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach. Data Mining and Knowledge Discovery 1, 317–328 (1997).
Flask Web Development, 2nd Edition [Book]. https://www.oreilly.com/library/view/flask-web-development/9781491991725/.
Zannini, L., Delia, D. & Buscemi, G. CHK2 kinase in the DNA damage response and beyond. J Mol Cell Biol 6, 442–457 (2014).
Liu, J. et al. Anti-oxidative and anti-apoptosis effects of egg white peptide, Trp-Asn-Trp-Ala-Asp, against H2O2-induced oxidative stress in human embryonic kidney 293 cells. Food Funct 5, 3179–3188 (2014).
Klaus, B. & Reisenauer, S. An end to end workflow for differential gene expression using Affymetrix microarrays. F1000Res 5, 1384 (2016).
Kauffmann, A., Gentleman, R. & Huber, W. arrayQualityMetrics–a bioconductor package for quality assessment of microarray data. Bioinformatics (Oxford, England) 25, 415–6 (2009).
Raudvere, U. et al. g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res 47, W191–W198 (2019).

No competing interests reported.

TableS1data.xlsx
Table S1 Variants, kinases, alignment & associated data A) Variants in the training set, from human kinases that are known to be activating, deactivating, neutral and resistance-causing in UniProt, COSMIC, PubMed and gnomAD databases. B) Variants in the test set from PubMed and gnomAD. C) Variant counts in the kinase domain of different data sources. D) All human kinases retrieved from UniProt. Note: Only Serine/Threonine and Tyrosine kinases were used in the development of the predictor. E) Residues at positions within the alignment constructed using the human kinases (see Methods). F) Emission scores of residues in the profile hidden Markov Model built using the alignment of the human kinases (see Methods). G) Known post-translational modification (PTM) sites in human kinases from PhosphositePlus. H) Known ATP binding sites in human kinases from UniProt.
TableS2ML.xlsx
Table S2 Machine learning results A) Description of features used to develop the predictors. B) Machine-learning results during the training and testing phase. C) Machine-learning results during the training and testing phase after randomization of the labels (see Methods). D) Predicted probabilities of the predictors, PolyPhen2 and PMUT on the 212 known activating, deactivating or resistance variants that were excluded from training. E) Comparison of machine-learning methods.
TableS3Activarkoutput.xlsx
Table S3 Prediction results of predictors on known functional variants in COSMIC and UniProt datasets and variants with no/low homozygous counts in gnomAD Results of the predictors on A) Somatic variants in the COSMIC dataset B) Hereditary variants associated with a disease in the UniProt dataset C) Variants with no/low homozygous counts in gnomAD
TableS4ExperimentalData.xlsx
Table S4 Experimental raw data and results A) PIM1 ⍺-MAPK14/p38 Thr180 ELISA raw results. Results were measured on a Tecan Spark. B) Normalised PIM1 results. Values were normalised to -tetracycline controls, corrected with crystal violet measurement to adjust for different cell densities and outliers were determined if their absolute Z-score was ≥ 3. C) Raw microarray data for MAP2K3 wild type and Ala84Thr overexpression following the GEO format. D) Top Table differential gene expression results generated with the R library limma. The applied contrast is of cells overexpressing (tetracycline induction) MAP2K3 Ala84Thr vs. MAP2K3 wild type. E) Gene enrichment results generated with the gProfiler web server. GO:CC - gene ontology by cellular compartment. Enriched GO terms are highlighted in green, and depleted GO terms in purple. F) CHEK2 raw cell counts were determined manually with Neubauer counting chambers. G) Normalised CHEK2 cell counts. Counts of + tet cells were compared to their - tet controls and adjusted to lacZ + tet. H) Mitotracker grey value means. I) Experimental summary.
Supplementalinfo.pdf

Download PDF

Reviews received at journal
15 Nov, 2024
Reviewers agreed at journal
18 Oct, 2024
Reviews received at journal
27 Sep, 2024
Reviewers agreed at journal
13 Sep, 2024
Reviewers invited by journal
12 Sep, 2024
Editor assigned by journal
03 Sep, 2024
Submission checks completed at journal
30 Aug, 2024
First submitted to journal
30 Aug, 2024

You are reading this latest preprint version

Discriminating activating, deactivating and resistance variants in protein kinases

Status:

Version 1

Abstract

Figures

Introduction

Results

Discussion

Methods

Declarations

Competing interests

Author Contribution

Acknowledgements

Data Availability

References

Additional Declarations

Supplementary Files

Status:

Version 1