Data Collection
In order to correlate proteins’ properties with propensities of having ACs, it was necessary to gather accurate ligand binding data for the collected protein kinases. Duplicate molecules and molecules of inaccurately reported structural information (e.g., undefined stereochemistry) were removed. Additionally, we limited the collected inhibitors to molecules that have their bioactivities expressed as Ki values to minimize the effects of inter-laboratory assay variations commonly seen with other bioactivity indicators (e.g., IC50) [47]. Moreover, ACs were strictly defined as MMPs of bioactivity difference exceeding 100 folds provided that potent ACs members are of Ki values below 100 nM. On the proteins’ side; only wide type protein kinases were considered, and if certain protein is represented by several entries in protein data bank, we opted for the one that combines best possible resolution and longest complete (uncut) peptide chain.
However, since the presence/absence of ACs can be a function of the explored chemical space of the particular protein kinase, it can be argued that absence of ACs for certain protein kinase is related to the limited medicinal chemistry efforts against this target rather than due to certain intrinsic factors related to the protein kinase itself. We adopted two measures to remedy this issue. Firstly, we only collected protein kinases of substantial number of reported ligands (Table 1). However, to propose a reasonable minimal number of ligands for a particular kinase to be included in the modeled list (i.e., Table 1) we looked at protein kinases that combine the highest counts of reported inhibitors and least number of ACs. Two kinases met these criteria, namely, LCK (1818 inhibitors and 3 ACs) and FYN (1643 inhibitors and 6 ACs) with ACs-to-ligands ratios of 1 in 606 and 1 in 274, respectively. Their average ratio is 1 in ca. 440. Accordingly, based on this ratio, it can be reasonably assumed that if the binding space of a particular protein kinase has been explored by ≥ 440 ligands without identifying any ACs then it is likely that the inhibitors space of this target is free from ACs. Based on this plausible assumption we collected protein kinases of at least 440 reported inhibitors. Still, we noticed 5 protein kinases of inhibitors’ counts ranging from 393 to 437 (i.e., FES, BRK, ACK1, PYK2, and AXL, table 1), two of which have established ACs (BRK and ACK1) and thus can be safely included for ML. However, the other three are devoid of reported ACs, still we included them for ML because they have inhibitors’ counts close to the proposed threshold (i.e., 440 inhibitors), i.e., PYK2 (428 inhibitors), AXL (437 inhibitors) and FES (393 inhibitors). Additionally, these are rather limited in number (just three cases) and therefore should have limited error contribution in ML (if any).
The second measure we took to minimize errors related to limited medicinal chemistry efforts is to squash ACs counts of each protein kinase into a binary response of either “having ACs” (given binary code of one) or “devoid of ACs” (given binary code of zero). This is done to emphasize that the mere presence of even a single valid AC indicates that protein is vulnerable to other ACs and that counts of reported ACs are irrelevant (in fact all our attempts to correlate the counts of ACs with protein properties were futile). However, although current absence of ACs for certain target does not guarantee that there will be no ACs upon future medicinal chemistry exploration of this target, still, targets labeled as being “devoid of ACs” with ligands counts exceeding 440 can be reasonably considered as being resistant to the ACs phenomenon. Accordingly, the binary label of being “showing ACs” or “devoid of ACs” is really surrogate for “AC-vulnerable” or “AC-resistant” targets. This conclusion is underlined in figure 1.
Figure 1 shows the counts of “devoid of AC” and “showing AC” protein kinases as function of the counts of their reported inhibitors in ChEMBL database. Clearly from the graph the “ACs-devoid” category supersedes the “AC-showing” category in the first two intervals, i.e., 390-700. However, protein kinases of higher ligands’ counts tend to incline towards the “AC-showing” class despite the presence of significant “AC-devoid” minority. In the highest ligands’ counts interval, i.e., 1301-2514, all 8 collected protein kinases showed ACs among their ligands. Nonetheless, even in this interval, two protein kinases were rather resistant to ACs, i.e., LCK (3 ACs out of 1818 inhibitors reported inhibitors) and FYN (6 ACs out of 1643 inhibitors). Conclusions from figure 1 provide impetus for our proposition that the presence or absence of ACs, within certain protein kinase binders, points to the level of resistance/vulnerability of that target to ACs. However, since it is hard to set certain numerical threshold for the number of ACs to consider a particular protein kinase as being AC-resistant or AC-vulnerable, our use of “devoid of AC” and “showing AC” labels is rather reasonable surrogate of AC resistance/vulnerability.
Machine Learning
To evaluate the relevance of protein descriptors to the propensity of exhibiting ACs, we initially t-tested the difference between protein kinases “showing-ACs” versus those “devoid-of-ACs” with respect to calculated protein descriptors. Four descriptors showed significant difference (p < 0.05) between the two groups, namely; normalized count of hydrogen bonds (N/H-bonds), normalized count of basic residues (N/basic amino acids), normalized count of hydrophobic residues (N/Hydrophobic amino acids) and normalized count of rotatable bonds (N/Rotatable bonds). Although the number of significantly different descriptors between the two protein kinase categories is rather limited (only four), it should not preclude the possibility of significant differences resulting from interactions between different descriptors warranting subsequent evaluation by MLs.
ML studies commenced by splitting the collected kinases into training and testing sets (as in table 1, testing compounds are marked with asterisks). However, due to the limited number of protein kinases, i.e., for effective machine learning, we opted to augment the training and testing lists, separately, using Synthetic Minority Oversampling Technique (SMOTE) [26]. However, SMOTE was used to augment both classes “Showing-ACs” or “Devoid-of-ACs” (i.e., not only the minority class) to yield an overall 272 training observations of which 104 are labeled as “devoid of ACs” while 168 are labeled as “showing ACs”. The SMOTE-enhanced testing set included 64 observations, of which 24 are labeled as “devoid of ACs” and 40 as “showing ACs”.
Subsequently, we scanned ten ML algorithms (see experimental section) implementing “Showing-ACs” or “Devoid-of-ACs” as response labels, while all calculated protein descriptors were used as independent variables. However, seven MLs yielded accuracies exceeding 70% (based on leave-20%-cross validation) warranting subsequent combination with GA to identify the best predictive descriptors. Two MLs succeeded in maintaining significant accuracies and Cohen’s Kappa values despite GA-driven feature reduction, namely; K-star (K*) and XGBoost (XGB). Table 2 summarizes the statistical criteria of the best models.
The K-star algorithm (K*) is an instance-based classifier that uses entropy-based distance function [48]. Classification with K* is made by summing the probabilities from the new instance to all of the members of a category [27]. On the other hand, eXtreme Gradient Boosting (XGBoost, or XGB) is a tree-based standardized ensemble method that relies on an ensemble of weak decision tree (DT)-type models to create new subsequent boosted DT-type models of a reduced loss function. XGB system includes a tree learning algorithm, a theoretically justified weighted quantile sketch procedure with parallel and distributed computing [49].
Accuracies and Cohen’ Kappa values in Table 2 suggest that it is possible to predict the propensity of ACs for certain protein kinase by applying few protein descriptors in certain ML model(s). Cohen’s Kappa values of SMOTE-enhanced observations in table 2 position the corresponding ML models within moderate to substantial interrater reliability.
However, to exclude the possibility of chance correlation, we decided to reevaluate the models using the original unaugmented training and testing sets and to test the validity of the models using Y-scrambling [50]. The results are summarized in table 2. Interestingly, both ML models maintained successful statistical criteria even upon using the original unaugmented data.
In Y-scrambling, 100 random bioactivity data were generated based on either the SMOTE-enhanced training set or the original unaugmented training set. Subsequently, each successful learner and corresponding features were challenged to use these random data to generate ML models of equal or better accuracies compared to the original nonrandomized data, as judged by Leave-20%-Out cross-validations [50]. Table 2 summarizes the results of Y-scrambling while supporting Tables S1 to S4 show the results in details. The results show that in both successful MLs, the non-randomized data yielded models of significantly superior accuracy and Cohen’s Kappa values compared to all randomized trials. The effect is particularly evident in Cohen’s Kappa values. Overall, the results support the validity of our ML models.
Table 2 points to interesting facts: (i) Most of the significant descriptors, from both models, are normalized suggesting that protein size is not of significance with regard to ACs. (ii) The two ML models emphasize the significance of two classes of amino acid residues, namely, acidic and hydrophobic amino acids. Acidic amino acids are represented by N/Asp, N/Glu and N/Acidic amino acids, while hydrophobic groups are represented by N/Hydrophobic amino acids and N/Val. (iii) K* ML model uniquely emphasizes on the significance of N/Rotatable bonds, while the XGB model underlines the influence of hydrogen bonds count on the propensity of having ACs.
Emergence of N/Hydrophobic, N/Rotatable and N/H-bonds agrees with the t-test results mentioned earlier. Strangely though, N/basic failed to emerge in any of the top models despite being significantly different between the two protein kinases (Showing and Devoid of-ACs) according to t-test. It appears that the interaction between different descriptors in the optimal ML models in table 2 overshadowed the effects of N/basic.
Significance of Individual Features
Although ML-models in table 2 have reasonable accuracies and Cohen’s Kappa values, it is hard to infer the role contributed by each protein feature in predicting the class label of a particular testing protein kinase. For example, it is hard to tell how N/Rotatable bonds contributes to the “Showing ACs” class within the context of GA-K* ML model (table 2). Therefore, we decided to implement SHapley Additive exPlanations (SHAP) values to explain the relative contributions of individual descriptors in predicting class labels [23].
SHAP values originated in game theory, however, within the context of machine learning they are useful for explaining model predictions. The SHAP approach enables prioritization of features that determine the predicted classification of certain observation using any ML model. The SHAP value of certain feature for a certain observation indicates how much this feature (percentage wise) has contributed to the deviation of the prediction from the base prediction (i.e., the mean prediction over the full sampling data) for this observation. SHAP of all features should add up to the difference between the mean prediction and the actual prediction for certain testing observation.
Figure 2 shows the average SHAP values for GA-selected descriptors in the optimal K* and XGB models. Each average SHAP was calculated as the mean of SHAP values of the particular descriptor across “showing ACs” or “devoid of ACs” SMOTE-enhanced testing compounds. Clearly from the figure, the selected descriptors exhibited SHAP probability contributions consistent with the class categories of testing observations, i.e., they yielded positive probabilities towards the “Showing ACs” label for protein kinases that have ACs, while they also gave positive probabilities for “Not showing ACs” label for protein kinases that do not show ACs.
It can be concluded from Figure 2 that the most significant contributor to probability predictions in the GA-K* model is the normalized number of rotatable bonds (N/Rotatable bonds). On the other hand, the most significant contributor to probability predictions in the GA-XGB model is the normalized number of aspartic acid moieties (N/Asp), while the normalized number of hydrogen bonds and hydrophobic groups (N/H-Bonds and N/Hydrophobic groups, respectively) come second.
Another interesting conclusion from Figure 2 is the orthogonality of the two models, as can be deduced from the differing probability contributions of descriptors among the two models suggesting the possibility of stacking the two ML models in a meta-learning model (e.g., consensus voting) [51].