Generation of benchmark datasets CatPred-DB of in vitro enzyme kinetic parameters
CatPred-DB consists of a set of comprehensive benchmark datasets for training ML models, one each for kcat, Km and Ki in vitro measurements. We used data from the BRENDA release 2022_2 and data from the SABIO-RK as of November 2023. Initially, we parse the databases to identify entries containing essential information, including at least one kinetic parameter value (kcat, Km, or Ki), the enzyme type (EC number), the organism of enzyme’s origin, and the names of reactants and products. To maintain the accuracy of organisms’ names, we retain entries only if they are listed in the NCBI Taxonomy database42. We then mapped each entry to the enzyme’s amino acid sequence identifier using the UniProt database (Methods for details). We excluded entries that lack one or more of these annotations or if any of these annotations are incomplete. Finally, each substrate name is used to obtain a canonical SMILES string that corresponds to the 2D atom connectivity. If there exist multiple measurements of any parameter belonging to an enzyme-sequence and substrate-SMILES pair, then the maximum (for kcat) and the geometric mean (for Km and Ki) value, respectively is retained. The selection of the maximum value for kcat value is carried out because it likely maps to the optimal growth conditions (i.e., temperature, pH, etc.). In contrast, Km and Ki values are more directly associated with the enzyme-substrate/inhibitor affinities rather than on the experimental conditions. The use of the geometric average implies an arithmetic averaging of the logarithmically transformed values used in the training process. The selection of a unique value for the enzymatic parameters is needed to safeguard against the ML method attempting to learn significantly different outputs for the same inputs which can result in instabilities during training.
CatPred-DB contains 23,197 kcat, 41,174 Km and 11,929 Ki measurements spanning thousands of unique enzymes, organisms, and substrates (Table 1). Each entry in CatPred-DB is also mapped to a predicted 3D-structure of the corresponding enzyme using AlphaFold-2.0 database11. In the absence of a 3D structure in the AlphaFold database, we used ESMFold7 to carry out structure prediction. The coverage statistics of CatPred-DB contrasted with other efforts28–30 are summarized in Table 1. Notably, CatPred-DB has a significantly expanded enzyme sequence space (up to 60% new sequences introduced) in comparison to the existing ML datasets for kcat and Km. New sequences span widely across enzyme classes with no biases for specific EC classes (Fig. 2b). Moreover, kcat and Km entries in CatPred-DB have broader coverages compared to existing ML datasets across all the enzyme families as per the EC level 1 (Fig. 2c). Therefore, we envision that the enhanced sequence and EC classification coverage would make CatPred-DB a useful resource to the community for aiding systematic development and benchmarking of ML models for enzyme kinetic parameter prediction.
Table 1
Coverage statistics of CatPred-DB vs. other datasets of in vitro enzyme kinetic parameter measurements.
Dataset
|
CatPred-DB
|
Existing datasets
|
kcat
|
Km
|
Ki
|
kcat (Li. et. al.28)
|
Km (Kroll et. al.29)
|
Entries
|
23,197
|
41,174
|
11,929
|
17,010
|
11,722
|
Unique organisms
|
1,685
|
2,419
|
652
|
849
|
N/A
|
Unique Enzyme Classes (EC)
|
2,657
|
3,550
|
1,306
|
1,692
|
3,690#
|
Unique enzyme sequences
|
7,183
|
12,355
|
2,829
|
3,219
|
6,990
|
Unique substrates
|
12,290
|
10,535
|
7,146
|
2,696
|
1,566
|
# Predicted Enzyme Classification (EC) numbers using CLEAN |
Overview of CatPred training framework
CatPred relies on the enzyme sequences/3D-structures along with the SMILES string of the corresponding substrates (reactants) as inputs and outputs machine-learned in vitro kinetic parameters. We used a concatenated SMILES string of all the reactant molecules for kcat prediction. For Km or Ki prediction, the SMILES string corresponding to the relevant substrate is used. During training, the two sets of inputs are first transformed into their respective feature spaces through separate feature learning modules (Fig. 3a). For enzyme feature learning, CatPred makes use of three approaches that successively add to the detail of description: (1) Sequence Attention (Seq-Att) (2) protein Language Model (pLM) features, and finally (3) 3D-structure features (Fig. 3c). This is carried out to properly delineate the respective contribution to improved prediction of more sophisticated encodings. For substrate feature learning, CatPred utilizes the extensively benchmarked Directed Message Passing Neural Networks41 (D-MPNN). D-MPNNs transform SMILES strings to 2D-graphs of atoms with bond connectivity and learn their aggregated representations using graph convolution operations41 (Fig. 3b). For the derivation of sequence attention (Seq-Attn) features, the amino-acid sequences of enzymes are encoded into numerical representations using the rotary positional embeddings43 akin to the encoding layer used for training the ESM-2 pLM7. The encoded numerical representations are then transformed using self-attention layers44 to capture dependencies and relationships across the length of enzyme sequences (Fig. 3a). The pLM features are extracted by using the ESM-27 (Evolutionary Scale Modeling) model pretrained on the Uniref50 dataset. The 3D structural features are extracted using the Equivariant Graph Neural Networks (E-GNN40) that operate on amino acid residue graphs. We integrated E-GNN from Greener et. al.45 that has been pre-trained using a supervised contrastive learning for embedding protein structures into a low-dimensional latent space (Fig. 3a). The pre-trained E-GNN’s latent space clusters the embeddings of similar protein structures together whereas separating dissimilar ones away from one another 45. We reasoned /that using these E-GNN derived embeddings as features within CatPred can complement the sequence-attention and pLM features. Enzyme features learnt through these modules (Seq-Attn, pLM, E-GNN) are concatenated along with the substrate features from D-MPNNs and used to predict the respective targets (log10-transformed kinetic parameters). CatPred uses a probabilistic regression approach46 and therefore provides kinetic parameter predictions as distributions characterized by both a mean and a standard deviation, rather than single value predictions. Specifically, the concatenated enzyme and substrate features are fed into a fully connected neural network which outputs a mean and variance for each input (Fig. 3c). The network is trained using a negative log likelihood (NLL) loss function with respect to the CatPred-DB’s
For each dataset in CatPred-DB, the CatPred framework is used to train ML models that minimize a negative log-likelihood loss46 (Methods for details) of the predicted distributions to the corresponding target values. Each CatPred-DB dataset is randomly split into 80-10-10 proportions for training-validation-testing, respectively. Because CatPred involves using both enzyme sequences/structures and substrate SMILES as inputs, the splitting is carried out so as no enzyme-substrate pair is repeated across different partitions. Adjustable hyper-parameters in the framework are either fixed to default values or optimized by evaluating trained CatPred models on the validation sets (Methods). The optimized hyperparameters are used to train the final models CatPred-kcat, CatPred-Km and CatPred-Ki using the training and validation sets and evaluated on the testing sets (see below). Production models trained on the full datasets are made available for easy access through the Google Colab interface which can be used without the requiring any local installation or specialized hardware (Fig. 3d).
Evaluation of trained CatPred models
Trained CatPred models were evaluated on two test sets – (1) “held-out” test set and (2) “out-of-distribution” test set. The evaluation criterion is based on the coefficient of regression (R2) which quantifies the fraction of data variance in the regression target that is captured by the predicted values. For each kinetic parameter, the held-out test sets are constructed to be randomly selected 10% in size subsets of the complete CatPred-DB dataset. As implied by their definition, the held-out test sets do not contain any enzyme-substrate pairs used for training the models. The out-of-distribution test sets are further subsets of the held-out test sets (approximately 12 to 15% thereof) with not only specific enzyme-substrate but all enzyme sequences (nearly) identical excluded from the training set (Fig. 4a). By construction, any enzyme sequence in the out-of-distribution set is at most 99% identical (Methods) to any sequence in the training set. Therefore, prediction metrics achieved on the held-out test sets reflect the prediction fidelity for unseen enzyme-substrate pairs. Out-of-distribution test sets provide a more stringent prediction challenge by assessing prediction performance on unseen enzymes (even excluding enzymes within 99% in sequence identity).
We find that CatPred models that use substrate features along with both Seq-Attn and pLM features have the best performance across all three enzymatic parameters (Fig. 4b). Notably, using only the substrate features leads to a reasonable performance for both Km and Ki prediction (R2 of 0.465 and 0.525) at par with previous studies29. Even though inclusion of Seq-Attn features alone only slightly improves prediction performance, the combined addition of both Seq-Attn and pLM features leads to best “in-class” performance for kcat, Km and Ki prediction with R2 values of 0.607, 0.648 and 0.637, respectively (Fig. 4b). These metrics are at least as good or better than all existing ML models for predicting kcat27,28,30 and Km29,30 values respectively. It is worth noting that CatPred models that use 3D-structural features extracted from the E-GNN in addition to Seq-Attn and pLM features do not improve the prediction performance compared to only using Seq-Attn and pLM. The achieved R2 values were 0.607, 0.648 and 0.639 on the held-out test sets respectively for kcat, Km and Ki (Fig. 4b).
Importantly, CatPred models retained strong prediction performance even on “out-of-distribution” test sets for Km (R2 = 0.536) and somewhat less accurate for kcat and Ki (R2 = 0.390 and 0.409 respectively) (Fig. 4b). We observe that while adding Seq-Attn features leads to improved performance for kcat and Km predictions, the improvements are not as pronounced on out-of-distribution sets. This suggests that even though the self-attention layers in Seq-Attn can successfully encode enzyme sequences by extracting local and global patterns, they cannot account for higher-order relationships across sequences that are necessary for generalization to unseen protein sequences. ESM-2 pLM can capture such features and have already proven capable of encoding evolutionarily rich semantics of protein sequences7,47 explaining their good performance on out-of-distribution samples.
We found that adding Seq-Attn + pLM features leads to a reduction in the R2 value for Ki prediction on out-of-distribution test sets when compared to adding only Seq-Attn features. This seemingly surprising finding is likely due to overfitting on the relatively small Ki dataset (approximately four-fold smaller than Km dataset, see Table 1) using high dimensional pLM features. This calls for an expansion to the size of the Ki dataset in the future. It is worth noting that CatPred performs (R2 = 0.39) comparably with TurNup (R2 = 0.40) on out-of-distribution samples for kcat prediction. To the best of our knowledge, CatPred is the only available predictive model for Km and Ki prediction that is evaluated on out-of-distribution samples.
Recently, Kroll et al.27 reported that the DLKcat model for kcat prediction showed a diminishing performance as a function of the similarity of test enzyme sequences to those of the training set indicating that the DLKcat model might have “memorized” the training dataset instead of “learning” meaningful patterns. They showed that the DLKcat model exhibited poor predictive performance (R2 = -0.61) on sequences that are significantly dissimilar compared to those in the training set. Motivated by the need to avoid such a prediction behavior, we systematically assessed the reduction in prediction performance of CatPred models as the test sets become more and more dissimilar to the training set. This analysis revealed that CatPred models for Km prediction maintain robust performance with an R2 value of 0.48 even on out-of-distribution test sets with sequence similarities less than 40% when pLM features are enabled (Fig. 5b). Prediction by CatPred for kcat values remain reasonable (i.e., R2 = 0.33) even down to a seq. id. cutoff of 40% (Fig. 5a) with the contribution of pLM encodings being even more pronounced. This suggests that the CatPred models for kcat and Km (with pLM features) have learnt generalizable enzyme attributes that go beyond sequence similarities. In contrast, for CatPred-Ki the benefit of using pLM features is not realized presumably due to overfitting caused by the relatively small training set size. However, using only Substrate and Seq-Attn features, a good predictive performance is reached for Ki with an R2 value of 0.42 even on the test set with < 40% similarity to training sequences (Fig. 5c). Also, for CatPred models using E-GNN features, the corresponding R2 values on the out-of-distribution test sets were 0.389, 0.538 and 0.454 for kcat, Km and Ki respectively (Fig. 4b) indicating no significant improvement over using only Seq-Attn + pLM features. Therefore, the production CatPred models accessible through our Google Colab interface (Fig. 3d) are based on Substrate + Seq-Attn + pLM for kcat and Km and only Substrate + Seq-Attn for Ki. Also, all further mentions of CatPred-models throughout the manuscript refer to these models unless otherwise explicitly specified.
In the analyses described above we used R2 as the sole metric of prediction quality. We have repeated almost all assessments and Figures using the mean absolute error (MAE) metric (Supplementary Figure S1) obtaining the same trendlines. However, neither R2 nor MAE provide immediate feedback to the user as to whether the predicted value for the enzyme parameter is likely to be “order of magnitude” accurate or not. Motivated by the need to provide such a metric, we introduced a new metric termed p1mag defined as the percent of test predictions that are within one order (+/-) of magnitude error. We choose the relatively large window of acceptance of one order of magnitude as enzyme kinetic parameters span multiple orders of magnitude. Table 2 shows the performance evaluation of CatPred models in terms of R2, MAE and p1mag. Results indicate that approximately 80%, 87% and 70% of held-out test predictions fall within an order of magnitude error for kcat, Km and Ki predictions, respectively. They drop to 63.5%, 82.7% and 58.6% when evaluated on the out-of-distribution test sets. The p1mag metric provides a confidence level metric evaluated for an entire subset of measurements. We next describe how one could directly use the variances predicted by the probabilistic regression model in CatPred to infer confidence values for each prediction separately. Reliable confidence estimates can help segregate predictions with small errors from those with larger ones.
Table 2
The performance metrics obtained by CatPred models as quantified using the coefficient of regression (R2), the mean absolute error (MAE), and the percent of predictions within test sets that are within one order of mangintude error (p1mag). Prediction metrics obtained on both held-out test sets and out-of-distribution sets are listed.
|
CatPred-kcat
|
CatPred-Km
|
CatPred-Ki
|
|
Held-out
|
Out-of-distribution
|
Held-out
|
Out-of-distribution
|
Held-out
|
Out-of-distribution
|
R2
(higher is better)
|
0.608
|
0.390
|
0.648
|
0.536
|
0.552
|
0.461
|
MAE
(lower is better)
|
0.703
|
1.002
|
0.548
|
0.649
|
0.997
|
1.050
|
p1mag
(higher is better)
|
79.4%
|
63.5%
|
87.6%
|
82.7%
|
67.1%
|
56.4%
|
Uncertainty estimates for predictions using CatPred models
Regression models described in the earlier section for training ML models of kcat and Km relied on a mean-squared error loss function27–30. This approach precludes quantifying the level of uncertainty of predictions for individual enzyme-substrate pairs. The metrics such as R2, MAE or f1mag are assessed for the entire evaluation set (i.e., held-out or out-of-distribution) and not for individual predictions. Either lack of measurements or noisy data can adversely affect predictions for enzyme-substrate pairs. This implies that not all predictions would have the same fidelity. Using a probabilistic description allows CatPred to quantify the uncertainty in prediction for individual enzyme-substrate pairs. There are two sources of encountered uncertainty (i.e., aleatoric and epistemic39). Aleatoric uncertainty arises from noise in the training data due to randomly occurring experimental error. This leads to uncharacteristic fluctuations in the value of the output even for small changes in the input (Fig. 6c). Epistemic uncertainty arises due to the lack (or insufficiency) of training data in certain regions of the input space (Fig. 6c). Aleatoric uncertainty can be captured using the probabilistic regression approach used in CatPred (Methods for details). By training the neural networks using a negative log likelihood (NLL) loss function, each CatPred model estimate is a Gaussian distribution characterized by a mean and a variance (Fig. 6a). Epistemic uncertainty on the other hand, requires estimating the variance in prediction from an ensemble of identical neural network models trained using different initializations (Fig. 6b). Individual models in the ensemble would provide dissonant predictions for inputs corresponding to regions with insufficient training data (Fig. 6c). The extent of the disagreement thus quantifies the associated epistemic uncertainty. For each kinetic parameter prediction made by CatPred, the combined uncertainty (sum of aleatoric and epistemic contributions) is provided (Fig. 6b). The aleatoric uncertainty is quantified as the square root of the arithmetic mean of ensemble variances (Fig. 6b) whereas the epistemic uncertainty is the sample standard deviation of the ensemble means (Fig. 6b, also see Methods). It is important to note that because the model training is performed using log10-transformed kinetic parameter values, the corresponding standard deviations estimated are also on a log10-scale (Methods for details). A similar uncertainty description framework was used before in molecular property prediction39.
We first verified if the predicted uncertainty values are consistent with the absolute errors for predictions made by the CatPred trained models on held-out test sets. The goal was to ensure that the predicted uncertainties can be used to discriminate between high from low confidence predictions. To this end, the held-out test sets were partitioned in four subsets each consisting of predictions with uncertainty values less than the 100th, 75th, 50th and 25th percentile, respectively. This means that each subset becomes progressively enriched with predictions of higher confidence. Performance metrics R2, MAE and p1mag are calculated separately within each subset (Fig. 6 (d) –(f)). We perform these analyses on CatPred production models i.e., based on Substrate + Seq-Attn + pLM for kcat and Km and only Substrate + Seq-Attn for Ki. We observe that the prediction metrics monotonically improved when held-out subsets with smaller predicted uncertainties are assessed (Fig. 6 (d) –(f)). We note that R2 values for the (25th percentile) set are improved to 0.78, 0.76 and 0.61 for CatPred-kcat, CatPred-Km and CatPred-Ki models, respectively. Similarly, the MAE drops by approximately ~ 36% for the 25th percentile set compared to the 100th percentile set. This trend is also reflected by the increase in p1mag values (Fig. 6 (d) – (f)) showing that more than 90% of predictions in the highest confident subset (i.e., 25th percentile subset) are within an order of magnitude error for kcat and Km prediction. We also carried out this analysis for the out-of-distribution tests and we observed similar trends (Supplementary Figure S2). These results imply that the probabilistic description of CatPred correctly assigns lower standard deviations for predictions associated with higher confidence evaluation sets.
Google Colab web interface for using CatPred
We developed an easy-to-use interface on Google Colab (https://tiny.cc/catpred) for accessing CatPred. This interface allows for remote computations in a web browser without requiring any local installation. The input to CatPred is the amino-acid sequence of the enzyme and the substrate SMILES string. In the case of kcat prediction, the substrate SMILES string must contain the concatenation of the SMILES strings associated with all reactants. As discussed previously this is needed as we discovered that not only the primary substrate but also the co-substrates (such as secondary substrates, cofactors etc.) contain information relevant to kcat prediction. Unsurprisingly, this is not the case for Km and Ki where only substrate connectivity information is needed. Once the enzyme parameter of interest is chosen and the inputs are entered, they are validated for correct formatting. If the enzyme sequence contains characters other than the natural amino-acid alphabet or if the SMILES string is invalid, then an error prompt is displayed asking for re-entry of inputs. Once the inputs are validated, the relevant enzyme parameter prediction value along with the estimated uncertainty (contributions from aleatoric and epistemic) are output on the screen. On average, the computation takes ~ 20 seconds on CPU and ~ 10 seconds on GPU. Figure 7a pictorially illustrates the inputs and outputs for predicting the Km value of a Hexokinase (from Homo sapiens) acting on its native substrate D-Glucose. The output value 5.58mM is within 7% error from the experimentally reported value of 6.3mM48. In addition, CatPred interface also checks if given inputs already occur in the databases BRENDA and/or SABIO-RK to alert the user. If the check passes, then the database entries corresponding to the inputs are listed (Fig. 7b).