2.1 Software
The code was implemented in R 4.2.2 (R Developement Core Team, 2009) and Python 3.9.0 (Sanner, 1999) on a Windows 11 desktop PC with an NVIDIA 2070 Ti GPU. The Integrated Development Environments (IDE) used were R Studio 2023.03.0386 and PyCharm 2021.2.2 (Professional Edition). R and Python are widely used as statistical computing languages because of their robust data analysis capabilities and AI applications. ANNs were implemented using the deep learning libraries TensorFlow and Keras (Pang et al., 2020) in the R environment. TensorFlow, developed by Google, is a leading open-source deep learning library that provides a flexible and scalable platform for building complex neural networks of multiple architectures. Keras, built on top of TensorFlow, offers a high-level API, making it user-friendly and efficient for constructing deep learning models. Enzyme representations were generated using the Python programming environment.
The graphs were generated using ggplot2 (Wickham, 2011) in the R computing environment (R Developement Core Team, 2009). Statistical analysis of the predictions with the experimentally measured values was also conducted in the R computing environment. The equations were integrated into the graphs using the ggpmisc library (Aphalo, 2016).
2.2 Data collection and data cleaning
The construction of a concrete artificial intelligence model depends on the data availability and the data quality. Given the high uncertainty in the determination of Vmax values, we chose to use only data originating from Sabio RK (Wittig et al., 2012) and, more specifically, only information concerning wild enzymes. The data from SABIO RK were manually collected due to the unavailability of a related library providing an interface to an application programming interface (API) in any programming language. This manual collection process yielded 1795 unique Excel files, each containing information for only one wild-type enzyme. To our knowledge, we downloaded all the available information that existed there. The raw Excel files are also provided on the GitHub page.
All Excel files contained the same number of columns, simplifying integration into a common table. To increase the computational time, parallel computing methodologies (Weston & Calaway, 2015) were also employed by incorporating the R-library doParallel to load and combine the raw Excel files from SABIO RK. This consolidation resulted in a table with 1,236,908 rows. We then filtered the table to include only rows containing Vmax values and removed any rows with missing values (NA), resulting in 858,351 rows. Additionally, we ensured that each row corresponded to only one enzyme by splitting rows with multiple enzymes and UniProt IDs. Rows where KEGG reaction IDs or UniProt IDs were not provided were removed. Duplicate data were eliminated as well. After completing the data cleaning stage of the SABIO RK dataset, the resulting table contained 43633 unique rows.
To access reaction data from the KEGG database, it was necessary to organize all the data locally and make a single request for information rather than continuous requests. This approach ensured faster execution of the algorithm since it eliminated the need to make repeated calls to the KEGG APIs, which could introduce time constraints during the testing phase of the methodology. Specifically, all metabolic conversions (reactions) and endogenous metabolites were organized into two nested lists. This information was retrieved through communication with the KEGG APIs using the KEGGREST (Tenenbaum et al., 2019) and KEGGgraph (Zhang & Wiemann, 2009; J. D. Zhang et al., 2015) libraries. All libraries mentioned in this step were employed within the R computing environment and are part of the Bioconductor software (Gentleman et al., 2004). The resulting datasets from the KEGG database are also available on the GitHub page.
2.3 Introduction of amino acid sequences
The introduction of enzyme structural information relies on the UniProt ID provided in the information from SABIO. These IDs were used to retrieve amino acid (AA) sequences from the UniProt database (UniProt Consortium, 2019), which serves as a hub for abundant protein-related information, including structure information. To retrieve these data, the R-library UniprotR (Soudy et al., 2020) provided by Bioconductor (Gentleman et al., 2004) was used.
2.4 Multiple Vmax values for a unique structure and reaction
Since multiple Vmax values may correspond to one enzyme reaction and structure, series lacking a KEGG reaction ID were removed, as previously mentioned. This step facilitated the conversion of substrate and product names to KEGG IDs, enabling the identification of additional duplicate references. Consequently, a dataset consisting of 4,215 unique entries was obtained.
The present dataset was segmented into chunks based on the amino acid sequence, the respective organism, and the Vmax unit. This segmentation process yielded a list of 1,472 data frames. It is important to note that some data frames contained more than one row. Therefore, considering the potential presence of identical structures with similar Vmax values in the test set, which could significantly impact the performance of the model, it was decided that in cases where there was more than one record in a data frame, the maximum and minimum values for Vmax from all rows would be retained. Using these maximum and minimum values, 10,000 values were generated randomly, representing a uniform distribution. Using these maximum and minimum values, 10,000 values were randomly generated to represent a uniform distribution. For each distribution, the geometric mean was calculated, which was then utilized as the Vmax value for the corresponding amino acid sequence. The same approach was applied in instances where only one row was provided, with maximum and minimum values for Vmax occurrence. However, this methodology was not applied in cases where there was a unique Vmax value for each sequence and organism. As a result, the resulting table contains 1,472 unique entries with unique sequences. Last but not least, since concentration levels were not provided from SABIO RK for all the enzymes considered thus far, it was decided not to incorporate this characteristic of the enzymes in this analysis. Consequently, the model presented in the next sections will serve as a reference point for the estimation of enzyme concentrations. Additionally, if concentration was employed as a predictor, this model could not be used for such approximations.
To conclude, duplicated structures with reactions that are not included in the training set were organized in a dataset to further test the developing models on known structures. This dataset contained 47 rows.
2.5 Definition of molecular fingerprints and reaction fingerprints
Each substrate should have a .mol file that describes its structure. The integration of .sdf files in the methodology allows for the generation of chemical descriptors and fingerprints that can be easily incorporated into artificial intelligence algorithms representing structures. Molecular fingerprint definition demands the use of 2D .mol files that have been collected manually from the KEGG database (Kanehisa et al., 2017) and CheBI database (Hastings et al., 2016). For the conversion algorithm, the RCDK package, which is a widely accepted R cheminformatics library (Guha & Cherto, 2017) that provides a plethora of fingerprint options as well as an advanced framework for chemical analysis, was used. If an .sdf file was not available from either database, the corresponding file was removed.
We chose four different fingerprints to represent the molecular structures of the substrates employed in the approach presented here. As a result, each substrate is represented by the standard 1024-bit fingerprint provided by RCDK and has been used in multiple studies (Choi et al., 2020; Willighagen et al., 2017); the MACCS key is a 166-bit fingerprint (Durant et al., 2002), which is one of the most commonly used fingerprint types; PubChem is an 881-bit fingerprint (Kim et al., 2019) that has been applied in several studies (Bean et al., 2017; Bender et al., 2007; Dey et al., 2018; Dimitri & Lió, 2017; Jamal et al., 2017; Liu et al., 2012; Mizutani et al., 2012; Poleksic & Xie, 2018; Wang et al., 2019; Yamanishi et al., 2012; W. Zhang et al., 2015; Zhou et al., 2015); and the E-state is a 79-bit fingerprint introduced by Hall et al. (1991) that has also been employed in numerous studies (Barigye et al., 2013; Elton et al., 2018; Floris et al., 2014). The resulting reaction fingerprint estimation relies on the work of Schneider et al. (2015) and the following equation (3), assuming that in this case, agents are considered insignificant (4). The computational process to introduce a reaction fingerprint is described in detail in Figure 1. A similar approach to concatenating individual molecular fingerprints and introducing reaction fingerprints has also been proposed by Kroll et al. (2023) and has been used for other purposes, such as clustering chemical reactions (Probst et al., 2022). Finally, one of the key objectives of the present study is to assess the extent to which the number of bits contained in a reaction fingerprint can provide more information regarding the catalyzed metabolic transformation within the neural network.
2.6 Introduction of ESM-1b representations
The methodology employed in this study to introduce ESM-1b vectors followed the approach proposed by Rives et al. (2021) and was subsequently adapted into a custom Python script to conform with our methodological framework. The original models and scripts from Rives et al. (2021) are available at GitHub. It is also important to note that the model utilizes a transformer-based (Lin et al., 2022) architecture. Introducing protein language models that operate within an evolutionary context is a fundamental step toward advancing predictive and generative artificial intelligence methodologies for biological research. To achieve this, Rives et al. (2021) utilized unsupervised learning techniques to train a large language model (LLM). Their training set comprised a dataset containing 86 billion amino acids (AAs) sourced from 250 million protein sequences, encompassing a diverse range of evolutionary contexts. This dataset is the UniRef50 dataset and is available at Suzek et al. (2015). The resulting model effectively captures significant biological insights within its representations, derived exclusively from sequence data. These representations encompass organizational structures that include information on the biochemical properties of amino acids, as well as protein homologies. Moreover, such a model encodes details about secondary and tertiary protein structures, which can be discerned through linear projections. More information about this can be found in the work of Rives et al. (2021).
The AA sequences retrieved from the UniProt database (UniProt Consortium, 2019), as previously discussed in detail, were organized in .fasta file format (Lipman & Pearson, 1985) by using the seqinR library (Charif et al., 2023) in the R computing environment. These sequences were then processed by the pretrained ESM-1b model to generate numerical protein representations. The resulting arrays, with specific dimensions of (1280, 1, 1), are referred to as ESM-1b representations.
The adopted methodology relies on the natural language processing (NLP) modeling framework. The advantage of using such a tool regards the transition of all the words in a sentence being converted to numerical vectors that encode significant information about the content and the position of a word. The application of these models to amino acid (AA) sequences, which contain biochemical information for proteins, replaces the word entity in normal NLP paradigms and takes advantage of it to create unique numerical representations based on the enzyme structure (Kroll et al., 2022a). As the number of entries in the training datasets increases, the ability of such models to learn representations significantly improves (Baevski et al., 2019; Radford et al., 2019; Rives et al., 2021). Deciphering the information embedded within protein sequence variations has long been a challenge in computational biology. The approach applied not only by Rives et al. (2021) but also by Alley et al. (2019) brings the scientific community a step closer to achieving that goal. For instance, representations derived from protein language models such as those generated by the ESM-1b model can identify secondary and tertiary protein structures, offering valuable insights for proteins.
ProteinBERT, for example, is a variant of the bidirectional encoder representations from transformers (BERT) model specifically tailored for protein sequences. Its aim is to capture global protein representations effectively. The model demonstrates versatility by achieving near state-of-the-art results across numerous protein-related tasks through quick fine-tuning after being trained on the UniProtKB/UniRef90 dataset (Boutet et al., 2016; Suzek et al., 2007). ProteinBERT introduces architectural elements specifically designed for proteins, combining language modeling with Gene Ontology (GO) annotation (GO Consortium, 2012) prediction in its pretraining scheme. The architecture of the model enables efficient processing of long sequences and yields impressive performance across multiple benchmarks, including protein structure prediction and posttranslational modifications, despite being smaller and faster than competing methods (Brandes et al., 2022). Another significant example of larger protein-based models is the work (ProtT5 model) of Elnaggar et al. (2021), in which the authors report that these models are able to learn “some of the grammar of the language of life”.
Leveraging the theoretical background and the capability of current models to accurately describe protein structures through numerical representations, this potential is exploited to identify patterns within proteins and predict the maximum velocity of the Michaelis–Menten equation as accurately as possible. Instances from the available literature demonstrate that calculations for the kinetic constants of enzymes can rely on specific numerical representations (Kroll et al., 2021; Kroll et al., 2023). Furthermore, it is worth mentioning that the value of Vmax is also related to the concentration of the enzyme in a biological system. By applying the present model, it becomes possible to indirectly estimate the concentration of an enzyme in a biological system.
2.7 Introduction of enzyme amino acid proportions
Furthermore, we aimed to delve more deeply into matters related to protein structure, specifically investigating how the amino acid composition of an enzyme could impact its physicochemical properties, with a focus on the Michaelis–Menten maximal velocity. Another vector was introduced into the dataset, which describes the proportions of amino acids that compose an enzyme and has a length equal to 20, similar to the number of amino acids involved in the construction of proteins (Lopez & Mohiuddin, 2020). The introduction of additional features can enrich the dataset and potentially enhance the predictive power of the developed model.
To achieve this goal, the amino acid sequences collected in previous steps were utilized alongside the PROTR library (Xiao et al., 2015) within the R environment. PROTR is a comprehensive and powerful R package designed for the analysis of protein sequences and their associated physicochemical properties. With a wide range of functions and classes, PROTR offers a rich set of tools for bioinformatics to extract valuable insights from protein information, including the computation of diverse protein descriptors, such as amino acid composition, dipeptide composition, and numerous physicochemical properties. The methodology used to introduce the respective vectors is briefly described in Figure 2. These enzyme representations are known as PROTR or amino acid proportions.
2.8 Introduction of the training and test sets
In the pursuit of building robust and accurate deep learning models for enzyme kinetics prediction, careful data partitioning is essential to ensure unbiased evaluation and validation of model performance. In the framework of this study, the dataset was divided randomly, with 70% of the data allocated for training purposes, 10% for validation and the remaining 20% reserved for testing. During testing of the optimization algorithm that is described in detail below, it was observed that when 10% of the data were reserved for validation purposes, the model performed significantly better on unseen data compared to the classic 80/20 ratio. As a result, it was decided to adopt the 70/10/20 ratio for the development of the models. This process yielded a training set with 626 entries, a validation set of 90 entries and a test set of 179 entries. It is important to note that only values given in mol/(s*g) were considered in these datasets.
The indication provided by Kroll and Lercher (2023) regarding the presence of similar structures in both the training and test sets underscores the importance of processing the SABIO RK data in a manner that ensures uniqueness of enzyme structures for a given organism, along with the standardization of Vmax units as described in the previous sections. Consequently, the present data splitting approach ensures that the model learns from a diverse range of data and is tested on unseen samples, facilitating generalization to new data. It is important to mention that careful measures were taken to ensure that the training, validation, and test datasets contained distinct entries, thus avoiding any instances of data leakage or overlap. This meticulous preprocessing approach enhances the reliability and robustness of the model's performance when applied to real-world scenarios and unseen data for the development of in silico NAMs, among others. In addition, to further test the models, they were tested in a set of similar structures that catalyze other metabolic reactions. This dataset consists of 47 entries and resulted from the declined structures from the training set. Last but not least, it should also be mentioned that the distinction between training, validation and test samples prevents the model from overfitting and encourages it to learn underlying patterns in enzyme kinetics given the structural inputs provided.
To ensure consistent and reproducible model training, validation, and testing metrics, the randomly selected training, validation, and test indices were locally stored. These files are also available on the GitHub page. By doing so, we could consistently access the same set of training, validation, and testing data during multiple model training runs. This approach prevents fluctuations in the data distribution across different training sessions, allowing for direct comparisons between the developing models, thus ensuring the consistency and robustness of the results presented in this work.
2.9 ANN architecture, hyperparameter optimization, and model fitting
The use of deep learning has revolutionized predictive modeling, resulting in powerful and accurate models for a wide array of scientific and industrial applications. In the realm of enzyme kinetics, understanding enzymatic behavior holds paramount importance for numerous processes, such as computational new approach methodologies, systems biology models, pharmacokinetics, and toxicokinetics. TensorFlow offers high-level APIs that streamline the design, training, validation, and evaluation of deep learning models, making it more accessible and user friendly for both researchers and practitioners. Keras, an integral part of TensorFlow, provides a simple and intuitive interface, allowing users to swiftly prototype complex neural networks. This simplicity is particularly beneficial in enzyme kinetics research, where domain experts may lack expertise in deep learning methodologies. The user-friendly APIs empower researchers to concentrate on the biological aspects of the problem rather than being entangled in the technical complexities of deep learning modeling. The versatility of these tools proves invaluable in enzyme kinetics, where the intricate nature of enzymatic behavior may require different architectural designs to capture specific patterns and interactions. TensorFlow and Keras have emerged as significant tools for developing predictive deep learning models, with a particular focus in this study on predicting the Michaelis‒Menten maximal velocity.
One of the primary challenges in developing deep learning models revolves around optimizing hyperparameters to enhance model reliability and robustness. Traditional methods often involve trial and error to optimize an AI model. However, this approach can be exceedingly time-consuming, especially when dealing with large datasets and models that require extensive training time due to complex architectures. Consequently, the benefits of such approaches compared to experimental methods may diminish. Moreover, even achieving high training metrics such as R2 > 0.95 and very low MAEs and RMSEs does not guarantee finding the optimal combination of hyperparameters despite offering high reliability and performance. Therefore, it is imperative to adopt methodologies that expedite the process of finding the best possible architecture. In this regard, this work adapts the capabilities of KerasTuner (Pon & Krishna Prakash, 2021) to the computing environment of R[1] through an in-house algorithm to optimize deep learning to predict the Michaelis‒Menten maximal velocity.
KerasTuner is a hyperparameter optimization library for Keras that is widely employed in the development of artificial neural network (ANN) models. It enables the automatic exploration of optimal hyperparameter configurations, including but not limited to learning rates, layer units, and dropout rates. Through its utilization, the performance of neural network models can be significantly enhanced. The library offers a user-friendly and adaptable interface for defining search spaces, selecting search algorithms, and conducting hyperparameter optimization experiments (Pon & Krishna Prakash, 2021).
The models that will be presented in this study are constructed using the sequential API of Keras, a high-level deep learning framework based on TensorFlow. For enzyme kinetic prediction, fully connected neural networks (FCNNs) are employed, and the sequential model offers a convenient approach to building such architectures. These models are constructed and optimized using the functions provided by KerasTuner. During optimization, the algorithm is empowered to select the optimizer and learning rate, including tuning parameters specific to each optimizer. The algorithm offers a wide range of options, including Adadelta (Adaptive Delta), Adamax (Adaptive Moment Estimation), Adamax (Adaptive Moment Estimation with Infinity), Nadam (Nesterov Adaptive Moment Estimation), RMSProp (Root Mean Squared Propagation), FTRL (The Regularized Leader), and SGD (Stochastic Gradient Descent), with distinct learning rate values, including 1E-05, 1E-04, 1E-03, 1E-02, and 1E-01. All optimization functions are integrated as built-in functions within the Keras library. The tuning parameters optimized in each case are outlined in the table below (Table 1). The minimum and maximum values denote the upper and lower bounds, respectively, while the step indicates the minimum spacing between two subsequent values.
Table 1
The parameters were finely tuned using Keras Tuner functions. In cases where no tuning parameters were provided, the algorithm defaults to using preset values for each optimizer.
Optimizer | Tuning Parameter | Minimum Value | Maximum Value | Step |
Adadelta (Adaptive Delta) | RHO | 0.01 | 0.99 | 0.01 |
Adam (Adaptive Moment Estimation) | BETA 1 | 0.01 | 0.99 | 0.01 |
BETA 2 | 0.001 | 0.999 | 0.001 |
Adamax (Adaptive Moment Estimation with Infinity) | BETA 1 | 0.01 | 0.99 | 0.01 |
BETA 2 | 0.001 | 0.999 | 0.001 |
Nadam (Nesterov Adaptive Moment Estimation) | BETA 1 | 0.01 | 0.99 | 0.01 |
BETA 2 | 0.001 | 0.999 | 0.001 |
RMSProp (Root Mean Squared Propagation) | RHO | 0.01 | 0.99 | 0.01 |
FTRL (Follow The Regularized Leader) | Power of learning rate | -1 | -0.01 | 0.01 |
SGD (Stochastic Gradient Descent) | | | | |
Additionally, the algorithm is capable of selecting from a plethora of built-in options for loss functions, including the mean absolute error (MAE), mean squared error (MSE), mean squared logarithmic error, and log-Cosh, to determine the most reliable model. However, the only metric function utilized is R-squared, which is also the metric used by the algorithm to choose the best model. An in-house function for estimating R-squared was employed because Keras does not offer an R-squared metric as a built-in option. This function also serves as a potential option for loss function selection.
A reliable architecture for an artificial neural network also involves activation functions and typically consists of more than one layer. Activation functions are of crucial importance in introducing nonlinearity to the deep learning model, allowing it to learn complex patterns from the input data. The present algorithm is equipped with the option to select from multiple activation functions, including rectified linear unit (ReLU), exponential linear unit (ELU), scaled exponential linear unit (SELU), hard sigmoid, linear, sigmoid, softmax, tanh, exponential, Gaussian error linear unit (GELU), and Swish, with their default arguments as they were provided by the Keras library. Additionally, the algorithm allows for the selection of seven dense layers and two dropout layers: a normal layer and a Gaussian layer. The first dense layer and the last dense layer must be chosen. Each dense layer can have a different activation function compared to the others. All layers containing neurons have the option to choose from 1 to 2048 neurons, except for the last layer, where the algorithm necessarily selects from 1 to 4 neurons, and the activation function is set to linear by default. In tests conducted during the development of the algorithm, models consistently performed better when the last layer had a linear activation function (data not shown). As a result, the linear activation function was deemed appropriate for regression tasks where the output needs to be a continuous value. The algorithm also determines whether to include the intermediate five layers, all of which are equipped with the previously mentioned activation functions and neurons.
To mitigate overfitting, which occurs when the model memorizes the training data rather than learning general patterns, dropout layers were introduced into the Keras–Tunner algorithm. Dropout randomly deactivates neurons during training, compelling the network to learn more robust and generalized representations, thereby enhancing its generalizability. Additionally, dropout serves as a regularization technique, making the model less sensitive to noise. In addition to the option of including dropout layers in the network architecture, the algorithm also provided the flexibility to specify the dropout fraction, ranging between 0.01 and 0.99 with a step of 0.01 in both the Gaussian and regular dropout layers.
Finally, the main criteria for the adoption of a model in each case concerned metrics such as a high R-squared in the test dataset (and unseen data). The detailed results obtained in each use case as well as the deep learning models developed are presented in detail in the next chapter. It is worth mentioning that the number of neurons in each layer gradually decreases until the final layer. The progressive reduction can be attributed to the ability of the model to extract increasingly abstract and meaningful features from the data as it delves deeper into the network.
2.10 Statistical analysis plan to validate the performance of the developing models.
To enhance the credibility of the results presented in this scientific work, it is essential to compare the predictions generated by the deep learning models in the following sections with experimentally recorded values by applying widely accepted statistical methods. To achieve this, two statistical methods have been employed. In the first approach, linear regression is utilized to compare the predictions with the experimental data. Linear regression is commonly employed to assess the performance of predictive models by analyzing their predictions against experimental measurements. This method offers a straightforward yet robust means to establish the relationship between two continuous variables—the model outputs and the corresponding experimental observations. Similar methodologies have also been reported in the literature, such as the work of Choetkiertikul et al. (2018).
The second method employed to enhance the reliability of the predictions generated by the models presented below involves calculating the Pearson correlation coefficient (r). Pearson's coefficient serves as a robust metric for quantifying the strength and direction of the linear relationship between two variables. A value close to 1 indicates a strong positive linear relationship, while values approaching 0 suggest little to no linear correlation. Importantly, Pearson's coefficient is resilient to outliers, which is advantageous when dealing with experimental data that may contain noise or outliers, such as datasets with multiple variations in Vmax values. Its adoption in comparing deep learning predictions with experimental measurements provides a universally accepted measure of association. The Pearson correlation coefficient (r) is a widely adopted metric for assessing the performance of a deep learning model against experimentally measured values and has also been broadly reported in the literature (Preuer et al., 2018).
Finally, when assessing the performance of deep learning models, the inclusion of metrics such as the R-squared (coefficient of determination), mean absolute error (MAE), and root mean square error (RMSE) is imperative for comprehensive evaluation. R-squared provides valuable insight into the proportion of variance in the dependent variable that is predictable from the independent variables, thereby offering a measure of model adequacy. MAE, characterized by its simplicity and interpretability, offers a robust indication of the average magnitude of errors present in predictions, facilitating a clear understanding of model accuracy. RMSE, akin to MAE but incorporating the square root of the average squared differences between predictions and actual values, tends to penalize larger errors more significantly, thus providing a nuanced perspective on model performance. The combined utilization of these metrics provides a comprehensive evaluation, allowing for a thorough assessment of both the predictive capability and the precision of the deep learning models presented in this study.