Dataset construction
To predict the enzyme active sites, we collected eight enzyme datasets and constructed new training and test sets from them. The eight datasets, namely NN [57], PC [58], HA superfamily [47], EF family [59], EF superfamily, EF fold, T-37, and T-124 [46], collectively contain a total of 987 proteins. T-124 containing 124 proteins was used as the test set (TS124), while the remaining 863 proteins were utilized as a training set. For excluding the sequences with high identity, the chains in the training set that share > 25% identity with TS124 were removed using MMseqs2 [60], resulting in 588 sequences in the training set (Train588). For EC number prediction, referring to CLEAN [11], more than 220, 000 enzyme sequences were extracted from UniProt [61], and a training set of size 74, 487 for enzyme EC number identification was constructed through 70% clustering. Two independent test sets were used to evaluate the model performance. The first is NEW-392, which collected data from Swiss-Prot released after April 2022. In NEW-392, 392 enzyme sequences were included, encompassing a total of 177 EC numbers. The second is Price-149, an experimental dataset of 149 enzyme sequences described by Price et al. [50]. For predicting the enzyme optimum pH, 11383 enzymes were collected from BRENDA (released in January 2023) [54], which provides the experimental optimum pH for enzyme-catalyzed reactions. After removing the similar sequences with > 25% identity, 4110 enzymes remained and were ranked by the released time. The latest 813 sequences (about 20%) were utilized as the test set (Brenda-test), while the remaining were used as the training set (Brenda-train).
The architecture of the model
As shown in Fig. 1, protein structures are predicted using ESMFold to construct the protein graph and sequence embeddings are extracted via ProtTrans, which are then fed into a featurizer layer to obtain node and edge features. These features are employed to obtain geometric embeddings through geometric graph learning. Based on the embeddings, enzyme active sites are predicted and a weight score is assigned to every residue. Using these weight scores, enzyme EC numbers are identified with an attention layer and label diffusion. In addition, for better determining the reaction conditions, the model is subsequently expanded to optimum pH prediction by incorporating attention pooling.
Featurizer layer
A protein is represented as a radius graph constructed by the \(\:{c}_{\alpha\:}\) atoms of residues, where the radius defaults to 10 Å. The protein graph comprises the adjacency matrix, as well as node and edge features, which are derived from a local coordinate system. The \(\:{\text{C}}_{{\alpha\:}},\:\text{C},\) and N atoms of residue \(\:i\) are employed to build the coordinate system \(\:{Q}_{i}={[b}_{i},\:{n}_{i},\:{b}_{i}\times\:{n}_{i}]\). Formally, we define:
$$\:\begin{array}{c}{u}_{i}={C}_{{\alpha\:}_{i}}-{N}_{i},\:{\:v}_{i}={C}_{i}-{C}_{{\alpha\:}_{i}},\:{b}_{i}=\frac{{u}_{i}-{v}_{i}}{\parallel\:{u}_{i}-{v}_{i}\parallel\:},\:{n}_{i}=\frac{{u}_{i}\times\:{v}_{i}}{\parallel\:{u}_{i}\times\:{v}_{i}\parallel\:}\:\#\left(1\right)\end{array}$$
Based on the local coordinate system, the node and edge features are defined as follows:
(i) Node features. Given two atoms \(\:A\in\:\left\{{C}_{i},\:\:{C}_{{\alpha\:}_{i}},\:\:{N}_{i},\:\:{O}_{i},\:{\:R}_{i}\right\}\) and \(\:B\in\:\left\{{C}_{i},\:\:{C}_{{\alpha\:}_{i}},\:\:{N}_{i},\:\:{O}_{i},{\:R}_{i}\right\}\), where \(\:{C}_{i}\), \(\:{C}_{{\alpha\:}_{i}}\), \(\:{N}_{i}\), and \(\:{O}_{i}\)represent four atoms of residue \(\:i\) and\(\:{\:R}_{i}\) denotes the centroid of sidechain atoms. By analyzing the characteristics between A and B, the distance, direction, and angle features are computed for each residue. The distance features are \(\:RBF(\parallel\:A-B\parallel\:)\), where \(\:A\ne\:B\) and \(\:RBF\) is a radial basis function. The direction features are regulated as \(\:{Q}_{i}^{T}\frac{A-{C}_{{\alpha\:}_{i}}}{\parallel\:A-{C}_{{\alpha\:}_{i}}\parallel\:}\), indicating the direction of other atoms relative to \(\:{C}_{{\alpha\:}_{i}}\). For adequately reflecting the geometrical information of backbone, the torsion angles (\(\:{\varphi\:}_{i},\:{\psi\:}_{i},\:{\omega\:}_{i}\)) and bond angles (\(\:{\alpha\:}_{i},\:{\beta\:}_{i},{\gamma\:}_{i}\)) have been exploited and their sine and cosine values are applied as angle features.
For enhancing the node features, a pre-trained language model (ProtTrans) was utilized to extract informative protein embeddings from sequences. ProtTrans is a transformer-based pre-trained language model with 3B parameters, trained on BFD and fine-tuned on UniRef50 using the BERT’s denoising objective. Besides the sequence, we also attempted to extract more information from structures. DSSP was used to compute valuable structural properties, including one-hot secondary structure profile and relative solvent accessibility, which were used to further enhance the node features.
(ii) Edge features. For atom pairs \(\:A\in\:\left\{{C}_{i},\:\:{C}_{{\alpha\:}_{i}},\:\:{N}_{i},\:\:{O}_{i},\:{\:R}_{i}\right\}\) and \(\:D\in\:\left\{{C}_{j},\:\:{C}_{{\alpha\:}_{j}},\:\:{N}_{j},\:\:{O}_{j},{\:R}_{j}\right\}\)representing residues \(\:i\)and \(\:j\)respectively, the edge features are defined similarly, including distance, direction, and orientation features. The distance features between residues \(\:i\) and \(\:j\) are \(\:RBF(\parallel\:A-D\parallel\:)\), indicating the distance characteristics of given residue pairs. The direction features are defined as \(\:{Q}_{i}^{T}\frac{D-{C}_{{\alpha\:}_{i}}}{\parallel\:D-{C}_{{\alpha\:}_{i}}\parallel\:}\), denoting the direction of atoms in residue \(\:j\) to \(\:{C}_{{\alpha\:}_{i}}\). To represent the relative rotation between the local coordinate systems, \(\:{q(Q}_{i}^{T}{Q}_{j})\) is computed as orientation features, where \(\:q\) represents a quaternion encoding function [62].
Geometric graph learning
The node and edge features obtained from featurizer layer were fed into several GNN layers for geometric graph learning. To learn the multi-scale residue interactions, node update, edge update and global context attention modules were employed at node, edge, and global context levels, respectively.
(i) Node update. Due to the transformer's reputation as a powerful model for both sequence and graph data [63, 64], we employed its multi-head attention mechanism for efficient message passing. The feature vectors of node \(\:i\) and edge \(\:j\to\:i\) in layer \(\:l\) were represented as \(\:{h}_{i}^{l}\) and \(\:{e}_{ji}^{l}\), which were transformed into a \(\:d\)-dimensional space before the GNN operation. To update node \(\:i\) in layer \(\:l\), we execute the message passing in the following manner:
$$\:\begin{array}{c}{\widehat{h}}_{i}^{l+1}={h}_{i}^{l}+\sum\:_{j\in\:{NB}_{i}\cup\:i}{\alpha\:}_{ji}^{l}\left({W}_{V}^{l}{h}_{j}^{l}+{W}_{E}^{l}{e}_{ji}^{l}\right)\#\left(2\right)\end{array}$$
the attention weight \(\:{\alpha\:}_{ji}^{l}\) is computed as follows:
$$\:\begin{array}{c}\left\{\begin{array}{c}{w}_{ji}^{l}=\frac{{\left({W}_{Q}^{l}{h}_{i}^{l}\right)}^{T}\left({W}_{K}^{l}{h}_{j}^{l}+{W}_{E}^{l}{e}_{ji}^{l}\right)}{\sqrt{d}}\\\:{\alpha\:}_{ji}^{l}=\frac{{e}^{{w}_{ji}^{l}}}{{\sum\:}_{k\in\:{NB}_{i}\cup\:i}{e}^{{w}_{ki}^{l}}}\end{array}\right.\#\left(3\right)\end{array}$$
Where the \(\:{W}_{Q}^{l}\), \(\:{W}_{K}^{l}\), and \(\:{W}_{V}^{l}\) are three weight matrices utilized to convert the node vectors to query, key, and value representations, respectively. The key and value representations are further supplemented by edge vectors using weight matrice \(\:{W}_{E}^{l}\). \(\:{NB}_{i}\:\)represents the neighbors of node \(\:i\). The queries, keys, and values are translated multiple times, with parallel attention functions being performed before concatenating them together.
(ii) Edge update. The edge features are updated through the neighbor nodes to enhance the model performance.
$$\:\begin{array}{c}{e}_{ji}^{l+1}={e}_{ji}^{l}+EdgeMLP\left({\widehat{h}}_{j}^{l+1}\parallel\:{e}_{ji}^{l}\parallel\:{\widehat{h}}_{i}^{l+1}\right)\#\left(4\right)\end{array}$$
where \(\:EdgeMLP\) denotes the MLP operation for edge updates and \(\:\parallel\:\) represents the concatenation operation.
(iii) Global context attention. Although local interactions are crucial for learning residue representations, global information has also been shown to be beneficial in enhancing method performance. However, the increased computational overhead in calculating global attention poses a major challenge. To reduce the complexity, an alternative is proposed to calculate a global context vector before employing it for node representations with gate attention [36].
$$\:\begin{array}{c}\left\{\begin{array}{c}{c}^{l}=\frac{{\sum\:}_{k=0}^{n-1}{\widehat{h}}_{k}^{l+1}}{n}\\\:{h}_{i}^{l+1}={\widehat{h}}_{i}^{l+1}⨀\sigma\:\left(GateMLP\left({c}^{l}\right)\right)\end{array}\right.\#\left(5\right)\end{array}$$
where \(\:n\) represents the quantity of residues in a protein, \(\:\sigma\:\) is the sigmoid function, \(\:⨀\) is the element-wise product operation and \(\:GateMLP\) denotes the MLP for gated attention.
Enzyme active site prediction (GraphEC-AS)
Due to the important role of enzyme active sites in enzyme function, we first predict the active sites before identifying the EC numbers. The geometric embeddings obtained from the geometric graph learning were fed into an MLP layer to assign a score to each residue, indicating its likelihood of belonging to an active site. Using these scores, each residue was assigned a weight to represent its level of importance.
The identification of EC numbers (GraphEC)
Under the guidance of weight scores generated by GraphEC-AS, an EC number predictor was proposed. The previously generated geometric embeddings were further input to an attention layer, where the attention functions were performed in parallel with the multi-head attention mechanism. By integrating the multi-head attention and weight scores, the residue-level information was aggregated to the protein level through a pooling layer. After pooling, the initial prediction was obtained, and a label diffusion algorithm was employed to enhance the prediction using DIAMOND. The label diffusion algorithm was used to extract homologous information, as referenced by S2F [44]. Following the label diffusion, the final pred was generated to identify the EC numbers as a multilabel classification task.
Enzyme optimum pH prediction (GraphEC-pH)
Since enzymes require certain environmental conditions to exert their catalytic activity, we further predicted the optimal pH of the enzyme. The pH values were categorized into three groups: acidic (less than 5), neutral (between 5 and 9), and alkaline (greater than 9). To get the characterization for predicting the enzyme optimum pH, multi-head attention was utilized to process the geometric embeddings derived from the geometric graph learning. Then an MLP layer was used to predict the optimum pH. By combining the previous identification of enzyme function with the current prediction of pH, a more effective method can be provided to guide actual experiments.
Hierarchy of catalytic functions
The Enzyme Commission (EC) number is a numerical system used to classify enzymes according to the reactions they catalyze. Each EC number comprises four digits, which hierarchically categorize enzymes based on their catalytic reaction types and specific substrates [65] (e.g., EC: 1.3.1.32 represents the maleylacetate reductase). In this study, we collected 5,106 EC numbers from the training set and defined a label of length 5,106, where each position corresponds to a specific EC number.
The protein language model (ProtTrans)
The informative sequence embeddings were generated through a pre-trained language model ProtT5-XL-U50 (ProtTrans [39]). ProtTrans is a transformer-based autoencoder known as T5 [66], which has been pre-trained on UniRef50 [67] to facilitate the prediction of masked amino acids. The features derived from the final layer of the ProtTrans encoder were employed to enhance the node representations.
Protein structure prediction using a language model (ESMFold)
ESMFold [30] is a large language model with up to 15B parameters, developed on the premise that language models can capture evolutionary patterns across millions of sequences. Achieving accurate and fast structure prediction, ESMFold reduces inference time by as much as 60 times in comparison to the state-of-the-art method. Benefiting from its high efficiency, the first evolutionary scale structural characterization of a metagenomic resource has been presented. In this study, we employed ESMFold to predict the protein structures, which were then applied in subsequent geometric graph learning.
Label diffusion algorithm
To enhance the initial predictions of EC numbers, a label diffusion algorithm [44, 68] was applied during the testing phase. First, the sequences in the training set similar to the test sequences were found using DIAMOND [15]. Second, based on the sequence identity of protein pairs, a homology network \(\:M\in\:{R}^{T\times\:T}\) was constructed (\(\:T\) represents the sum of the number of proteins in the test set and the number of hits in the training set). Then, to measure the degree to which a pair of proteins belongs to the same community within the homology network, a Jaccard similarity matrix was defined as follows:
$$\:\begin{array}{c}{J}_{ij}=\frac{{\sum\:}_{z}{M}_{iz}{M}_{jz}}{\sum\:_{z}{M}_{iz}+\sum\:_{z}{M}_{jz}-{\sum\:}_{z}{M}_{iz}{M}_{jz}}\#\left(6\right)\end{array}$$
For a target EC number \(\:x\), the \(\:{x}^{th}\) column of the final annotation matrix \(\:S\) (\(\:{S}_{x}\)) was learned by minimizing the cost function \(\:P\left({S}_{x}\right)\):
$$\:\begin{array}{c}P\left({S}_{x}\right)=\:\sum\:_{i=1}^{T}{\left({S}_{ix}-{Y}_{ix}\right)}^{2}+\frac{\epsilon\:}{2}\sum\:_{i=1}^{T}\frac{1}{{d}_{i}}\sum\:_{j=1}^{T}{J}_{ij}{M}_{ij}{\left({S}_{ix}-{S}_{jx}\right)}^{2}\#\left(7\right)\end{array}$$
Where \(\:\epsilon\:\) represents the regularization parameter. The first term serves to preserve the initial labels (\(\:{Y}_{ix}\)), and the consistency of the labels of adjacent nodes is accounted for through the second term. And \(\:\frac{1}{{d}_{i}}\) is defined as:
$$\:\begin{array}{c}\frac{1}{{d}_{i}}=\frac{1}{{\sum\:}_{j}{J}_{ij}{M}_{ij}}\#\left(8\right)\end{array}$$
Furthermore, we define \(\:{M}^{1}\) as:
$$\:\begin{array}{c}{{M}^{1}}_{ij}=\frac{1}{2}\left(\frac{1}{{d}_{i}}+\frac{1}{{d}_{j}}\right){J}_{ij}{M}_{ij}\#\left(9\right)\end{array}$$
its Laplacian matrix \(\:L\) is:
$$\:\begin{array}{c}L=DM-{M}^{1}\#\left(10\right)\end{array}$$
where \(\:DM\) is the diagonal degree matrix of \(\:{M}^{1}\). The closed-form solution that minimizes \(\:P\left({S}_{x}\right)\) can be converted to:
$$\:\begin{array}{c}S={\left(I+\epsilon\:L\right)}^{-1}Y\#\left(11\right)\end{array}$$
where \(\:S\) is the updated annotation matrix, \(\:I\in\:{R}^{T\times\:T}\) indicates an identity matrix, and \(\:Y\) represents the combination of the training set labels along with the initial predictions for the test set.
Constructing the enzyme pocket from predicted enzyme active sites
The construction of the enzyme pocket involved two steps. First, the predicted enzyme active sites were clustered (k-means), where the k was set to 2 empirically. For eliminating false positives, we removed the isolated points that were classified separately on their own. Second, using the \(\:{c}_{\alpha\:}\) coordinates, the enzyme pocket is defined as the area within 10 Å of the cluster center.
Implementation and evaluation
Five-fold cross-validation was performed on training data, where each time the model was trained on four folds and validated on the remaining one-fold data. This operation was repeated five times, with the best model saved at each iteration. After training, several independent tests were used to test the model performance on different tasks. In enzyme active prediction, TS124 was employed to compare the GraphEC-AS to other methods. And the performance of GraphEC in predicting the EC numbers was evaluated on NEW-392 and Price-149. In order to test the accuracy of GraphEC-pH in predicting the enzyme optimum pH, a new independent test (Brenda-test) was built and two of the latest methods were evaluated on it. During testing, the average predictions of the five models from the cross-validation were utilized as the final predictions. Specifically, Pytorch 1.13.1 was used to construct the geometric graph network, which consists of a 3-layer GNN with 256 hidden units. The attention layer of GraphEC employed the multi-head attention with 8 attention heads. Based on the binary cross-entropy loss, the Adam optimizer was employed to optimize the model. The training process was limited to a maximum of 35 epochs and an early stopping with patience of 4 was implemented, along with a dropout value of 0.1 to prevent overfitting. To comprehensively evaluate model performance, AUC, AUPR, recall, precision, F1-score (F1), and Matthews correlation coefficient (MCC) were utilized, as defined in detail in Supplementary Evaluation metrics.