2.1 Dataset
We used the dataset used during the development of DeepSuccinylSite20 to train and test our approach. This dataset consists of experimentally verified succinylation sites provided by Hasan et. al.15 that was originally obtained from the UniProtKB/Swiss-Prot36 database and NCBI protein sequence database. First, these sequences were obtained and subjected to redundancy removal using a CD-hit37 algorithm with a cut-off similarity of 0.3 to any other proteins in the dataset. This resulted in 5009 succinylated sites from 2322 protein sequences. Subsequently, the sequences were randomly separated into a training set consisting of 2192 protein sequences and a testing set consisting of 124 proteins. The training and test set sequences contain 4755 and 254 succinylation sites, respectively. It has to be noted here the input to the ProtT5 is the full protein sequence. However, input to the supervised word embedding is a window sequence created around the central residues (positive or negative sites) by flanking it with equal number of residues to the left and the right.
Since five of the positive succinylation sites were around the N- or C-termini of the protein, we were unable to extract a window of size 33 (optimal window size obtained from DeepSuccinylSite paper) around those sites. For this reason, we excluded these five sites, resulting in 4750 succinylation sites. After fixing positive training and testing sequences, negative sets were then extracted from the same protein sequences. Specifically, all other lysine (K) residues from the same protein sequences that were not annotated as succinylated (i.e., positive sites) were considered negative succinylation sites. This resulted in 50,565 negative succinylation sites in the training set and 2977 negative succinylation sites in the test set. To deal with the imbalanced training set, we performed random under sampling on the negative training set to obtain the same number of negative sites (4750) as the number of positive sites in the training set. The independent test set was kept as it is. The final number of sites in this dataset is shown in Table 1. The dataset is provided in the GitHub repository at https://github.com/KCLabMTU/LMSuccSite. It is worth noting that some of the existing approaches use an imbalanced training set and use threshold-moving to handle the imbalanced dataset.18 38
Table 1. Dataset description of the training and independent test dataset
Dataset Type
|
Positive (Succinylated)
|
Negative (Non-Succinylated)
|
Training Data
|
4750
|
50,565
|
Training Data (after balancing)
|
4750
|
4750
|
Benchmark Independent Test Data
|
254
|
2977
|
2. 2. Feature Encoding
Since ML/DL models are only capable of understanding inputs in numerical space, it is imperative to convert the protein sequences into a vectorized representations (i.e., feature vectors). In fact, the quality of these features is directly proportional to the robustness of the predictive models. To address this, we leveraged two encoding approaches to identify the best representation of succinylation sites.
Similar to DeepSuccinylSite, the first approach uses the representation based on the local interaction of amino acids in the vicinity of K residues, which is achieved by a supervised word embedding (obtained via Keras’s embedding layer)20. In this strategy, the features take into consideration the influence of upstream and downstream flanking residues within the window sequence (or peptide) centered around the site of interest. The other encoding approach employs a pLM-based transformer model called ProtT532 that extracts a contextualized representation (aka embeddings) of the site of interest (i.e., K residues) from full protein sequences. We only extracted embeddings from the encoder-side of ProtT5 in half-precision as it was shown in previous work that this outperforms embeddings of ProtT5’s decoder-side. Please note that both of these methods directly take protein sequences as an input (supervised word embedding takes a window sequence, and ProtT5 takes the full protein sequence), eliminating the requirement for handcrafted or manual extraction of features. Each feature encoding is described below in detail.
Supervised word embedding
The first type of feature we used in this work attempts to capture the local information that is extracted using supervised word embedding obtained using Keras’s embedding layer. The embedding encoding used in this work is similar to the embedding encoding in DeepSuccinylSite20. Essentially, a window sequence centered around the site of interest (i.e., a K residue) with an equal number of residues upstream and downstream is taken as an input. Based on comparison of various window sizes using 10-fold cross-validation, a window size of 33 was chosen for subsequent development. This window size is similar to the window size identified during the development of DeepSuccinylSite.
This supervised word embedding, which is obtained via Keras’s embedding layer, is able to capture the local information between amino acids within a fixed-sized window. Embedding layers in Keras work by treating peptides as documents and individual amino acids within that peptide as words. Initially, each amino acid in a peptide is represented by a unique integer and the embedding layer is initialized with random weights. Hence, a peptide of length n will be represented by a vector of length n. The output is a dense vector of dimension n x m where m is the size of the predefined vocabulary. The output vector is dynamically updated in subsequent epochs during training in a backpropagation fashion. This supervised learning nature of embedding from a set of protein sequences allows the network to learn the semantic similarity of amino acids within the embedded vector space. Furthermore, each vectorized representation is an orthogonal representation in some other dimension, thus, preserving the semantic quality of individual amino acids. In practice, the shape of the output vector is a parameter to be set before training. We set the size of the vocabulary to 21 as in DeepSuccinylSite to address 20 canonical amino acids found in proteins, along with a virtual amino acid. Therefore, for a given peptide of length 33, supervised word embedding returns a dense vector of shape (33,21).
ProtT5 encoding
Transfer learning is a promising machine learning methodology that transfers the knowledge learned in data-rich problems to similar data-limited problems. Protein language models learn summary representation that can be used to distill the knowledge from a large dataset, which can be used to improve downstream function prediction through transfer learning. The second type of feature that we use in this work is based on a pLM that captures global contextual information. Unlike supervised word embedding, the inputs to these models are full-length (no window) protein sequences, generating embeddings for full-length protein sequences, i.e., for all residues in a protein. The ppLM we utilized is called ProtT532. Leveraging teacher-forcing and span-generation, ProtT5, is pre-trained in a self-supervised manner on a massive dataset called UniRef50 that consists of 45 x 106 protein sequences. In more detail, ProtT5 was trained by teacher forcing, i.e., input and targets were fed to the model with inputs being corrupted protein sequences and targets being identical to inputs but shifted to the right (span generation with span size of 1 for ProtT5). It is based on Google’s t5-3b model35, which consists of a 24 layer encoder-decoder and has around 2.8 x 109 learnable parameters .
In our work, we used the pretrained ProtT532 model to encode the features. This model takes the overall protein sequence as an input and returns an embedding vector of dimension 1024 for each amino acid. Notably, the ProtT5 is a context-dependent encoding approach. Hence, for succinylation site prediction, using ProtT5, we utilized an embedding vector (length 1024) corresponding to the site of interest (in our case K residue).
2. 3 Model Architecture:
Our proposed ensemble deep learning model, which we termed LMSuccSite, is designed to capture knowledge from the supervised word embedding (herein referred to as the embedding module) using the dataset for succinylation and unsupervised language models learned from a large dataset of proteins (obtained via ProtT5). Initially, a 2D-CNN-based architecture was used for supervised embedding features while an artificial neural network (ANN)-based module was used for ProtT5 features. Eventually, a meta-classifier based on an ANN using feature sets generated from the outputs of the individual classifiers was trained. Below, we describe the model architecture in detail.
Embedding Module
The input to the embedding module is the output of the supervised word embedding, which is a dense vector of shape with dimensions of 33 and 21, where 33 represents the window size and 21 represents the size of the vocabulary. This module consists of a 2D convolutional layer, a 2D max-pooling layer, and a fully connected layer (consisting of a flatten layer, a dense layer, and an output layer). Two dropout layers were added to the network to avoid overfitting. Moreover, a rectified linear unit (ReLU) was used as an activation function in all the layers due to its representational sparsity and computational efficiency. The architecture of the embedding module is shown in Figure 1. The optimization of parameters in the network was achieved using the Adam optimizer due to its combined benefits with respect to both adaptive gradient descent and root mean square propagation. Binary cross-entropy or log loss was used as the loss function for training. All the optimal hyperparameters used in the module are reported in Supplementary Table S2 in the supplementary material section. This architecture is chosen based on 10-fold cross validation on the training set using different architectures with different combinations of hyperparameters using grid search.
ProtT5 Module
This module takes global contextual features extracted from the pLM ProT5 for the residue of interest (i.e., K residue) as the input. The size dimension of the feature is 1024, as explained above. The ProtT5 module is based on an ANN architecture that consists of two hidden layers with sizes 256 and 128, each followed by a dropout layer. The architecture of Prot-T5 based model is shown in Figure 2. This architecture was chosen based on 10-fold cross validation on the training set using different machine learning as well as deep learning architectures with different combinations of hyperparameters using grid-search. The ReLU activation was used for both layers and the model was optimized using Adam based on binary cross-entropy loss. The parameters associated with this model are described in Supplementary Table S3 in supplementary materials. Similar to the work of Villegas-Morcello et al. 39 and Weissenow et al. 23 we also observed that these pLM-based features do not require a complex architecture to obtain competitive performance.
Meta-classifier
In order to combine the classifying capability of two different techniques, the embedding module and Prot-T5 module were combined using an ANN as a meta-classifier. Rather than using the final output of each individual module, we combined the learned features from the second-to-last layers from both modules (i.e., the last hidden layer in each module). During training, we paid special attention to the training process in order to avoid data leakage. To find the optimal architecture for the meta classifier, we performed 10-fold cross-validation on the training set, ensuring that no data leakage occurred from the target information to the training set. In a classical stacked generalization methodology, the meta classifier might only learn from the predictions of the individual (base) models resulting in data leakage and over estimation of classification performance40.
Initially, all the layers of these two base modules (i.e., the embedding and ProtT5 modules) were frozen and the resulting features were obtained by concatenating the output (meta-feature) of the second-last layers from both modules. Importantly, the input size of the meta-classifier was 144 (16 from the embedding and 128 from the ProtT5 module). This meta-classifier is based on a simple feed forward neural network (NN) architecture. This architecture was chosen based on 10-fold cross validation experiments with different combinations of hyperparameters (e.g., number of hidden layers, number of neurons in each layer, regularization parameters) using grid-search. The hyperparameters used in the meta-classifier are provided in Supplementary Table S4. Furthermore, underfitting and overfitting in each module were carefully prevented by using early stopping.
The final architecture of the meta classifier is simple, consisting of two hidden layers with ReLU activation and an output layer with softmax activation, as shown in Figure 3. Similar to previous base modules, the meta classifier was also optimized using Adam based on binary cross-entropy loss.
2.4 Performance Evaluation
To evaluate the performances of the aforementioned models, we used the standard confusion matrix for binary classification that consists of four components: True Positives (TP), which represent the number of positive sites predicted as succinylated sites; True Negatives (TN), which correspond to the number of negative sites predicted as non-succinylated sites; False Positives (FP), which represent the number of negative sites predicted as succinylated sites; and False Negatives (FN), which describe the number of positive sites predicted as non-succinylated sites. Using these four components of the confusion matrix, evaluation metrics such as Accuracy (ACC), Matthew’s Correlation Coefficient (MCC), Sensitivity (Sn), Specificity (Sp), and geometric mean (g-mean) were calculated for each experiment (defined in Supplementary Table S1). We also used area under the receiver operating characteristic (AUROC) curve and Precision-Recall area under the curve (PrAUC) to further evaluate the discriminating ability of the models.
2.5 Further analysis of the proposed model
In order to further investigate the proposed models, we performed t-distributed stochastic neighbor embedding (t-SNE) visualization of learned features and the ablation study.
t-SNE visualization of learned features
The feature vector obtained from the final hidden layer can be projected into lower-dimensional latent cartesian space to visualize information available in high-dimensional space. Towards this end, proteins or residues can be colored according to a certain label. The clearer the distinction between the classes, the more readily available certain information is from the representation of sequences. In this regard, we utilized a t-SNE algorithm (with a perplexity of 30 and learning rate of 100) that can capture nonlinear signals within the data robustly, improving the visible boundary of separation. At first, the features generated from the embedding module and the ProtT5 module were visualized in 2-dimensional scatter plot using t-SNE to elucidate their respective boundary of separation between succinylated sites and non-succinylated sites. Then, the features learned from the ensemble model were projected onto a t-SNE to investigate if there was an improvement in the visible boundary of separation, which would indicate the usefulness of features obtained from the pLM.
Sensitivity analysis with respect to data size (Ablation study)
Sensitivity analysis is an intuitive technique to quantify the performance of a model under different inputs or varying environments. It can be performed by using various what-if analyses that tell how the changing input or other configuration affects the outcome. In this study, we analyzed the trend of model performance when the size of the training data gradually increases.