LMSuccSite: Improving Protein Succinylation Sites Prediction Using Embeddings from Protein Language Model

doi:10.21203/rs.3.rs-1953874/v1

Protein succinylation is an important post-translational modification (PTM) responsible for many vital metabolic activities in cells, including cellular respiration, regulation, and repair. Here, we present a novel approach that combines features from supervised word embedding with embedding from a protein language model called ProtT5-XL-UniRef50 (hereafter termed, ProtT5) in a deep learning framework to predict protein succinylation sites. To our knowledge, this is one of the first attempts to employ embedding from a pre-trained protein language model to predict protein succinylation sites. The proposed model, dubbed LMSuccSite, achieves state-of-the-art results compared to existing methods, with performance scores of 0.36, 0.79, 0.79 for MCC, sensitivity, and specificity, respectively. LMSuccSite is likely to serve as a valuable resource for exploration of succinylation and its role in cellular physiology and disease.

Succinylation

Language Model

ProtT5

PTM Prediction

Protein Sequence Encoding

Support Vector Machine

Transfer learning

Post-translational modifications (PTMs) are important regulators of proteins that modulate a myriad of physiological and pathological processes such as signal transduction, gene expression, metabolism, DNA repair, cell cycle among many others¹. Among more than 400 known PTMs, succinylation has emerged as an important PTM that has been implicated in numerous diseases such as, hepatic, cardiac, pulmonary and neurological disorders². Therefore, it is important to identify sites of succinylation and how they affect protein function. Indeed, a better understanding of succinylation could facilitate the development of novel therapeutic compounds to treat diseases. However, comprehensive identification of succinylation sites, as well as understanding of its functional impact, remains elusive.

Succinylation, which is among the more recently discovered PTMs, is comparatively unique³. Like methylation, acetylation or ubiquitination, succinylation also occurs in lysine residues. However, compared to methylation (14 Da) or acetylation (40 Da), succinylation (100 Da) causes a larger mass change and alters a positively charged side chain to a negatively charged one, causing a two-unit charge shift in the charge of the modified residues⁴. During succinylation, a metabolically derived succinyl CoA modifies protein lysine groups. It has been shown that succinylation alters the catalytic rates of enzymes and the pathways in which they are involved, especially mitochondrial metabolic pathways⁴. This provides an elegant mechanism to coordinate metabolism and signaling. Additionally, succinylation is known to provide a link between metabolism and protein function in the nervous system and in neurological diseases, rendering deeper understanding of its mechanism highly interesting⁵. Likewise, succinylation has also been shown to be substantially upregulated in the early phases of infection by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), suggesting that succinylation inhibitors may be able to reduce viral replication to treat COVID-19⁶. Interestingly, succinylation is observed in diverse organisms, including bacteria, yeast, and mouse and these sites are also frequent targets of acetylation (another important PTM)⁷. Given its widespread occurrence, and its implication in many diseases, characterization of succinylation is important for drug discovery.

Due to its widespread occurrence and its disease relevancy, various experimental techniques have been utilized to detect protein succinylation. Mass spectrometry (MS) has been one of the most popular experimental techniques for identifying PTMs, including succinylation³,⁸⁻¹⁰. For instance, Weinert et al.¹¹ developed an antibody-based affinity enrichment of succinylated peptides followed by strong cation exchange (SCX) chromatography and liquid chromatography tandem MS (LC-MS/MS) to characterize succinylation sites.

Similarly, to bridge the gap between relatively sparse experimental data and the wealth of protein sequences in current databases, a plethora of computational methods have been developed to predict protein succinylation sites over the last decade. For instance, various machine learning based methods like iSuc-PseAAC¹², iSuc-PseOp¹³, pSuc-Lys¹⁴ have been proposed for succinylation site prediction. Likewise, a random forest (RF)-based approach called SuccinSite¹⁵, a position specific scoring matrix (PSSM)-based predictor called PSSM-Suc¹⁶, a secondary structure and PSSM-based approach called SSEvol-Suc¹⁷, and a logistics approach based approach called GPSuc¹⁸ have been developed.

Recently, deep learning-based approaches have also been developed for prediction of protein succinylation sites. CNN-SuccSite¹⁹ uses four feature encoding techniques as input to a convolutional neural network (CNN)-based architecture to predict succinylation sites. We also developed a deep learning-based approach that does not require hand-crafted feature extraction but instead uses techniques from natural language processing (NLP), such as supervised word embedding, to extract vector representations (embeddings) directly from protein sequences. Together with one-hot encoding, those embeddings are used as input to a CNN-architecture called DeepSuccinylSite²⁰. Similarly, MDCAN-Lys²¹ uses a multi-lane dense convolutional attention network to predict succinylation sites.

Recently, language models (LMs) have emerged as a powerful paradigm to learn embeddings directly from large, unlabeled natural language datasets. This is achieved, for example, by reconstructing corrupted tokens within a sequence of tokens from the remaining, non-corrupted tokens within the same sentence. In contrast to uncontextualized word embeddings which will always return the same embedding for a word irrespective of the surrounding words, embeddings from LMs are contextualized in that they render the embedding dependent on the surrounding words. These advances are now being explored in proteins as well as through protein language models (pLMs)²². These deep language models are an exciting breakthrough in protein sequence modeling as they were shown to capture complex dependencies between protein residues²³ based solely on training using large, but unlabeled, protein sequence databases^24,25 (rather than sequences belonging to a specific protein family or task). Since large protein sequence databases represent vast data goldmines, recently various pLMs have been developed for distilling information from them²⁶. This information can be transferred to other tasks, for example, by extracting the hidden states of the last layer of the pLM and using it as input to predict some properties of protein. These pLMs ^27–31 were shown to learn features that can be used to better capture sequence relationships.

Similar to LMs, pLMs are usually trained by masking some parts (usually single amino acids) of the input protein sequence and reconstructing it from non-corrupted sequence context. Instead of using the final classification output during inference, LMs and pLMs use the output of the last hidden layers of the network as a means of representing a protein sequence as numerical vectors called embeddings. This allows information gathered from large but unlabeled protein sequence databases to be transferred to much smaller but labeled sequence data sets (via transfer learning). In order to distinguish the different types of embeddings used in this work, we dub one strategy “supervised word embedding” and the other “ProtT5”³². These embeddings have been used in various structural bioinformatics applications, including the prediction of secondary structure, subcellular localization, binding residues³³ variant effects, or the identification of remote homologs. ^22,29,31,34.

In this work, we utilize the embeddings derived from pLM ProtT5 (based on the NLP sequence-to-sequence model T5³⁵ but trained on Big Fantastic Database (BFD) and fine-tuned on Uniref50 instead of natural language) in conjunction with supervised word embedding to improve the prediction of succinylation sites in proteins. The proposed deep-learning approach, called LMSuccSite, combines supervised word embeddings with embeddings from ProtT5 using a simple neural network architecture. This strategy achieves or exceeds state-of-the-art performance in the benchmark dataset. To the best of our knowledge, this is the first work to use pLM-based embeddings for the prediction of succinylation sites in specific and PTMs, in general.

2.1 Dataset

We used the dataset used during the development of DeepSuccinylSite²⁰ to train and test our approach. This dataset consists of experimentally verified succinylation sites provided by Hasan et. al.¹⁵ that was originally obtained from the UniProtKB/Swiss-Prot³⁶ database and NCBI protein sequence database. First, these sequences were obtained and subjected to redundancy removal using a CD-hit³⁷ algorithm with a cut-off similarity of 0.3 to any other proteins in the dataset. This resulted in 5009 succinylated sites from 2322 protein sequences. Subsequently, the sequences were randomly separated into a training set consisting of 2192 protein sequences and a testing set consisting of 124 proteins. The training and test set sequences contain 4755 and 254 succinylation sites, respectively. It has to be noted here the input to the ProtT5 is the full protein sequence. However, input to the supervised word embedding is a window sequence created around the central residues (positive or negative sites) by flanking it with equal number of residues to the left and the right.

Since five of the positive succinylation sites were around the N- or C-termini of the protein, we were unable to extract a window of size 33 (optimal window size obtained from DeepSuccinylSite paper) around those sites. For this reason, we excluded these five sites, resulting in 4750 succinylation sites. After fixing positive training and testing sequences, negative sets were then extracted from the same protein sequences. Specifically, all other lysine (K) residues from the same protein sequences that were not annotated as succinylated (i.e., positive sites) were considered negative succinylation sites. This resulted in 50,565 negative succinylation sites in the training set and 2977 negative succinylation sites in the test set. To deal with the imbalanced training set, we performed random under sampling on the negative training set to obtain the same number of negative sites (4750) as the number of positive sites in the training set. The independent test set was kept as it is. The final number of sites in this dataset is shown in Table 1. The dataset is provided in the GitHub repository at https://github.com/KCLabMTU/LMSuccSite. It is worth noting that some of the existing approaches use an imbalanced training set and use threshold-moving to handle the imbalanced dataset.¹⁸ ³⁸

Table 1. Dataset description of the training and independent test dataset

Dataset Type	Positive (Succinylated)	Negative (Non-Succinylated)
Training Data	4750	50,565
Training Data (after balancing)	4750	4750
Benchmark Independent Test Data	254	2977

2. 2. Feature Encoding

Since ML/DL models are only capable of understanding inputs in numerical space, it is imperative to convert the protein sequences into a vectorized representations (i.e., feature vectors). In fact, the quality of these features is directly proportional to the robustness of the predictive models. To address this, we leveraged two encoding approaches to identify the best representation of succinylation sites.

Similar to DeepSuccinylSite, the first approach uses the representation based on the local interaction of amino acids in the vicinity of K residues, which is achieved by a supervised word embedding (obtained via Keras’s embedding layer)²⁰. In this strategy, the features take into consideration the influence of upstream and downstream flanking residues within the window sequence (or peptide) centered around the site of interest. The other encoding approach employs a pLM-based transformer model called ProtT5³² that extracts a contextualized representation (aka embeddings) of the site of interest (i.e., K residues) from full protein sequences. We only extracted embeddings from the encoder-side of ProtT5 in half-precision as it was shown in previous work that this outperforms embeddings of ProtT5’s decoder-side. Please note that both of these methods directly take protein sequences as an input (supervised word embedding takes a window sequence, and ProtT5 takes the full protein sequence), eliminating the requirement for handcrafted or manual extraction of features. Each feature encoding is described below in detail.

Supervised word embedding

The first type of feature we used in this work attempts to capture the local information that is extracted using supervised word embedding obtained using Keras’s embedding layer. The embedding encoding used in this work is similar to the embedding encoding in DeepSuccinylSite²⁰. Essentially, a window sequence centered around the site of interest (i.e., a K residue) with an equal number of residues upstream and downstream is taken as an input. Based on comparison of various window sizes using 10-fold cross-validation, a window size of 33 was chosen for subsequent development. This window size is similar to the window size identified during the development of DeepSuccinylSite.

This supervised word embedding, which is obtained via Keras’s embedding layer, is able to capture the local information between amino acids within a fixed-sized window. Embedding layers in Keras work by treating peptides as documents and individual amino acids within that peptide as words. Initially, each amino acid in a peptide is represented by a unique integer and the embedding layer is initialized with random weights. Hence, a peptide of length n will be represented by a vector of length n. The output is a dense vector of dimension n x m where m is the size of the predefined vocabulary. The output vector is dynamically updated in subsequent epochs during training in a backpropagation fashion. This supervised learning nature of embedding from a set of protein sequences allows the network to learn the semantic similarity of amino acids within the embedded vector space. Furthermore, each vectorized representation is an orthogonal representation in some other dimension, thus, preserving the semantic quality of individual amino acids. In practice, the shape of the output vector is a parameter to be set before training. We set the size of the vocabulary to 21 as in DeepSuccinylSite to address 20 canonical amino acids found in proteins, along with a virtual amino acid. Therefore, for a given peptide of length 33, supervised word embedding returns a dense vector of shape (33,21).

ProtT5 encoding

Transfer learning is a promising machine learning methodology that transfers the knowledge learned in data-rich problems to similar data-limited problems. Protein language models learn summary representation that can be used to distill the knowledge from a large dataset, which can be used to improve downstream function prediction through transfer learning. The second type of feature that we use in this work is based on a pLM that captures global contextual information. Unlike supervised word embedding, the inputs to these models are full-length (no window) protein sequences, generating embeddings for full-length protein sequences, i.e., for all residues in a protein. The ppLM we utilized is called ProtT5³². Leveraging teacher-forcing and span-generation, ProtT5, is pre-trained in a self-supervised manner on a massive dataset called UniRef50 that consists of 45 x 10⁶ protein sequences. In more detail, ProtT5 was trained by teacher forcing, i.e., input and targets were fed to the model with inputs being corrupted protein sequences and targets being identical to inputs but shifted to the right (span generation with span size of 1 for ProtT5). It is based on Google’s t5-3b model³⁵, which consists of a 24 layer encoder-decoder and has around 2.8 x 10⁹ learnable parameters .

In our work, we used the pretrained ProtT5³² model to encode the features. This model takes the overall protein sequence as an input and returns an embedding vector of dimension 1024 for each amino acid. Notably, the ProtT5 is a context-dependent encoding approach. Hence, for succinylation site prediction, using ProtT5, we utilized an embedding vector (length 1024) corresponding to the site of interest (in our case K residue).

2. 3 Model Architecture:

Our proposed ensemble deep learning model, which we termed LMSuccSite, is designed to capture knowledge from the supervised word embedding (herein referred to as the embedding module) using the dataset for succinylation and unsupervised language models learned from a large dataset of proteins (obtained via ProtT5). Initially, a 2D-CNN-based architecture was used for supervised embedding features while an artificial neural network (ANN)-based module was used for ProtT5 features. Eventually, a meta-classifier based on an ANN using feature sets generated from the outputs of the individual classifiers was trained. Below, we describe the model architecture in detail.

Embedding Module

The input to the embedding module is the output of the supervised word embedding, which is a dense vector of shape with dimensions of 33 and 21, where 33 represents the window size and 21 represents the size of the vocabulary. This module consists of a 2D convolutional layer, a 2D max-pooling layer, and a fully connected layer (consisting of a flatten layer, a dense layer, and an output layer). Two dropout layers were added to the network to avoid overfitting. Moreover, a rectified linear unit (ReLU) was used as an activation function in all the layers due to its representational sparsity and computational efficiency. The architecture of the embedding module is shown in Figure 1. The optimization of parameters in the network was achieved using the Adam optimizer due to its combined benefits with respect to both adaptive gradient descent and root mean square propagation. Binary cross-entropy or log loss was used as the loss function for training. All the optimal hyperparameters used in the module are reported in Supplementary Table S2 in the supplementary material section. This architecture is chosen based on 10-fold cross validation on the training set using different architectures with different combinations of hyperparameters using grid search.

ProtT5 Module

This module takes global contextual features extracted from the pLM ProT5 for the residue of interest (i.e., K residue) as the input. The size dimension of the feature is 1024, as explained above. The ProtT5 module is based on an ANN architecture that consists of two hidden layers with sizes 256 and 128, each followed by a dropout layer. The architecture of Prot-T5 based model is shown in Figure 2. This architecture was chosen based on 10-fold cross validation on the training set using different machine learning as well as deep learning architectures with different combinations of hyperparameters using grid-search. The ReLU activation was used for both layers and the model was optimized using Adam based on binary cross-entropy loss. The parameters associated with this model are described in Supplementary Table S3 in supplementary materials. Similar to the work of Villegas-Morcello et al. ³⁹ and Weissenow et al. ²³ we also observed that these pLM-based features do not require a complex architecture to obtain competitive performance.

Meta-classifier

In order to combine the classifying capability of two different techniques, the embedding module and Prot-T5 module were combined using an ANN as a meta-classifier. Rather than using the final output of each individual module, we combined the learned features from the second-to-last layers from both modules (i.e., the last hidden layer in each module). During training, we paid special attention to the training process in order to avoid data leakage. To find the optimal architecture for the meta classifier, we performed 10-fold cross-validation on the training set, ensuring that no data leakage occurred from the target information to the training set. In a classical stacked generalization methodology, the meta classifier might only learn from the predictions of the individual (base) models resulting in data leakage and over estimation of classification performance⁴⁰.

Initially, all the layers of these two base modules (i.e., the embedding and ProtT5 modules) were frozen and the resulting features were obtained by concatenating the output (meta-feature) of the second-last layers from both modules. Importantly, the input size of the meta-classifier was 144 (16 from the embedding and 128 from the ProtT5 module). This meta-classifier is based on a simple feed forward neural network (NN) architecture. This architecture was chosen based on 10-fold cross validation experiments with different combinations of hyperparameters (e.g., number of hidden layers, number of neurons in each layer, regularization parameters) using grid-search. The hyperparameters used in the meta-classifier are provided in Supplementary Table S4. Furthermore, underfitting and overfitting in each module were carefully prevented by using early stopping.

The final architecture of the meta classifier is simple, consisting of two hidden layers with ReLU activation and an output layer with softmax activation, as shown in Figure 3. Similar to previous base modules, the meta classifier was also optimized using Adam based on binary cross-entropy loss.

2.4 Performance Evaluation

To evaluate the performances of the aforementioned models, we used the standard confusion matrix for binary classification that consists of four components: True Positives (TP), which represent the number of positive sites predicted as succinylated sites; True Negatives (TN), which correspond to the number of negative sites predicted as non-succinylated sites; False Positives (FP), which represent the number of negative sites predicted as succinylated sites; and False Negatives (FN), which describe the number of positive sites predicted as non-succinylated sites. Using these four components of the confusion matrix, evaluation metrics such as Accuracy (ACC), Matthew’s Correlation Coefficient (MCC), Sensitivity (Sn), Specificity (Sp), and geometric mean (g-mean) were calculated for each experiment (defined in Supplementary Table S1). We also used area under the receiver operating characteristic (AUROC) curve and Precision-Recall area under the curve (PrAUC) to further evaluate the discriminating ability of the models.

2.5 Further analysis of the proposed model

In order to further investigate the proposed models, we performed t-distributed stochastic neighbor embedding (t-SNE) visualization of learned features and the ablation study.

t-SNE visualization of learned features

The feature vector obtained from the final hidden layer can be projected into lower-dimensional latent cartesian space to visualize information available in high-dimensional space. Towards this end, proteins or residues can be colored according to a certain label. The clearer the distinction between the classes, the more readily available certain information is from the representation of sequences. In this regard, we utilized a t-SNE algorithm (with a perplexity of 30 and learning rate of 100) that can capture nonlinear signals within the data robustly, improving the visible boundary of separation. At first, the features generated from the embedding module and the ProtT5 module were visualized in 2-dimensional scatter plot using t-SNE to elucidate their respective boundary of separation between succinylated sites and non-succinylated sites. Then, the features learned from the ensemble model were projected onto a t-SNE to investigate if there was an improvement in the visible boundary of separation, which would indicate the usefulness of features obtained from the pLM.

Sensitivity analysis with respect to data size (Ablation study)

Sensitivity analysis is an intuitive technique to quantify the performance of a model under different inputs or varying environments. It can be performed by using various what-if analyses that tell how the changing input or other configuration affects the outcome. In this study, we analyzed the trend of model performance when the size of the training data gradually increases.

In this section, we describe the development and evaluation of LMSuccSite, a method that uses embedding from a pLM to predict protein succinylation sites in proteins using only the primary amino acid sequence as input. LMSuccSite combines two modules, one based on supervised word embedding and another based on a ProtT5 LM using a meta-classifier. First, we compare the performance of various DL/ML architectures for ProtT5 and supervised word embedding models using cross-validation techniques to find the best performing model. Subsequently, we compare the performance of various meta-classifiers for the stacked model using cross-validation to find the best meta-classifier. Finally, we compare the performance of our approach (LMSuccSite) with existing succinylation site prediction tools using an independent test set. The details are given in the following subsections.

3.1. Training and Evaluation of embedding module, ProtT5 module and LMSuccSite

As discussed in the method section, we performed 10-fold cross-validation using the training set on the embedding and Prot-T5 modules for various ML/DL architectures (Table 2). Since the relative performance of various DL and ML based models for embedding has already been compared during the development of our previous model, DeepSuccinylSite²⁰, we chose to utilize a similar architecture based on CNN2D. For the ProtT5 module, we implemented RF, support vector machines (SVM), XGBoost, CNN1D, and ANN architectures. As can be seen in Table 2, the CNN2D and ANN architectures exhibit the best performance in terms of the embedding and ProtT5 modules, respectively. Since extensive validation of various ML and DL models was already performed during the development of DeepSuccinylSite, we only provided the results for CNN2D for supervised word embedding.

Table 2. 10-Fold cross-validation results on the training set of Embedding module, ProtT5 module with different ML and DL models. The highest values in each category are bolded.

Encoding approach	Architecture	ACC	MCC	Sn	Sp
Embedding	CNN2D	0.73 ± 0.02	0.47 ± 0.05	0.76 ± 0.01	0.70 ± 0.01
Embedding	LSTM	0.71 ± 0.01	0.43 ± 0.02	0.77 ± 0.04	0.66 ± 0.03
ProtT5	RF	0.62 ± 0.01	0.25 ± 0.01	0.59 ±0.01	0.65 ± 0.01
	SVM	0.73 ± 0.01	0.46 ± 0.01	0.75 ±0.02	0.71 ± 0.01
	XGBoost	0.70 ± 0.01	0.41 ± 0.01	0.76 ± 0.01	0.65 ± 0.01
	CNN1D	0.69 ± 0.01	0.38 ± 0.03	0.78 ± 0.08	0.59 ± 0.09
	ANN	0.74 ± 0.01	0.47 ± 0.02	0.76 ± 0.02	0.71 ± 0.02

Based on these data, a CNN2D-based architecture performed the best in the category of supervised word embedding while ANN performed the best in the category of embedding from ProtT5. Hence, these two modules are combined using a meta-classifier. In order to determine the best architecture for the ensemble model, we performed 10-fold cross validation using various architectures to combine these two best individual models.

As discussed in the Methods section, the hyperparameters of the final stacked generalization model is obtained using grid search. The 10-fold cross-validation results using various architectures for the best model (in terms of MCC) is shown in Table 3.

Table 3. Performance of best different ML/DL models as a meta classifier

Encoding	Model	ACC	MCC	Sn	Sp
Embedding + ProtT5	SVM	0.76 ± 0.01	0.52 ± 0.02	0.80 ± 0.02	0.71 ± 0.02
	RF	0.75 ± 0.01	0.51 ± 0.02	0.79 ± 0.01	0.71 ± 0.02
	LR	0.74 ± 0.01	0.50 ± 0.03	0.78 ± 0.02	0.71 ± 0.02
	XGBoost	0.73 ± 0.02	0.46 ± 0.04	0.75 ± 0.03	0.71 ± 0.02
	ANN(LMSuccSite)	0.77 ± 0.01	0.56 ± 0.02	0.80 ± 0.01	0.76 ± 0.02

We also compared the performance of these various models using the ROC curve for the 10-fold cross-validation. The ROC and Pr-Auc curves of embedding-based CNN2D model, different ProtT5-based models, and the combined model of ProtT5 based ANN and Embedding based 2DCNN are shown in Figure 4. This data suggest that the AUC of the ANN-based combined model is better than those of the individual models. Hence, this model was chosen as the final model, which we termed LMSuccSite. Additionally, LMSuccSite was trained on the overall training data before being compared against other existing approaches.

3.2. Comparison of LMSuccSite with other predictors

In order to compare the performance of LMSuccSite with the existing succinylation site predictors, we used the independent test set described in Table 1 and computed parameters such as accuracy, MCC, sensitivity, specificity and g-mean (Table 4). Importantly, the same training and test sets were used for all of the models tested. The results for other predictors were obtained from the respective articles.

Table-4: Comparison of our model with existing succinylation prediction tools using an independent test set. The numbers are rounded to two significant digits after the decimal. The highest value in each category is shown in bold. Abbreviations: Accuracy (ACC), Matthew’s Correlation Coefficient (MCC), Sensitivity (Sn), Specificity (Sp), Geometric mean (g-mean).

Tool	ACC	MCC	Sn	Sp	g-mean
iSuc-PseAAC¹²	0.83	0.01	0.12	0.89	0.33
iSuc-PseOpt¹³	0.72	0.04	0.30	0.76	0.48
pSuc-Lys¹⁴	0.78	0.04	0.22	0.83	0.43
SuccineSite⁴¹	0.84	0.19	0.37	0.88	0.57
SuccineSite2.0⁴²	0.85	0.26	0.4	0.88	0.63
GPSuc¹⁸	0.85	0.30	0.50	0.88	0.66
PSuccE³⁸	0.85	0.20	0.38	0.89	0.58
DeepSuccinylSite²⁰	0.70	0.27	0.79	0.69	0.74
LMSuccSite	0.79	0.36	0.79	0.79	0.79

As observed in Table 4, LMSuccSite exhibited the highest MCC, Sn, and g-mean scores among the methods tested. Moreover, though it did not achieve the highest ACC and Sp scores, LMSuccSite still performed well in these areas, with ACC and Sp scores that were within 10 and 12% of the best performing methods, respectively. In terms of MCC (the most important metric as it takes into account both Sn and Sp), LMSuccSite achieved a 22% increase in MCC compared to the next best succinylation site predictor, GPSUc (0.36 vs. 0.30).

3.3. Further analysis of LMSuccSite

To gain insights into the basis for LMSuccSite’s improved performance, we then constructed t-SNE plots in R²space for original data and features learned from supervised word embedding, ProtT5 embedding, and LMSuccSite model (Figure 5). In contrast to the original data, the features learned from the embedded vector space show the formation of distinct clusters of succinylated (orange data points) and non-succinylated (blue data points) sites. This boundary of separation was even more pronounced when features were learned from the protT5 model, indicating that contextualized features are useful. Interestingly, the features learned from the ensemble model exhibited prominent distinctions between the clusters of succinylated sites and non-succinylated sites, as shown in Figure 5.

Results of Sensitivity Analysis with respect to different Training Data Size

To explore the effects of training set size on model performance, we conducted sensitivity analysis (Figure 6). To this end, we created five different training datasets by randomly selecting 20%, 40%, 60%, and 80% of the samples from our training set described in Table 1. Subsequently, 10-fold cross-validation was performed in each of these random samples using our proposed model. These studies suggest that, as the size of training data increased, our model exhibited an increasing trend with respect to ACC, MCC, Sn, and Sp. It can be inferred that with the increase in the size of the training data, the results of our model will further improve.

Succinylation, which is a recently discovered PTM, is garnering much interest due to the biological implications of introducing a large (100 Da) chemical moiety that changes the charge of the modified residue. Given its widespread occurrence and its putative association with many diseases, including SARS-CoV-2 infection and the recent COVID-19 pandemic, the characterization of succinylation sites has important implications for drug discovery of many diseases as well. In conjunction with technological advances in experimental technologies to identify succinylation sites, various complementary computational approaches have been proposed for prediction of protein succinylation sites.

The available dataset for succinylation site is still relatively small. In the cases where the dataset is scarce, transfer learning becomes very useful. In particular, language models-based approaches that learn summary representations can be used to distill the knowledge from a large dataset and this knowledge can then be used to improve downstream prediction through transfer learning. In that regard, due to the availability of pLMs, it is now possible to use LMs that learn summary representation for feature extraction. In this regard, LMSuccSite uses embedding learned from the ProtT5 LM, which is trained on a T5 model on UniRef 50 ⁴³ protein dataset. By combining the embedding learned from a LM with supervised word embedding using a meta-classifier on the existing succinylation dataset, we obtained overall improvement in the prediction of protein succinylation sites. The combination of supervised word embedding and embedding learned from pLM produced the best results, suggesting that contextual information obtained from pLMs is important for the prediction of succinylation sites.

Although pLMs have been used for other structural bioinformatics tasks, to the best of our knowledge this strategy of using embeddings from pLMs has not been explored for PTM site prediction, in general, and protein succinylation site prediction, in particular. Our data suggest that pLM-based representations are versatile features that can be used for various structural bioinformatics tasks. In addition, similar to the work of Villegas-Morcello et al. ³⁹ and Weissenow et al. ²³, we also observed that there is no requirement for complex architecture for these LM-based features to obtain competitive performance, as evidenced by the LMSuccSite architecture.

Interestingly, through t-SNE visualization, we found that the separation between succinylation and non-succinylation sites for the embedding learned from LMs shows clear boundaries. This boundary becomes more discernible when the features learned from the ensemble model show a prominent distinction between the clusters of succinylated sites and non-succinylated sites. Additionally, through sensitivity analysis it can be inferred that LMSuccSite’s performance is likely to improve with the availability of more training data.

Recently, numerous LMs have been proposed for learning representation from a vast number of protein sequences. In that regard, comparative performance of various LMs for protein succinylation sites will be an important future work.

PTM: Post-Translational Modification

LMs: Language Models

pLMs: Protein Language Models

DNA: Deoxyribonucleic acid

CoA: Coenzyme

MS: Mass Spectrometry

SCX: Strong Cation Exchange(SCX)

CNN: Convolutional Neural Network

NLP: Natural Language Processing

BFD: Big Fantastic Database

T5: Text-to-Text Transfer Transformer

ML: Machine Learning

DL: Deep Learning

ANN: Artificial Neural Network

NN: Neural Network

2D: Two Dimensional

ReLU: Rectified Linear Unit

TP: True Positive

TN: True Negative

FP: False Positive

FN: False Negative

ACC: Accuracy

MCC: Matthew's Correlation Coefficient

Sp: Specificity

Sn: Sensitivity

AUROC: Area Under Reciver Operating Characteristics

PrAUC: Precision Recall Area Under Curve

T-SNE : t-distributed Stochastic Neighbor Embedding

CNN2D : 2 Dimensional CNN

CNN1D: 1 Dimensional CNN

SVM: Support Vector Machine

RF: Random Forest

XGBoost: Extreme Gradient Boosting

Availability

The source code and trained models are publicly available in the GitHub repository at https://github.com/KCLabMTU/LMSuccSite. All the steps to extract the features and executing the programs are discussed in the same GitHub repository.

Acknowledgement

KC acknowledges support from NSF grants no. 2003019 and 2021734.

Author Contributions

SP performed experiments implementing ProtT5 features in different ML/DL based architectures. DK and MH provided constructive guidance throughout the project. SP and PP prepared the draft paper and MH, RN, DK revised. AAI and LMH developed the dataset and revised the draft. SP and DK conceived and designed the experiment and revised the manuscript. DK oversaw the overall aspects of the project.

Competing interests

The authors declare no competing interests.

Ramazi, S. & Zahiri, J. Post-translational modifications in proteins: resources, tools and prediction methods. Database-Oxford, doi:ARTN baab012 10. 1093/database/baab012 (2021).
Alleyn, M., Breitzig, M., Lockey, R. & Kolliputi, N. The dawn of succinylation: a posttranslational modification. Am J Physiol Cell Physiol 314, C228-C232, doi:10.1152/ajpcell.00148.2017 (2018).
Zhang, Z. et al. Identification of lysine succinylation as a new post-translational modification. Nat Chem Biol 7, 58–63, doi:10.1038/nchembio.495 (2011).
Yang, Y. & Gibson, G. E. Succinylation Links Metabolism to Protein Functions. Neurochem Res 44, 2346–2359, doi:10.1007/s11064-019-02780-x (2019).
Yang, Y. & Gibson, G. E. Succinylation Links Metabolism to Protein Functions. Neurochemical Research 44, 2346–2359, doi:10.1007/s11064-019-02780-x (2019).
Liu, Q. et al. The global succinylation of SARS-CoV-2–infected host cells reveals drug targets. Proceedings of the National Academy of Sciences 119, e2123065119 (2022).
Weinert, B. T. et al. Lysine Succinylation Is a Frequently Occurring Modification in Prokaryotes and Eukaryotes and Extensively Overlaps with Acetylation. Cell Reports 4, 842–851, doi:10.1016/j.celrep.2013.07.024 (2013).
Jin, W. & Wu, F. Proteome-Wide Identification of Lysine Succinylation in the Proteins of Tomato (Solanum lycopersicum). PLoS One 11, e0147586, doi:10.1371/journal.pone.0147586 (2016).
Meng, L. et al. Comparative proteomics and metabolomics of JAZ7-mediated drought tolerance in Arabidopsis. J Proteomics 196, 81–91, doi:10.1016/j.jprot.2019.02.001 (2019).
Zhang, N. W. et al. Quantitative Global Proteome and Lysine Succinylome Analyses Reveal the Effects of Energy Metabolism in Renal Cell Carcinoma. Proteomics 18, doi:ARTN 1800001 10. 1002/pmic.201800001 (2018).
Weinert, B. T. et al. Lysine succinylation is a frequently occurring modification in prokaryotes and eukaryotes and extensively overlaps with acetylation. Cell Rep 4, 842–851, doi:10.1016/j.celrep.2013.07.024 (2013).
Xu, Y. et al. iSuc-PseAAC: predicting lysine succinylation in proteins by incorporating peptide position-specific propensity. Sci Rep 5, 10184, doi:10.1038/srep10184 (2015).
Jia, J., Liu, Z., Xiao, X., Liu, B. & Chou, K. C. iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Anal Biochem 497, 48–56, doi:10.1016/j.ab.2015.12.009 (2016).
Jia, J., Liu, Z., Xiao, X., Liu, B. & Chou, K. C. pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. J Theor Biol 394, 223–230, doi:10.1016/j.jtbi.2016.01.020 (2016).
Hasan, M. M., Yang, S., Zhou, Y. & Mollah, M. N. H. SuccinSite: a computational tool for the prediction of protein succinylation sites by exploiting the amino acid patterns and properties. Molecular bioSystems 12, 786–795 (2016).
Dehzangi, A. et al. PSSM-Suc: Accurately predicting succinylation using position specific scoring matrix into bigram for feature extraction. J Theor Biol 425, 97–102, doi:10.1016/j.jtbi.2017.05.005 (2017).
Dehzangi, A. et al. Improving succinylation prediction accuracy by incorporating the secondary structure via helix, strand and coil, and evolutionary information from profile bigrams. PLoS One 13, e0191900, doi:10.1371/journal.pone.0191900 (2018).
Hasan, M. M. & Kurata, H. GPSuc: Global Prediction of Generic and Species-specific Succinylation Sites by aggregating multiple sequence features. PLoS One 13, e0200283, doi:10.1371/journal.pone.0200283 (2018).
Huang, K. Y., Hsu, J. B. & Lee, T. Y. Characterization and Identification of Lysine Succinylation Sites based on Deep Learning Method. Sci Rep 9, 16175, doi:10.1038/s41598-019-52552-4 (2019).
Thapa, N. et al. DeepSuccinylSite: a deep learning based approach for protein succinylation site prediction. BMC bioinformatics 21, 1–10 (2020).
Wang, H., Zhao, H., Yan, Z., Zhao, J. & Han, J. MDCAN-Lys: A Model for Predicting Succinylation Sites Based on Multilane Dense Convolutional Attention Network. Biomolecules 11, doi:10.3390/biom11060872 (2021).
Heinzinger, M. et al. Contrastive learning on protein embeddings enlightens midnight zone. NAR Genom Bioinform 4, lqac043, doi:10.1093/nargab/lqac043 (2022).
Weissenow, K., Heinzinger, M. & Rost, B. Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure, doi:10.1016/j.str.2022.05.001 (2022).
Steinegger, M., Mirdita, M. & Soding, J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat Methods 16, 603-+, doi:10.1038/s41592-019-0437-4 (2019).
Steinegger, M. & Soding, J. Clustering huge protein sequence sets in linear time. Nat Commun 9, 2542, doi:10.1038/s41467-018-04964-5 (2018).
Bepler, T. & Berger, B. Learning the protein language: Evolution, structure, and function. Cell Syst 12, 654–669 e653, doi:10.1016/j.cels.2021.05.017 (2021).
Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. arXiv preprint arXiv:1902.08661 (2019).
Rao, R. et al. Evaluating Protein Transfer Learning with TAPE. Adv Neural Inf Process Syst 32, 9689–9701 (2019).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A 118, doi:10.1073/pnas.2016239118 (2021).
Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat Commun 12, 5743, doi:10.1038/s41467-021-25976-8 (2021).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16, 1315-+, doi:10.1038/s41592-019-0598-1 (2019).
Elnaggar, A. et al. ProtTrans: towards cracking the language of Life's code through self-supervised deep learning and high performance computing. arXiv preprint arXiv:2007.06225 (2020).
Littmann, M., Heinzinger, M., Dallago, C., Weissenow, K. & Rost, B. Protein embeddings and deep learning predict binding residues for various ligand classes. Sci Rep-Uk 11, doi:ARTN 23916 10. 1038/s41598-021-03431-4 (2021).
Heinzinger, M. et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 20, 723, doi:10.1186/s12859-019-3220-8 (2019).
Raffel, C. et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J Mach Learn Res 21 (2020).
Consortium, U. UniProt: a hub for protein information. Nucleic acids research 43, D204-D212 (2015).
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Ning, Q., Zhao, X., Bao, L., Ma, Z. & Zhao, X. Detecting succinylation sites from protein sequences using ensemble support vector machine. BMC bioinformatics 19, 1–9 (2018).
Villegas-Morcillo, A. et al. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 37, 162–170, doi:10.1093/bioinformatics/btaa701 (2021).
Raschka, S. STAT 451: Machine Learning Lecture Notes. (2020).
Hasan, M. M., Yang, S., Zhou, Y. & Mollah, M. N. SuccinSite: a computational tool for the prediction of protein succinylation sites by exploiting the amino acid patterns and properties. Mol Biosyst 12, 786–795, doi:10.1039/c5mb00853k (2016).
Hasan, M. M., Khatun, M. S., Mollah, M. N. H., Yong, C. & Guo, D. A systematic identification of species-specific protein succinylation sites using joint element features information. Int J Nanomedicine 12, 6303–6315, doi:10.2147/IJN.S140875 (2017).
Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932, doi:10.1093/bioinformatics/btu739 (2015).

No competing interests reported.

lmsuccsupplementary.docx

LMSuccSite: Improving Protein Succinylation Sites Prediction Using Embeddings from Protein Language Model

Status:

Version 1

Abstract

Figures

1. Introduction

2. Dataset And Methods

3. Results

Conclusion And Discussion

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1