From Bottleneck to Breakthrough: Superior Performance of AngPPIS, DisPPIS, and SecPPIS Models in PPI Prediction

doi:10.21203/rs.3.rs-4610127/v1

Download PDF

Research Article

From Bottleneck to Breakthrough: Superior Performance of AngPPIS, DisPPIS, and SecPPIS Models in PPI Prediction

https://doi.org/10.21203/rs.3.rs-4610127/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

In the current field of medical research, particularly in the development of targeted medications for cancer and neurodegenerative diseases, tasks are often accomplished through protein-protein interactions (PPI). Consequently, mastering intracellular protein interactions is becoming increasingly important. This study developed three innovative deep learning models: SecPPIS, DisPPIS, and AngPPIS specifically designed to predict features related to proteins' secondary structures, spatial distances, and spatial angles, respectively. Our models underwent comprehensive training and testing, assessing their practicality through their performance in real-world applications. Compared with existing technologies our models demonstrated superior performance levels. These achievements provide effective technical support for the study of protein interactions and related drug development.

Protein-Protein Interactions

Deep Learning

Data Balance

Prediction Models

Protein-protein interactions (PPI) are processes that occur between pairs or multiples of protein molecules to form complexes and aggregates[1].PPIs are ubiquitous in virtually all biological reactions within organisms, such as antibody-antigen binding, cell signaling, gene expression and regulation, and viral self-assembly[2, 3]. Changes in PPIs form the molecular basis for most diseases[2]. The microscopic foundation of interactions between proteins usually involves the presence of one or more interaction sites (PPIs)[4]. In recent years, there has been increasing research on protein-protein interactions. Currently, this is very important in the field of disease treatment, such as target drug screening and design based on protein interaction relationships[5]. However, many drugs based on protein screening have characteristics like low target specificity and poor pharmacological properties[6]. Based on efficient prediction and empirical validation of amino acid functional targets, drug screening and design targeting protein amino acids have the potential to improve the current unsatisfactory effects of drugs.

Currently, the exploration of PPIs is divided into biochemical experiments and virtual experiments. Most biochemical experiments require techniques like co-immunoprecipitation, yeast two-hybrid assays, and mass spectrometry to explore PPIs. Compared to a decade ago, there's now less reliance on biochemical methods for validating PPIs; instead, more use is made of protein fragment strategies to determine interaction regions in proteins. Today, an increasing number of virtual experiments on PPIs indicate that exploring PPIs via biochemical means entails higher costs and greater time consumption. Through virtual experimental methods, the cycle for exploring PPIs can be significantly reduced, enhancing workflow efficiency and lowering cost expenditures.

Methods for virtual experiments can be divided into sequence-based methods and structure-based methods, relying on protein-protein docking modeling[7]. Predictive methods using molecular docking assess the binding patterns between proteins based on molecular dynamics calculations, allowing for learning and predicting protein binding patterns[8]. However, these approaches require more experimental iterations and depend on complete, high-quality protein data[8]. Both sequence-based and structure-based methods involve learning local or global features of protein sites to predict their functions[7].

Up to now, reported sequence prediction methods include: UNISPPI[9], DeepPPI[10], PIPR[11], S-VGAE[12], ACT-SVM[12], D-SCRIPT[13], BiLSTM-RF[14], Heterogeneous Network[15], OR-RCNN[16], DeepTrio[17] among others. As for structure prediction approaches reported so far: IntPred[18], Struct2Graph[19], SPPIDER[20], DeepPPISP[21], EGRET[22], GraphPPIS[23]. These existing models based on sequence features are relatively mature in design and exhibit superior performance but inherently come with some limitations such as amino acid sequence length restriction,the completeness of the protein files,model stability issues limited learning scope,and model performance not sufficient enough to differentiate data[24, 25].

So far, very few models have been capable of high-performance model design, but our model achieves high-quality spatial calculations of proteins from two novel structural perspectives and a well-known structural perspective: distance, angle, and secondary structure. These three features were selected through the evaluation of feature importance via random forest trees.

Our model employs a transfer learning architecture using BiLSTM, a custom Deep Neural Network (CNN) architecture, and a pretrained ResNet18 to train on protein pdb file data concerning amino acid secondary structure information, amino acid atomic distance matrices, and amino acid angle matrices respectively. This training effectively predicts protein interaction sites. Compared to previously reported models, our model stands out due to its ability to handle imbalanced datasets distinctively. It manages input data category balancing adeptly—adapting to proteins of varying lengths and completeness—and flexibly addresses inconsistencies in the data using advanced learning algorithms thus ensuring prediction accuracy and reliability while improving precision and authenticity in results.

In evaluating our model's effectiveness we employ multiple validation strategies. Initially by performing comparative analyses with existing models that underscore our method’s superiority in handling large-scale complex datasets; secondly by implementing cross-validation using independent experimental datasets which assert the stability and accuracy of our model; finally by exploring real biomedical case studies where we showcase our model’s practical application especially noting its potential value in unearthing new protein functional targets or drug development strategies.

Model Overview

Our experimental process includes three main phases: data acquisition and preprocessing phase, data feature learning training phase, and the prediction phase for newly inputted unknown classified data (Fig. 1). First, we convert protein PDB files into three types of data formats: sequences marked with secondary structure identifiers of proteins, protein atomic distance matrices, and protein amino acid angle matrices. These are then fed into respective models - a transfer learning model based on pre-trained ResNet, a CNN-based classification model for datasets and a sequence processing model using BiLSTM. From these models, we derive probabilities for different categories of sites as well as model weights and bias files. We then load these trained model weights particularly biases to judge provided predictions on newly processed similar datasets leading to site categorization probability distribution.

[INSERT Fig. 1 HERE]

Model Evaluation Metrics

We used six different assessment datasets to quantitatively evaluate the results from three types of model training (a sequence processing model based on BiLSTM, a data classification model based on CNN, and a transfer learning model based on pretrained ResNet). We utilized t-distributed Stochastic Neighbor Embedding (t-SNE) dimension reduction technology and four clustering metrics to qualitatively assess two of these models' training results (the data classification model based on CNN, and the transfer learning model based on pretrained ResNet).

The six quantitative assessment measures were accuracy, precision rate, recall rate, F1 score, the Matthews correlation coefficient (MCC), and the area under the precision-recall curve (AUPRC), AUROC. The formulas are as mentioned in equations (1) through (5).

$${\text{Accuracy=}}\frac{{{\text{TP+TN}}}}{{{\text{TP+TN+FP+FN}}}}$$

$${\text{Precision=}}\frac{{{\text{TP}}}}{{{\text{TP+FP}}}}$$

$${\text{Recall=}}\frac{{{\text{TP}}}}{{{\text{TP+FN}}}}$$

$${\text{F1-Score=}}\frac{{{\text{2*Precision*Recall}}}}{{{\text{Precision+Recall}}}}$$

$${\text{MCC=}}\frac{{{\text{TP*TN-FP*FN}}}}{{\sqrt {{\text{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}} }}$$

Five clustering metrics are Homogeneity, Completeness, Silhouette Coefficient, Calinski-Harabasz index, Davies-Bouldin index, as shown in formulas (6) to (10).

$${\text{Homogeneity=1-}}\frac{{{\text{H(C|K)}}}}{{{\text{H(C)}}}}$$

$${\text{Completeness=1-}}\frac{{{\text{H(K|C)}}}}{{{\text{H(K)}}}}$$

$${\text{Silhouette=}}\frac{{{\text{b-a}}}}{{{\text{max(a,b)}}}}$$

$${\text{Calinski-Harabasz=}}\frac{{{\text{Tr(}}{{\text{B}}_{\text{k}}}{\text{)}}}}{{{\text{Tr(}}{{\text{W}}_{\text{k}}}{\text{)}}}}{\text{*}}\frac{{{\text{N-k}}}}{{{\text{k-1}}}}$$

Model training process and evaluation results

In this study, each training classification process used datasets that included image data, text sequence data, and corresponding label data. We loaded these preprocessed data using customized dataset classes for each model and completed the final processing for model input, including necessary transformation operations. The entire dataset was divided into a training set and a test set, with 80% of the data used for training and the remaining 20% used for testing.

The three model training processes utilized the same optimizer—Adam—with different loss functions: CrossEntropyLoss, BCEWithLogitsLoss, BCELoss. Each process trained models over 10 to 15 epochs; during each epoch, we calculated average losses across the whole training set and monitored and evaluated performance based on performance on the test sets.

At the end of each training cycle, model performance was assessed on the test set. Evaluation metrics included accuracy, precision recall rate F1-score AUC-ROC AUPRC ,and MCC all these parameters comprehensively reflect models' performances in binary classification tasks besides numerical indicators dimensionality reduction via t-SNE method was also employed visualize feature distributions learned by models results can be seen in Fig. 2, Fig. 3, Fig. 4.

AngPPIs

In this study, we utilized a modified pre-trained ResNet-18 model. Specifically, we retained the original model weights and replaced the fully connected layer of ResNet-18, setting the output dimension to 2 to directly correspond to predictions for two categories. The model was finely tuned over 10 training epochs and evaluated on multiple performance metrics including accuracy, precision, recall rate, F1 score, AUC-ROC, AUPRC, and MCC.

Looking at the training results, the model's performance gradually improves with each training cycle. In the first training cycle, the model's accuracy was 0.7547, but by the tenth cycle, it had increased to 0.8623. This improving trend is also reflected in precision and recall rates, which rose from 0.7555 and 0.7546 in cycle one to 0.8674 and 0.8623 in cycle ten respectively.

The F1 score of the model, as a harmonic mean of precision and recall rate, also shows a similar growth trend - increasing from 0.7547 in cycle one to 0.8621 by cycle ten.

Furthermore, AUC-ROC and AUPRC as important indicators of predictive ability have risen from initial values of .8174 and .7842 to .9265 and .9084 respectivel.

From these data changes we can further confirm that our model has improved its ability to classify positive versus negative samples effectively.

The Matthews Correlation Coefficient (MCC), as a comprehensive indicator of the performance of binary classification models, has grown from 0.5101 in cycle 1 to 0.7299 in cycle 10. The growth of MCC directly reflects the improvement of model prediction balance across categories, indicating that while maintaining category sensitivity, the model also maintains high prediction accuracy.

Overall, these improvements show that the model is gradually adapting and optimizing for complex image data classification problems during training. Through fine-tuning the structure of the model and appropriate training strategies, we can effectively use deep learning technology to handle high-dimensional image data and achieve accurate classification predictions.

In further evaluating clustering performance on our model, we focused on several key indicators: Homogeneity, Completeness, Silhouette Score, Calinski-Harabaz Index and Davies-Bouldin Index. These indicators reflect cluster result quality and unsupervised learning ability on unlabeled data by our model.

Homogeneity and completeness both gradually increased from a low level of 0.1874 and 0.1875 in the first cycle to 0.4341 and 0.4363 in the tenth cycle, respectively. The increase in homogeneity and completeness indicates that the clusters generated by clustering are gradually aligning with the real class labels, i.e., the clustering results tend to be consistent within classes while remaining independent between classes.

The silhouette coefficient increased from 0.4553 in cycle one to 0.6146 in cycle ten, showing a significant improvement in cluster structure stability and compactness. The closer the silhouette coefficient is to one, it means that intra-cluster distances are smaller and inter-cluster distances are larger, indicating better clustering quality.

The Calinski-Harabaz index grew from 4122.4921 at cycle one to 8992.1156 at cycle ten; this significant growth suggests an increasing ratio of intra-cluster density to inter-cluster separation, indicating clearer definition of clustering results.

The Davies-Bouldin index decreased during training process from 0.7696 at cycle one down to 0.5104 at cycle ten; a decreasing Davies-Bouldin index implies improved clustering quality because lower values indicate better cluster separation.

These overall improvements on all these clustering metrics suggest that although this model was initially trained for supervised learning tasks, it also demonstrated excellent performance when dealing with unsupervised clustering tasks as well.This result emphasizes deep learning models' potential adaptability especially when using advanced architectures like ResNet-18 for multi-task learning scenarios.

Through both qualitative and quantitative performance analysis, we have not only verified the effectiveness of the model in classification tasks, but also demonstrated its ability to provide insights into data structures. These achievements provide strong support for the wider application of this model.

[INSERT Fig. 2 HERE]

DisPPIs

After further in-depth analysis of the results using the CNNModel, this article demonstrates the performance of this model under two different experimental settings.

Using CNNModel, the model's performance steadily improved over 15 training cycles. Accuracy increased from 0.6926 to 0.8788, precision and recall also significantly increased, rising respectively from 0.6642 and 0.7790 to 0.8411 and 0.9340. The change in performance data proves the ability of model parameters and network architecture to gradually adapt to data features during training process.

F1 score, AUC-ROC and AUPRC all showed significant improvements, indicating that the model performed excellently in balancing recall rate and precision while maintaining good sensitivity to discrimination thresholds.

In particular, AUC-ROC increased from 0.7681 to 0.9462 demonstrating a strong capability of distinguishing between different categories.

In terms of cluster effect evaluation, we observed an increase in Homogeneity and Completeness indicators which indicates that our model can gradually improve cluster purity and completeness during clustering process.

Silhouette coefficient, Calinski-Harabasz index,and Davies-Bouldin index all showed improvement in clustering quality.

Specifically,the silhouette coefficient rose from .9029 up to .9367; Calinski-Harabasz index went up approximately from around about45280to nearly79026; whereas Davies-Bouldin Index decreased downfrom .3297to .2849 - lower values indicate better separation between clusters as well as higher density within clusters.

These results demonstrate that the CNNModel not only performs excellently in supervised learning tasks when dealing with datasets with complex internal structures, but also effectively reveals the underlying structure of data in unsupervised learning clustering tasks. Through deep feature learning and data extraction, the model shows potential for handling high-dimensional data and extracting useful information from it.

[INSERT Fig. 3 HERE]

SecPPIs

In this experiment, the BiLSTM model did not use clustering indicators for its task but focused on evaluating the classification performance of the model. The training of the model was completed within 10 cycles, and the performance metrics observed included precision, recall rate, F1 score, accuracy, AUC-ROC, AUPRC and MCC.

The initial performance of precision and recall rate was not ideal. However as training progressed especially in the 10th cycle of training there was a significant increase in recall to 0.6972; although precision fluctuated it remained stable overall.

The F1 score which is a harmonic mean of precision and recall increased from an initial value of 0.1642 to 0.4851 reflecting improvements made by the model in identifying positive class samples.

Accuracy maintained at a high level throughout while AUC-ROC remained stable always above 0.9155 indicating that our model has good classification ability.

AUPRC and MCC also showed positive progress with MCC increasing from an initial value of .1609 to .4446 showing enhanced prediction accuracy and consistency by our model.

This BiLSTM Model demonstrated unique advantages when dealing with complex data sequences through continuous training and adjustments leading to significant improvement in its performance.

Ultimately it displayed strong capabilities handling time series data particularly excelling in terms of Recall Rate & AUC-ROC metrics.

[INSERT Fig. 4 HERE]

Model Performance

Comparing performance with different competitive methods

Since our dataset comes from DeepPPISP, to reflect the fairness of data selection, we choose to compare methods on the same task set, as shown in Table 1. The data comes from published research results[25]. At the same time, for a more comprehensive performance comparison, we include the best performance values of methods other than those using DeepPPISP tasks for comparison. The data comes from published literature[26].

[INSERT Table 1 HERE]

In this study, we demonstrated the significant advantages of our models by comparing three models we defined: SecPPIS, AngPPIS and DisPPIS, with a variety of mainstream models on the same task set. These models have shown excellent performance in multiple key performance indicators such as accuracy (ACC), precision (PRE), recall rate (REC), F1 score, AUC-ROC, AUPRC and MCC.

The SecPPIS model showed the highest accuracy among all models (0.868), which far exceeded other traditional methods such as RF_PPI and SCRIBER. The accuracy of the RF_PPI model was only 0.598 and that of SCRIBER was 0.616. In addition, the AUC-ROC value of DisPPIS is 0.924, which is significantly higher than most models; for example, the AUC-ROC of SCRIBER is 0.635 showing its excellence in evaluating model prediction capabilities for positive and negative classes.

Although SecPPIS has relatively low precision and recall rates compared to others it still shows strong reliability and stability in complex protein-protein interaction prediction tasks.

The AngPPIS model excels in accuracy, precision, recall rate and F1 score, with all indicators at 0.839 or above. Particularly in terms of precision and recall rate, it reaches 0.840, demonstrating the balance and efficiency of this model in identifying and classifying positive and negative samples. Notably, its AUC-ROC reached 0.908 while AUPRC was 0.889; these figures not only surpass conventional models such as DLPred and TransformerPPIS but also rank among the top tier within bioinformatics field.

The DisPPIS model exhibits the highest recall rate (0.890), indicating that it performs exceptionally well in recognizing all positive class samples without missing any positives. In addition to this, the accuracy of this model along with its AUC-ROC have also reached 0.840 and 0.924 respectively further validating that while ensuring a high recall rate it can still maintain a high overall prediction accuracy.

DisPPIS's MCC value is at an impressive level of 0.684 which is significantly higher than other models like EnsemPPIS whose MCC stands at just 0.277 or EGRET's MCC which is even lower at just .27 showing a significant improvement for DisPPIs when considering statistical correlation as well as predictive quality

Overall, our model not only outperforms existing technical indicators on individual metrics, but also excels in overall performance, marking a comprehensive improvement in model performance indicators. This result highlights the effectiveness and practicality of this method in predicting protein-protein interactions, while also providing new research tools and directions for the field of bioinformatics, especially with promising prospects for application in medical and biological research. Through further development and optimization, these models are expected to play a key role in a wider range of biological data analysis tasks.

Actual use of model prediction

In this study, our focus is on assessing the actual performance of deep learning models in predicting protein-protein interactions. By comparing the results predicted by the model with publicly published and verified experimental data, we can more accurately understand the effectiveness and predictive accuracy of the model.

Based on publicly published literature, we selected proteins that are recognized to have interacting sequences or structural domains to fit with the prediction results. Since our research is about site prediction, if fitting with sequence or structural domain, then we need to expand sites. In practice, we select predicted sites and an additional three upstream and downstream sites to form a sequence. Then sort and merge these sequences by comparing current range with end position of last element in merged list.

The interacting structure domain or sequence may include one or more interaction sites[27]. These interaction sites together with other active sites within domain constitute functional structure domains[28]. This paper selects PB1, PX, SH3 etc., for verification[29–31], as shown in Fig. 5.

[INSERT Fig. 5 HERE]

In this study, we selected nine proteins as examples to demonstrate the fit between actual protein interaction sites and our model's predicted results. Although the predictions generally align with the actual interaction sites, we observed that the predicted coverage area exceeded half but did not fully correspond with the actual data, as shown in the left half of Fig. 5. This could potentially increase subsequent research workload. Therefore, we attempted to remove datasets containing more than two interacting proteins from the original dataset and only retained datasets involving two interacting proteins in hopes of improving prediction accuracy through this method. After adjustments, there was a relative reduction in the model's predicted area and a significant improvement in site prediction accuracy, as shown on the right side of Fig. 5. We speculate that when including more than two proteins, due to multi-subunit interactions, an increase in quantity and complexity of interaction sites may lead to inaccurate or unstable model predictions. Compared with systems involving only two interacting proteins, multi-protein interaction systems introduce more variables and dynamic interactions which might affect learning algorithm performance.

Overall, whether it is achieving evaluation metrics or practical applications our model exhibits outstanding comprehensive performance.

In fact, before we found the optimal solution, we experimented with different preprocessing methods and model architectures on the same data. Ultimately, we discovered that the key to achieving higher performance model training results lies in how to balance data preprocessing and model architecture design. This means that researchers who find more representative preprocessing methods and model architectures that fit the characteristics of the data often achieve better results..

Initially, we attempted to conduct experiments starting from the atomic distance matrix of proteins. The way it was cut differed from the final sequence cutting method; we adopted a sub-matrix cutting approach and learned site features through input models. After discussion, we decided to use a 16*16 sub-matrix size for cutting, as shown in Fig. 6. We first used the GNN model architecture. Using this structure allowed us to receive sub-matrices as inputs, meaning that we took the central value of distances in matrices as graph nodes and distances between other sites as edge feature embeddings in graph neural networks. Additionally, we found that using both point (node) and edge features simultaneously could help our model learn and express information more effectively. However, training results were not satisfactory, as shown in Fig. 6b. We believe if a model's design does not suit our data structure then it will struggle to learn these data effectively. Simply put: for good training results, a model's structure must match with the form of its data.

Therefore, we employed variational autoencoders to separate out our model training process into different stages: encoding high-dimensional input data into low-dimensional latent space containing three fully connected layers each followed by ReLU activation functions gradually reducing feature dimensions; two fully connected layers outputting mean values and log variances respectively within latent space - parameters used subsequently for reparameterization tricks controlling distribution shape when generating new data; decoding low-dimensional representations within latent space back into high-dimensional representations matching original input data including four fully connected layers each followed by ReLU activation functions gradually increasing feature dimensions with last layer followed by Sigmoid activation function compressing output values between [0–1]; finally setting up classifier comprising series of fully connected layers plus batch normalization layers progressively reducing feature dimensions ending with single-dimension result via one final fully-connected layer suitable for binary classification tasks.

We also tried directly using ResNet architecture on matrix data but results were unsatisfactory (as seen again in Fig. 6b). Therefore conclusion is that ResNet performs better when dealing with matrix-data transformed into images.

After discussion, we decided to increase the feature dimensions and complexity of the model architecture in an attempt to improve model performance. We selected four types of information: amino acid properties, secondary structure information, distance and angle, spatial coordinates. Amino acid properties include charge, polarity, hydrophobicity, isoelectric point, molecular weight and flexibility for each amino acid. We predefined attribute values for 20 common amino acids. Secondary structure information refers to using DSSP algorithm to extract protein's secondary structure information from pdb files. Distance and angle refer to calculating Euclidean distance and angles between amino acids. Spatial coordinates refer to calculating spatial coordinates (average coordinates) of amino acid residues based on aforementioned attributes, sequence position and spatial coordinates.

We divided these 16-dimensional features into node features (properties and location info of the amino acids) along with edge indices & attributes (distance & direction info). These were inputted into a Graph Attention Network (GAT) together with node classification labels in order capture both node & edge features.

Interestingly enough after applying cross-entropy loss function while setting category weights(2), sample attention(2), as well as output categories from loss function(sample loss synthesis), training results improved as shown in figure below.

However,the effect was still not sufficient.So we changed our strategy focusing more on data relevance itself.

Tree-based models such as decision trees, random forests, gradient boosting trees(such as XGBoost, LightGBM etc.) are widely used for feature selection due their ability provide intuitive feature importance scores.These models predict target variables by constructing series decision rules during which they can evaluate contribution each feature makes towards model prediction performance[32]. Feature importance scores tree-based models quantify contribution each feature makes when building decision tree usually calculated by examining frequency at which a particular feature is used split nodes along with improvement brought about by said split[32].

The input data required for Random Forest to evaluate feature importance mainly includes a matrix in the shape of sample-feature[33]. In graph data, nodes (amino acids) are interconnected. To adapt to Random Forest, we treat each node (and its features) as a sample and the label of the node as the label of the sample. Thus, we create a feature matrix X where each row represents an amino acid's features and a vector y containing labels corresponding to each amino acid (0 or 1). Eventually, we use Random Forest to complete this 16-dimensional information's importance evaluation (16-dimensional information includes protein charge, hydrophobicity, polarity, relative molecular mass, rigidity-flexibility level, isoelectric point type of secondary structure occupied by amino acids relative position in sequence and space between amino acids distance information between amino acids angle information), if shown like Fig. 6a.

In conclusion, we finally discard redundant feature data and only retain distance information about amino acids angle information and secondary structure info.

When trying classic CNN architecture processing sub-matrix data model performance began improving as shown in Fig. 6b. After experimenting with different model architectures new attempt at preprocessing method was made selecting from center point upwards or downwards expanding range selecting numerical series feature data as input into new transformer architecture trial resulted in maintaining relatively high-level performance as shown in Fig. 6b.

Next up we completed exploration within RNN architecture transform architecture Attention framework and CNN framework using our dataset(distance matrix dataset already sliced for example). Results are displayed on Fig. 6b.

In summary, to ensure the reliability of the experiment and the accuracy of model weight file selection, we used CNN, BiLSTM architecture, and ResNet pre-trained models for data training. We recorded train loss and test loss during the training process as shown in Fig. 6c. The results show that both SecPPIs and DisPPIs models' train and test losses present a steady downward trend. However, while AngPPIs model's train loss continues to decrease, its test loss does not continue to decline steadily after the fourth epoch, suggesting overfitting begins at this point. This phenomenon may reflect limitations of using pre-trained model methods. Although we tried adding Dropout layers to alleviate overfitting, results showed significant overfitting still occurred. Therefore, during AngPPIs training we adopted an early stopping strategy manually and chose the weights from epoch four for prediction. For DisPPIs and SecPPIs models respectively we selected weights from epochs fourteen and ten for prediction.

Compared with publicly published researches our study employed balanced sample category numbers ensuring consistent sample quantities across different classification sites. Through experimentation we found when sample category numbers are unbalanced; model predictions tend towards categories with larger quantities which affects feature learning accuracy within these models. Therefore when evaluating metrics like ROC if a model is biased towards one category although Accuracy or ROC values might be high other indicators perform poorly in comparison.

Given this future attention should be paid before training on balancing class data quantity techniques which could effectively improve generalization capabilities along with predictive accuracy within these models.

[INSERT Fig. 6 HERE]

Currently, the application of virtual experimental methods is still in the model development stage and has not been widely applied to real-world scenarios. These methods start from different protein feature perspectives, indicating that there may be more innovative feature dimensions in the future, which will deepen our understanding of the mechanism of protein action at a microscopic level. On one hand, diverse biological functions of proteins can be mapped through different feature dimensions, meaning that research scope is not only limited to exploring their interactions but could also extend to aspects such as protein activity, drug resistance and catalytic properties. On the other hand, the development and application of these research methods are cornerstones for driving innovation in key medical fields like targeted drug development, cancer treatment and neurotherapy. As these methods further apply practically we anticipate more emerging experimental techniques and strategies in future bringing revolutionary progress for biomedical research and treatment.

In this study, we successfully developed and validated three innovative deep learning models: SecPPIS, AngPPIS, and DisPPIS. Compared to traditional methods, these models have significant performance advantages in the task of predicting protein-protein interactions, achieving breakthrough progress on multiple key indicators of overall performance. This result indicates that through carefully designed network architecture and algorithm optimization, the model's ability to process big biological data and prediction accuracy can be significantly improved.

The SecPPIS model has demonstrated the potential application of deep learning technology in precision medicine and protein research with its excellent overall accuracy and outstanding AUC-ROC performance. The AngPPIS and DisPPIS models show balanced performances in terms of precision, recall rate, and AUC-ROC ensuring reliability and robustness in practical applications. Particularly noteworthy is the exceptional recall rate performance by the DisPPIS model which provides valuable reference for effectively capturing positive samples.

These achievements not only emphasize the importance of deep learning methods in bioinformatics but also provide a solid foundation for future research directions. Future work will focus on further optimizing these models' structure parameters to improve their universality efficiency across different biological datasets. Moreover, these research results are expected to drive innovation in biological data analysis methods promoting cross-fusion between biology computational science fields providing powerful computational tools for exploring more complex biological problems.

In conclusion our research not only improves accuracy efficiency predictions protein-protein interactions but also enhances adaptability our models especially successful application these models marks a solid step forward using advanced machine learning techniques solve complex biological problems.

Detailed materials and methods are available in the Supplementary data.

Ethics approval and consent to participate

There are no human subjects in this article, therefore ethics approval and consent to participate are not applicable to this article.

Consent for publication

All authors approved the final manuscript and the submission to this journal.

Acknowledgements

We acknowledge the support of the Yunnan Fundamental Research Projects (202101BE070001-004) and the Natural Science Foundation of China(82260461).

Credit authorship contribution statement

All authors participated in the construction of the article. The conceptualization of the research framework was completed by Wenyan Wu and Chao Huang. The model design and operation were carried out by Wenyan Wu. The data used for practical application were jointly completed by Wenyan Wu and Lianglong Chen. Chao Huang, Feng Yao and Wenru Tang commented on and modified manuscript.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

Availability of data and material

The source code involved in this study has been placed at https://github.com/weny66/AngPPIS-DisPPIS-SecPPIS. The data on which the article is based were accessed from repositories and are available for downloading through the following links: https://www.rcsb.org and http://swift.cmbi.ru.nl/gv/dssp.

Funding

This work was supported by the Yunnan Fundamental Research Projects (202101BE070001-004) and the Natural Science Foundation of China(82260461).

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this review.

X. Hu, C. Feng, T. Ling, M. Chen, Deep learning frameworks for protein-protein interaction prediction, Computational and structural biotechnology journal 20 (2022) 3223-3233.
B. Mathew, S. Bathla, K.R. Williams, A.C. Nairn, Deciphering Spatial Protein-Protein Interactions in Brain Using Proximity Labeling, Molecular & cellular proteomics : MCP 21(11) (2022) 100422.
O. Meurette, P. Mehlen, Notch Signaling in the Tumor Microenvironment, Cancer cell 34(4) (2018) 536-548.
D.D. Boehr, R.N. D'Amico, K.F. O'Rourke, Engineered control of enzyme structural dynamics and function, Protein science : a publication of the Protein Society 27(4) (2018) 825-838.
Y. Wang, B. Deng, Hepatocellular carcinoma: molecular mechanism, targeted therapy, and biomarkers, Cancer metastasis reviews 42(3) (2023) 629-652.
K. Trajanoska, C. Bhérer, D. Taliun, S. Zhou, J.B. Richards, V. Mooser, From target discovery to clinical drug development with human genetics, Nature 620(7975) (2023) 737-745.
M. Zeng, F. Zhang, F.X. Wu, Y. Li, J. Wang, M. Li, Protein-protein interaction site prediction through combining local and global features with deep neural networks, Bioinformatics (Oxford, England) 36(4) (2020) 1114-1120.
G.M. Morris, M. Lim-Wilby, Molecular docking, Methods in molecular biology (Clifton, N.J.) 443 (2008) 365-82.
G.T. Valente, M.L. Acencio, C. Martins, N. Lemke, The development of a universal in silico predictor of protein-protein interactions, PloS one 8(5) (2013) e65587.
X. Du, S. Sun, C. Hu, Y. Yao, Y. Yan, Y. Zhang, DeepPPI: Boosting Prediction of Protein-Protein Interactions with Deep Neural Networks, Journal of chemical information and modeling 57(6) (2017) 1499-1510.
M. Chen, C.J. Ju, G. Zhou, X. Chen, T. Zhang, K.W. Chang, C. Zaniolo, W. Wang, Multifaceted protein-protein interaction prediction based on Siamese residual RCNN, Bioinformatics (Oxford, England) 35(14) (2019) i305-i314.
W. Ma, Y. Cao, W. Bao, B. Yang, Y. Chen, ACT-SVM: Prediction of protein-protein interactions based on support vector basis model, Scientific Programming 2020 (2020) 1-8.
S. Sledzieski, R. Singh, L. Cowen, B. Berger, Sequence-based prediction of protein-protein interactions: a structure-aware interpretable deep learning model, BioRxiv (2021) 2021.01. 22.427866.
W. Ma, W. Bao, Y. Cao, B. Yang, Y. Chen, Prediction of protein-protein interaction based on deep learning feature representation and random forest, Intelligent Computing Theories and Application: 17th International Conference, ICIC 2021, Shenzhen, China, August 12–15, 2021, Proceedings, Part III 17, Springer, 2021, pp. 654-662.
X.-R. Su, Z.-H. You, Z.-H. Chen, H.-C. Yi, Z.-H. Guo, Protein-protein interaction prediction by integrating sequence information and heterogeneous network representation, Intelligent Computing Theories and Application: 17th International Conference, ICIC 2021, Shenzhen, China, August 12–15, 2021, Proceedings, Part III 17, Springer, 2021, pp. 617-626.
W. Xu, Y. Gao, Y. Wang, J. Guan, Protein–protein interaction prediction based on ordinal regression and recurrent convolutional neural networks, BMC bioinformatics 22(Suppl 6) (2021) 485.
X. Hu, C. Feng, Y. Zhou, A. Harrison, M. Chen, DeepTrio: a ternary prediction system for protein–protein interaction using mask multiple parallel convolutional neural networks, Bioinformatics (Oxford, England) 38(3) (2022) 694-702.
T.C. Northey, A. Barešić, A.C.R. Martin, IntPred: a structure-based predictor of protein-protein interaction sites, Bioinformatics (Oxford, England) 34(2) (2018) 223-229.
M. Baranwal, A. Magner, J. Saldinger, E.S. Turali-Emre, P. Elvati, S. Kozarekar, J.S. VanEpps, N.A. Kotov, A. Violi, A.O. Hero, Struct2Graph: a graph attention network for structure based predictions of protein-protein interactions, BMC bioinformatics 23(1) (2022) 370.
A. Porollo, J. Meller, Prediction‐based fingerprints of protein–protein interactions, Proteins: Structure, Function, and Bioinformatics 66(3) (2007) 630-645.
M. Zeng, F. Zhang, F.-X. Wu, Y. Li, J. Wang, M. Li, Protein–protein interaction site prediction through combining local and global features with deep neural networks, Bioinformatics (Oxford, England) 36(4) (2020) 1114-1120.
S. Mahbub, M.S. Bayzid, EGRET: edge aggregated graph attention networks and transfer learning improve protein–protein interaction site prediction, Briefings in Bioinformatics 23(2) (2022) bbab578.
Q. Yuan, J. Chen, H. Zhao, Y. Zhou, Y. Yang, Structure-aware protein–protein interaction site prediction using deep graph convolutional network, Bioinformatics (Oxford, England) 38(1) (2022) 125-132.
F. Soleymani, E. Paquet, H. Viktor, W. Michalowski, D. Spinello, Protein-protein interaction prediction with deep learning: A comprehensive review, Computational and structural biotechnology journal 20 (2022) 5316-5341.
M. Mou, Z. Pan, Z. Zhou, L. Zheng, H. Zhang, S. Shi, F. Li, X. Sun, F. Zhu, A Transformer-Based Ensemble Framework for the Prediction of Protein-Protein Interaction Sites, Research (Washington, D.C.) 6 (2023) 0240.
Y. Kang, Y. Xu, X. Wang, B. Pu, X. Yang, Y. Rao, J. Chen, HN-PPISP: a hybrid network based on MLP-Mixer for protein-protein interaction site prediction, Brief Bioinform 24(1) (2023).
Z. Wang, J. Huang, L. Nie, Y. Hu, N. Zhang, Q. Guo, J. Guo, B. Du, L. Zhu, G. He, R. Chen, Molecular and functional analysis of a brown planthopper resistance protein with two nucleotide-binding site domains, Journal of experimental botany 72(7) (2021) 2657-2671.
X. Li, S. Chen, W.-D. Zhang, H.-G. Hu, Stapled Helical Peptides Bearing Different Anchoring Residues, Chemical Reviews 120(18) (2020) 10079-10144.
T.J. Guilfoyle, The PB1 domain in auxin response factor and Aux/IAA proteins: a versatile protein interaction module in the auxin response, The Plant cell 27(1) (2015) 33-43.
J. Bravo, D. Karathanassis, C.M. Pacold, M.E. Pacold, C.D. Ellson, K.E. Anderson, P.J.G. Butler, I. Lavenir, O. Perisic, P.T. Hawkins, L. Stephens, R.L. Williams, The Crystal Structure of the PX Domain from p40phox Bound to Phosphatidylinositol 3-Phosphate, Molecular Cell 8(4) (2001) 829-839.
O. Aitio, M. Hellman, T. Kesti, I. Kleino, O. Samuilova, K. Pääkkönen, H. Tossavainen, K. Saksela, P. Permi, Structural Basis of PxxDY Motif Recognition in SH3 Binding, Journal of Molecular Biology 382(1) (2008) 167-178.
H. Zhu, X. Li, P. Zhang, G. Li, J. He, H. Li, K. Gai, Learning Tree-based Deep Model for Recommender Systems, Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, London, United Kingdom, 2018, pp. 1079–1088.
M. Schonlau, R.Y. Zou, The random forest algorithm for statistical learning, The Stata Journal 20(1) (2020) 3-29.

Table 1 Comparison of performance using 21 different methods. The first 17 rows compare different methods on the same task set, sorted in ascending order by accuracy value. The last 4 rows evaluate the method's performance indicators using other task sets, still sorted in ascending order by accuracy value.

No competing interests reported.

Download PDF

Reviewers agreed at journal
01 Nov, 2024
Reviewers invited by journal
30 Oct, 2024
Editor assigned by journal
24 Jun, 2024
Submission checks completed at journal
20 Jun, 2024
First submitted to journal
20 Jun, 2024

You are reading this latest preprint version

From Bottleneck to Breakthrough: Superior Performance of AngPPIS, DisPPIS, and SecPPIS Models in PPI Prediction

Status:

Version 1

Abstract

Figures

Introduction

Results and Discussion

Model Overview

Model Evaluation Metrics

Model training process and evaluation results

AngPPIs

DisPPIs

SecPPIs

Model Performance

Comparing performance with different competitive methods

Actual use of model prediction

Conclusion

Material and Methods

Declarations

Ethics approval and consent to participate

Consent for publication

Acknowledgements

Credit authorship contribution statement

Declaration of competing interest

Availability of data and material

Funding

Conflict of interest statement

References

Table

Additional Declarations

Supplementary Files

Status:

Version 1