Model Evaluation Metrics
We used six different assessment datasets to quantitatively evaluate the results from three types of model training (a sequence processing model based on BiLSTM, a data classification model based on CNN, and a transfer learning model based on pretrained ResNet). We utilized t-distributed Stochastic Neighbor Embedding (t-SNE) dimension reduction technology and four clustering metrics to qualitatively assess two of these models' training results (the data classification model based on CNN, and the transfer learning model based on pretrained ResNet).
The six quantitative assessment measures were accuracy, precision rate, recall rate, F1 score, the Matthews correlation coefficient (MCC), and the area under the precision-recall curve (AUPRC), AUROC. The formulas are as mentioned in equations (1) through (5).
$${\text{Accuracy=}}\frac{{{\text{TP+TN}}}}{{{\text{TP+TN+FP+FN}}}}$$
1
$${\text{Precision=}}\frac{{{\text{TP}}}}{{{\text{TP+FP}}}}$$
2
$${\text{Recall=}}\frac{{{\text{TP}}}}{{{\text{TP+FN}}}}$$
3
$${\text{F1-Score=}}\frac{{{\text{2*Precision*Recall}}}}{{{\text{Precision+Recall}}}}$$
4
$${\text{MCC=}}\frac{{{\text{TP*TN-FP*FN}}}}{{\sqrt {{\text{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}} }}$$
5
Five clustering metrics are Homogeneity, Completeness, Silhouette Coefficient, Calinski-Harabasz index, Davies-Bouldin index, as shown in formulas (6) to (10).
$${\text{Homogeneity=1-}}\frac{{{\text{H(C|K)}}}}{{{\text{H(C)}}}}$$
6
$${\text{Completeness=1-}}\frac{{{\text{H(K|C)}}}}{{{\text{H(K)}}}}$$
7
$${\text{Silhouette=}}\frac{{{\text{b-a}}}}{{{\text{max(a,b)}}}}$$
8
$${\text{Calinski-Harabasz=}}\frac{{{\text{Tr(}}{{\text{B}}_{\text{k}}}{\text{)}}}}{{{\text{Tr(}}{{\text{W}}_{\text{k}}}{\text{)}}}}{\text{*}}\frac{{{\text{N-k}}}}{{{\text{k-1}}}}$$
9
Model training process and evaluation results
In this study, each training classification process used datasets that included image data, text sequence data, and corresponding label data. We loaded these preprocessed data using customized dataset classes for each model and completed the final processing for model input, including necessary transformation operations. The entire dataset was divided into a training set and a test set, with 80% of the data used for training and the remaining 20% used for testing.
The three model training processes utilized the same optimizer—Adam—with different loss functions: CrossEntropyLoss, BCEWithLogitsLoss, BCELoss. Each process trained models over 10 to 15 epochs; during each epoch, we calculated average losses across the whole training set and monitored and evaluated performance based on performance on the test sets.
At the end of each training cycle, model performance was assessed on the test set. Evaluation metrics included accuracy, precision recall rate F1-score AUC-ROC AUPRC ,and MCC all these parameters comprehensively reflect models' performances in binary classification tasks besides numerical indicators dimensionality reduction via t-SNE method was also employed visualize feature distributions learned by models results can be seen in Fig. 2, Fig. 3, Fig. 4.
AngPPIs
In this study, we utilized a modified pre-trained ResNet-18 model. Specifically, we retained the original model weights and replaced the fully connected layer of ResNet-18, setting the output dimension to 2 to directly correspond to predictions for two categories. The model was finely tuned over 10 training epochs and evaluated on multiple performance metrics including accuracy, precision, recall rate, F1 score, AUC-ROC, AUPRC, and MCC.
Looking at the training results, the model's performance gradually improves with each training cycle. In the first training cycle, the model's accuracy was 0.7547, but by the tenth cycle, it had increased to 0.8623. This improving trend is also reflected in precision and recall rates, which rose from 0.7555 and 0.7546 in cycle one to 0.8674 and 0.8623 in cycle ten respectively.
The F1 score of the model, as a harmonic mean of precision and recall rate, also shows a similar growth trend - increasing from 0.7547 in cycle one to 0.8621 by cycle ten.
Furthermore, AUC-ROC and AUPRC as important indicators of predictive ability have risen from initial values of .8174 and .7842 to .9265 and .9084 respectivel.
From these data changes we can further confirm that our model has improved its ability to classify positive versus negative samples effectively.
The Matthews Correlation Coefficient (MCC), as a comprehensive indicator of the performance of binary classification models, has grown from 0.5101 in cycle 1 to 0.7299 in cycle 10. The growth of MCC directly reflects the improvement of model prediction balance across categories, indicating that while maintaining category sensitivity, the model also maintains high prediction accuracy.
Overall, these improvements show that the model is gradually adapting and optimizing for complex image data classification problems during training. Through fine-tuning the structure of the model and appropriate training strategies, we can effectively use deep learning technology to handle high-dimensional image data and achieve accurate classification predictions.
In further evaluating clustering performance on our model, we focused on several key indicators: Homogeneity, Completeness, Silhouette Score, Calinski-Harabaz Index and Davies-Bouldin Index. These indicators reflect cluster result quality and unsupervised learning ability on unlabeled data by our model.
Homogeneity and completeness both gradually increased from a low level of 0.1874 and 0.1875 in the first cycle to 0.4341 and 0.4363 in the tenth cycle, respectively. The increase in homogeneity and completeness indicates that the clusters generated by clustering are gradually aligning with the real class labels, i.e., the clustering results tend to be consistent within classes while remaining independent between classes.
The silhouette coefficient increased from 0.4553 in cycle one to 0.6146 in cycle ten, showing a significant improvement in cluster structure stability and compactness. The closer the silhouette coefficient is to one, it means that intra-cluster distances are smaller and inter-cluster distances are larger, indicating better clustering quality.
The Calinski-Harabaz index grew from 4122.4921 at cycle one to 8992.1156 at cycle ten; this significant growth suggests an increasing ratio of intra-cluster density to inter-cluster separation, indicating clearer definition of clustering results.
The Davies-Bouldin index decreased during training process from 0.7696 at cycle one down to 0.5104 at cycle ten; a decreasing Davies-Bouldin index implies improved clustering quality because lower values indicate better cluster separation.
These overall improvements on all these clustering metrics suggest that although this model was initially trained for supervised learning tasks, it also demonstrated excellent performance when dealing with unsupervised clustering tasks as well.This result emphasizes deep learning models' potential adaptability especially when using advanced architectures like ResNet-18 for multi-task learning scenarios.
Through both qualitative and quantitative performance analysis, we have not only verified the effectiveness of the model in classification tasks, but also demonstrated its ability to provide insights into data structures. These achievements provide strong support for the wider application of this model.
[INSERT Fig. 2 HERE]
DisPPIs
After further in-depth analysis of the results using the CNNModel, this article demonstrates the performance of this model under two different experimental settings.
Using CNNModel, the model's performance steadily improved over 15 training cycles. Accuracy increased from 0.6926 to 0.8788, precision and recall also significantly increased, rising respectively from 0.6642 and 0.7790 to 0.8411 and 0.9340. The change in performance data proves the ability of model parameters and network architecture to gradually adapt to data features during training process.
F1 score, AUC-ROC and AUPRC all showed significant improvements, indicating that the model performed excellently in balancing recall rate and precision while maintaining good sensitivity to discrimination thresholds.
In particular, AUC-ROC increased from 0.7681 to 0.9462 demonstrating a strong capability of distinguishing between different categories.
In terms of cluster effect evaluation, we observed an increase in Homogeneity and Completeness indicators which indicates that our model can gradually improve cluster purity and completeness during clustering process.
Silhouette coefficient, Calinski-Harabasz index,and Davies-Bouldin index all showed improvement in clustering quality.
Specifically,the silhouette coefficient rose from .9029 up to .9367; Calinski-Harabasz index went up approximately from around about45280to nearly79026; whereas Davies-Bouldin Index decreased downfrom .3297to .2849 - lower values indicate better separation between clusters as well as higher density within clusters.
These results demonstrate that the CNNModel not only performs excellently in supervised learning tasks when dealing with datasets with complex internal structures, but also effectively reveals the underlying structure of data in unsupervised learning clustering tasks. Through deep feature learning and data extraction, the model shows potential for handling high-dimensional data and extracting useful information from it.
[INSERT Fig. 3 HERE]
SecPPIs
In this experiment, the BiLSTM model did not use clustering indicators for its task but focused on evaluating the classification performance of the model. The training of the model was completed within 10 cycles, and the performance metrics observed included precision, recall rate, F1 score, accuracy, AUC-ROC, AUPRC and MCC.
The initial performance of precision and recall rate was not ideal. However as training progressed especially in the 10th cycle of training there was a significant increase in recall to 0.6972; although precision fluctuated it remained stable overall.
The F1 score which is a harmonic mean of precision and recall increased from an initial value of 0.1642 to 0.4851 reflecting improvements made by the model in identifying positive class samples.
Accuracy maintained at a high level throughout while AUC-ROC remained stable always above 0.9155 indicating that our model has good classification ability.
AUPRC and MCC also showed positive progress with MCC increasing from an initial value of .1609 to .4446 showing enhanced prediction accuracy and consistency by our model.
This BiLSTM Model demonstrated unique advantages when dealing with complex data sequences through continuous training and adjustments leading to significant improvement in its performance.
Ultimately it displayed strong capabilities handling time series data particularly excelling in terms of Recall Rate & AUC-ROC metrics.
[INSERT Fig. 4 HERE]
Model Performance
Comparing performance with different competitive methods
Since our dataset comes from DeepPPISP, to reflect the fairness of data selection, we choose to compare methods on the same task set, as shown in Table 1. The data comes from published research results[25]. At the same time, for a more comprehensive performance comparison, we include the best performance values of methods other than those using DeepPPISP tasks for comparison. The data comes from published literature[26].
[INSERT Table 1 HERE]
In this study, we demonstrated the significant advantages of our models by comparing three models we defined: SecPPIS, AngPPIS and DisPPIS, with a variety of mainstream models on the same task set. These models have shown excellent performance in multiple key performance indicators such as accuracy (ACC), precision (PRE), recall rate (REC), F1 score, AUC-ROC, AUPRC and MCC.
The SecPPIS model showed the highest accuracy among all models (0.868), which far exceeded other traditional methods such as RF_PPI and SCRIBER. The accuracy of the RF_PPI model was only 0.598 and that of SCRIBER was 0.616. In addition, the AUC-ROC value of DisPPIS is 0.924, which is significantly higher than most models; for example, the AUC-ROC of SCRIBER is 0.635 showing its excellence in evaluating model prediction capabilities for positive and negative classes.
Although SecPPIS has relatively low precision and recall rates compared to others it still shows strong reliability and stability in complex protein-protein interaction prediction tasks.
The AngPPIS model excels in accuracy, precision, recall rate and F1 score, with all indicators at 0.839 or above. Particularly in terms of precision and recall rate, it reaches 0.840, demonstrating the balance and efficiency of this model in identifying and classifying positive and negative samples. Notably, its AUC-ROC reached 0.908 while AUPRC was 0.889; these figures not only surpass conventional models such as DLPred and TransformerPPIS but also rank among the top tier within bioinformatics field.
The DisPPIS model exhibits the highest recall rate (0.890), indicating that it performs exceptionally well in recognizing all positive class samples without missing any positives. In addition to this, the accuracy of this model along with its AUC-ROC have also reached 0.840 and 0.924 respectively further validating that while ensuring a high recall rate it can still maintain a high overall prediction accuracy.
DisPPIS's MCC value is at an impressive level of 0.684 which is significantly higher than other models like EnsemPPIS whose MCC stands at just 0.277 or EGRET's MCC which is even lower at just .27 showing a significant improvement for DisPPIs when considering statistical correlation as well as predictive quality
Overall, our model not only outperforms existing technical indicators on individual metrics, but also excels in overall performance, marking a comprehensive improvement in model performance indicators. This result highlights the effectiveness and practicality of this method in predicting protein-protein interactions, while also providing new research tools and directions for the field of bioinformatics, especially with promising prospects for application in medical and biological research. Through further development and optimization, these models are expected to play a key role in a wider range of biological data analysis tasks.
Actual use of model prediction
In this study, our focus is on assessing the actual performance of deep learning models in predicting protein-protein interactions. By comparing the results predicted by the model with publicly published and verified experimental data, we can more accurately understand the effectiveness and predictive accuracy of the model.
Based on publicly published literature, we selected proteins that are recognized to have interacting sequences or structural domains to fit with the prediction results. Since our research is about site prediction, if fitting with sequence or structural domain, then we need to expand sites. In practice, we select predicted sites and an additional three upstream and downstream sites to form a sequence. Then sort and merge these sequences by comparing current range with end position of last element in merged list.
The interacting structure domain or sequence may include one or more interaction sites[27]. These interaction sites together with other active sites within domain constitute functional structure domains[28]. This paper selects PB1, PX, SH3 etc., for verification[29–31], as shown in Fig. 5.
[INSERT Fig. 5 HERE]
In this study, we selected nine proteins as examples to demonstrate the fit between actual protein interaction sites and our model's predicted results. Although the predictions generally align with the actual interaction sites, we observed that the predicted coverage area exceeded half but did not fully correspond with the actual data, as shown in the left half of Fig. 5. This could potentially increase subsequent research workload. Therefore, we attempted to remove datasets containing more than two interacting proteins from the original dataset and only retained datasets involving two interacting proteins in hopes of improving prediction accuracy through this method. After adjustments, there was a relative reduction in the model's predicted area and a significant improvement in site prediction accuracy, as shown on the right side of Fig. 5. We speculate that when including more than two proteins, due to multi-subunit interactions, an increase in quantity and complexity of interaction sites may lead to inaccurate or unstable model predictions. Compared with systems involving only two interacting proteins, multi-protein interaction systems introduce more variables and dynamic interactions which might affect learning algorithm performance.
Overall, whether it is achieving evaluation metrics or practical applications our model exhibits outstanding comprehensive performance.
In fact, before we found the optimal solution, we experimented with different preprocessing methods and model architectures on the same data. Ultimately, we discovered that the key to achieving higher performance model training results lies in how to balance data preprocessing and model architecture design. This means that researchers who find more representative preprocessing methods and model architectures that fit the characteristics of the data often achieve better results..
Initially, we attempted to conduct experiments starting from the atomic distance matrix of proteins. The way it was cut differed from the final sequence cutting method; we adopted a sub-matrix cutting approach and learned site features through input models. After discussion, we decided to use a 16*16 sub-matrix size for cutting, as shown in Fig. 6. We first used the GNN model architecture. Using this structure allowed us to receive sub-matrices as inputs, meaning that we took the central value of distances in matrices as graph nodes and distances between other sites as edge feature embeddings in graph neural networks. Additionally, we found that using both point (node) and edge features simultaneously could help our model learn and express information more effectively. However, training results were not satisfactory, as shown in Fig. 6b. We believe if a model's design does not suit our data structure then it will struggle to learn these data effectively. Simply put: for good training results, a model's structure must match with the form of its data.
Therefore, we employed variational autoencoders to separate out our model training process into different stages: encoding high-dimensional input data into low-dimensional latent space containing three fully connected layers each followed by ReLU activation functions gradually reducing feature dimensions; two fully connected layers outputting mean values and log variances respectively within latent space - parameters used subsequently for reparameterization tricks controlling distribution shape when generating new data; decoding low-dimensional representations within latent space back into high-dimensional representations matching original input data including four fully connected layers each followed by ReLU activation functions gradually increasing feature dimensions with last layer followed by Sigmoid activation function compressing output values between [0–1]; finally setting up classifier comprising series of fully connected layers plus batch normalization layers progressively reducing feature dimensions ending with single-dimension result via one final fully-connected layer suitable for binary classification tasks.
We also tried directly using ResNet architecture on matrix data but results were unsatisfactory (as seen again in Fig. 6b). Therefore conclusion is that ResNet performs better when dealing with matrix-data transformed into images.
After discussion, we decided to increase the feature dimensions and complexity of the model architecture in an attempt to improve model performance. We selected four types of information: amino acid properties, secondary structure information, distance and angle, spatial coordinates. Amino acid properties include charge, polarity, hydrophobicity, isoelectric point, molecular weight and flexibility for each amino acid. We predefined attribute values for 20 common amino acids. Secondary structure information refers to using DSSP algorithm to extract protein's secondary structure information from pdb files. Distance and angle refer to calculating Euclidean distance and angles between amino acids. Spatial coordinates refer to calculating spatial coordinates (average coordinates) of amino acid residues based on aforementioned attributes, sequence position and spatial coordinates.
We divided these 16-dimensional features into node features (properties and location info of the amino acids) along with edge indices & attributes (distance & direction info). These were inputted into a Graph Attention Network (GAT) together with node classification labels in order capture both node & edge features.
Interestingly enough after applying cross-entropy loss function while setting category weights(2), sample attention(2), as well as output categories from loss function(sample loss synthesis), training results improved as shown in figure below.
However,the effect was still not sufficient.So we changed our strategy focusing more on data relevance itself.
Tree-based models such as decision trees, random forests, gradient boosting trees(such as XGBoost, LightGBM etc.) are widely used for feature selection due their ability provide intuitive feature importance scores.These models predict target variables by constructing series decision rules during which they can evaluate contribution each feature makes towards model prediction performance[32]. Feature importance scores tree-based models quantify contribution each feature makes when building decision tree usually calculated by examining frequency at which a particular feature is used split nodes along with improvement brought about by said split[32].
The input data required for Random Forest to evaluate feature importance mainly includes a matrix in the shape of sample-feature[33]. In graph data, nodes (amino acids) are interconnected. To adapt to Random Forest, we treat each node (and its features) as a sample and the label of the node as the label of the sample. Thus, we create a feature matrix X where each row represents an amino acid's features and a vector y containing labels corresponding to each amino acid (0 or 1). Eventually, we use Random Forest to complete this 16-dimensional information's importance evaluation (16-dimensional information includes protein charge, hydrophobicity, polarity, relative molecular mass, rigidity-flexibility level, isoelectric point type of secondary structure occupied by amino acids relative position in sequence and space between amino acids distance information between amino acids angle information), if shown like Fig. 6a.
In conclusion, we finally discard redundant feature data and only retain distance information about amino acids angle information and secondary structure info.
When trying classic CNN architecture processing sub-matrix data model performance began improving as shown in Fig. 6b. After experimenting with different model architectures new attempt at preprocessing method was made selecting from center point upwards or downwards expanding range selecting numerical series feature data as input into new transformer architecture trial resulted in maintaining relatively high-level performance as shown in Fig. 6b.
Next up we completed exploration within RNN architecture transform architecture Attention framework and CNN framework using our dataset(distance matrix dataset already sliced for example). Results are displayed on Fig. 6b.
In summary, to ensure the reliability of the experiment and the accuracy of model weight file selection, we used CNN, BiLSTM architecture, and ResNet pre-trained models for data training. We recorded train loss and test loss during the training process as shown in Fig. 6c. The results show that both SecPPIs and DisPPIs models' train and test losses present a steady downward trend. However, while AngPPIs model's train loss continues to decrease, its test loss does not continue to decline steadily after the fourth epoch, suggesting overfitting begins at this point. This phenomenon may reflect limitations of using pre-trained model methods. Although we tried adding Dropout layers to alleviate overfitting, results showed significant overfitting still occurred. Therefore, during AngPPIs training we adopted an early stopping strategy manually and chose the weights from epoch four for prediction. For DisPPIs and SecPPIs models respectively we selected weights from epochs fourteen and ten for prediction.
Compared with publicly published researches our study employed balanced sample category numbers ensuring consistent sample quantities across different classification sites. Through experimentation we found when sample category numbers are unbalanced; model predictions tend towards categories with larger quantities which affects feature learning accuracy within these models. Therefore when evaluating metrics like ROC if a model is biased towards one category although Accuracy or ROC values might be high other indicators perform poorly in comparison.
Given this future attention should be paid before training on balancing class data quantity techniques which could effectively improve generalization capabilities along with predictive accuracy within these models.
[INSERT Fig. 6 HERE]
Currently, the application of virtual experimental methods is still in the model development stage and has not been widely applied to real-world scenarios. These methods start from different protein feature perspectives, indicating that there may be more innovative feature dimensions in the future, which will deepen our understanding of the mechanism of protein action at a microscopic level. On one hand, diverse biological functions of proteins can be mapped through different feature dimensions, meaning that research scope is not only limited to exploring their interactions but could also extend to aspects such as protein activity, drug resistance and catalytic properties. On the other hand, the development and application of these research methods are cornerstones for driving innovation in key medical fields like targeted drug development, cancer treatment and neurotherapy. As these methods further apply practically we anticipate more emerging experimental techniques and strategies in future bringing revolutionary progress for biomedical research and treatment.