In this section, we discuss the workflow of building a training and testing dataset for a machine learning (ML)-based model to assess protein-protein interfaces. The derived score (PI-score) is then applied to assess the quality of interfaces in models submitted for CASP13 targets, EM model challenge targets, and PDB entries associated with EMDB. We discuss the examples from each of these datasets. We also compared the performance of PI-score with other density based-score and statistical interface potentials.
Building the Dataset
A total of 3,926 high-resolution complexes obtained from PDB32 were subjected to an in-house pipeline to assign the interfaces using a distance-based threshold (Methods). To avoid the over-representation of similar interfaces in the dataset, structurally similar interfaces within a quaternary structure were filtered out using interface similarity score calculated with iAlign33, resulting in 2,858 interfaces from 2,314 complexes. Various interface features, namely: number of interface residues, contact pairs, surface area, shape complementarity, number of hydrophobic, charged, polar and conserved residues at the interface and other interface properties evaluated by PISA, were computed for the dataset. These features were successfully calculated for 2,406 interfaces, which form “positive dataset1” (PD1, see Methods).
To train the ML classifiers of our choice on data closer in quality to models fitted on cryo-EM maps (especially at intermediate-to-low resolutions), noise was added to PD1 by slightly perturbing the relative positions and orientations of the interacting subunits. This was performed using a protein-protein docking method34 and then selecting the poses with high fraction of aligned interface residues/interface residues in native complex (fNal), and low interface RMSD (iRMSD) (see Methods for cut-offs). This set, which contains 3,743 interfaces, is referred to as “positive dataset2” (PD2).
A “negative dataset” (ND), containing 3,578 interfaces, was also derived using docking and includes complexes in which the interfaces are structurally different, i.e., ‘far’ from native interfaces (low fNal and high iRMSD, see Methods for cut-offs). The schematic of the procedure to collate the datasets and workflow is summarised in Fig. 1.
Ranking of Interface Features
Various interface features (listed in Methods) were computed for the above-described datasets. As the number of derived features (12) was manageable and computationally not very expensive, we used all the features to train our classifiers. To identify the top-ranking (or most influential) interface feature(s), we used different methods namely, Ridge, Random Forest, Recursive Feature Elimination, Linear Regression and Lasso for feature ranking. Our exploration showed that the top-ranking features were shape complementarity, number of polar interface residues, number of charged interface residues, and interface solvation energy (Fig. 2a).
Training the Classifier and Cross-Validation
To develop a better understanding of the performance, we evaluated three ML classifiers, namely, Support Vector Machine (SVM), Random Forest (RF), and plain vanilla Neural Network (NN), or simply, multi-layer perceptron (MLP), using the Scikit-learn Python package35 (scikit).
We used the following combination of datasets described above to train two high-level models, namely, Model A and Model B (referred to as models henceforth), using the interface features (Methods) (Fig. 1). Each of these models relies on three different classifiers, described above (SVM, RF and NN).
Model A: Positive and negative datasets derived using docking (PD2 and ND, respectively). Model B: Positive dataset constitutes high-resolution complexes and computationally derived docked complexes (PD1 + PD2) and ND as negative dataset.
While training and testing both models (Model A and Model B), to minimise the bias of the classifiers, which can easily become an issue with unbalanced datasets, we used stratified shuffle split with ten splits (and test size of 30%) as a measure of cross-validation. In both scenarios, the performance, which was measured based on ML- and classifier-specific metrics, namely, accuracy, precision, recall, F1 and Matthew’s Correlation Coefficient (MCC), was comparable between the three methods (Fig. 2b and Fig. 2c).
Among these, the SVM-trained model offered the best validation accuracy of 86% (Model B), and it was selected to assess the quality of protein-protein interfaces modelled in cryo-EM maps. The SVM classifier finds a hyperplane that maximises the inter-class variance, and we used the distance of a given data-point from this hyperplane as the machine learning-based score (PI-score) for a given prediction (interface). The farther a point (interface) is located from the hyperplane (more negative or positive), the more confident is the prediction using the SVM model36. We assessed the performance of the PI-score at different thresholds by analysing the number of false positives (FP) and false negatives (FN). For the ten test sets (30%) obtained using the stratified shuffle split (for cross validation purpose), the fractions of FP (41%) and FN (42%) were observed to be highest in the PI-score ranges of (0 to 0.5] and (-0.5 to 0], respectively (Fig. 2d). We also estimated the measures of performance in different PI-score thresholds and observed that the PI-scores > 1 and <-1 (for the positive and negative class label, respectively), were more reliable, based on the low false positive rate (FPR) and high true positive rate (TPR) in the respective bins (Table 1).
Table 1
Performance in different bins of the scores using the SVM machine learning-based classifier.
Scores’ bins
|
TPR (True Positive Rate)
|
FPR (False positive Rate)
|
Precision
|
Specificity
|
(-/+)[0.0 to 0.5)
|
0.15
|
0.76
|
0.15
|
0.24
|
(-/+) [0.5 to 1.0)
|
0.29
|
0.58
|
0.29
|
0.41
|
(-/+) [1.0 to 1.5)
|
0.61
|
0.39
|
0.63
|
0.61
|
(-/+) [1.5 to 2.0)
|
0.78
|
0.21
|
0.78
|
0.78
|
(-/+) [2.0 to 2.5)
|
0.93
|
0.11
|
0.93
|
0.99
|
>=2.5 and <=-2.5
|
1.0
|
0.18
|
0.97
|
0.81
|
The following bins according to the listed thresholds and the measure TPR, FPR, Precision and Specificity are averaged values over the test datasets obtained from ten-fold cross-validation:
0.0-0.5: True positives present in the score range of 0.0 to 0.5 and true negative in score range of -0.5 to 0.0. False positives are the complexes from negative dataset (ND) but are predicted positive with a score assigned between (0.0 to 0.5) and false negatives are positive complexes from either positive dataset 1 or 2 (PD1 or PD2) but are predicted as negative with the score between (-0.5 to 0.0).
0.5-1.0: True positives present in the score range of 0.5 to 1.0 and true negative in score range of -0.5 to -1.0. False positives are the complexes from negative dataset (ND) but are predicted positive with a score assigned between (0.5 to 1) and false negatives are positive complexes from either positive dataset 1 or 2 (PD1 or PD2) but are predicted as negative with the score between (-0.5 to -1.0).
1.0-1.5: True positives present in the score range of 1.0 to 1.5 and true negative in score range of -1.0 to -1.5. False positives are the complexes from negative dataset (ND) but are predicted positive with a score assigned between (1.0 to 1.5) and false negatives are positive complexes from either positive dataset 1 or 2 (PD1 or PD2) but are predicted as negative with the score between (-1.0 to -1.5).
1.5-2.0: True positives present in the score range of 1.5 to 2.0 and true negative in score range of -1.5 to -2.0. False positives are the complexes from negative dataset (ND) but are predicted positive with a score assigned between (1.5 to 2.0) and false negatives are positive complexes from either positive dataset 1 or 2 (PD1 or PD2) but are predicted as negative with the score between (-1.5 to -2.0).
2.0-2.5: True positives present in the score range of 2.0 to 2.5 and true negative in score range of -2.0 to -2.5. False positives are the complexes from negative dataset (ND) but are predicted positive with a score assigned between (2.0 to 2.5) and false negatives are positive complexes from either positive dataset 1 or 2 (PD1 or PD2) but are predicted as negative with the score between (-2.0 to -2.5).
>=2.5 and <=-2.5: True positives with a score >= 2.5 and true negative with score <= -2.5. False positives are the complexes from negative dataset (ND) but are predicted positive with a score >2.5 and false negatives are positive complexes from either positive dataset 1 or 2 (PD1 or PD2) but are predicted as negative with the score < -2.5.
Application to CASP Targets (High Resolution Targets)
We applied the above trained models to make predictions on the quality of protein-protein interfaces in cryo-EM targets from the CASP13 competition37. Three of the targets (T1020o, T0995o, and T0984o) were classified as ‘easy targets’, with many high-accuracy models deposited by the participating groups and were also evaluated for the goodness-of-fit to the experimental cryo-EM maps38. For each of these targets, a submitted pool of models, an experimentally solved structure (target) and a density-based score for assessing the goodness-of-fit are available from the CASP13 website. Therefore, these targets form an ideal dataset for assessing the performance of PI-score.
We used the CASP multimeric scores (https://predictioncenter.org/casp13/multimer_results.cgi), namely, F1, Jaccard index, lDDT(oligo) and GDT(o) (see Methods: comparison with CASP13 oligomeric scores) to define true positives (TP), true negatives (TN), FP and FN for CASP targets. If any of the four CASP13 multimeric scores was equal or greater than (>=) 0.5, and the model was scored positive by our classifier, it was treated as TP. TN were defined as model structures which did not have any of the CASP13 scores > = 0.5 and were scored negative by the classifier. The models which were scored > = 0.5 by any of the four CASP13 scores and negative using our classifier score were defined as FP and models which were scored negative by the classifier but had at least one of the CASP13 score > = 0.5 were FN.
T1020o: 3.3 Å resolution homo-trimer structure of an S-type anion channel from Brachypodium distachyon
Using our SVM classifier, nine of the assessed 111 submitted models (with 329 interfaces) were predicted to have at least one ‘negative’ interface (negative PI-score) in the complex. These nine models were also scored low on the CASP multimeric assessment scores39 (Supplementary Table S1). With a more systematic comparison of PI-scores against the oligomeric assembly assessment scores from CASP1339, we achieve 82% accuracy for this target (see Methods: comparison with CASP13 oligomeric scores).
All interfaces in the target structure (Fig. 3a) and in the top-ranked model based on the cross-correlation of the model with the cryo-EM density (CCC) ((TS004_2o, Fig. 3b) have positive PI-score. Out of the nine negatively-scoring models, TS008_4o and TS135_3o had negative PI-score for all the three interfaces (Fig. 3c and Fig. 3d, respectively). When these models are compared to the target structure, all three interfaces have high iRMSD and low fraction of aligned native residues (Average iRMSD of 2.93 Å and 3.33 Å for TS008_4o and TS135_3o, respectively, Table 2). For model TS208_1o, two of the interfaces (formed by chains, AC and BC) have negative PI-score (Fig. 3e) and PI-score was not calculated for the third interface, as the number of interface residues were only nine and eight for chain A and B, respectively, which was less than our cut-off for defining an interface (Methods).
Table 2
Assessment of interfaces in the models of CASP13 cryo-EM target T1020o. The model and equivalent target chains forming the interface are listed along with the interface RMSD (iRMSD), fraction of aligned native interface residues (fNat) and predicted class using our model.
Model ID
|
Model interface
|
Target interface
|
iRMSD (Å), fnat
|
Predicted class
|
Score
|
TS004_2o
|
AB
|
AB
|
2.2, 0.81
|
Positive
|
2.6
|
|
BC
|
BC
|
2.5, 0.75
|
Positive
|
2.6
|
|
AC
|
AC
|
2.1, 0.82
|
Positive
|
2.7
|
TS008_4o
|
AB
|
AC
|
3.16, 0.42
|
Negative
|
-1.5
|
|
BC
|
BC
|
2.82, 0.48
|
Negative
|
-1.5
|
|
AC
|
AB
|
2.81, 0.48
|
Negative
|
-1.5
|
TS135_3o
|
AB
|
BC
|
3.08, 0.56
|
Negative
|
-1.6
|
|
BC
|
AB
|
3.4, 0.52
|
Negative
|
-1.6
|
|
AC
|
AC
|
3.51, 0.6
|
Negative
|
-1.6
|
TS208_1o
|
AB
|
BC
|
2.6, 0.63
|
Not ranked (Interface residues from model 9 and 8)
|
NA
|
|
BC
|
AC
|
2.5, 0.54
|
Negative
|
-0.2
|
|
AC
|
AB
|
2.6, 0.52
|
Negative
|
-0.39
|
For the model TS208_1o (CCC = 0.42, target structure CCC = 0.77), we generated density maps at resolutions lower than the target map: 5, 8, 10 and 12 Å using the ‘low pass filter’ utility in CCP-EM suite (https://www.ccpem.ac.uk/). Since the CCC does not have a defined absolute cut off value to differentiate between good and bad fit at any given resolution, it is difficult to identify ‘target-like’ models. On the other hand, PI-score, which is a density-independent metric, can be very useful to distinguish ‘target/native-like’ interfaces in the modelled complex(es) (Supplementary Table S2).
T0995o: A 3.15 Å resolution homo-octamer (A8) of cyanide dehydratase
The PI-score for the target structure was positive for the dimer interface (Supplementary Figure S1a), which is repeated to form an octameric complex. We calculated the PI-score for 657 interfaces in the 118 CASP13 models for this target and assessed the quality of the dimer interface between all subunits. 123 interfaces in 37 models were observed to have negative PI-score. The top-ranked model (after target) in terms of CCC was TS008_2o (Supplementary Figure S1b), which is calculated to have positive PI-score for the equivalent dimer interface (iRMSD = 1.55 Å). Examples for the models with negative PI-score are TS117_1o (iRMSD = 4 Å, Supplementary Figure S1c) and TS008_5o (iRMSD = 2.76 Å, Supplementary Figure S1d). The models with interfaces having negative PI-score using our classifier were also scored low for the CASP13 multimeric scores (Supplementary Table S1).
By comparing it with the multimeric scores in CASP13, we achieve an accuracy of 67% for this target. This target has higher stoichiometry and more interfaces than T1020o, and therefore it is expected to achieve a lower accuracy against the CASP13 assembly scores, which are calculated per complex (while our classifier is per interface and hence this may not be a direct comparison).
PI-scores for the assessed interfaces in models for CASP13 cryo-EM targets is provided in Supplementary Table S3.
T0984o: A 3.4 Å dimer of a calcium channel
145 models were assessed, and all were observed to have a positive PI-score for the interface (Supplementary Table S3).
Given the nature of CASP experiments where the participating groups model the complexes without the knowledge of cryo-EM map, protein-protein interface assessments such as PI-score will provide complementary assessment and are crucial to provide insights into model quality.
Application to EM Model Challenge
Next, we calculated PI-scores for the models submitted for the targets from two EM validation challenges (https://challenges.emdataresource.org/), namely, 2016 EM model challenge and 2019 model metrics challenge (Supplementary Table S4).
Target T0002 (from model challenge 2016) is a 3.3 Å resolution cryo-EM map of the 20S proteasome (EMD-5623). We assessed the ten submitted models (with 175 interfaces) and interfaces in the target structure (PDB ID: 3J9I). In three of the models- EM164_1, EM189_1 and EM189_2, there was at least one interface that obtained a negative PI-score.
As an example, we chose model EM164_1, for which most the interfaces in the alpha and beta subunits were scored negative (Supplementary Table S4). In the alpha ring, the two subunits in the model (chains F and C, shown in red and green Fig. 4a) were scored negative by our classifier (PI-score: -2.27). The interface conformation is slightly different as compared to the target structure (iRMSD = 0.86, fNal = 0.54). This interface is loosely packed (Fig. 4) and smaller than the equivalent interface in the target structure (23 interface residues in model and 37 in the target structure). Due to its small size the iRMSD is low, and therefore is not a good indicator of the quality of the modelled interface in this case. Due to the offset in the modelled interface the shape complementarity at the interface drops significantly to 0.32 as opposed to 0.73 for the interface in the target structure. We also checked the multimeric scores from CASP assessment and EM164_1 is scored high for QS-global (0.88) and lDDT (0.98)) scores but these scores reflect the quality of a multimeric structure as a whole rather than per interface. Other interface assessment scores (from CASP13) such as F1 and Jaccard index, which are calculated per interface, are not reported in the CASP website for this model.
Structurally equivalent subunits in the target structure (chains P and Q) have a CCC of 0.85 and the model subunits (chains F and C) have a CCC of 0.73 (Supplementary Table S5) and local score (SMOC) averaged over interface residues are 0.73 and 0.23 for target and EM164_1 respectively. Our model has rightly predicted this interface as ‘negative’ as reflected by the loose packing at the interface and lower local density-based score for the modelled interface.
Further, we calculated the density-based scores (global and SMOC) at different resolutions (map simulated using low pass filter utility in CCP-EM, Supplementary Table S5). The scores assessing the fit of the model (with interface offset) are comparable at resolution worse than 5 Å. Therefore, especially at intermediate-low resolution, our proposed density-independent PI-score can be a crucial model validation tool.
The interfaces between the beta-subunits were also scored negative (TS164_1) by our classifier. This is reflected by the presence of steric clashes at the interface (blue and purple in Fig. 4). The clashes present at the interface resulted in lower shape complementarity score for the interface in model (0.28) as opposed to a higher score (0.62) for the equivalent interface in the target structure. The subunits (chains X and Y) have a CCC of 0.85 whereas the subunits (chains n and d) of model have a CCC of 0.65. This model interface also has a much lower SMOC score than the equivalent interface in the target at all resolutions (Supplementary Table S5).
Recently, model metrics challenge (2019) was open, and we applied our score for assessing the only multimeric target -T0104 (Horse Liver Alcohol Dehydrogenase, 2.9 Å, dimer). We assessed the reference structure (PDB ID: 6NBB) and 17 submitted models using PI-score. Two models (T0104EM060_1; PI-score − 0.31 and T0104EM060_2; PI-score 0.13), were scored low (Supplementary Table S4), which is in agreement with the CASP multimeric scores (QS and lDDT scores, https://challenges.emdataresource.org/?q=model-metrics-challenge-2019).
Application to Fitted Entries in EMDB
We divided this dataset into three sets: high resolution (better than 4 Å, ‘high resolution’), 4–8 Å (‘intermediate resolution) and 8–12 Å (low resolution). As we have described above the performance of PI-score using high-resolution complexes from CASP and the EM model challenge targets, in this section we will focus more on selected examples from intermediate and low resolution cryo-EM maps. The fitted models were also compared with the interfaces in the corresponding crystal structures.
For completeness, we also provide the PI-scores of our SVM model for the interfaces fitted at high resolution (better than 4 Å) in Supplementary Table S6.
Intermediate resolution (4–8Å)
Chikungunya virus: A cryo-EM map resolved at 5 Å (EMD-5577) with fitted model is available for the virus (PDB: 3J2W, shown in green and red in Fig. 5a, with interface residues shown as grey circles). The envelope1-envelope2 (E1-E2) heterodimer was observed to have a negative PI-score (-1.67). The available crystal structure (PDB: 3N44, 2.35 Å) for the E1-E2 subcomplex (chains B and F, coloured in golden and interface residues in grey spheres, Fig. 5a) is scored positive (PI-score: 1.67). The interface between E1-E2 is slightly shifted as compared to the crystal structure (Supplementary Table S7).
We also calculated the density-based scores (global and local) for the E1-E2 subcomplex to assess the fitted model and crystal structure. The E1-E2 subcomplexes from both the fitted model and crystal structure have a CCC of 0.64 and average SMOC over interface residues of 0.72 and hence are indistinguishable with these scores. The plot for the SMOC score (per residue) is shown in Fig. 5a, for both chains and the average SMOC score per chain is shown with a blue dashed line. Interestingly, the interface residues (grey circles) are observed to score higher than the per-chain average, especially for chain B. Therefore, at intermediate resolution interface-based scores such as PI-score can prove useful to distinguish the offsets in the modelled protein-protein interfaces that are indistinguishable with the density-based scores.
PI-scores for the interfaces in the fitted models at the intermediate resolution range derived from EMDB are available as Supplementary Table S8.
Low resolution set (8–12Å)
Transcription initiation factor TFIID complex: A cryo-EM map resolved at 9.8 Å resolution (EMD-9302) with fitted model is available for the structure of the TFIID complex (PDB: 6MZD, shown in green and red in Fig. 5b, with interface residues shown as grey circles). The interface between subunits 9 and 5 in the fitted model (LF) was scored negative (PI-score: -1.99). This interface is shifted when compared to the corresponding crystal structure (Supplementary Table S7) at 2.5 Å (PDB: 6F3T, chains F and A, shown in golden with interface residues marked as grey circles, Fig. 5b).
Next, we calculated the density-based scores (global and local) to assess the fitted model and crystal structure. The CCC of the fitted model is 0.54 and CCC of the crystal structure is 0.63 upon local optimization of the fit in map, whereas the average SMOC score over interface residues is 0.84 and 0.87 for the fitted model and crystal structure, respectively (Fig. 5b). In this example, we see again (but this time with low resolution maps) that PI-score can provide additional complementary assessment when density-based scores alone are not sufficient to identify the offsets in the modelled interfaces.
PI-scores for the interfaces in the fitted models at the low-resolution range derived from EMDB are available as Supplementary Table S9.
Application to SARS-CoV-2 Cryo-EM Derived Complexes
We also assessed the fitted models in the SARS-CoV-2 cryo-EM maps using PI-score. 108 fitted models were downloaded from EMDB [https://www.ebi.ac.uk/pdbe/emdb/searchResults.html/?EMDBSearch&q=text:(ncov%20OR%20SARS-CoV-2)]. Out of the 108 models, we were able to successfully compute interface features and PI-scores for 55 complexes (149 interfaces). Of these 149 interfaces, 12 were observed to have a negative PI-scores (Supplementary Table S10), with 11 of these being antibody-antibody or protein-antibody interfaces. As our machine learning classifier is not trained on such interfaces (which are reported to have different shape complementarity from other protein-protein interfaces17), we decided not to further investigate these cases.
However, the interface between small subunits (S28-S5) of a human 40S ribosome bound to nsp1 (blue spheres, Supplementary Figure S2a) SARS-Cov2 protein (EMD-11301, PDB ID: 6ZMT) obtained a negative PI-score of -0.04 (Supplementary Figure S2b). We next inspected this complex using the validation suite in CCP-EM (Joseph A.P, et al., under preparation). The sub-complex S28-S5 was found to have a clashscore of 7.20 with severe clashes reported at the interface. We used ‘real space refine zone’ and ‘auto fit rotamer’, with backrub rotamers switched on to fix the steric clashes at the interface using Coot40. Upon re-refinement in Coot, the clashscore drops to 6.20 and PI-score improves to 0.25 (Supplementary Figure S2c). The improvement in the PI-scores is most likely due to resolving the clashes between the interface residue pairs R63-A138, V55-34S and L59-R122 (Supplementary Figure S2b) from chain d and K, respectively.
Comparison With Protein-protein Interface Statistical Potentials
Next, we compared PI-score to the existing protein-protein interface-based statistical potentials (PIE41 and PISA42) commonly used for protein-protein docking. PIE and PISA scores provide residue and atomic potentials respectively, and we also used a combination (0.1*PISA + (-0.8)*PIE + PISA*PIE) of these, which is reported to perform better in identifying “native-like” complexes42. We used the 30% randomly selected test dataset from the entire set (PD1 + PD2 + ND) to calculate the statistical potentials (combined PIE-PISA score) for the interfaces. Different weights for SVM-based score and statistical potentials were tried ranging from 0 to 1, with an increment of 0.1. For this dataset, the combined statistical potential score alone was unable to discriminate between the complexes from ND and PD2 datasets which are both derived using docking (whereas the PI-score separates these complexes better (Supplementary Figure S3).