Structure prediction of high confidence human protein interactions
We selected experimentally determined human interactions from the Human Reference Interactome (HuRI) and the Human Protein Complex Map (hu.MAP 2.0). HuRI comprises interactions determined by yeast two-hybrid screening (Luck et al, 2020) from which we modelled 55586 pairs. From hu.MAP we selected 10,207 high-quality (confidence score ≥0.5) protein-protein interactions (PPIs), which were derived by integration of affinity purification, co-fractionation and proximity ligation assays (Drew et al, 2021). While HuRI is more likely to be enriched for direct protein interactions, including potentially transient partners, the hu.MAP set is more likely to reflect stable protein interactions, including members of the same complex that may not necessarily interact directly. The overlap between the two datasets is small (309 pairs), and a comparison with two large scale compendiums of structural models ((Mosca et al, 2012), Methods) indicates that 62,019 of the combined pairs do not have experimental models or can be modelled easily by homology, suggesting a significant potential gain in structural knowledge.
We predicted the structure of 65,484 non-redundant pairs using the FoldDock pipeline (Bryant et al, 2021), based on AlphaFold2 (Jumper et al, 2021). We have previously shown that larger interface size and higher predicted lDDT (plDDT) scores from AlphaFold2 in the interfaces of the predicted complexes are associated with more reliable predictions (Bryant et al, 2021). As in the FoldDock pipeline, we combined these two metrics into a single score, which can be used to predict the DockQ score of a complex, dubbed pDockQ (Methods) that can rank models by confidence. We tested the overlap performance and ranking by pDockQ score by comparing the predicted models with experimental models. Across 1,465 comparisons, 742 (50%) of predicted complexes were deemed to be well modelled (DockQ>0.23). For predictions with pDockQ>0.23, we estimate that 70% (671 out of 955) are well modelled and for pDockQ>0.5, 80% (521 out of 651) are considered well modelled. However, it is worth noting that this estimated performance applies to cases where we expect that two proteins interact via direct contact in a single conformation.
We show in Fig. 1A the distribution of pDockQ for the predicted interactions and a set of predicted structures for 2,000 random pairs of proteins. The pDockQ of known interacting proteins tends to be higher than for the random set with the predictions for hu.MAP showing on average higher confidence than for the HuRI set. Additionally, when selecting hu.MAP interactions also supported by yeast-two-hybrid (Y2H) or cross-link data (cross linking) results in much higher confidence values (Fig. 1A). This suggests that high confidence models are enriched for protein interactions supported by the two types of methods associated with high affinity and direct interactions. Based on the benchmarking results, we selected 3,137 structures (Fig. 1B) as high confidence models based on a cutoff of pDockQ>0.5, which would indicate around 80% of correct models in comparison with the experimental models. Only 0.3% of the random set of models would be considered a confident prediction at this cut-off. In Fig. 1C we show examples of predicted structures aligned to experimental or homology models, showing how the predictions and the confidence score relate to the observed alignments. For the majority of these cases, even with lower confidence values, the interaction interface is generally in good agreement, except for the PSMC2-PSMD11 interaction, which has the lowest confidence score of the illustrated models.
A list of protein interactions with predicted structural models, confidence metrics and annotations are provided in Table S1 and all models are available as described in the data availability section.
Cross-linking support for predicted complex structures
Chemical cross-linking followed by mass spectrometry is an approach that can be used to identify reactive residues (usually lysines) that are in proximity, as constrained by the geometry of the cross-link agent used. The identification of such residues across a pair of proteins can help define the likely protein interface. To determine if the predicted complex structures agree with such orthogonal spatial constraints, we obtained a compilation of cross-links for pairs of residues across 528 protein pairs with predicted models (Fig. 2A, Table S1, see Methods). Of these, 51% of models have one or more cross-links at a distance below the expected maximal distance possible (Fig. 2A, Methods). Restricting the predicted models to higher confidence by the pDockQ score increases the fraction of complexes with cross-links within the maximal distance possible, reaching 75% for pDockQ scores greater than 0.5 (Fig. 2A). This result is in line with the benchmark results above, indicating that most models are likely to be correct at a pDockQ cut-off above 0.5. Additionally, predicted structures with pDockQ>0.23 are also likely to have many correct models as judged by the fraction supported by cross-linking.
In total, we have identified 479 cross-links providing supporting evidence for 171 predicted complex structures with pDockQ>0.5. Out of these, 41 correspond to complex structures with no experimental structure or homology models, from which we selected some to illustrate in Fig. 2B-E. Fig. 2B shows the AF2 model for the full length of the ERLIN1/ERLIN2 complex, which mediates the endoplasmic reticulum-associated degradation (ERAD) of inositol 1,4,5-trisphosphate receptors (IP3Rs). AlphaFold2 predicts a globular domain (1-190) followed by an extended helical region with a kink around amino acid position 280. Unlike the model in Interactome3D, the paralogous proteins are stacked side-by-side with the hydrophobic face of the helices buried and the hydrophilic face (mainly Lys) exposed to solvent. A cross-link between the C-terminal residues K275 (ERLIN1) and K287 is predicted to bridge a distance of 18 Å, supporting the predicted model. In Fig. 2C we show the model for proteins IMMT and CHCHD3, components of the mitochondrial inner membrane MICOS complex. AlphaFold2 predicts a globular helical domain at the C-terminal end of IMMT (550-750) to interact with the C-terminal end of CHCHD3 (150-225). This is supported by data of three cross-links between; K173 (CHCD3) and K565 (IMMT), and K203 (CHCD3) to both K714 and K726 of IMMT. Fig. 2D shows the complex of tRNA-guanine-N(7)-methyltransferase (METTL) with its non-catalytic subunit (WDR4). The structure of WDR4 has not yet been solved experimentally but contains WD40 repeats, which are expected to form a β-propeller domain, as predicted here. The METTL domain is predicted to interact with the side of the WDR40, away from the ligand-binding pore. This orientation is supported by a cross-link between K122 (WDR4) and K143 (METTL) (18 Å). Finally, in Fig. 2E we show the predicted complex structure for the heterogeneous nuclear ribonucleoprotein C (HNRNPC) and the RNA-binding protein, RALY. Two regions in both proteins are predicted with high confidence (plDDT>70), with the lower confidence regions not shown. The N-terminal domain in HNRNPC (16-85) is predicted to interact with the N-terminal domain of RALY (1-100). A long helix in HNRNPC (185-233) is predicted to interact with a helix in RALY (169-228). This interhelix interface is supported by crossl-inking data for three pairs of lysines at either end of the helices (189→ 222; 229→179 and 232→ 183).
Disease-associated missense mutations at interfaces
Missense mutations associated with human diseases can alter protein function via diverse mechanisms, including disrupting protein stability, allosterically modulating enzyme activity, and altering protein-protein interactions. Structural models can lead to the identification of interface residues allowing for the rationalisation of possible mechanisms of such interface disease mutations. To determine the usefulness of the predicted structures for studying disease mutations, we compiled a set of mutations located at interface residues that were previously experimentally tested for the impact on the corresponding interaction (IMEx Consortium Curators et al, 2019). We then performed in silico predictions of changes in binding affinity upon mutations using FoldX (Delgado et al, 2019) and observed that mutations known to disrupt the interactions are predicted to have a strong destabilisation of binding compared to mutations known not to have an effect (Fig. 3A, Table S2). Higher confidence (plDDT) of the mutated residues led to more substantial discrimination between mutations known or not known to disrupt or not complex formation (Fig. 3A).
Having established the value of the predicted structures for modelling interface mutations, we mapped human disease (from ClinVar) and cancer mutations (from TCGA) to the interface residues defined by the set of high confidence protein complex predictions (pDockQ>0.5). The hu.MAP and HuRI confident predictions identified 280 interfaces carrying pathogenic mutations and 602 interfaces corresponding to the top 25% recurrently mutated interfaces in cancer, defined as the highest number of mutations per interface position (Fig. 3B, Methods). We illustrate in Fig. 3C examples of protein network clusters with interface disease mutations across a range of biological functions. For example, interface mutations in chromatin remodelling, including members of SWI/SNF complex (SMARCD1, SMARCD2, SMARCD3) and several transcription factors related to the development (e.g. TCF3, TCF4, LMO1 and LMO2). All of the disease mutation information is provided in Table S1.
We selected examples of interfaces with disease mutations and no previous experimental data or homology to available models (Fig. 3D-G). Fig. 3D shows the interface of WDR4-METTL1 that has supporting cross-link information described above. WDR4 has two annotated pathogenic variants at this interface, linked with Galloway-Mowat Syndrome 6, with the highlighted R170 participating in interactions with a negatively charged residue of METTL1. Fig. 3E shows an example of an interface with 32 recorded interface mutations in cancer for both proteins, including the highlighted arginines in LDOC1, which form electrostatic interactions with the opposite chain. TWIST1 has several annotated pathogenic mutations, including L149R and L159H, which are at residues buried in the interface (Fig. 3F). In particular, the leucine to arginine mutation, associated with the Saethre-Chotzen syndrome, would strongly disrupt packing. The R118G mutation would disrupt the interaction with residue F22 mainchain O in TCF4. In RAD51D we found the mutation R266C (Breast-ovarian cancer, familial) that interacts across the interface with XRCC2 (Fig. 3G), paralogous genes involved in the repair of DNA double-strand breaks by homologous recombination. Interestingly, we also found mutations at R239, to Trp/Gln/Gly, associated with Breast-ovarian cancer that interacts with Tyr119 in XRCC2 that itself is also annotated as having mutations linked to hereditary cancer-predisposing syndrome.
Phospho-regulation of protein complex interfaces
Protein phosphorylation can regulate protein interactions by modulating the binding affinity via the change in size and charge of the modified residue. Over 100,000 experimentally human phosphorylation sites have been determined to date (Ochoa et al, 2020; Lawrence et al, 2016), but only 5 to 10% of these have a known function (Hornbeck et al, 2015). Mapping phosphorylation site positions to models of protein interfaces can generate mechanistic hypotheses for the functional role of phosphorylation sites in controlling protein interactions. We used a recent characterisation of the human phosphoproteome (Ochoa et al, 2020) to identify 4145 unique phosphosites at interface residues of the set of confidently predicted structures. We noted that the average functional importance, defined by the functional score described by Ochoa and colleagues (Ochoa et al, 2020), was generally higher than random for phosphorylation sites at interfaces (Fig. 4A). Among the interface phosphorylation sites, we found some enrichment for targets of multiple kinases, including several tyrosine kinases (ERBB2, AXL, ABL2, FER) (Fig. 4B). This observation suggests that some interfaces in different protein pairs may be under coordinated regulation by specific kinases and conditions.
To identify potentially co-regulated interfaces, we collected measurements of changes in phosphorylation levels across a large panel of over 200 conditions (Ochoa et al, 2016). We retained 260 phosphosites that had a significant regulation in three conditions and then computed all-by-all pairwise correlations in phosphosite fold changes across conditions. We clustered these phosphosites by their profile of correlations (Fig. 4C), identifying 16 groups of co-regulated interface phosphorylation sites (Fig. 4C, Table S3). For each group of phosphosites, we identified the conditions where these have the strongest up- or down-regulation (Fig S1) and plotted a subset of conditions in Fig. 4D. We also performed a GO enrichment analysis for each group of co-regulated phosphosites, including both proteins of the modified interfaces, to search for common biological functions (Fig. 4E, Table S4). For example, we observed a cluster of interface phosphosites in proteins related to intermediate filaments (cluster 7) that show strong regulation patterns along the cell cycle, downregulated in S-phase and up-regulated in G1 and mitosis. Phosphosites in cluster 1 (cell cycle G1-S phase transition) show the opposite trends with up-regulation in late S-phase and down-regulation in G1 and mitosis. Some clusters show regulation under specific kinase inhibition which may provide novel hypotheses for kinase regulation of specific processes. For example, phosphosites in cluster 9 (regulation of chromosome assembly) tend to be up-regulated after inhibition of ROCK and up-regulation after inhibition of mTOR.
While not all phosphosites at interfaces are likely to regulate the binding affinity, this analysis provides hypotheses for the potentially coordinated regulation of multiple proteins by tuning their interactions after specific perturbations. We provide the complete list of interface phosphosites, known kinase regulators and condition-specific regulation in Table S1.
Higher-order assemblies of protein complexes from binary interactions
Proteins interact with multiple partners either simultaneously, as part of larger protein complexes, or separated in time and space. This is also reflected in our structurally characterised network, where proteins can be found in groups as illustrated in a global network view of the interactions with confident models (Fig 5, central network, Fig S2 and Data S1). One key benefit of structurally characterising an interaction network is the identification of shared interfaces for multiple interactors. As an example, we highlight GDI1 (Rab GDP dissociation inhibitor alpha) that interacts with multiple Rab proteins regulating their activity by inhibiting the dissociation of GDP. The predicted complex structures for these interactions shows how these share the same interface and therefore cannot co-occur. Other clusters in the network suggest that the proteins form larger protein complex assemblies with many-to-many interactions. As the use of AlphaFold2 for predicting larger complex assemblies can be limited by computational requirements, we tested whether the structures for pairs of proteins could be iteratively structurally aligned. We tested this procedure on a small set of complexes covered in this network, with known structures and the number of subunits ranging from 5 (RFC complex, TFIIH core complex) to 14 (20S proteasome). We then aligned an experimentally determined structure with the predicted models (Fig. 5, grey - experimental model). These examples showcase the potential and also limitations of this procedure.
The TFIIH core complex is composed of 5 subunits with 1-to-1 stoichiometry. All subunits can be modelled with the final complex generally agreeing (Fig. 5) with a cryoEM structure for these subunits (6NMI). The most significant difference to the cryoEM model is the relative positioning of the ERCC3 subunit. The exact final model obtained can vary depending on the aligned pairs with multiple possible final conformations (Fig S3). Fig. 5 illustrates the conformation that best matches the cryoEM model in 6NMI. For example, for the TFIIH core complex, there is a predicted model where the complex adopts a more open conformation (as seen in 5OQJ) and alternative predicted placements of the GTF2H1 subunit.
The RFC complex is also composed of 5 subunits with 1-to-1 stoichiometry. One iterative alignment of pairwise interactions builds a model that includes all five subunits organised similarly to that observed in the 6VVO cryoEM structure (Fig. 5). In this predicted model, the subunits RFC2/5/4/3 match the experimentally observed model well, but there are apparent deviations introduced by compounding errors in alignment by this iterative process. Individual subunits in the cryoEM structure can be aligned to each of the model subunits well, but then the alignment of the rest of the model is progressively worse the further away the subunits are positioned from the aligned subunit. The RFC1 subunit is individually not well predicted, showing a considerable difference between the cryoEM and AlphaFold2 models. Some of the modelled pairs highlighted additional issues. For example, the RFC3 - RFC5 interaction pair is predicted with high confidence, while in fact, these do not share a direct contact in the experimental structure. AlphaFold2 places RFC3 at the RFC5-RFC4 interface, likely due to the structural similarity between RFC3 and RFC4.
Encouraged by the examples tested, we defined an automatic procedure to generate larger models by iterative alignment of pairs (Methods). We start building all possible dimers in a complex, then sort them by pDockQ, and start building from the first ranked dimers. Next, we add the highest-ranked dimer, which shares one subunit with the complex if it does not overlap; this is repeated for all dimers until the complex is complete or no additional proteins can be added. We tested this on the 20S proteasome, a particularly challenging example with stoichiometries different from 1-to-1 and homologous subunits. This automatic procedure could build a model containing all 14 subunits (half of the proteasome) that are mostly placed in agreement within the experimental model (Fig. 5). However, the exact order of the chains is incorrect, i.e. at each location an incorrect protein is placed, highlighting that AF2 cannot distinguish which two proteins interact from a set of homologous proteins.
In summary, we find that it is possible to iteratively align structures of pairs of interacting proteins to build larger assemblies but identified issues that limit this procedure at the moment.