Zero-shot prediction of mutation effects on protein function with multimodal deep representation learning

doi:10.21203/rs.3.rs-3358917/v1

Download PDF

Biological Sciences - Article

Zero-shot prediction of mutation effects on protein function with multimodal deep representation learning

https://doi.org/10.21203/rs.3.rs-3358917/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Mutations in amino acid sequences can disrupt protein structures and functions. Accurate and unsupervised prediction of mutation effects is critical in biotechnology and biomedicine, but remains a fundamental challenge. To resolve this challenge, we first presented a multimodal deep representation learning model that comprehensively learns both sequence context and structural constraints from ~160 million proteins with reliable structures. Based on the proposed model, we developed ProMEP (Protein Mutational Effect Predictor) to predict mutation effects in a zero-shot manner. ProMEP can capture multi-scale signatures of proteins at atomic-resolution and achieves state-of-the-art (SOTA) performance on mutational effects prediction. ProMEP showed an average accuracy of 91.67% during the engineering of the TnpB protein. Remarkably, ProMEP identified a TnpB variant of triple-mutations with approximately 2.4-fold editing efficiency than the wild-type. ProMEP enables efficient exploration of gigantic protein space and will greatly facilitate studies in protein structures and functions.

Biological sciences/Computational biology and bioinformatics/Machine learning

Biological sciences/Computational biology and bioinformatics/Protein function predictions

Growing evidence from molecular evolution indicates that mutations in the protein sequence can invalidate functional sites¹, provoke changes in protein secondary structures² and result in misfolding³^,⁴. Structural changes at different scales are often associated with changes in protein functions, such as enzyme deficiency⁵, human diseases⁶ and viral escape⁷^,⁸. Deciphering the effects of mutations is thus important in biological sciences, but it remains a major unanswered question because of the gigantic space of possible protein sequences (20^L for a protein of length L), intricate interactions of many residues and the complex mutational epistasis⁹^,¹⁰. Although AlphaFold2 can now generate millions of predicted structures with atomic accuracy¹¹, it is not expected to capture the structural and functional effects of mutated residues^12-14. High-throughput experimental technologies have led to a significant improvement in parallel mutation assessment^15-17, but are not commonly used due to considerations of scale and costs. Accordingly, accurate computational methods are needed to satisfy the growing requirements of mutation effects prediction.

Great efforts have been previously made to predict mutational effects. Classic modeling approaches aim to approximate mutational effects via one or a small subset of protein properties. For example, changes in the physicochemical properties of the amino acids could be used to estimate mutation tolerance¹⁸. Alignment-based methods leverage evolutionary properties by identifying conserved regions or mutational patterns among multiple sequence alignment (MSA)^19-21. Stability predictors largely operate on the basis of protein folding energy to assess functional changes upon mutations²². Supervised learning methods learn the mapping from sequences or structures to a specific protein property on the basis of annotated datasets⁷^,²³^,²⁴. While these methods are clearly useful for mutation effects prediction, their performance is contingent on the depth of MSA, the availability of labeled datasets, or the type of proteins. The protein-specific or data-dependent nature limits their application in exploring unseen protein space at scale.

Deep representation learning, which aims to abstract features that best represent data²⁵, is a general and unsupervised approach to predict mutation effects. Driven by the rapid development of natural language processing techniques and the steady growth of protein sequences, protein language models (e.g. Unirep²⁶, ESM²⁷ and ProtTrans²⁸) emerge and greatly benefit protein representation learning. Trained with massive protein sequences spanning evolutionary diversity, sequence-based representation learning methods generate semantically rich and context-aware representations of amino acids. On the basis of masked-language-modeling or next-token-prediction objectives, these representations can then be used to predict the effects of mutations by estimating the differences between the probabilities of wild-type amino acids and mutated amino acids²⁹^,³⁰. While exciting advances, accurate prediction of mutation effects remains unsolved due to the omission of fine-grained structural signatures.

Protein structures are more evolutionary conserved than sequences and contain long-range contact information that are often crucial for protein functions³¹, but few attempts have been made to develop structure-based protein representation learning models that explicitly utilize protein tertiary structures to predict mutation effects³²^,³³. The major challenge of integrating structural information for mutation effects prediction is how to learn structural constraints at atomic accuracy from hundreds of millions of protein structures. Traditional quantitative structure modeling approaches can represent arbitrary protein structures as predefined geometric features^34-36 or frequency of protein segments³⁷^,³⁸, but are not expected to capture atomic dependences in a self-supervised manner. Protein contact map captures inter-residue distances while disregarding atomic structural information, such as coordinates of each atom. Protein graph contains fine-grained details in protein structures, but are often computationally expensive. Point cloud is a commonly used format for 3D data and has shown tremendous success in many areas, including computer vision, autonomous driving, and robotics³⁹. Compared with contact map and protein graph, point cloud has the advantage of preserving atomic geometric information without sacrificing computational efficiency. However, protein representation learning on point cloud remains in its infancy due to its unordered and homogeneous features.

In this study, we first developed a deep representation learning model with ~659.3 million parameters, which integrates both sequence and structure information by tapping into ~160 million proteins in the AlphaFold protein structure database⁴⁰. To introduce structural constraints at atomic-resolution and capture the multi-scale signatures caused by mutations, we adopted an ordered and heterogeneous protein point cloud as a novel representation of protein structures. Based on our proposed multimodal deep representation learning model, we developed Protein Mutational Effect Predictor (ProMEP) to enable zero-shot prediction of mutation effects. ProMEP detects multi-scale structural signatures of a protein and achieves state-of-the-art (SOTA) performance in the prediction of mutation effects. Moreover, we used ProMEP to guide the engineering of the transposon-associated TnpB enzyme⁴¹ without prior knowledge. On average, ProMEP demonstrated ~91.67% accuracy in predicting the functional effects of TnpB mutations. Compared with the wild-type, TnpB variant with triple mutations, which was identified by ProMEP, showed approximately 2.4-fold improvement in editing efficiency. As gigantic sequence spaces could be explored in a high-throughput and cost-effective manner, ProMEP may tackle many of the challenges in biotechnology and biomedicine.

A multimodal deep representation learning model for proteins.

Deep representation learning methods based on protein sequences and protein structures have been successfully applied for approximating protein functions, respectively⁴². To integrate both sequence information and structure information of proteins, we first developed a multimodal deep representation learning model (Fig. 1a; Methods) with ~659.3 million parameters. It was trained on ~160 million AlphaFold2 structures by completing the missing elements from corrupted input under the constraints of both sequence and structure (Methods). With the learned knowledge, it could summarize an arbitrary protein into a fixed-length representation that encodes protein functions.

To make the best use of protein tertiary structures at atomic accuracy, we explored the design of the protein point cloud and the structure embedding module (Supplementary Fig. 1; Methods). The proposed point cloud showed markedly higher performance than the naive point cloud, which contains only coordinates information. In addition, masking the type of residue attached to each point leads to further performance improvement due to the increasing training difficulty (Supplementary Fig. 1a). Further investigation towards the effects of the atom type to the structure embedding demonstrated that the protein point cloud constructed by alpha C atoms performs better than that of beta C and N atoms (Supplementary Fig. 1b). Besides, one layer structure embedding module showed the best performance in generating structure embeddings of a protein (Supplementary Fig. 1c).

To assess whether multimodal representations approximate functional properties of proteins, we carried out a systematic evaluation with 15 function-related datasets (Methods). Proteins in each dataset are annotated with specific labels, including the EC number, GO terms and protein-protein interactions (PPI). Our multimodal representations achieved SOTA performance on all seven function annotation datasets and eight PPI prediction datasets (Fig. 1b and 1c; Supplementary Table 1 and 2; Supplementary Fig. 2). Compared with existing deep representation learning methods that used sequence alone (e.g. UniRep²⁶ and ESM²⁷) or structure alone (e.g. GearNet³³), and existing shallow multimodal methods (e.g. DeepFRI⁴³) that combined both sequence and simplified structure information, our proposed model integrated atomic-resolution protein structures at the scale of more than 100 million proteins. Simultaneously learning sequence context and structural constraints enabled our proposed model to generate fundamental protein representations and ultimately made more accurate function predictions. Robustness tests on 4 function annotation datasets and 3 PPI prediction datasets (Methods) demonstrated that our multimodal representations can capture functional properties for proteins that share low sequence similarity or low structure similarity (Supplementary Fig. 3). Extensive generalization tests (Methods) showed that our multimodal representations enabled one-shot function prediction (Supplementary Fig. 4a and 4b) and generalized well across species (Supplementary Fig. 4c).

Multi-scale characterization of protein via deep representation learning model.

Mutations in protein sequences can bring disturbance to protein structures at different scales, including global folding structures, local secondary structures and micro functional sites. We examined whether our multimodal representation learning model can accurately capture multi-scale structural signatures of proteins.

First, we evaluated the global protein folding perception ability using a multi-class fold classification benchmark, which contains 13,265 domains filtered from the SCOPe v2.07 database (Methods). Compared with previous structure-based models that were trained on top of contact map⁴⁴ or protein graph⁴⁵, representations generated by our model can be classified into the correct fold with superior accuracy (Fig. 1c; Supplementary Fig. 5a). Specifically, our proposed model showed significant performance improvement for sparse fold classes, including multi-domain proteins (class e) and small proteins (class g).

Since the protein secondary structures contain interactions between neighboring amino acids, we next used secondary structure classification benchmarks to evaluate the local structure perception ability of our proposed model (Methods). The latent representations of all amino acids of a protein are generated and mapped to secondary structure labels. We conducted in-silico comparison with baselines including pre-trained protein language models and alignment-based methods. Our model outperformed all baseline methods in both 3-class and 8-class secondary structure classification on three standard datasets (Fig. 1d; Supplementary Fig. 5b). Evaluation on three additional benchmarks further validated its superior local structure perception ability, including B-factor and solvent accessible surface area (Supplementary Fig. 5c and 5d; Methods).

Functional sites are often involved in molecular interactions and critical to model mutational effects⁴⁶. However, without expert knowledge of the structure and molecular function of a protein, identifying its functional sites can be a daunting task. We used functional site identification benchmarks (Methods) to examine the micro structure perception ability of our model. During inference, attention at each position of a protein is calculated and ranked. Although our proposed model is not trained to identify functional sites, we found it demonstrated superior performance than two structure-based functional site prediction baselines⁴³^,⁴⁷ (Fig. 1e; Supplementary Fig. 6a). It is worth mentioning that, the residue that gets the most attention in our model is likely to be a functional site (48.30%/52.21% probability in Swissprot and CLEAN datasets, respectively). Shuffling the attention could result in significant performance loss in function prediction tasks (Supplementary Fig. 6b).

Taken together, these results demonstrated that our proposed multimodal deep representation learning model can capture multi-scale structural signatures of a protein. Leveraging these structural constraints, it has the potential to accurately estimate mutational effects on both structure and function. Moreover, the quantitative and visualizable multi-scale structural signatures offer valuable insight for biologists to understand the molecular function of annotated proteins. For proteins without annotations, the top residues that get the most attention are important hints for identifying active sites or binding sites.

Zero-shot prediction of mutation effects on protein based on multimodal representations.

To accurately predict the mutation effects on proteins in a zero-shot manner, we proposed ProMEP based on our multimodal deep representation learning model (Fig. 2a). The log-ratio heuristic, which compares the probabilities between wild-type amino acids and mutated amino acids, has been successfully used to estimate the effects of mutations²⁰^,²⁹^,³⁰. While previous methods calculate this score only conditioning on sequence context, our multimodal architecture allows ProMEP to quantify the log-likelihood of protein variants with additional structural constraints. For arbitrary mutations, ProMEP first extracts sequence embeddings and structure embeddings from the wild-type protein. These embeddings are then aligned and fed into the pre-trained transformer encoder to generate protein representations at residual resolution. With the sequence decoder, fine-grained protein representations are eventually decomposed into the conditional probabilities on each amino acid under the constraints of both sequence and structure. By comparing probabilities of the wild-type sequence and the mutant sequence, ProMEP could accurately depict the protein fitness landscape without supervision from experimental data.

We sourced three representative deep mutational scanning (DMS) datasets, including REV_HV1V2⁴⁸, KKA2_KLEPN¹⁶ and SPG1_STRSG¹⁰ (Methods), to benchmark whether ProMEP could predict mutation effects for protein span diverse functions. Specifically, SPG1_STRSG is a large dataset that contains 535,917 double mutations with measured binding fitness of protein GB1. We predicted the fitness score for each mutation in a zero-shot manner (Methods) and computed the Spearman’s rank correlation coefficient between predictions and experimental measurements. Leading mutation effects prediction methods are evaluated for comparison, including EVE⁴⁹, Tranception³⁰, and MSA transformer⁵⁰. ProMEP with multi-scale structural signatures significantly outperformed all other baselines on all datasets (Fig. 2b). For SPG1_STRSG that contains multiple mutations, ProMEP consistently showed the best correlation with experimental measurements than other baselines.

To interpret the performance improvement, we analyzed the structural signatures captured by ProMEP on KKA2_KLPEN, which contains both the experimentally determined tertiary structure and functional site annotations (Methods). We observed consistently high performance from ProMEP across local structural signatures and micro structural signatures (Fig.2c and Supplementary Fig. 7a). ProMEP detects the actual secondary structure (Supplementary Fig. 7b) and pays more attention to its functional sites, in particular on Asp208, a binding site of KKA2_KLPEN (Fig. 2d). Together, ProMEP captured multi-scale structural signatures of a protein and achieved superior performance in zero-shot prediction of mutation effects.

Generalization of ProMEP in various proteins.

To validate the generalization ability of ProMEP in predicting mutation effects, we assessed model predictions across a collection of 20 high-throughput mutational scans (Methods). The entire benchmark contains 172,310 mutations from 20 proteins (Supplementary Table 3), which are derived from nine different organisms, measured by different assays, range in length (93–635 aa) and take part in diverse biological processes (for example, response to antibiotic, transcription, and catalysis). Despite these challenges, ProMEP outperformed current state-of-the-art methods and showed the best performance on all 20 DMS datasets (Fig. 3a). In particular for datasets that contain mutants with multiple mutation sites, ProMEP consistently achieved superior performance (Supplementary Fig. 8).

To test the contribution of the multi-scale structural signatures in predicting mutational effects, we also carried out an ablation study of the structure embedding module of ProMEP (Methods). The results showed that structural constraints learned by the structure embedding module markedly contributed to the improved performance (Fig. 3b). For 12 proteins with experimentally determined tertiary structures, their actual secondary structures could be more precisely predicted by ProMEP, in particular with structurally rich representations captured by the structure embedding module (Fig. 3c). Besides, for 5 proteins with functional site annotations, ProMEP could accurately identify the corresponding active sites or binding sites, which are critical to protein functions (Fig. 3d and Supplementary Fig. 9).

Since ProMEP utilizes protein structures for fine-grained function prediction, we examined how the resolution of protein structures affects its performance. Compared with experimentally-determined protein structures, structures predicted by AlphaFold2 or ESMFold⁵¹ are accurate enough for ProMEP to make predictions (Fig. 3e; Supplementary Fig. 10). Moreover, ProMEP could capture the rotation-/translation- invariant features of protein tertiary structures. It also tolerates structure noises and exhibits higher performance than ESM2_15B, which is the largest open-source protein language model to date, even when introducing 3Å jitter in predicted structures (Fig. 3f).

Collectively, structural constraints enable ProMEP to interpret underlying impact of mutation on structure and ultimately to more accurate functional effects prediction. The markedly higher performance, generalization ability, and robustness of ProMEP suggested powerful potential in predicting mutational effects for proteins without prior knowledge.

ProMEP-guided mutagenesis of TnpB.

Transposon-associated TnpB is a miniature (408aa) RNA-guided DNA endonuclease⁴¹. Although TnpB is suitable for adeno-associated virus-based delivery, its therapeutic applications are hindered due to the limited editing efficiency. Recent effort has identified a suit of active TnpB systems⁵², but the rational design of TnpB remains challenging since it requires a holistic understanding of protein structure and molecular function. The scarce data from mutagenesis experiments also thwart the use of supervised computational protein design methods.

Without prior knowledge of TnpB, we explored whether ProMEP could guide specific mutagenesis by modeling the protein fitness landscape (Methods). We begin by analyzing the multi-scale structural signatures captured by ProMEP. Both secondary structures and functional sites of TnpB can be accurately identified (Supplementary Fig. 11). Specifically, among the top 10 residues that get the most attention of ProMEP, 5 of them are annotated functional sites (Fig. 4a) that are consistent with recent work on structural analysis⁵³^,⁵⁴.

We then ranked all single mutants according to their corresponding fitness score (Methods). Enrichment analysis shows that arginine-based mutants are significantly enriched among top-5% ranked mutants (Fig. 4b). Therefore, we selected 10 arginine-based mutants assigned either the highest (favored) or lowest (disfavored) fitness score for experimental validation (Supplementary Table 4; Supplementary Fig. 12; Methods). We also randomly chose 10 arginine-based mutants to serve as a control. Compared with the wild-type, seven of ten favored mutants, two of ten random mutants, and zero of ten disfavored mutants show average fold change larger than 1.0 (Fig. 4c and 4d). Among 10 favored mutants identified by ProMEP, V1.1 (S57R), V1.3 (S72R) and V1.4 (L61R) significantly improved the editing efficiency of TnpB (p-value < 0.05). Specifically, V1.3 (S72R) demonstrated the highest editing efficiency and resulted in approximately 1.56-fold improvement relative to the wild-type.

With the accurate and unsupervised identification of favored single mutations, we then used ProMEP to engineer TnpB with multiple mutations. The entire sequence space of TnpB contains 20⁴⁰⁸ mutants. Four subsets of the protein sequence space are systematically explored, including double arginine-based mutants with S72R (S72R-D subspace, 371 mutants), triple arginine-based mutants with S72R (S72R-T subspace, 68,635 mutants), all double arginine-based mutants (Double subspace, 69,006 mutants), and all triple arginine-based mutants (Triple subspace, 8,710,740 mutants). For each subspace, mutants that contain neutral or negative mutations in top-10 arginine-based single mutants were filtered out (Methods). We selected 10 multiple mutants assigned with the highest fitness score from each subspace for further evaluation (Supplementary Table 5). As the protein sequence space grows rapidly, it’s extremely difficult to identify efficient proteins against the vast majority of non-functional ones. Nevertheless, ProMEP consistently demonstrated extraordinary accuracy in identifying beneficial mutants. Specifically, for double mutations, nine of ten favored mutants in the Double subspace and all of ten favored mutants in the S72R-D subspace showed further improved editing efficiency than the wild-type (Fig. 4e and 4f). For triple mutations, all of ten favored mutants in the Triple subspace and nine of ten favored mutants in the S72R-T subspace demonstrated significantly higher editing efficiency relative to the wild-type (Fig. 4g and 4h). Besides, V3.10 (S72R-S267R-S57R), V3.14 (S72R-S57R-K207R), V3.15 (S72R-P391R-S57R) and V3.20 (S72R-Q99R-S57R) exhibited 2.16-fold to 2.39-fold improvement in editing efficiency (p-value < 0.05).

Token together, ProMEP showed an average accuracy of 91.67% in identifying beneficial or deleterious mutations in TnpB. Without the requirement of expert knowledge and labeled datasets, ProMEP provides an unsupervised strategy that allows biologists to efficiently explore the gigantic sequence space in a high-throughput and cost-effective manner.

Characterization of ProMEP-engineered TnpB variants.

To validate the editing efficiency of the most effective single mutant (S72R) and multi-site mutants (S57R-S72R-Q99R) identified by ProMEP, we conducted tests on six additional endogenous sites in human 293T cells (Methods). Among all the endogenous sites examined (Methods), the S72R mutation exhibited a significant improvement in gene editing efficiency compared to the wild type (WT). Furthermore, the multi-site mutant, S57R-S72R-Q99R, demonstrated further enhanced editing efficiency when compared to the single mutant, S72R, consistent with previous observations at the EMX1 site1 locus (Fig. 4h and Fig. 5a). Notably, the triple mutation combination showed a significant increase in editing efficiency across all tested endogenous sites, rising from 22% to 37% at the RNF2 site1 and from 42% to 60% at the RNF2 site2.

Sequencing analysis of the AGBL1 endogenous locus revealed that TnpB mainly induced deletions (Fig. 5b). Deletions caused by both WT and mutant variants were primarily located in the distant TAM region, with a peak at approximately 15–20 bp from TAM, consistent with previous reports⁵². Moreover, as mutations accumulated, the efficiency of inducing deletions progressively increased. These findings provide evidence that our TnpB modifications, guided by ProMep, significantly enhance its gene editing capacity in mammalian cells.

Structural data from previous studies indicated that the optimal triple mutant combination sites are located within the REC structural domain of TnpB⁵³ (Fig. 5c). Additionally, ProMEP identified sites (Fig. 4a) consistent with TnpB’s reported enzymatic activity centers (D191, E278, and D361) and the four conserved cysteine residues in the CCCC-type zinc finger (C331, C334, C351, and C354)⁵³^,⁵⁴. Multiple sequence alignments of 5000 homologous sequences revealed that S72, S57 and Q99 showed high conservation with residue K (97.66%), S (61.95%) and E (56.97%), respectively. Residue R exhibited a low occurrence at these positions (< 5%), which limits the usage of arginine-based mutations in traditional rational design methods. Nevertheless, integrating both sequence context and structural constraints of TnpB allows ProMEP to explore the fitness landscape and identify these beneficial mutations.

Overall, the ProMEP-guided TnpB variants showed significantly enhanced gene editing efficiency in mammalian cells and enabled further therapeutic applications. These findings also highlight the potential of ProMEP in exploring gigantic sequence space and accurately identifying additive beneficial mutants without prior knowledge.

For efforts ranging from the designing of new functional proteins, to the quantification of pathogenicity for less studied protein variants, to the evolutionary prediction of new viruses, accurate and unsupervised prediction of mutation effects is critical to a wide range of applications. Current state-of-the-art approaches, such as DeepSequence²⁰, MSA-Transformer⁵⁰ and Tranception³⁰, are based on the sequence information of proteins. Since protein functions are largely encoded in its tertiary structures, we hypothesized that leveraging structure constraints will improve the accuracy of mutation effects prediction. Therefore, we presented ProMEP based on a multimodal deep learning method that incorporates structure information for zero-shot prediction of mutation effects on proteins (Fig. 1a). By modeling the protein fitness landscape under constraints of both sequence and structure, ProMEP significantly outperformed sequence-based methods in predicting mutation effects. Specifically, regardless of mutational depths and source organisms, ProMEP shows consistently superior performance across diverse proteins (Fig. 2a, Fig. 3a). Ablation study of the structure embedding module demonstrates that structure information leads to the markedly higher performance of ProMEP (Fig. 3b and 3c).

A key aspect of this work distinguishing ProMEP from existing methods is its underlying multimodal deep representation model. In addition to sequence information, our model leverages structure information learned from ~160 million proteins with reliable structures. The generated multimodal representations contain fundamental properties of proteins and significantly facilitate studies in protein function, including function annotation and PPI prediction (Fig. 1b and 1c). It also provides an unsupervised and quantitative approach that sheds lights on the in-depth study of the sequence-structure-function paradigm. In aspect of mutation effects prediction, the customized protein point cloud enables our proposed model to introduce structural constraints at atomic accuracy. By tapping into large unlabeled protein structure space, our proposed model demonstrates outstanding ability in capturing functional properties and multi-scale structural signatures of a protein, including the global fold type, the local secondary structures, and the functional sites in particular (Fig. 1c, 1d and 1e). Perception of these fine-grained structural signatures allows ProMEP to precisely interpret interactions between residues and accurately predict the functional effects of mutations.

ProMEP used protein structures predicted by ESMFold⁵¹ or AlphaFold2¹¹ to predict mutation effects. This a major advantage that enables ProMEP to predict mutation effects for any protein as long as its amino acid sequence is available for structure prediction. Since experimentally determined 3D structures are not available for most proteins⁵⁵, this is critical for the application of ProMEP. The results showed that, compared with using experimentally determined 3D structures, ProMEP still achieved competitive performance in modeling mutation effects through predicted structures (Fig. 3e), especially those of AlphaFold2 predicted (Supplementary Fig. 10). Moreover, robustness tests demonstrate that ProMEP could tolerate 3Å jitter in predicted structures and still outperforms ESM2_15B, the largest open-source protein language model to date (Fig. 3f). On the one hand, these results demonstrated that utilizing structure information is important to accurately decode mutational effects. On the other hand, as structure prediction methods keep evolving, combining more accurate predicted structures could further improve the performance of ProMEP. In addition, requiring no structures of all protein variants, ProMEP used a single wild-type protein structure for mutation effects prediction. This strategy allows ProMEP to introduce structural constraints selected through evolution, and explore gigantic protein space without sacrificing computational efficiency.

Recent research has shown that the transposase-related ribonucleoprotein TnpB⁴¹ is the ancestral protein of various Cas12 effector proteins in the V-type CRISPR system and possesses the ability to cleave double-stranded DNA. Given the relatively small size of TnpB (408 amino acids), it is of significant importance to engineering a high editing efficiency variant for therapeutic application. Traditional directed evolution involves the iterative process of randomly constructing different mutants and screening for improved variants, followed by experimental quantification or qualitative screening of individual variants to identify the best ones. However, this process entails labor-intensive experiments and is often constrained by the throughput of screening and selection methods. Without the requirements of labeled datasets or a holistic understanding of protein structure and molecular function, ProMEP introduces a new approach to enzyme engineering. ProMEP can predict the effects of multi-site mutations under the premise of zero samples and demonstrated high prediction accuracy (Fig. 4). Through the beneficial mutation engineering of TnpB using ProMEP, we have significantly enhanced its editing activity within mammalian cells (Fig. 5) and expanded its potential for gene therapy applications.

There are some limitations of the current study. First, ProMEP can quantify the effect of multiple amino-acid substitutions in arbitrary protein sequences, but cannot handle insertions/deletions (InDels). While most protein engineering efforts and DMS experiments have been performed using substitutions, InDels could also cause folding changes and affect organismal adaptations⁵⁶^,⁵⁷. Currently, ProMEP is trained to predict the masked token in each corrupted sequence under the constraints of both sequence context and tertiary structures. Estimating protein fitness by comparing the log-likelihood of each position can score substitutions, but unable to score InDels. Switching the training objective from masked-language-modeling to next-token-prediction and calculating the probability of the full sequence could tackle this issue. However, it might require a larger scale architecture and larger training dataset to develop an optimal next-token-prediction model⁵⁸. Second, due to the limitation of current context size (Methods), ProMEP works the best for proteins shorter than 1024aa, which covers ~95.88% Uniparc sequences⁵⁹. For proteins longer than this threshold, such as the spike protein of SARS-CoV-2 (1273aa), ProMEP needs to split the protein into overlapped segments and run multiple times to capture the whole sequence context and structural constraints. With the advent of natural language processing techniques such as recurrent memory transformer⁶⁰, we will update a long-context and InDels-compatible version of ProMEP in the future.

In conclusion, we described a general and accurate computational method that enables zero-shot prediction of mutation effects on proteins. ProMEP showed outstanding performance in diverse proteins and outperforms recent SOTA methods. ProMEP also demonstrated great potential in designing functional mutants. Emerging de-novo protein design methods (e.g., ProteinMPNN⁶¹, RF-Diffussion⁶² and ProGen⁶³) are expected to substantially expand the space of protein sequences from those sampled by evolution. In combination with fast structure prediction methods such as ESMFold, ProMEP will enable the exploration of the vast uncharted realms of protein space and greatly benefit many fields, including protein engineering and design, pathogenicity prediction of protein variants and evolution prediction of viruses.

Construction of protein point cloud.

Compared with the naive point cloud which is unordered and homogeneous, our proposed protein point cloud consists of ordered and heterogeneous points that are extracted from its raw structure. Specifically, each point corresponds to the alpha C atom of an amino acid. In addition to the 3-dimensional coordinates of each point (x, y, z), the type of residue each point belongs to (R) and the position of each residue in the protein sequence of length L (P) are attached as point features.

Definition of each point in protein point cloud: [x, y, z, R, P],

R (G, A, V, L, I, S, T, C, M, D, E, N, Q, R, K, F, Y, W, P, H)

P (1, 2, 3, ..., L)

Architecture of the multimodal deep representation learning model.

To decipher protein functions at the residual resolution, we developed a multimodal protein representation learning model (~659.3 million parameters). It applies an encoder-decoder architecture to simultaneously learn sequence context and structural constraints from millions of proteins (see Fig. 1). For a protein of length L, the encoder takes the masked sequence and the masked protein point cloud as input and generates a K-dimensional feature vector for each amino acid. The latent representations (L*K) are then fed into the decoder to complete the missing elements of the corrupted sequence and protein point cloud. K is set to 1280 during training and inference.

The sequence embedding module, the transformer encoder module and the sequence decoder module apply similar networks to that of current protein language models²⁷^,²⁸. Specifically, the transformer encoder module is a 33-layer stacked Transformer, and each layer consists of one layer normalization block, one 8-head attention block, and one feed-forward network.

The global features of a protein tertiary structure should be invariant to arbitrary input poses, which means 3D translations and rotations of the input protein structure should not affect the output. To guarantee such invariance, we chose the NVIDIA-optimized version of SE(3)-Transformer⁶⁴ as the structure embedding module, which contains one 8-head attention block interspersed with one normalization module, one TFN layer, and one max pooling layer. We used 1 layer SE(3)-Transformers for large-scale training. The structure decoder module is a multiple-layer perceptron network.

To capture the structural signatures of a protein, the structure embedding module first calculates the K nearest neighborhoods centered on each point as well as their relative positions. Next, an equivariant weight matrix is built upon the Clebsch-Gordon coefficient and spherical harmonics to guarantee the equivariance of point features during transformation. Thirdly, the attention mechanism is applied to pass features between adjacent points. Finally, point features are aggregated and pooled to output the final structural signatures.

Model training.

We used proteins from the AlphaFold protein structure database as the self-supervised training dataset. It contains ~200 million structures precited by AlphaFold2. We removed proteins shorter than 64 amino acids and those average lddT score lower than 70. We randomly selected ~0.5 million proteins for validation. The final training dataset contains ~160 million proteins. Both amino acid sequence and protein point cloud were extracted from the raw protein structure for multimodal training. Since ~95.88% Uniparc sequences contain fewer than 1024 amino acids, we set the context size to 1024. For proteins longer than 1024 amino acids, we sampled the start position of amino acids from uniform distribution [1, n - x +1] where n is the length of protein minus 1024, and x is sampled from uniform distribution [0, n]. For proteins shorter than 1024 amino acids, padding tokens are appended to their sequence, and random alpha C atoms selected from the raw structures are appended to the extracted protein point clouds.

The extracted amino acid sequence and protein point cloud were then corrupted and recovered by the proposed multimodal model during training. To mask the protein sequence, we randomly sampled 15% of tokens from the sequence after tokenization, and each of them was replaced with a special mask token with 80% probability, a randomly chosen alternate amino acid token with 10% probability, and the original input token (i.e., no change) with 10% probability. To mask the protein point cloud, we calculated the central point of the protein and chose 256 nearest neighbor points centered on it. We masked the coordinates of these points and trained the proposed multimodal network to automatically recover them.

The loss function was a sum of a categorical Cross-Entropy (CE) loss and a permutation-invariant Chamfer Distance (CD) loss⁶⁵. In particular, the CE loss measures the differences between the model’s predictions and the true token for masked amino acid sequence. CD loss quantifies the completion results by calculating the average nearest squared distance between the recovered protein point cloud and the ground truth. By minimizing the cross-entropy loss and the chamfer distance loss, our proposed model learns high-order representations of a protein in a self-supervised manner.

All layers except the transformer encoder module are initialized from a zero-centered normal distribution with a standard deviation of 0.02. The transformer encoder module is initialized with parameters of ESM1b²⁷. We trained the multimodal deep representation learning model for 380K steps using Adam optimizer (β1 = 0.9, β2 = 0.999,) at initial learning rate 1e-4 with batch size 480. The learning rate increases linearly during a warm-up period of 10,000 steps. Afterwards, the learning rate follows an inverse square root decay schedule. The training took about 1 month on 120 NVIDIA A100 GPUs.

Benchmarking multimodal representations with function-related datasets.

We used 15 function-related datasets (See Supplementary Table 6) to benchmark the performance of our multimodal deep representation learning model compared to a comprehensive suite of baselines. For each dataset, protein representations are either fed into a Multi-Layer Perceptron (MLP) or integrated to a customized model to make final predictions. According to the downstream network, two types of representations are used, including the residual-level representation of the protein (protein length × 1,280), and the molecular-level representation that averaged across the length of the protein (1,280). Details are introduced as follows.

EC annotation tasks. Enzyme Commission (EC) number⁶⁶ is a commonly used classification scheme that specifies the catalytic function of an enzyme by four digits. Three diverse datasets are used for benchmarking. The EC-PDB dataset, which was constructed by Gligorijevi´c et al.⁴³, consists of non-redundant proteins retrieved from PDB. Proteins in the test set have corresponding experimentally determined PDB structures and at least one experimentally determined annotation. The EC-New-392 dataset was constructed by Yu et al.⁶⁷, which consists of 392 proteins covering 177 different EC numbers from Swiss-Prot. The EC-Price-149 dataset was a collection of 149 proteins validated by experiments described by Price et al.⁶⁸. The training set of both EC-New-392 and EC-Price-149 was a collection of ~220K proteins from SwissProt that covers 5242 unique EC numbers.

EC reaction classification task. Hermosilla et al.⁶⁹ constructed this dataset, which classifies 37,428 proteins based on the enzyme-catalyzed reaction according to 384 EC numbers. The entire dataset is split into training, validation and testing sets, all of them containing whole EC numbers. In addition, these proteins were clustered via sequence similarity and all protein chains belonging to the same cluster are in the same set.

GO annotation tasks. Gene Ontology (GO) annotations capture statements about how a gene functions at the molecular level (MF), where in the cell it functions (CC), and what biological processes (BP)⁷⁰^,⁷¹. The GO-MF, GO-CC and GO-BP datasets use the same partitioning scheme described in Gligorijevi´c et al.⁴³, which splits ~36K non-redundant PDB chains into training, validation, and test sets. Only GO terms with at least 50 and no more than 5000 training samples are selected. Each protein in the test set contains at least one experimentally confirmed GO term in each branch of GO. The entire dataset covers 489, 320 and 1,943 GO terms in MF, CC and BP, respectively.

Cross-species Protein-Protein Interaction (PPI) prediction tasks. We used the dataset D-SCRIPT⁷² built from the STRING database, which contains PPIs across multiple species, including the Human (15,755 proteins), Mouse (17,252 proteins), Fly (11,306 proteins), and E. coli (4,412 proteins). The Human split contains about 38 thousand as the training set and 25 thousand as the test set. All PPIs of other species are taken into the test set. Except E. coli has only 2000 positive PPIs, the positive samples of other test species are 5000. Negative samples are generated by randomly pairing proteins from the non-redundant set, which is ten times that of the positive to reflect the intuition that true PPIs are rare. A model trained on human PPIs is used to predict PPIs in other species. We sourced all protein structures from the AlphaFold protein structure database.

Virus-human PPI prediction tasks. We used three datasets built by Dong et al.⁷³ to evaluate our model on virus-human PPI prediction. The three datasets contain PPIs between human proteins and viruses, including Ebola, and H1N1. Each dataset has about ten positive and negative interactions between thousands of human proteins and hundreds of virus proteins (see Supplementary Table 7). Structures of human proteins are retrieved from the AlphaFold protein structure database. We predicted the structures of virus proteins with ESMFold.

Multi-class PPI prediction tasks. We exploited two datasets, SHS148K and STRING, built by GNN-PPI⁷⁴, that contain 44,488 and 593,397 multilabel PPIs, respectively. These PPIs are divided into 7 types, including activation, inhibition, reaction, binding, expression, catalysis, and post-translational modifications (ptmod). Each pair of interacting proteins contains at least one of these labels. We sourced all protein structures (5,189 proteins in SHS148k, 15,335 proteins in STRING) from the AlphaFold protein structure database.

For EC-PDB, EC-Reaction, GO-BP, GO-MF, GO-CC, PPI-Mouse, PPI-Fly, and PPI-E. coli, we respectively constructed a MLP classifier as described in Zhang et al.³³ to decode the representations generated by different methods (see Supplementary Table 1 and Supplementary Table 2). For PPI- SHS148K and PPI-STRING, we constructed an 8-layers stacked transformer with a hidden size of 256. While CNN, ResNet, LSTM, and Transformer are initialized randomly and trainable⁷⁵^,⁷⁶, the parameters of other pre-trained models were frozen during training. In particular, pre-trained parameters of UniRep²⁶, ESM1b²⁷ and ProtT5²⁸ were downloaded and used. Models and results of other pre-trained models were obtained from corresponding publications³³^,⁶⁹^,⁷⁴^,^77-81.

For the rest of the datasets, our proposed model acts as a plugin model that generates latent representations for proteins and utilizes existing customized models for further prediction. Specifically, we used CLEAN⁶⁷, a contrastive learning–enabled enzyme annotation model, for EC-New-392 and EC-Price-149. We replaced the raw input of CLEAN with representations generated by our proposed model and kept the model unchanged. For PPI-Denovo, PPI-EBOLA and PPI-H1N1, we used multimodal representations as the input of the graph model proposed by Dong et al.⁷³ and kept all hyper-parameters unchanged. Results of other methods are obtained from corresponding publications.

We probed the robustness of our proposed model in learning sequence-function and structure-function relationships from proteins with low sequence/structure similarity on downstream tasks. Specifically, Blast is used to align the protein sequences of the test set to the training sequences and compute the identity score. We used TMScore⁸² to assess the topological similarity between protein structures of the test set and the training set. 5 similarity cutoffs were used to partition each test set into multiple groups (see Supplementary Table 8 and Supplementary Table 9).

Multi-scale structural signatures benchmarking.

We used three benchmarks to investigate whether these multi-scale structural signatures could be captured by the proposed multimodal network.

Global structure signatures. The Structural Classification of Proteins-extended (SCOPe) database organizes protein domains into multiple hierarchies, including Family, Superfamily and Fold⁸³. In particular, the basis of classification for Folds is purely structural. As described in Xia et al.⁴⁵, we used the 40% identity filtered subset of SCOPe v2.07 as the benchmark set. It contains 13,265 domains that can be classified into seven classes (see Supplementary Table 10). We constructed a five-layer MLP (batch size: 24, learning rate 3e-5, drop out ratio 0.2, Adam optimizer) as the decoder to classify multimodal representations to a specific fold class. We reported the F1 score and accuracy score of the 5-fold cross-validation results on the entire dataset. Leading structure-based methods were used as baselines, including GraSR and DeepFold. GraSR uses a contrastive learning framework to capture protein features from a protein graph of protein structure⁴⁵. DeepFold extracts structural motif features from protein contact maps via a deep convolutional neural network⁴⁴. SGM³⁶ and SSEF³⁸ are classical structural classification tools that uses 30 global backbone topological measures or frequencies of 1,500 secondary structure triplets to encode protein structures, respectively. The performance of these methods were obtained from Xia et al.⁴⁵ and Liu et al.⁴⁴.

Local structure signatures. To benchmark the ability of the proposed multimodal network to capture local structural signatures, we used the dataset constructed by Klausen et al.³⁶ as the training and validation set. It contains 10,837 crystal structures obtained from PDB that filtered at the 25% identity threshold as well as the 2.5Å resolution threshold. Among these structures, 10,337 structures were used as training set, and 500 randomly selected structures were left out as the validation set. CB513³⁸, CASP12³⁸ and TS115⁸⁴ were used as test sets, which contains 507, 21 and 115 non-redundant structures, respectively. Each amino acid of each structure was mapped to a secondary structure label. In particular, both 3-class (Q3) and 8-class (Q8) secondary structure labels were calculated as described in Klausen et al.⁸⁵. Our proposed model acts as an encoder that generates residual-level representations, which are then fed into a MLP classifier as described in Rao et al.⁷⁶. We evaluated its accuracy in both Q3 and Q8. Pre-trained ESM1b²⁷ was used as one of the leading baselines. Results of other methods were obtained from Rao et al.⁷⁶.

We also constructed two small-scale datasets to further benchmark multimodal representations in the condition of scarce training samples. The PDB-100 dataset (see Supplementary Fig. 5c) consists of 100 single-domain proteins that are randomly selected from the SCOP database⁸³. For each protein, we obtained its structure and secondary structure annotations for each of the positions from PDB⁵⁵. In addition to the SCOP-100 dataset, we constructed the CATH-100 dataset by randomly selecting 100 proteins from the dataset collected by Zhou et al.⁸⁶. Each protein in CATH-100 has experimentally determined 3D structure as well as annotated B-factor and Solvent Accessible Surface Area (SASA) for each of the positions. While PDB-100 is a 3-class classification task and CATH-100 corresponds to two regression tasks. We compared the performance of our proposed multimodal network with several leading methods, including UniRep, ESM1b and ProtT5. For each task, we constructed a random forest model to make predictions on the basis of representations generated by each of these methods, respectively. We reported the performance of each model in a 5-fold cross-validation manner on the entire dataset. We used the random forest model trained on SCOP-100 for subsequent secondary structure benchmarking and visualization (see Fig. 3b and 3c; Supplementary Fig. 7; Supplementary Fig. 11a).

Micro structure signatures. We evaluated the micro-structure perception ability of our proposed multimodal network by quantifying its attention on functional sites. During training and inference, it applies the attention mechanism⁸⁷ to probe sequence context and structural constraints. Calculating the attention score between residues of a protein allows us to identify key positions that the model focuses on. We randomly selected 10,000 proteins from Swiss-Prot and filtered out proteins without functional site annotations. We used the 80% identity filtered subset, which contains 1,325 proteins. We also constructed the dataset CLEAN which consists of 113 proteins with functional site annotations from EC-Price-149 and EC-New-392. We ranked all residues based on the attention score during the evaluation. Examples of attention visualization are shown in Fig. 2d, Supplementary Fig. 6c and Supplementary Fig. 9.

DeepFRI⁴³, HEAL⁴⁷ and a random approach (denoted as Random) were used as baselines. Specifically, DeepFRI is a graph convolutional network that employs a pre-trained protein language model to extract sequence features and further constructs a residue graph to predict protein functions. In addition, DeepFRI could identify the contribution of each residue to the predicted function. We utilized the official model checkpoints of DeepFRI and used the Molecular Function branch for evaluation. HEAL is a deep learning model for protein function prediction, which could capture structural features via a hierarchical graph transformer. We downloaded the pre-trained parameters of HEAL and applied the gradient-weighted Class Activation Map (grad-CAM)⁸⁸ to rank the activation score of each residue. Random is a baseline strategy that randomly ranks the importance of each residue. We reported the average performance of Random across five runs. We used the commonly used Top-1 Hits Ratio (Top-1 HR), Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR) as evaluation metrics. These models were also used in subsequent evaluations (see Fig. 1f, Fig. 2c and Supplementary Fig. 6a).

Zero-shot protein fitness modeling.

ProMEP quantifies the log-likelihood of protein variants under the constraints of both sequence and structure. The calculation is shown in the Equation. To model the fitness of a protein, the wild-type sequence S and structural constraints C are fed into ProMEP, which in turn outputs a sequence of log probabilities. We calculated the conditional probabilities of mutated amino acid mt and wild-type amino acid wt at each mutational position t. The sum over all mutated positions T is the final fitness score of a protein variant.

We sourced three representative DMS datasets to investigate ProMEP’s ability in modeling protein fitness in a zero-shot manner. Specifically, the REV_HV1V2 dataset contains 2,147 single mutations with measured replication of the REV protein derived from the HIV-1 virus⁴⁸. The KKA2_KLEPN dataset contains 4,960 single mutations with measured growth of the APH(3’)II derived from Klebsiella pneumoniae¹⁶. SPG1_STRSG is a large dataset that contains 1,045 single mutations and 535,917 double mutations with measured binding fitness of protein GB1 derived from Streptococcus sp. group G¹⁰. 20 DMS datasets used for generalization test are sourced from Notin et al.³⁰ (See Supplementary Table 3).

We use ESMFold to predict the structure of the wild-type protein in these datasets. For each wild-type protein, we collected ~300 homologous sequences from the NR database with sequence identity lower than 80% and predicted their structures via ESMFold. We then used these homologous samples to fine-tune ProMEP for 3 epochs with a learning rate of 1e-4. The fine-tuning procedure enables ProMEP to learn a better understanding of sequence and structural constraints from homologies sampled through evolution. ESM2-15B, which is the largest open-sourced protein language model (~22 times parameters than that of ProMEP), was used as a leading baseline. Results of other baseline methods were obtained from the ProteinGym benchmark³⁰.

ProMEP-guided protein engineering of TnpB.

We fine-tuned ProMEP for 3 epochs with 300-500 homologous proteins retrieved from the NR database with a sequence identity threshold of 80%. All structures of homologous sequences were predicted by ESMFold.

We begin by filtering protein variants with single mutations. Specifically, we constructed a virtual saturation mutagenesis library that only contains single variants (7,752 variants). We then ranked all variants via the calculated fitness score and analyzed the enrichment of mutants among the top 5% variants (387 variants). Since arginine-based mutants are significantly enriched in the top 5% variants, we chose the top 10 and bottom 10 arginine-based variants from the entire ranked list for further evaluation. We also randomly selected 10 arginine-based mutants as a control set.

To generate variants with multiple mutations, we constructed four virtual arginine-based mutagenesis libraries, including double arginine-based mutants with S72R (371 mutants), triple arginine-based mutants with S72R (68,635 mutants), all double arginine-based mutants (69,006 mutants), and all entire triple arginine-based mutants (8,710,740 mutants). Again, we calculated the fitness score of each variant in four virtual arginine-based mutagenesis libraries. According to the experimental data from top-10 arginine-based single mutants, we filtered out mutants that contain neutral or negative mutations (Y388R, S217R, L398R, T405R, L406R, K44R and H403R) from top-ranked variants. The top 10 ranked double-mutants and triple-mutants variants from each mutagenesis library are selected for further evaluation. P values were derived by a two-tailed Student’s t-test. All statistical analyses were performed on n = 3 biologically independent experiments.

Plasmid vector construction.

The TnpB gene was optimized for expression in human cells through codon optimization, and the optimized sequence was synthesized for vector construction by Sangon Biotech. We inserted the ultimately optimized sequence into the pST1374 vector, which contains the CMV promoter and a nuclear localization signal. All reRNA plasmids were cloned using T4 DNA Ligase (New England Biolabs). Oligos for targeting spacers were annealed and ligated into BsaI (New England BioLabs) digested PGL3-U6 backbone vectors. The spacer sequences of sgRNAs used in the study are shown in Supplementary Table 9. The final constructed vectors were all validated for accuracy through sequencing by Sanger sequencing.

TnpB engineering.

The construction of TnpB mutants was achieved through site-directed mutagenesis. PCR amplifications were performed using Phanta Max Super-Fidelity DNA Polymerase (Vazyme). Following digestion with DpnI (New England BioLabs), the PCR products were then ligated using 2X MultiF Seamless Assembly Mix (ABclonal). Ligated products were transformed into DH5α E. coli cells. The success of the mutations was confirmed via Sanger sequencing. The modified plasmid vectors were purified using a TIANpure Midi Plasmid Kit (TIANGEN).

Cell culture and transfection.

HEK293T cells were maintained in Dulbecco’s modified Eagle medium (Gibco) supplemented with 10% fetal bovine serum (Gemini) and 1% penicillin–streptomycin (Gibco) in an incubator (37 °C, 5% CO2). For indel analysis, HEK293T cells were transfected at 80% confluency with a density of approximately 1x105 cells per 24-well. Transfection was conducted following the manufacturer’s manual with 2 μl of ExFect Transfection Reagent (Vazyme) and 1 μg of plasmids (0.5 μg of reRNA plasmids + 0.5 μg of TnpB plasmids).

DNA extraction and Deep sequencing.

The transfected cells as described above were washed with PBS (Gibco) and extracted using QuickExtract DNA Extraction Solution (Lucigen). Samples were incubated at 65°C for 60 minutes and heat inactivated at 98°C for 3 minutes. The lysed products were used as templates for the first round PCR (PCR1). PCR1 was conducted with barcoded primers （see Supplementary Table 12）to amplify the genomic region of interest using Phanta Max Super-Fidelity DNA Polymerase (Vazyme). PCR1 was performed under the following cycle conditions: 98°C for 3 min, [98°C 15 s, 60°C 15 s, 72°C 30 s]x29, 72°C 3 min. Following the confirmation of successful PCR1 amplification through gel electrophoresis, the PCR1 products were pooled in equal moles and then purified, getting them ready for the second round of PCR (PCR2). The PCR2 products were amplified using index primers (Vazyme)and purified by FastPure Gel DNA Extraction Mini Kit(Vazyme) for sequencing on the Illumina NovaSeq platform. PCR2 was performed under the following cycle conditions: 98°C for 45s, [98°C 15 s, 60°C 15 s, 72°C 30 s]x6, 72°C 3 min. Indels were analysed using CRISPResso2 with the following parameters, minimum of 80% homology for alignment to the amplicon sequence, quantification window of 20 bp, ignoring substitutions to avoid false positives.

Data availability

Protein structures used for training are publicly available in AlphaFold protein structure database (https://www.alphafold.ebi.ac.uk/). Public datasets we used for performance evaluation were obtained from corresponding publications. Please refer to Methods for more details. The deep sequencing data from this study have been submitted to the National Center for Biotechnology Information Sequence Read Archive database under accession number GSE242493 (secure token: utmjeyoaljwnng). Source data are provided with this paper (https://github.com/wenjiegroup/ProMEP).

Code availability

The source codes and trained model parameters of ProMEP are available at https://github.com/wenjiegroup/ProMEP. We predicted protein structures via ESMFold (https://github.com/facebookresearch/esm) and AlphaFold2 (https://github.com/deepmind/alphafold). Deep sequencing data were analyzed with CRISPResso2 (https://github.com/pinellolab/CRISPResso2). Protein structures were visualized with PyMOL (https://pymol.org/2/). Sequence alignments are visualized with ESPript (https://espript.ibcp.fr/ESPript/ESPript/).

Acknowledgements

This work was supported in part by the National Key Research and Development Project of China (2021YFC2302400 to W.S., 2022YFC2702705 to J.Z.), National Natural Science Foundation of China (81830101 to S.W. and 62306334 to S.Y.), Key Research Project (117005-AC2106/002 and 2022PG0AC02) and Infrastructure and Facility Construction Project (103000-AF2204) of Zhejiang Lab. We thank the Research Center for Intelligent Computing Software at the Research Institute of Intelligent Computing, Zhejiang Lab, for providing technical support.

Author contributions

P.C. and W.S. conceived the study with the feedback from S.Z. and X.H.. P.C. designed the model. J.T., together with W.H. and A.P. designed and performed model training. P.C., S.Y. and Y.C. conducted the computational evaluations. J.Z. designed the experiments, C.M. and Q.G. performed the experiments. P.C., C.M. and J.Z. analyzed the data. P.C., C.M. and S.Y. wrote the manuscript with help from all authors. W.S., X.H. S.Z., and W.L. revised the manuscript. S.W. provided conceptual advice and supervised the work. All of the authors reviewed and approved the manuscript.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Meiering, E. M., Serrano, L. & Fersht, A. R. Effect of active site residues in barnase on activity and stability. Journal of molecular biology 225, 585-589, doi:10.1016/0022-2836(92)90387-y (1992).
Nguyen, T. T. et al. Genomic mutations and changes in protein secondary structure and solvent accessibility of SARS-CoV-2 (COVID-19 virus). Scientific reports 11, 3487, doi:10.1038/s41598-021-83105-3 (2021).
Guzzi, P. H. et al. Computational analysis of the sequence-structure relation in SARS-CoV-2 spike protein using protein contact networks. Scientific reports 13, 2837, doi:10.1038/s41598-023-30052-w (2023).
Wang, M. & Kaufman, R. J. Protein misfolding in the endoplasmic reticulum as a conduit to human disease. Nature 529, 326-335, doi:10.1038/nature17041 (2016).
Yamada, Y., Goto, H. & Ogasawara, N. A point mutation responsible for human erythrocyte AMP deaminase deficiency. Human molecular genetics 3, 331-334, doi:10.1093/hmg/3.2.331 (1994).
Fiziev, P. P. et al. Rare penetrant mutations confer severe risk of common diseases. Science 380, eabo1131, doi:10.1126/science.abo1131 (2023).
Taft, J. M. et al. Deep mutational learning predicts ACE2 binding and antibody escape to combinatorial mutations in the SARS-CoV-2 receptor-binding domain. Cell 185, 4008-4022.e4014, doi:10.1016/j.cell.2022.08.024 (2022).
Stern, A. & Andino, R. in Viral Pathogenesis 233-240 (2016).
Miton, C. M. & Tokuriki, N. How mutational epistasis impairs predictability in protein evolution and design. Protein science : a publication of the Protein Society 25, 1260-1272, doi:10.1002/pro.2876 (2016).
Olson, C. A., Wu, N. C. & Sun, R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Current biology : CB 24, 2643-2651, doi:10.1016/j.cub.2014.09.072 (2014).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583-589, doi:10.1038/s41586-021-03819-2 (2021).
Buel, G. R. & Walters, K. J. Can AlphaFold2 predict the impact of missense mutations on structure? Nature structural & molecular biology 29, 1-2, doi:10.1038/s41594-021-00714-2 (2022).
Varadi, M. & Velankar, S. The impact of AlphaFold Protein Structure Database on the fields of life sciences. Proteomics, e2200128, doi:10.1002/pmic.202200128 (2022).
Hu, M. et al. Exploring evolution-aware & -free protein language models as protein function predictors. Proceedings of the 36th Conference on Neural Information Processing Systems (2022).
Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat Methods 11, 801-807, doi:10.1038/nmeth.3027 (2014).
Melnikov, A., Rogov, P., Wang, L., Gnirke, A. & Mikkelsen, T. S. Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes. Nucleic Acids Res 42, e112, doi:10.1093/nar/gku511 (2014).
Tsuboyama, K. et al. Mega-scale experimental analysis of protein folding stability in biology and design. Nature 620, 434-444, doi:10.1038/s41586-023-06328-6 (2023).
de Beer, T. A. et al. Amino acid changes in disease-associated variants differ radically from variants observed in the 1000 genomes project dataset. PLoS computational biology 9, e1003382, doi:10.1371/journal.pcbi.1003382 (2013).
Ng, P. C. & Henikoff, S. Predicting deleterious amino acid substitutions. Genome research 11, 863-874, doi:10.1101/gr.176601 (2001).
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat Methods 15, 816-822, doi:10.1038/s41592-018-0138-4 (2018).
Huang, Y. F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nature genetics 49, 618-624, doi:10.1038/ng.3810 (2017).
Gerasimavicius, L., Liu, X. & Marsh, J. A. Identification of pathogenic missense mutations using protein stability predictors. Scientific reports 10, 15387, doi:10.1038/s41598-020-72404-w (2020).
Gelman, S., Fahlberg, S. A., Heinzelman, P., Romero, P. A. & Gitter, A. Neural networks to learn protein sequence-function relationships from deep mutational scanning data. Proc Natl Acad Sci U S A 118, doi:10.1073/pnas.2104878118 (2021).
Zhang, H., Xu, M. S., Fan, X., Chung, W. K. & Shen, Y. Predicting functional effect of missense variants using graph attention neural networks. Nat Mach Intell 4, 1017-1028, doi:10.1038/s42256-022-00561-w (2022).
Bengio, Y., Courville, A. & Vincent, P. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35, 1798-1828, doi:10.1109/tpami.2013.50 (2013).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16, 1315-1322, doi:10.1038/s41592-019-0598-1 (2019).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A 118, doi:10.1073/pnas.2016239118 (2021).
Elnaggar, A. et al. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell 44, 7112-7127, doi:10.1109/TPAMI.2021.3095381 (2022).
Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Proceedings of the 35th Annual Conference on Neural Information Processing Systems, doi:10.1101/2021.07.09.450648 (2021).
P., N. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. Proceedings of the 39th International Conference on Machine Learning (2022).
Illergard, K., Ardell, D. H. & Elofsson, A. Structure is three to ten times more conserved than sequence--a study of structural response in protein cores. Proteins 77, 499-508, doi:10.1002/prot.22458 (2009).
Guo, Y., Wu, J., Ma, H. & Huang, J. Self-Supervised Pre-training for Protein Embeddings Using Tertiary Structures. Proceedings of the AAAI Conference on Artificial Intelligence 36, 6801-6809 (2022).
Zhang, Z. et al. Protein Representation Learning by Geometric Structure Pretraining. Proceedings of the 39th International Conference on Machine Learning (2022).
Maghawry, H. A., Mostafa, M. G. & Gharib, T. F. A new protein structure representation for efficient protein function prediction. Journal of computational biology : a journal of computational molecular cell biology 21, 936-946, doi:10.1089/cmb.2014.0137 (2014).
Durairaj, J., Akdel, M., de Ridder, D. & van Dijk, A. D. J. Geometricus represents protein structures as shape-mers derived from moment invariants. Bioinformatics 36, i718-i725, doi:10.1093/bioinformatics/btaa839 (2020).
Rogen, P. & Fain, B. Automatic classification of protein structure by using Gauss integrals. Proc Natl Acad Sci U S A 100, 119-124, doi:10.1073/pnas.2636460100 (2003).
Corral-Corral, R., Chavez, E. & Del Rio, G. Machine Learnable Fold Space Representation based on Residue Cluster Classes. Computational biology and chemistry 59 Pt A, 1-7, doi:10.1016/j.compbiolchem.2015.07.010 (2015).
Zotenko, E., O'Leary, D. P. & Przytycka, T. M. Secondary structure spatial conformation footprint: a novel method for fast protein structure comparison and classification. BMC structural biology 6, 12, doi:10.1186/1472-6807-6-12 (2006).
Guo, Y. et al. Deep Learning for 3D Point Clouds: A Survey. IEEE Trans Pattern Anal Mach Intell 43, 4338-4364, doi:10.1109/tpami.2020.3005434 (2021).
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 50, D439-D444, doi:10.1093/nar/gkab1061 (2022).
Karvelis, T. et al. Transposon-associated TnpB is a programmable RNA-guided DNA endonuclease. Nature 599, 692-696, doi:10.1038/s41586-021-04058-1 (2021).
Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47-60, doi:10.1038/s41586-023-06221-2 (2023).
Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat Commun 12, 3168, doi:10.1038/s41467-021-23303-9 (2021).
Liu, Y., Ye, Q., Wang, L. & Peng, J. Learning structural motif representations for efficient protein structure search. Bioinformatics 34, i773-i780, doi:10.1093/bioinformatics/bty585 (2018).
Xia, C., Feng, S. H., Xia, Y., Pan, X. & Shen, H. B. Fast protein structure comparison through effective representation learning with contrastive graph neural networks. PLoS computational biology 18, e1009986, doi:10.1371/journal.pcbi.1009986 (2022).
Weinstein, J. Y. et al. Designed active-site library reveals thousands of functional GFP variants. Nat Commun 14, 2890, doi:10.1038/s41467-023-38099-z (2023).
Gu, Z., Luo, X., Chen, J., Deng, M. & Lai, L. Hierarchical graph transformer with contrastive learning for protein function prediction. Bioinformatics 39, doi:10.1093/bioinformatics/btad410 (2023).
Safari, M. et al. Functional and structural segregation of overlapping helices in HIV-1. eLife 11, doi:10.7554/eLife.72482 (2022).
Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91-95, doi:10.1038/s41586-021-04043-8 (2021).
Rao, R., Meier, J., Sercu, T., Ovchinnikov, S. & Rives, A. Transformer protein language models are unsupervised structure learners. Proceedings of 9th International Conference on Learning Representations (2021).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123-1130, doi:10.1126/science.ade2574 (2023).
Xiang, G. et al. Evolutionary mining and functional characterization of TnpB nucleases identify efficient miniature genome editors. Nat Biotechnol, doi:10.1038/s41587-023-01857-x (2023).
Nakagawa, R. et al. Cryo-EM structure of the transposon-associated TnpB enzyme. Nature 616, 390-397, doi:10.1038/s41586-023-05933-9 (2023).
Sasnauskas, G. et al. TnpB structure reveals minimal functional core of Cas12 nuclease family. Nature 616, 384-389, doi:10.1038/s41586-023-05826-x (2023).
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res 28, 235-242, doi:10.1093/nar/28.1.235 (2000).
Emond, S. et al. Accessing unexplored regions of sequence space in directed enzyme evolution via insertion/deletion mutagenesis. Nat Commun 11, 3469, doi:10.1038/s41467-020-17061-3 (2020).
Zhang, Z., Wang, J., Gong, Y. & Li, Y. Contributions of substitutions and indels to the structural variations in ancient protein superfamilies. BMC genomics 19, 771, doi:10.1186/s12864-018-5178-8 (2018).
Brown, T. B. et al. Language Models are Few-Shot Learners. Proceedings of the Advances in Neural Information Processing Systems 33 (2020).
Consortium, U. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res 51, D523-d531, doi:10.1093/nar/gkac1052 (2023).
Bulatov, A., Kuratov, Y. & Burtsev, M. S. Scaling Transformer to 1M tokens and beyond with RMT. Preprint at https://arxiv.org/abs/2304.11062 (2023).
Dauparas, J. et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 378, 49-56, doi:10.1126/science.add2187 (2022).
Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature, doi:10.1038/s41586-023-06415-8 (2023).
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol 41, 1099-1106, doi:10.1038/s41587-022-01618-2 (2023).
Fuchs, F. B., Worrall, D. E., Fischer, V. & Welling, M. SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks. Proceedings of the Advances in Neural Information Processing Systems 34 (2020).
Fan, H., Su, H. & Guibas, L. J. A Point Set Generation Network for 3D Object Reconstruction from a Single Image. IEEE Conference on Computer Vision Pattern Recognition, 2463-2471 (2016).
Bairoch, A. The ENZYME database in 2000. Nucleic Acids Res 28, 304-305, doi:10.1093/nar/28.1.304 (2000).
Yu, T. et al. Enzyme function prediction using contrastive learning. Science 379, 1358-1363, doi:10.1126/science.adf2465 (2023).
Price, M. N. et al. Mutant phenotypes for thousands of bacterial genes of unknown function. Nature 557, 503-509, doi:10.1038/s41586-018-0124-0 (2018).
Hermosilla, P. et al. in International Conference on Learning Representations.
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics 25, 25-29, doi:10.1038/75556 (2000).
Aleksander, S. A. et al. The Gene Ontology knowledgebase in 2023. Genetics 224, doi:10.1093/genetics/iyad031 (2023).
Sledzieski, S., Singh, R., Cowen, L. & Berger, B. D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions. Cell systems 12, 969-982.e966, doi:10.1016/j.cels.2021.08.010 (2021).
Dong, T. N., Brogden, G., Gerold, G. & Khosla, M. A multitask transfer learning framework for the prediction of virus-human protein-protein interactions. BMC Bioinformatics 22, 572, doi:10.1186/s12859-021-04484-y (2021).
Lv, G. F., Hu, Z. Q., Bi, Y. G. & Zhang, S. T. Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) (2021).
Shanehsazzadeh, A., Belanger, D. & Dohan, D. Is Transfer Learning Necessary for Protein Landscape Prediction? Proceedings of the Machine Learning for Structural Biology Workshop in the Thirty-Fourth Annual Conference on Neural Information Processing Systems (2020).
Rao, R. et al. Evaluating Protein Transfer Learning with TAPE. Proceedings of the 33rd Conference on Neural Information Processing Systems (2019).
Wang, Z. et al. LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction. Scientific reports 12, 6832, doi:10.1038/s41598-022-10775-y (2022).
Kipf, T. & Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. Proceedings of the 5th International Conference on Learning Representations (2016).
Velickovic, P. et al. Graph Attention Networks. Proceedings of 6th International Conference on Learning Representations (2017).
Jing, B., Eismann, S., Suriana, P., Townshend, R. J. L. & Dror, R. O. Learning from Protein Structure with Geometric Vector Perceptrons. Proceedings of 8th International Conference on Learning Representations (2020).
Baldassarre, F., Menéndez Hurtado, D., Elofsson, A. & Azizpour, H. GraphQA: protein model quality assessment using graph convolutional networks. Bioinformatics 37, 360-366, doi:10.1093/bioinformatics/btaa714 (2021).
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 33, 2302-2309, doi:10.1093/nar/gki524 (2005).
Chandonia, J. M. et al. SCOPe: improvements to the structural classification of proteins - extended database to facilitate variant interpretation and machine learning. Nucleic Acids Res 50, D553-d559, doi:10.1093/nar/gkab1054 (2022).
Yang, Y. et al. Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Briefings in bioinformatics 19, 482-494, doi:10.1093/bib/bbw129 (2018).
Klausen, M. S. et al. NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins 87, 520-527, doi:10.1002/prot.25674 (2019).
Zhou, B. et al. Lightweight Equivariant Graph Representation Learning for Protein Engineering. NeurIPS 2022 Workshop on Machine Learning in Structral Biology (2022).
Vaswani, A. et al. in Proceedings of the 31st Annual Conference on Neural Information Processing Systems.
Selvaraju, R. R. et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. IEEE Conference on Computer Vision and Pattern Recognition 128, 336-359 (2016).

There is NO Competing Interest.

Supplementaryinformation.pdf
Supplementary information

Download PDF

Version 1

posted

You are reading this latest preprint version

Zero-shot prediction of mutation effects on protein function with multimodal deep representation learning

Status:

Version 1

Abstract

Figures

Introduction

Results

Discussion

Methods

Declarations

Data availability

Code availability

Acknowledgements

Author contributions

Competing interests

References

Additional Declarations

Supplementary Files

Status:

Version 1