Construction of protein point cloud.
Compared with the naive point cloud which is unordered and homogeneous, our proposed protein point cloud consists of ordered and heterogeneous points that are extracted from its raw structure. Specifically, each point corresponds to the alpha C atom of an amino acid. In addition to the 3-dimensional coordinates of each point (x, y, z), the type of residue each point belongs to (R) and the position of each residue in the protein sequence of length L (P) are attached as point features.
Definition of each point in protein point cloud: [x, y, z, R, P],
R (G, A, V, L, I, S, T, C, M, D, E, N, Q, R, K, F, Y, W, P, H)
P (1, 2, 3, ..., L)
|
Architecture of the multimodal deep representation learning model.
To decipher protein functions at the residual resolution, we developed a multimodal protein representation learning model (~659.3 million parameters). It applies an encoder-decoder architecture to simultaneously learn sequence context and structural constraints from millions of proteins (see Fig. 1). For a protein of length L, the encoder takes the masked sequence and the masked protein point cloud as input and generates a K-dimensional feature vector for each amino acid. The latent representations (L*K) are then fed into the decoder to complete the missing elements of the corrupted sequence and protein point cloud. K is set to 1280 during training and inference.
The sequence embedding module, the transformer encoder module and the sequence decoder module apply similar networks to that of current protein language models27,28. Specifically, the transformer encoder module is a 33-layer stacked Transformer, and each layer consists of one layer normalization block, one 8-head attention block, and one feed-forward network.
The global features of a protein tertiary structure should be invariant to arbitrary input poses, which means 3D translations and rotations of the input protein structure should not affect the output. To guarantee such invariance, we chose the NVIDIA-optimized version of SE(3)-Transformer64 as the structure embedding module, which contains one 8-head attention block interspersed with one normalization module, one TFN layer, and one max pooling layer. We used 1 layer SE(3)-Transformers for large-scale training. The structure decoder module is a multiple-layer perceptron network.
To capture the structural signatures of a protein, the structure embedding module first calculates the K nearest neighborhoods centered on each point as well as their relative positions. Next, an equivariant weight matrix is built upon the Clebsch-Gordon coefficient and spherical harmonics to guarantee the equivariance of point features during transformation. Thirdly, the attention mechanism is applied to pass features between adjacent points. Finally, point features are aggregated and pooled to output the final structural signatures.
Model training.
We used proteins from the AlphaFold protein structure database as the self-supervised training dataset. It contains ~200 million structures precited by AlphaFold2. We removed proteins shorter than 64 amino acids and those average lddT score lower than 70. We randomly selected ~0.5 million proteins for validation. The final training dataset contains ~160 million proteins. Both amino acid sequence and protein point cloud were extracted from the raw protein structure for multimodal training. Since ~95.88% Uniparc sequences contain fewer than 1024 amino acids, we set the context size to 1024. For proteins longer than 1024 amino acids, we sampled the start position of amino acids from uniform distribution [1, n - x +1] where n is the length of protein minus 1024, and x is sampled from uniform distribution [0, n]. For proteins shorter than 1024 amino acids, padding tokens are appended to their sequence, and random alpha C atoms selected from the raw structures are appended to the extracted protein point clouds.
The extracted amino acid sequence and protein point cloud were then corrupted and recovered by the proposed multimodal model during training. To mask the protein sequence, we randomly sampled 15% of tokens from the sequence after tokenization, and each of them was replaced with a special mask token with 80% probability, a randomly chosen alternate amino acid token with 10% probability, and the original input token (i.e., no change) with 10% probability. To mask the protein point cloud, we calculated the central point of the protein and chose 256 nearest neighbor points centered on it. We masked the coordinates of these points and trained the proposed multimodal network to automatically recover them.
The loss function was a sum of a categorical Cross-Entropy (CE) loss and a permutation-invariant Chamfer Distance (CD) loss65. In particular, the CE loss measures the differences between the model’s predictions and the true token for masked amino acid sequence. CD loss quantifies the completion results by calculating the average nearest squared distance between the recovered protein point cloud and the ground truth. By minimizing the cross-entropy loss and the chamfer distance loss, our proposed model learns high-order representations of a protein in a self-supervised manner.
All layers except the transformer encoder module are initialized from a zero-centered normal distribution with a standard deviation of 0.02. The transformer encoder module is initialized with parameters of ESM1b27. We trained the multimodal deep representation learning model for 380K steps using Adam optimizer (β1 = 0.9, β2 = 0.999,) at initial learning rate 1e-4 with batch size 480. The learning rate increases linearly during a warm-up period of 10,000 steps. Afterwards, the learning rate follows an inverse square root decay schedule. The training took about 1 month on 120 NVIDIA A100 GPUs.
Benchmarking multimodal representations with function-related datasets.
We used 15 function-related datasets (See Supplementary Table 6) to benchmark the performance of our multimodal deep representation learning model compared to a comprehensive suite of baselines. For each dataset, protein representations are either fed into a Multi-Layer Perceptron (MLP) or integrated to a customized model to make final predictions. According to the downstream network, two types of representations are used, including the residual-level representation of the protein (protein length × 1,280), and the molecular-level representation that averaged across the length of the protein (1,280). Details are introduced as follows.
EC annotation tasks. Enzyme Commission (EC) number66 is a commonly used classification scheme that specifies the catalytic function of an enzyme by four digits. Three diverse datasets are used for benchmarking. The EC-PDB dataset, which was constructed by Gligorijevi´c et al.43, consists of non-redundant proteins retrieved from PDB. Proteins in the test set have corresponding experimentally determined PDB structures and at least one experimentally determined annotation. The EC-New-392 dataset was constructed by Yu et al.67, which consists of 392 proteins covering 177 different EC numbers from Swiss-Prot. The EC-Price-149 dataset was a collection of 149 proteins validated by experiments described by Price et al.68. The training set of both EC-New-392 and EC-Price-149 was a collection of ~220K proteins from SwissProt that covers 5242 unique EC numbers.
EC reaction classification task. Hermosilla et al.69 constructed this dataset, which classifies 37,428 proteins based on the enzyme-catalyzed reaction according to 384 EC numbers. The entire dataset is split into training, validation and testing sets, all of them containing whole EC numbers. In addition, these proteins were clustered via sequence similarity and all protein chains belonging to the same cluster are in the same set.
GO annotation tasks. Gene Ontology (GO) annotations capture statements about how a gene functions at the molecular level (MF), where in the cell it functions (CC), and what biological processes (BP)70,71. The GO-MF, GO-CC and GO-BP datasets use the same partitioning scheme described in Gligorijevi´c et al.43, which splits ~36K non-redundant PDB chains into training, validation, and test sets. Only GO terms with at least 50 and no more than 5000 training samples are selected. Each protein in the test set contains at least one experimentally confirmed GO term in each branch of GO. The entire dataset covers 489, 320 and 1,943 GO terms in MF, CC and BP, respectively.
Cross-species Protein-Protein Interaction (PPI) prediction tasks. We used the dataset D-SCRIPT72 built from the STRING database, which contains PPIs across multiple species, including the Human (15,755 proteins), Mouse (17,252 proteins), Fly (11,306 proteins), and E. coli (4,412 proteins). The Human split contains about 38 thousand as the training set and 25 thousand as the test set. All PPIs of other species are taken into the test set. Except E. coli has only 2000 positive PPIs, the positive samples of other test species are 5000. Negative samples are generated by randomly pairing proteins from the non-redundant set, which is ten times that of the positive to reflect the intuition that true PPIs are rare. A model trained on human PPIs is used to predict PPIs in other species. We sourced all protein structures from the AlphaFold protein structure database.
Virus-human PPI prediction tasks. We used three datasets built by Dong et al.73 to evaluate our model on virus-human PPI prediction. The three datasets contain PPIs between human proteins and viruses, including Ebola, and H1N1. Each dataset has about ten positive and negative interactions between thousands of human proteins and hundreds of virus proteins (see Supplementary Table 7). Structures of human proteins are retrieved from the AlphaFold protein structure database. We predicted the structures of virus proteins with ESMFold.
Multi-class PPI prediction tasks. We exploited two datasets, SHS148K and STRING, built by GNN-PPI74, that contain 44,488 and 593,397 multilabel PPIs, respectively. These PPIs are divided into 7 types, including activation, inhibition, reaction, binding, expression, catalysis, and post-translational modifications (ptmod). Each pair of interacting proteins contains at least one of these labels. We sourced all protein structures (5,189 proteins in SHS148k, 15,335 proteins in STRING) from the AlphaFold protein structure database.
For EC-PDB, EC-Reaction, GO-BP, GO-MF, GO-CC, PPI-Mouse, PPI-Fly, and PPI-E. coli, we respectively constructed a MLP classifier as described in Zhang et al.33 to decode the representations generated by different methods (see Supplementary Table 1 and Supplementary Table 2). For PPI- SHS148K and PPI-STRING, we constructed an 8-layers stacked transformer with a hidden size of 256. While CNN, ResNet, LSTM, and Transformer are initialized randomly and trainable75,76, the parameters of other pre-trained models were frozen during training. In particular, pre-trained parameters of UniRep26, ESM1b27 and ProtT528 were downloaded and used. Models and results of other pre-trained models were obtained from corresponding publications33,69,74,77-81.
For the rest of the datasets, our proposed model acts as a plugin model that generates latent representations for proteins and utilizes existing customized models for further prediction. Specifically, we used CLEAN67, a contrastive learning–enabled enzyme annotation model, for EC-New-392 and EC-Price-149. We replaced the raw input of CLEAN with representations generated by our proposed model and kept the model unchanged. For PPI-Denovo, PPI-EBOLA and PPI-H1N1, we used multimodal representations as the input of the graph model proposed by Dong et al.73 and kept all hyper-parameters unchanged. Results of other methods are obtained from corresponding publications.
We probed the robustness of our proposed model in learning sequence-function and structure-function relationships from proteins with low sequence/structure similarity on downstream tasks. Specifically, Blast is used to align the protein sequences of the test set to the training sequences and compute the identity score. We used TMScore82 to assess the topological similarity between protein structures of the test set and the training set. 5 similarity cutoffs were used to partition each test set into multiple groups (see Supplementary Table 8 and Supplementary Table 9).
Multi-scale structural signatures benchmarking.
We used three benchmarks to investigate whether these multi-scale structural signatures could be captured by the proposed multimodal network.
Global structure signatures. The Structural Classification of Proteins-extended (SCOPe) database organizes protein domains into multiple hierarchies, including Family, Superfamily and Fold83. In particular, the basis of classification for Folds is purely structural. As described in Xia et al.45, we used the 40% identity filtered subset of SCOPe v2.07 as the benchmark set. It contains 13,265 domains that can be classified into seven classes (see Supplementary Table 10). We constructed a five-layer MLP (batch size: 24, learning rate 3e-5, drop out ratio 0.2, Adam optimizer) as the decoder to classify multimodal representations to a specific fold class. We reported the F1 score and accuracy score of the 5-fold cross-validation results on the entire dataset. Leading structure-based methods were used as baselines, including GraSR and DeepFold. GraSR uses a contrastive learning framework to capture protein features from a protein graph of protein structure45. DeepFold extracts structural motif features from protein contact maps via a deep convolutional neural network44. SGM36 and SSEF38 are classical structural classification tools that uses 30 global backbone topological measures or frequencies of 1,500 secondary structure triplets to encode protein structures, respectively. The performance of these methods were obtained from Xia et al.45 and Liu et al.44.
Local structure signatures. To benchmark the ability of the proposed multimodal network to capture local structural signatures, we used the dataset constructed by Klausen et al.36 as the training and validation set. It contains 10,837 crystal structures obtained from PDB that filtered at the 25% identity threshold as well as the 2.5Å resolution threshold. Among these structures, 10,337 structures were used as training set, and 500 randomly selected structures were left out as the validation set. CB51338, CASP1238 and TS11584 were used as test sets, which contains 507, 21 and 115 non-redundant structures, respectively. Each amino acid of each structure was mapped to a secondary structure label. In particular, both 3-class (Q3) and 8-class (Q8) secondary structure labels were calculated as described in Klausen et al.85. Our proposed model acts as an encoder that generates residual-level representations, which are then fed into a MLP classifier as described in Rao et al.76. We evaluated its accuracy in both Q3 and Q8. Pre-trained ESM1b27 was used as one of the leading baselines. Results of other methods were obtained from Rao et al.76.
We also constructed two small-scale datasets to further benchmark multimodal representations in the condition of scarce training samples. The PDB-100 dataset (see Supplementary Fig. 5c) consists of 100 single-domain proteins that are randomly selected from the SCOP database83. For each protein, we obtained its structure and secondary structure annotations for each of the positions from PDB55. In addition to the SCOP-100 dataset, we constructed the CATH-100 dataset by randomly selecting 100 proteins from the dataset collected by Zhou et al.86. Each protein in CATH-100 has experimentally determined 3D structure as well as annotated B-factor and Solvent Accessible Surface Area (SASA) for each of the positions. While PDB-100 is a 3-class classification task and CATH-100 corresponds to two regression tasks. We compared the performance of our proposed multimodal network with several leading methods, including UniRep, ESM1b and ProtT5. For each task, we constructed a random forest model to make predictions on the basis of representations generated by each of these methods, respectively. We reported the performance of each model in a 5-fold cross-validation manner on the entire dataset. We used the random forest model trained on SCOP-100 for subsequent secondary structure benchmarking and visualization (see Fig. 3b and 3c; Supplementary Fig. 7; Supplementary Fig. 11a).
Micro structure signatures. We evaluated the micro-structure perception ability of our proposed multimodal network by quantifying its attention on functional sites. During training and inference, it applies the attention mechanism87 to probe sequence context and structural constraints. Calculating the attention score between residues of a protein allows us to identify key positions that the model focuses on. We randomly selected 10,000 proteins from Swiss-Prot and filtered out proteins without functional site annotations. We used the 80% identity filtered subset, which contains 1,325 proteins. We also constructed the dataset CLEAN which consists of 113 proteins with functional site annotations from EC-Price-149 and EC-New-392. We ranked all residues based on the attention score during the evaluation. Examples of attention visualization are shown in Fig. 2d, Supplementary Fig. 6c and Supplementary Fig. 9.
DeepFRI43, HEAL47 and a random approach (denoted as Random) were used as baselines. Specifically, DeepFRI is a graph convolutional network that employs a pre-trained protein language model to extract sequence features and further constructs a residue graph to predict protein functions. In addition, DeepFRI could identify the contribution of each residue to the predicted function. We utilized the official model checkpoints of DeepFRI and used the Molecular Function branch for evaluation. HEAL is a deep learning model for protein function prediction, which could capture structural features via a hierarchical graph transformer. We downloaded the pre-trained parameters of HEAL and applied the gradient-weighted Class Activation Map (grad-CAM)88 to rank the activation score of each residue. Random is a baseline strategy that randomly ranks the importance of each residue. We reported the average performance of Random across five runs. We used the commonly used Top-1 Hits Ratio (Top-1 HR), Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR) as evaluation metrics. These models were also used in subsequent evaluations (see Fig. 1f, Fig. 2c and Supplementary Fig. 6a).
Zero-shot protein fitness modeling.
ProMEP quantifies the log-likelihood of protein variants under the constraints of both sequence and structure. The calculation is shown in the Equation. To model the fitness of a protein, the wild-type sequence S and structural constraints C are fed into ProMEP, which in turn outputs a sequence of log probabilities. We calculated the conditional probabilities of mutated amino acid mt and wild-type amino acid wt at each mutational position t. The sum over all mutated positions T is the final fitness score of a protein variant.
We sourced three representative DMS datasets to investigate ProMEP’s ability in modeling protein fitness in a zero-shot manner. Specifically, the REV_HV1V2 dataset contains 2,147 single mutations with measured replication of the REV protein derived from the HIV-1 virus48. The KKA2_KLEPN dataset contains 4,960 single mutations with measured growth of the APH(3’)II derived from Klebsiella pneumoniae16. SPG1_STRSG is a large dataset that contains 1,045 single mutations and 535,917 double mutations with measured binding fitness of protein GB1 derived from Streptococcus sp. group G10. 20 DMS datasets used for generalization test are sourced from Notin et al.30 (See Supplementary Table 3).
We use ESMFold to predict the structure of the wild-type protein in these datasets. For each wild-type protein, we collected ~300 homologous sequences from the NR database with sequence identity lower than 80% and predicted their structures via ESMFold. We then used these homologous samples to fine-tune ProMEP for 3 epochs with a learning rate of 1e-4. The fine-tuning procedure enables ProMEP to learn a better understanding of sequence and structural constraints from homologies sampled through evolution. ESM2-15B, which is the largest open-sourced protein language model (~22 times parameters than that of ProMEP), was used as a leading baseline. Results of other baseline methods were obtained from the ProteinGym benchmark30.
ProMEP-guided protein engineering of TnpB.
We fine-tuned ProMEP for 3 epochs with 300-500 homologous proteins retrieved from the NR database with a sequence identity threshold of 80%. All structures of homologous sequences were predicted by ESMFold.
We begin by filtering protein variants with single mutations. Specifically, we constructed a virtual saturation mutagenesis library that only contains single variants (7,752 variants). We then ranked all variants via the calculated fitness score and analyzed the enrichment of mutants among the top 5% variants (387 variants). Since arginine-based mutants are significantly enriched in the top 5% variants, we chose the top 10 and bottom 10 arginine-based variants from the entire ranked list for further evaluation. We also randomly selected 10 arginine-based mutants as a control set.
To generate variants with multiple mutations, we constructed four virtual arginine-based mutagenesis libraries, including double arginine-based mutants with S72R (371 mutants), triple arginine-based mutants with S72R (68,635 mutants), all double arginine-based mutants (69,006 mutants), and all entire triple arginine-based mutants (8,710,740 mutants). Again, we calculated the fitness score of each variant in four virtual arginine-based mutagenesis libraries. According to the experimental data from top-10 arginine-based single mutants, we filtered out mutants that contain neutral or negative mutations (Y388R, S217R, L398R, T405R, L406R, K44R and H403R) from top-ranked variants. The top 10 ranked double-mutants and triple-mutants variants from each mutagenesis library are selected for further evaluation. P values were derived by a two-tailed Student’s t-test. All statistical analyses were performed on n = 3 biologically independent experiments.
Plasmid vector construction.
The TnpB gene was optimized for expression in human cells through codon optimization, and the optimized sequence was synthesized for vector construction by Sangon Biotech. We inserted the ultimately optimized sequence into the pST1374 vector, which contains the CMV promoter and a nuclear localization signal. All reRNA plasmids were cloned using T4 DNA Ligase (New England Biolabs). Oligos for targeting spacers were annealed and ligated into BsaI (New England BioLabs) digested PGL3-U6 backbone vectors. The spacer sequences of sgRNAs used in the study are shown in Supplementary Table 9. The final constructed vectors were all validated for accuracy through sequencing by Sanger sequencing.
TnpB engineering.
The construction of TnpB mutants was achieved through site-directed mutagenesis. PCR amplifications were performed using Phanta Max Super-Fidelity DNA Polymerase (Vazyme). Following digestion with DpnI (New England BioLabs), the PCR products were then ligated using 2X MultiF Seamless Assembly Mix (ABclonal). Ligated products were transformed into DH5α E. coli cells. The success of the mutations was confirmed via Sanger sequencing. The modified plasmid vectors were purified using a TIANpure Midi Plasmid Kit (TIANGEN).
Cell culture and transfection.
HEK293T cells were maintained in Dulbecco’s modified Eagle medium (Gibco) supplemented with 10% fetal bovine serum (Gemini) and 1% penicillin–streptomycin (Gibco) in an incubator (37 °C, 5% CO2). For indel analysis, HEK293T cells were transfected at 80% confluency with a density of approximately 1x105 cells per 24-well. Transfection was conducted following the manufacturer’s manual with 2 μl of ExFect Transfection Reagent (Vazyme) and 1 μg of plasmids (0.5 μg of reRNA plasmids + 0.5 μg of TnpB plasmids).
DNA extraction and Deep sequencing.
The transfected cells as described above were washed with PBS (Gibco) and extracted using QuickExtract DNA Extraction Solution (Lucigen). Samples were incubated at 65°C for 60 minutes and heat inactivated at 98°C for 3 minutes. The lysed products were used as templates for the first round PCR (PCR1). PCR1 was conducted with barcoded primers (see Supplementary Table 12)to amplify the genomic region of interest using Phanta Max Super-Fidelity DNA Polymerase (Vazyme). PCR1 was performed under the following cycle conditions: 98°C for 3 min, [98°C 15 s, 60°C 15 s, 72°C 30 s]x29, 72°C 3 min. Following the confirmation of successful PCR1 amplification through gel electrophoresis, the PCR1 products were pooled in equal moles and then purified, getting them ready for the second round of PCR (PCR2). The PCR2 products were amplified using index primers (Vazyme)and purified by FastPure Gel DNA Extraction Mini Kit(Vazyme) for sequencing on the Illumina NovaSeq platform. PCR2 was performed under the following cycle conditions: 98°C for 45s, [98°C 15 s, 60°C 15 s, 72°C 30 s]x6, 72°C 3 min. Indels were analysed using CRISPResso2 with the following parameters, minimum of 80% homology for alignment to the amplicon sequence, quantification window of 20 bp, ignoring substitutions to avoid false positives.