Sequence Representations and Their Utility for Predicting Protein-protein Interactions

doi:10.21203/rs.3.rs-62896/v1

Download PDF

Research article

Sequence Representations and Their Utility for Predicting Protein-protein Interactions

https://doi.org/10.21203/rs.3.rs-62896/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 01 Jan, 2021

Read the published version in IEEE/ACM Transactions on Computational Biology and Bioinformatics →

Version 1

posted

You are reading this latest preprint version

Background: Protein-Protein Interactions (PPIs) are a crucial mechanism underpinning the function of the cell. Predicting the likely relationship between a pair of proteins is thus an important problem in bioinformatics, and a wide range of machine-learning based methods have been proposed for this task. Their success is heavily dependent on the construction of the feature vectors, with most using a set of physicochemical properties derived from the sequence. Few work directly with the sequence itself. Recent works on embedding sequences in a low dimensional vector space has shown the utility of this approach for tasks such as protein classification and sequence search. In this paper, we extend these ideas to the PPI prediction task, making inferences from the pair instead of the individual sequences.

Methods: We propose a generic PPI prediction framework that constitutes a representation learning module for feature construction and a binary classifier. To construct the feature vector for a protein pair, we concatenate the distributed representations (embeddings) learned for the sequences of the constituent proteins. Each protein pair is represented as a 200-dimensional feature vector. To learn the embedding of a sequence, we use two established methods - Seq2Vec and BioVec, and we also introduce a novel feature construction method and call it SuperVecNW. The embeddings generated through SuperVecNW captures network information to some extent, along with the contextual information present in the sequences. Finally, we feed these feature vectors into a Random forest classifier to predict protein pair interactions.

Results: To show the efficacy of our proposed approach, we evaluate its performance on human and yeast PPI datasets, benchmarking against the established methods. Furthermore, we test our approach on three well known networks: the one-core network (CD9), the multiple-core network (Ras-Raf-Mek-Erk-Elk-Srf pathway), and the cross-connection network (Wnt-related network) and demonstrate the improvement in predicting PPIs compared to the other methods.

Conclusions: Naive low dimensional sequence embeddings provide better results on protein-protein interaction prediction task than most of the alternative representations based on other physiochemical properties. These methods require computationally modest effort due to their lower dimensionality. Advanced representation learning methods that enrich the sequence embeddings with meta information are expected to improve the results further.

Bioinformatics

Sequence embedding

Machine learning

Protein-Protein interactions