Background: Protein-Protein Interactions (PPIs) are a crucial mechanism underpinning the function of the cell. Predicting the likely relationship between a pair of proteins is thus an important problem in bioinformatics, and a wide range of machine-learning based methods have been proposed for this task. Their success is heavily dependent on the construction of the feature vectors, with most using a set of physicochemical properties derived from the sequence. Few work directly with the sequence itself. Recent works on embedding sequences in a low dimensional vector space has shown the utility of this approach for tasks such as protein classification and sequence search. In this paper, we extend these ideas to the PPI prediction task, making inferences from the pair instead of the individual sequences.
Methods: We propose a generic PPI prediction framework that constitutes a representation learning module for feature construction and a binary classifier. To construct the feature vector for a protein pair, we concatenate the distributed representations (embeddings) learned for the sequences of the constituent proteins. Each protein pair is represented as a 200-dimensional feature vector. To learn the embedding of a sequence, we use two established methods - Seq2Vec and BioVec, and we also introduce a novel feature construction method and call it SuperVecNW. The embeddings generated through SuperVecNW captures network information to some extent, along with the contextual information present in the sequences. Finally, we feed these feature vectors into a Random forest classifier to predict protein pair interactions.
Results: To show the efficacy of our proposed approach, we evaluate its performance on human and yeast PPI datasets, benchmarking against the established methods. Furthermore, we test our approach on three well known networks: the one-core network (CD9), the multiple-core network (Ras-Raf-Mek-Erk-Elk-Srf pathway), and the cross-connection network (Wnt-related network) and demonstrate the improvement in predicting PPIs compared to the other methods.
Conclusions: Naive low dimensional sequence embeddings provide better results on protein-protein interaction prediction task than most of the alternative representations based on other physiochemical properties. These methods require computationally modest effort due to their lower dimensionality. Advanced representation learning methods that enrich the sequence embeddings with meta information are expected to improve the results further.