Background: Protein-protein interaction (PPI) is vital for life processes, diseases treatment and new drugs discovery. The computational prediction of PPI is well accepted for its inexpensive and efficient nature comparing to the wet-lab experiment. When a new protein comes, one try to find whether there is any PPI relationship between this new protein and existing proteins, the current computational prediction methods usually compare this new protein to existing proteins one by one in pairwise. This is time comsuming.
Results: We proposed an more efficient model, Deep Hash Learning Protein-and-Protein Interaction (DHL-PPI) model, to predict all-to-all PPI relationship on a database. First, DHL-PPI encodes a protein sequence into a binary Hash code based on the features extracted from sequences by using deep learning technique. This encoding scheme enables the PPI discrimination problem to be a much simpler searching problem. A protein with a binary code can be regarded as a number. In the prescreen of PPI prediction stage, the string match problem of searching a string against a database with M proteins can be turned into a much more simpler problem: to find a number inside an sorted array with length M. This prescreen process narrows down proteins inside the whole database into a much smaller candidate set for further confirmation. At last, DHL-PPI uses the Hamming distance to determine the final PPI relationship.
Conclusions: The experimental results confirmed that DHL-PPI is feasible and effective. Using a dataset with strictly negative PPI examples of four species, DHL-PPI is superior or competitive to the other state-of-the-art methods in terms of precision, recall or F1 score. Furthermore, in the prediction stage, the proposed DHL-PPI decrease the usual time compexity of O(M2 ) to O(MlogM) for predicting all-to-all PPI interactons between any pairs in M proteins on a database. A protein database can be stored in the proposed encoding scheme and waited to be searched, which is a potential novel encoding scheme to cope with current searching problem for a large volume of database.