Most novel viral human diseases, particularly those that have caused recent epidemics, are known to have originated in non-human animal hosts 1, 2, 3. Host expansion, the ability of a virus to cross species, is an essential step in the evolution of such viruses3, 4, 5. COVID-19 is a recent example of a disease caused by a host expansion event that permitted SARS-CoV-2, a SARS-related coronavirus, to propagate from a yet unknown non-human animal to humans5. Alpha and beta coronaviruses affect a wide range of animals interacting with humans, including farm animals and camels, thus facilitating zoonotic transmission 6, 7. Moreover, all seven human coronaviruses belong to either the alpha or beta coronavirus genus 7. While several studies have confirmed bats and rodents as natural hosts for the alpha and beta coronaviruses affecting humans, there is evidence of intermediate hosts that facilitate evolutionary events, leading to strains that eventually propagate in humans1, 6, 8. Determining which non-human animal viruses may infect humans remains a challenge.
Experimental evidence is still the gold standard used to determine whether a virus can infect a host 9, 10. However, the complete host range of a virus is often unknown. Recent studies have used diverse in-silico techniques to predict viral hosts and host expansion events, including qualitative expert analysis11, probabilistic12 and machine learning (ML)13, 14, 15, 16, 17models.
The problem of host prediction is commonly addressed using similarity analysis of viral genomes, where similar genomes are more likely to share the same hosts10, 18. Host prediction through genome similarity can be achieved by alignment-based or alignment-free approaches17, 19. Computational efficiency of alignment-based approaches decreases with the product of the lengths of the sequences being aligned19, 20 and are sensitive to genome rearrangements19, 20, 21. These observations suggest alignment-free approaches may be preferred when datasets are very large or sequences in the dataset are the product of recombination events. However, most alignment-free approaches disregard the relative position of the residues along the sequence 14.
Some alignment-free studies aimed at predicting the host of a specific species of virus 13, 14, while others15, 16, 17 created models to uncover signals common to different viruses (e.g. Zika, influenza, coronavirus) affecting a large group of hosts such as Chordata (vertebrates and others)15, 17. Although common signals between completely different families of viruses are useful for host prediction, these studies include only a limited number of representatives of each taxa across hosts and disregard the specific properties of the virus, preventing further mechanistic analysis of host expansion pathways.
In this work, we study the potential of alpha and beta coronaviruses to cause human infection. In particular, we aim at predicting whether the spike (S) protein of a coronavirus binds a human receptor. The S protein decorates the exterior of the viral envelope and is key in host expansion since its binding to the host receptor protein triggers the infection process 22, 23, 24. Starting with a collection of amino acid sequences from the S protein, we build a machine learning model that predicts binding to a human host receptor. We propose a skip-gram model which uses a neural network to transform the data into vectors. These vectors encode the relationship between neighboring protein sequences of length k (i.e. k-mers). A classifier uses these vectors to score each sequence according to its binding potential to a human receptor. We call this score the human-binding potential (h-BiP). We use a dataset consisting of 2,534 unique spike sequences from alpha and beta coronaviruses spanning all clades and variants (see Methods). The classifier is highly accurate, and its h-BiP score is highly correlated with sequence identity against human viruses. Moreover, the proposed h-BiP score also discriminates the binding potential in cases with similar sequence identity and detects binding in cases of low sequence identity. We identify two viruses, Bt13325 and LYRa326, with high h-BiP values and yet unknown human binding properties. Consistent with this finding, a phylogenetic analysis shows that Bt133 and LyRa3 are related to non-human viruses known to bind human receptors. Furthermore, a multiple sequence alignment of the receptor binding motifs (RBM) of Bt133 and of LYRa3 with their related viruses revealed that they conserve the contact residues with the human receptor. Molecular dynamics (MD) of the receptor binding domain (RBD) validates binding and identifies contact residues with human receptors. Finally, we test whether this model can be used for the surveillance of host expansion events. We emulate the conditions prior to SARS-CoV-2 emergence by excluding all SARS-CoV-2 sequences from the training set and find that the re-trained model predicts binding of the wild type of SARS-CoV-2 to a human receptor.