At the end of the year 2019, human cases of a new respiratory disease were detected in Wuhan, China1,2. It was determined that the infectious agent was a new RNA virus member of the Family Coronaviridae, named Severe acute respiratory syndrome coronavirus 2 or SARS-CoV-23. The virus spread worldwide during the next months, and cases have been reported in more than 200 countries. Over 600 million infections and more than 6.5 million deaths have been attributed to this virus as of September 2022 (https://covid19.who.int/). SARS-CoV-2 is the third coronavirus capable of inducing severe respiratory diseases in humans to emerge in the last 18 years, after the severe acute respiratory syndrome coronavirus (SARS-CoV), detected in China in 20024, and the Middle East respiratory syndrome coronavirus (MERS-CoV) identified in the year 2012 in Saudi Arabia5.
SARS-CoV-2, along with SARS-CoV, is a member of the subgenus Sarbecovirus, and has a positive sense single-stranded RNA genome with a length of approximately 30,000 base pairs that encodes four major structural proteins: spike, envelope, membrane, and nucleocapsid, encoded by the S, E, M and N genes respectively6. The spike protein has a central role in cell infection and pathogenesis, since it mediates the recognition of cellular receptors and the binding of the viral and cell membranes, a process that ultimately leads to the entry of the virus into the cell7. The spike protein contains a Receptor Binding Domain (RBD), which gives the virus an affinity for the angiotensin-converting enzyme 2 (ACE2), a human cell receptor also used by SARS-CoV for the binding process1,8. The RBD is the most variable part of the genome among coronaviruses and has an important role in determining the host range of each viral species9,10. The RBD of SARS-CoV-2 is characterized by six contact amino acid residues that are essential for the binding to the ACE2 receptors9, and are present in a region known as the variable loop11.
The evolutionary origin of SARS-CoV-2 remains unclear so far. Analyses of full genome sequences indicate that the closest known relatives of this virus are the bat Sarbecoviruses RmYN02 and RaTG132,12, which suggests an emergence in humans following a spillover from bats directly or through an intermediate host. However, Sarbecoviruses, like other coronaviruses, are highly recombinant11,13 and the examination of the SARS-CoV-2 RBD sequence has suggested more complex scenarios for its origin involving genetic recombination. Indeed, while bat viruses are the closest relatives of SARS-CoV-2 considering the full genome sequences, the RBDs of SARS-CoV-2 and GD410721, a Sarbecovirus isolated from Malayan pangolins (Manis javanica), are highly similar, and share the six amino acid residues that confer the affinity for ACE2 receptors14,15.
Hence, three of the four main hypotheses for the origin of the variable loop of SARS-CoV-2 include recombination (Fig. 1). First, it has been postulated that SARS-CoV-2 acquired the variable loop of the RBD after a recombination event with a pangolin Sarbecovirus (Fig. 1a). Previous studies have employed tools for the detection of recombination breakpoints and the generation of similarity plots, to find signals of recombination events in the S gene involving SARS-CoV-2, RaTG13 and GD41072115–17. A second hypothesis proposed by Boni et al.11 suggests that the common ancestor of SARS-CoV-2, GD410721 and RaTG13 was capable of recognizing the human ACE2 receptors and part of the RBD of RaTG13 was replaced through recombination with another Sarbecovirus missing the ACE2-specific residues (Fig. 1b). The recent discovery of Sarbecoviruses in different species of cave bats (BANAL-52, -103, and − 236), presenting an almost identical RBD to the one from SARS-CoV-2 has motivated a third hypothesis, in which SARS-CoV-2 could have acquired the contact residues as the result of recombination events among bat Sarbecoviruses18 (Fig. 1c). Finally, the affinity for human ACE2 receptors found in viruses that infect different hosts carrying the same type of receptors could also be the result of convergent evolution9 (Fig. 1d).
Given the potential importance of recombination in the evolutionary history of the S gene, the detection of recombination events has become a very important task in evolutionary studies of SARS-CoV-2. The presence of recombination among Sarbecoviruses has been explored previously with tools like SimPlot19, RDP420, RDP521 and GARD22. These methods allow the identification of recombinant strains and recombination breakpoints, as well as potential parental strains involved. However, they are unable to estimate Ancestral Recombination Graphs to represent the reticulated evolution induced by genetic recombination. This seriously limits the possibility to generate a comprehensive evolutionary picture, does not allow to propagate phylogenetic uncertainties in the estimation of recombination events23–25.
The reconstruction of ARGs from sequence data is a notoriously difficult task. Nevertheless, several software packages have been developed to tackle this challenge. One example is the Bayesian approach implemented in the BEAST2 package Bacter24. Bacter is an implementation of the ClonalOrigin model26 which can be used to estimate a special type of ARG referred to as Ancestral Conversion Graphs (ACG)s. ACGs consist of a backbone bifurcating phylogeny representing the evolution of the major part of the genetic material (referred to as the "clonal frame"), together with recombination events involving donor and recipient lineages on the clonal frame (Supplementary Fig. S1 online). The method allows to estimate recombination events within a dated phylogeny together with measures of statistical support and uncertainties. This allows us to date individual recombination events, which in turn provide us with an estimation of the emergence time of recombinant lineages. A different Bayesian approach to perform recombination-aware phylogenetic analyses produces recombination networks, where recombinant edges are also integrated within the phylogeny27. Although unlike Bacter, this approach doesn’t estimate the posterior support for the arrival point of the recombination on the recipient lineage. In this study we employed Bacter to detect recombination events within the RBD region of 39 Sarbecovirus genomes, with the goal of clarifying the origin of the amino acids located in the variable loop that give SARS-CoV-2 the high affinity to the ACE2 receptors of human cells.