What restriction sites (RS) sequence of the viral genome can say: generation of a restriction endonucleases barcoding map.
During the SARS-CoV epidemic outbreak in 2003, a method called reverse genetic to assemble a full-length cDNA of the SARS-CoV-Urbani strain, as a template for manipulation of the viral genome, was published to develop and test candidate vaccines and therapeutics9. This resulted in the so-called infectious clone icSARS-CoV containing atypical markers of the wild-type (WT) virus. In particular, several Bgl1 RSs were introduced into the icSARS-CoV cDNA, which can be recognized since mutation are included in the newly formed cDNA. Figure 1A shows the sequence alignment between the WT SARS-CoV-Urbani and the icSARS-CoV. We highlighted the sequence containing the Bgl1 RS used to produce icSARS-CoV.
The newly sequences introduced in the recombinant cDNA of SARS-CoV can be used as markers to follow possible virus laboratory spillage. We analysed natural sequences isolated from four different SARS-CoV (hCoV-19-Italy-Vr/hSARS-CoV-19-Wuhan/hCoV-19 Pangolin/Bat-Cov-raTG13) to look for Bgl1 RS ‘marker’ (GCCNNNN/NGGC). All the genomes did not contain these sites (Table S1). In particular, the hSARS-CoV-19-Wuhan and hCoV-19-Italy-Vr did not show these sites excluding a possible accidental virus spill-over. By analysing the sequences of Bat-CoV-raTG13 and the hCoV-19-Pangolin, we observed that only one sequence from SARS-CoV-Urbani (GCCAGCGTGGT) was found in SARS-CoV-2. This is expected since the first part of these two genomes show high similarities.
Another recombinant SARS-CoV was produced in 2007 which derived from fifteen passages of the SARS-CoV-Urbani in BALB/c mouse lungs, therefore it was named Mouse-Adapted (MA)-SARS-CoV10. The homology of MA-SARS-CoV compared with the original SARS-CoV-Urbani is 99.97% with only six distinct nucleotides, that cannot be used as markers of this recombinant virus since same mutations are naturally acquired by the WT-SARS-CoV, as demonstrated from the sequences of other isolated SARS-CoV10. Both the icSARS-CoV and the MA-SARS-CoV have become the most widely used recombinant viruses to study SARS-like viruses and no specific sequences were found in the hSARS-CoV-2.
In 2008, a consensus sequence called Bat-SCoV (FJ211859) was generated starting from four Bat-SCoVs genomes HKU3–1 (DQ022305), HKU3–2 (DQ084200), HKU3–3 (DQ084199), and RP3 (DQ071615)11. The full-length Bat-SCoV infectious clone, generated with the method described by Yount et al,9 include in the recombinant sequence specific markers such as the Bgl1 RSs. These RSs have specific nucleic base pairs in the “N” positions of the recombinant Bat-SCoV (see supplementary figure 5 of Becker et al. summarizing all the markers found11). We observed that these specific sequences were all absent (Figure 1B, Figure S1). This reinforces the theory that these recombinant viruses manipulated in the laboratory cannot be progenitors of the current SARS-CoV-2.
Other markers to identify the origin of SARS-Cov-2.
In 2008, Ren et al showed that SARS-like coronavirus (SL-CoVs) from horseshoe bat, which has a high similarity to SARS-CoV, differed in the N-terminus of the spike protein and particularly in the receptor binding RBS region12. Therefore, SL-CoVs were not able to infect hACE2 expressing cells, but only chimeric virus expressing the spike protein of the SARS-CoV were able to bind the hACE2 which is the functional receptor of SARS-CoV. The authors identified a specific region responsible for the virus entrance into hACE2-expressing cells consisting of a minimal region of less than 200 amino acids. Interestingly, this group showed that chimeric spike proteins, whereby different region of the SARS-CoV BJ01 (BJ01-S) spike were substituted into the spike of the bat SL-CoV (Rp3), were able to bind the hACE receptor. We generated in silico two of these chimeric spike (CS) sequences (the CS424-494 and the CS45-608), and then performed a multiple alignment to check similarities between other spikes identified after 2008, including the Bat-Cov-raTG13, the hCoV-19-Pangolin and the hSARS-CoV-2. The similarities of these two chimeric spikes is limited in the RBD of the spike (Figure 1C) and in the polybasic cleavage site (Figure S2). Thus, the recombinant spike as possible progenitors of the hSARS-CoV-2 spike sequence can be excluded.
Moreover, we performed a nucleotide blast sequence to find whether these recombinant spikes are found in the recent identified SARS-CoV-2 viruses. As shown in Figure S3-S5, we observed that, despite high similarities, many gaps (intended as single base mutations) are present between WT viruses and the recombinant spikes excluding possible manipulations.
The turning point arrived in 2013, when Xing-Yi and colleagues published an important paper showing that a WT bat SL-CoV was capable of using hACE2 as an entry receptor, dispelling the observation that no natural SL-SARS-CoV were able to use hACE2. Interestingly, the newly identified bat SL-CoV-WIV1 had high sequence similarity (99.9% identity) to two other identified WT bat coronaviruses, RsSHC014 and RS3367. This study suggested that direct bat-to-human infection is a possible scenario for some bat SL-CoVs. In 2015, Vineet et al made a recombinant virus between the spike of the bat coronavirus SHC014 and the mouse-adapted SARS-CoV backbone13 using the well establish reverse genetic approach9. According to this method, several Bgl1 RSs were included into the sequence (Table S2). Moreover, the sequences between the newly mutant SARS-CoV has a poor sequence similarity to hCoV-19-Italy-VR and the SARS-CoV-19-Wuhan (Figure 1D).
Unique restriction sequence sites: a novel approach to track the SARS-CoV-2 origin.
Exploiting the RS sequences as specific markers, we propose an alternative way to trace the SARS-CoV-2 origin. This approach consists in the generation of a RS map of SARS-CoV-2 and the other four related coronavirus genomes. Using the Serial Cloner Restriction Enzyme Library, we generated the RS barcoding map based on the frequency of finding specific RS sequences in the genome. First, we generated a RS barcoding map which was used as genetic fingerprinting of the specific sequence analysed and which easily highlights sequence differences between the genomes. The pattern of the barcode’s reconstruction demonstrated high similarity between the coronavirus isolated from the Bat-Cov-raTG13 and the Pangolin, suggesting a naturally evolution and adaptation of the virus. HIV, SARS-CoV and MERS-CoV were used as control (Figure 2A and Figure S6).
From the full restriction enzyme barcoding map, we identified in the spike (S) gene a sequence of 300 bp that can be used as barcode to identify the virus and differentiate from others (Figure 2B and S7). This approach is low-cost and does not require full sequencing of the virus genome and extended analyses conducted by bioinformaticians. Indeed, by using a standard PCR reaction to amplify the above mentioned 300bp spike gene, or simply by using real-time PRC products from swab test, and subsequent sequencing of this region, it is possible to generate an RS barcode that will give us a low-cost system to follow viral mutation and trace it over subsequent years. Moreover, this approach can easily be used to discriminate between false negative and false positive which are the reasons of important additional socio-economic disruptions14.
Using the data to generate the full barcode map we performed principal component analysis (PCA) to determine whether the observed frequency of RSs is related to the hierarchical distance of the genomes analysed. The PCA plot shows the top cluster on PC2 formed by the Pangolin and the Bat-Cov-raTG13 lies in close proximity to the hSARS-CoV-Wu (Figure 2C). Below, the cluster formed by the bat SARS-CoV related and the SARS-CoV Urbani. The HIV genome and the MERS were used as control and clearly show greater difference in sequence homology from SARS-CoV virus.
In addition, we focalised on informative RSs to perform a hierarchical clustering on the heatmap using Pearson correlation as distance metric (Figure 2D and S8). The heatmap confirm that hSARS-CoV-2 and Bat-Cov-raTG13 are closer than hCoV-19 Pangolin and MERS CoV.
Finally, the barcode map of the RSs confirmed the absence of unique sites giving another strong evidence that the SARS-CoV-19 is the product of a natural evolutionary process of single base insertions/deletions or recombination.
We then focused on the unique RS sequences used to modify the viral genome. In particular, we analysed shared sites between SARS-CoV-19, Bat-Cov-raTG13 and Pangolin-SARS-CoV-19. Only six RS sequences were shared between these genomes and their location does not suggest any genetic manipulation. In the Venn diagram shown in Figure 2E, there are 12-shared RS. However, they are only six if we consider that some of these enzymes recognize the same sequences. One example is the unique RS sequence recognized by Bsp68I, BtuMI, NruI, RruI found at 319bp on the Bat-Cov-raTG13 and shifted at 334bp on the SARS-CoV-19 and the Pangolin-SARS-CoV-19. This 15 bp shift is due to single base insertions (Figure 2F).
Another example is the unique RS sequence GAGCTC recognized by Ecl136II on the SARS-CoV-19 genome that is located at 15081bp, while on the Bat-Cov-raTG13 genome we found two of these sequences, one at 15080bp and the other one at 19768bp. The latter, if it were to be the result of genetic engineering, would be predicted to produce a gap of 6bp, while from the local alignment it is clear that a nucleotide substitution occurred from C to T forming the new site (Figure S9A).
Finally, the genomic location of these unique sites does not flank specific ORF. Indeed, engineered RSs are typically expected to be at the beginning and at the end of an ORF. Here, all the unique RSs are located inside the ORFs (Figure S9B), thus not easily editable by conventional genetic engineering.