The first SARS-CoV-2 genome was sequenced in January 2020. Since then, hundreds of thousands of virus genomes have been collected and sequenced. Comparative analysis of SARS-CoV-2 variants has provided for the identification of the routes of virus transmission 1–4, the selective pressure on different genes 5, and the discovery of new variants associated with higher infectivity 6–8. In many cases, genome analysis only included search for point mutations, but some deletions also have been identified, such as del69-70 one of the characteristic mutations of B.1.1.7 and Cluster 5 2,3. Moreover, recently, recurrent deletions have been shown to drive antibody escape 9. However, insertion sequences are mostly ignored, both during variant calling step and in the downstream analysis.
Although insufficiently studied, insertions appear to be crucial for beta-coronavirus evolution. Three insertions in the spike (S) glycoprotein and in the nucleoprotein (N) have been shown to differentiate highly pathogenic beta-coronaviruses (SARS-CoV-1, SARS-CoV-2 and MERS) from mildly pathogenic and non-pathogenic strains and suggested to be the key determinants of human coronaviruses pathogenicity 10. The best characterized insert in SARS-CoV-2 is the PRRA tetrapeptide that so far is unique to SARS-CoV-2 and introduces a polybasic furin cleavage site into the S protein, enhancing its binding to the receptor 11,12. Furthermore, the entire receptor-binding motif (RBM) domain of the S protein, most likely, was introduced into the SARS-CoV-2 genome via homologous recombination with coronaviruses from pangolins, which could have been a critical step in the evolution of SARS-CoV-2’s ability to infect humans 13–15. Similar frequent homologous recombination events among coronaviruses, and in particular in the sarbecovirus lineage, suggest that homologous recombination events is a common evolutionary mechanism that might have produced new coronavirus strains with changed properties on multiple occasions 15,16. In contrast, non-homologous recombination in RNA viruses appears to be rarely detected, and its molecular mechanisms remains poorly understood 17.
In infected cells, beta-coronaviruses produce 5 to 8 major subgenomic RNAs (sgRNAs) 18,19. Eight canonical sgRNAs are required for the expression of all encoded proteins of SARS-CoV-2. These sgRNAs are produced by joining the transcript of the 5’ end of the genome (TRS site) with the beginning of the transcripts of the respective open reading frames (ORFs) 20. In addition, SARS-CoV-2 has been reported to produce multiple noncanonical sgRNAs, some of which include the TRS at 5’ end, whereas others are TRS-independent 21,22.
Inserts in the SARS-CoV-2 genome are categorized in the CoV-GLUE database 23, and the preliminary results on systematic characterization of the structural variance and inserts in particular have been reported 24. Forty structural variants including three inserts, three nucleotides long each, were discovered and shown to occur in specific regions of the SARS-CoV-2 genome. These variants were further demonstrated to be enriched near the 5’ and 3’ breakpoints of the TRS-independent transcriptome. Additionally, indels have been shown to occur in arms of the folded SARS-CoV-2 genomic RNA 24. However, longer inserts that might have been introduced into the virus genome during SARS-CoV-2 evolution, to our knowledge, have not been systematically analyzed.
Here we report the comprehensive census of the inserts that during the evolution of SARS-CoV-2 over the course of the pandemic and show that at least some of these result from the virus evolution and not from experimental errors. These inserts are not randomly distributed along the genome, most being located in the 3’terminal half of the genome and co-localizing with 3’ breakpoints of non-canonical (nc) sgRNAs. We show that the long insertions occur either as a result of the formation of nc-sgRNAs or by duplication of adjacent sequences. We analyze in detail the inserts in the S glycoprotein and show that at least two of these are located in a close proximity to the antibody-binding site in the N-terminal domain (NTD), whereas others are also located in NTD loops and might lead to antibody escape, and/or T cell evasion.
Identification of inserts in SARS-CoV-2 genomes
To compile a reliable catalogue of inserts in SARS-CoV-2 genome, we analyzed all the 498224 sequences present in the GISAID multiple genome alignment (compiled on February 23, 2021). From this alignment, we extracted all sequences that contained insertions in comparison with the reference genome. After this initial filtering, insertions were identified in 4468 genomes, with 296 unique events detected in total.
To eliminate insertions resulting from sequencing errors, we performed several additional filtering steps. First, we retained for further analysis only those insertions that were multiples of three, and thus would not lead to frameshifts, resulting in the reduction of the dataset to 157 unique events in 1030 genomes ranging in length from 3 to 195 nucleotides (Supplementary Table 1).
We then screened the Sequence Read Archive (SRA) database for the corresponding raw read data. We were able to obtain raw reads for 48 inserts (Supplementary Table 1), and verified the insertions in 32 cases. All insertions except one that we were unable to validate with the raw data analysis were of the length 3 or 6 nucleotides. We removed those unconfirmed events from our dataset that resulted in 141 events. Among these inserts, 65 were three nucleotides in length and 22 were of length 6, whereas the rest were longer (Fig. 1a). We observed that inserts of lengths 3 and 6 had a distinct nucleotide composition with a substantial excess of uracil, at about 45%, whereas the composition of the longer inserts was similar to that of the SARS-CoV-2 genome average, with about 30% U (Fig. 1b). The similar trend is observed for inserts verified by read data, although the available data is insufficient to demonstrate the significance of this trend for the 6 nucleotide inserts (Supplementary Fig. 1). Thus, we split the collection of inserts into two categories, the short inserts of length 3 and 6 nucleotides, and the long inserts, which we analyzed separately.
We then checked whether inserts that were present in multiple genome sequences were monophyletic, that is, whether the genomes containing the same insertion formed a clade in the large phylogenetic tree containing more than 300,000 SARS-CoV-2 genomes (see Materials and Methods). Of the 37 short inserts identified in multiple genomes, 11 were found to be monophyletic, and thus, apparently, originating from the single evolutionary event (Supplementary Table 2, Supplementary Fig. 2). In 9 cases, identical insertions were observed in genomes submitted from the same laboratory, and mostly, on the same date, which implies that the genomes were sequenced and analyzed together, and makes it difficult to rule out a sequencing error. Interestingly, all 14 cases that can be confirmed by read data were not monophyletic. However, among the 18 long inserts that were found in multiple genomes, 13 were monophyletic, and only in five of these cases, sequences were from the same laboratory. What is more, all 4 long inserts present in multiple genomes and confirmed by read data were monophyletic (Supplementary Table 2, Supplementary Figs. 3).
As the result of all these checks, the inserts detected in SARS-CoV-2 genomes fell into the following categories: 87 short inserts, among which 21 were confirmed by read data; and 54 long (at least, 9 nucleotides) inserts. We additionally classified the long inserts into four groups, in the order of increasing confidence: 29 singletons, 5 non-monophyletic inserts observed in multiple genomes, 9 monophyletic inserts observed in multiple genomes, and 11 inserts (7 singletons and 4 monophyletic ones), for which the insertions were confirmed by the raw sequence data analysis. We thus concluded that the 21 short inserts confirmed by read data and 25 long inserts that were detected in multiple genomes (monophyletic and not) and/or confirmed by raw sequencing data represented the most reliable insertion events that are currently observable throughout the evolution of SARS-CoV-2 (Supplementary Table 3).
Insertions are non-uniformly distributed along the SARS-CoV-2 genome
We found that the insertions were not randomly distributed along the genome, with most occurring in the 3’-terminal third of the genome (Fig. 1c). Two, not necessarily mutually exclusive main hypothesis have been proposed on the origin of the short inserts (structural variants in the coronavirus genomes, namely, that they are associated with loops in the virus RNA structure or occur in the hotspots of template switch, at the breakpoints of TRS-independent transcripts 24. To distinguish between these two mechanisms, we compared the distribution of 141 inserts along the SARS-CoV-2 genome with the distributions of structured regions 25 and of template switch hotspots 22(Fig. 1d). We detected a strong association of the insertions with the template switch hotspots (r = 0.37, p-value = 2.3x10− 11). Almost 30% of the inserts occurred within 5 nucleotides of a template switch hotspot, whereas less than 10% are expected by chance (Fig. 1e). The observed pattern of inserts occurring in stems is the same as expected at random, indicating that inserts were not overrepresented in loops (Fig. 1f). Both these observations held when we included in the analysis not all the 141 inserts, but only the 46 highly confident ones (Supplementary Fig. 4). Thus, many inserts in the CoV-2 genomes are associated with template switch hotspots.
Short insertions in SARS-CoV-2 are generated by template sliding
The notable difference in nucleotide composition and different phyletic patterns of short and long inserts imply that the two types of insertions occur via different mechanisms. As pointed out above, the short insertions are rarely monophyletic, indicating that short U-rich sequences are inserted in the same position in the SARS-CoV-2 genome on multiple, independent occasions during virus evolution. Taken together, these observations suggest that such short insertions occur via template sliding (polymerase stuttering) on short runs of As or Us in the template (negative strand or positive strand, respectively) RNA 26–28 (Supplementary Fig. 5a). This could be either a biological phenomenon occurring during SARS-CoV-2 evolution, in case the errors are produced by stuttering of the coronavirus RdRP, or an artifact if the errors come from the reverse transcriptase or DNA polymerase that is used for RNA sequencing. It cannot be completely ruled out that these short inserts are a mix of biological and experimental polymerase errors. However, for the 19 inserts of length 3 that were confirmed by sequencing data analysis, we also detected the U enrichment. Those inserts were observed at high allele frequencies in the data (Supplementary Table 1), and thus, are unlikely to be experimental errors. Additionally, short inserts appear to be represented with the same frequency in SARS-CoV-2 genomes sequenced with different technologies, including Illumina MiSeq, NovoSeq and NextSeq and even Oxford Nanopore or IonTorrent (Supplementary Table 1). Furthermore, elevated rate of thymine insertion has not been reported as a common error of either Illumina or Oxford Nanopore technology 29–32. In contrast, production of longer transcripts and slow processing on polyU tracts has been demonstrated for nsp12 (RdRP) of SARS-CoV-1 33. Additionally, the RdRp complex of SARS-CoV lacking the proof-reading domain has been shown to misincorporate more nucleotides compared with other viral polymerases 34. Thus, a substantial contribution of sequencing errors to the origin of short inserts in SARS-CoV-2 genomes appears unlikely.
Long insertions in SARS-CoV-2 are caused by template switching and local duplications
For in-depth analysis of the long inserts, we selected only the 25 high-confidence ones (see above), which included 117 genomes and ranged in size from 9 to 27 nucleotides (Fig. 2, Supplementary Table 4).
Insertions were mostly observed in genome sequences from Europe (82) and US (25), and originated from different laboratories that employed different protocols. Furthermore, these events started to accumulate in early November 2020, and the median collection date of the genomes containing the long inserts is January, 9 2021. Seven of the 25 reliable long insertions are located in the S gene, which is significantly higher than expected by chance (Fisher exact test p-value = 0.0165). The excess of inserts in the S gene suggests that their spread in the virus population could be driven by positive selection for enhancement of the interaction of SARS-CoV-2 with the host cells that could be conferred by the inserts.
The length of these high-confidence inserts allowed us to search for matching sequences both in SARS-CoV-2 genomes and in other viruses. For 13 cases, we were unable to identify the probable origin of the insertion. For four inserts, we detected a local duplication that most likely gave rise to the insertion (Supplementary Table 4; Supplementary Fig. 5b). Three out of these four were found in multiple genomes and two of them were monophyletic although there was no raw read data for any of these genomes. In one more case, the insertion was a singleton, but was supported by raw data.
In 8 more cases, we detected significant matches in the SARS-CoV-2 genome, 6 in the coding strand and two in the complementary strand (Fig. 2a; Supplementary Table 4). Among these 8 insertions, five were monophyletic (2 confirmed by raw data), and two more were singletons supported by raw data. The apparent origin of inserts from distant parts of the SARS-CoV-2 genomes implies template switch (Supplementary Fig. 5c). We hypothesized that template switching occurs during the formation of the nc sgRNAs. To test this possibility, we compared the insert locations and the sites of the likely origin of the inserts with the available experimental data on the SARS-CoV-2 transcriptome 22. Hotspots of template switching are characterized by polymerase “jumping” from one location on the genome to another, which yields shorter sgRNAs. As mentioned above, inserts tend to occur close to template switch hotspots, so for the inserts with a traceable origin, we additionally checked whether their sites of origin occurred close to the site of RdRp “jumping”. Although the information on the SARS-CoV-2 transcriptome is limited, among the 8 cases we found that two insert sites were located within one end of the junction, whereas their corresponding sites of origin were within 100 nucleotides of the other side of the same junction (Fig. 2a). To assess the significance of this finding, we performed two permutation tests (see Material and Methods), in one of which the real insertion positions were matched with start sites chosen randomly, whereas in the second one, both types of sites were selected at random. Both tests showed that the co-localization of the inserts with template switch junctions was significant (Fig. 2b,c).
Thus, high-confidence long inserts in the SARS-CoV-2 genome apparently originated either by local duplication or by template switch which, at least in some cases, seemed to be associated with nc sgRNA synthesis. Notably, the PRRA insert, the furin cleavage site that is one of characteristic features of SARS-CoV-2, resembles the long inserts analyzed here. Although this insert has a high GC-content compared to the genomic average of SARS-CoV-2, it falls within the GC-content range of the long inserts (Supplementary Fig. 1b). Furthermore, this insert is located within 20 nucleotides of a template switch hotspot at position 22,582 22. Although we were unable to identify a statistically significant match that would allow us to map the origin of this insert to a particular location within the SARS-CoV-2 genome, it appears likely that this insert also originated by template switch, with subsequent substitutions erasing the similarity to the origin sequence.
Insertions in the S protein produce putative antibody escape variants
As indicated above, insertions are non-uniformly distributed along the SARS-CoV-2 genome (Fig. 1с). In particular, among the 25 long inserts identified with high confidence, 7 were located in the S protein, suggesting that these inserts could persist due to their adaptive value to the virus. Three of the 7 inserts in S were observed in multiple genomes that formed compact clades in the phylogenetic tree, and ins214TDR in position 22,204 was strongly supported by raw sequencing data. In four more cases, the inserts were found in single genomes, but again, were strongly supported by raw data, and reached allele frequency close to one in the raw sequences, so these are highly unlikely to be artifacts (Supplementary Table 1).
All 7 long inserts in the S protein were located in the N-terminal domain (NTD), and four of these occurred in the same genome position, 22,004 (Fig. 3). Compared to the receptor binding domain, the NTD initially attracted much less attention. Subsequently, however, multiple substitutions associated with variants of concern and observed in immunocompromised individuals with extended COVID-19 disease were identified in the NTD 2,35,36. To evaluate potential functional effects of the inserts in the NTD, we mapped them onto the protein structure. All these inserts occurred on the protein surface (Fig. 3), and two, ins15ATLRI and ins246DSWG, were located in an epitope that is recognized by antibodies obtained from convalescent plasma of recent COVID-19 patients 37. Furthermore, ins246DSWG is located in the loop that is responsible for the interaction with the 4A8 antibody and potentially other antibodies (Fig. 3a). Thus, at least these two insertions might be associated with the escape of SARS-CoV-2 variants from immune antibodies. The presence of multiple insertions in the same site, 22,004, suggests an important role of portion of the NTD in SARS-CoV-2 infection, especially, given that multiple deletion variants have been reported in the same region, 21971–22005 9. These insertions and ins98KAE are located in the neighboring loops, and given that the central region of the NTD has been shown to be essential for the virus interaction with CD4 + cells 38, could be associated with the escape from the T-cell immunity. Furthermore, recent evidence suggests that this region contains an additional epitope for antibody binding 39. Because these insertions were detected only in recent samples, it appears that the respective variants have to be further monitored.