High-frequency variants in Russia
We analyzed 4,487 SARS-CoV-2 sequences with known sampling dates obtained in Russia between February 25, 2020 - March 28, 2021. 2,842 of these samples are deposited to GISAID7, while the remaining 1,645 (all dating to February-November 2020) will be made available through another repository. All 968 analyzed samples from December 2020 - March 2021 are available on GISAID. The vast majority of samples over this period came from several genomic surveillance programs which were not targeted towards particular variants, although representation of Russia’s regions varied with time.
Throughout the pandemic, the SARS-CoV-2 diversity in Russia has been predominated by the B.1.1 Pango lineage which is frequent in Europe, as well as lineages descendant from it8 (Fig. 1). Three B.1.1-derived lineages with the highest prevalence in Russia in the beginning of 2021 were B.1.1.7, which was firstly introduced in Russia at the end of 2020, as well as two other lineages, B.1.1.317 and B.1.1.397, that appeared in Russia in April and July, 2020, respectively. B.1.1.317 was first sampled in Vietnam on March 27, 20209; within Russia, it was first sampled on April 5, 2020 in Moscow, spreading across the country throughout 2020 (Fig. 2). B.1.1.397 was first sampled in the Krasnoyarsk Region of Russia on July 22, 2020. By summer 2020, both B.1.1.317 and B.1.1.397 were frequent throughout Russia (Fig. 2).
To find the non-reference amino acid variants that gained in frequency in Russia, we selected the positions at which the mean frequency of the non-reference variant in Russian samples exceeded 5% (for the spike) or 10% (for other proteins) in February-March 2021. We found 21 such positions in spike and 21 such positions in other proteins. Among these changes, two (RdRp:P323L and S:D614G) were fixed early in the global evolution of SARS-CoV-2; other two (N:R203K and N:G204R) are the lineage-defining mutations of B.1.1.
The frequency dynamics of the derived variants at the remaining 38 positions is shown in Fig. 3. These include the mutations characterizing the B.1.1.7 variant which has been increasing in frequency in Russia since January 2021 (Fig. 1), as well as some of the other globally spreading mutations of concern or interest, including the E484K mutation in spike. However, at many of these sites, the non-reference variants were rare outside Russia (Fig. 3). Most of these variants showed similar temporal dynamics in Moscow and St. Petersburg regions, as well as in the European and Asian parts of Russia (Fig. S1), indicating that their increase in frequency is not a result of sampling bias.
We aimed to identify the high-frequency variants carrying these mutations. Many of these sites were highly homoplasic, and overall we found the resulting phylogenies not to be robust. Instead, we defined the most frequent variants composed of these mutations, independent of the alleles at other sites (Fig. 4).
We considered the allele combinations that were most frequent in Russia in February-March 2021, and noticed that they belonged to several nested sets. The most frequent combination (99 out of the 461 samples) carried the N:A211V mutation which is characteristic of the B.1.1.317 Pango lineage; the second most frequent combination carried the S:D138Y and ORF8:V62L combination of mutations which are characteristic of the B.1.1.397 lineage; the third combination carried the set of characteristic mutations of B.1.1.7.
Still, there was no one-to-one correspondence between the frequent combinations of mutations and Pango lineages. For example, while the most frequent variant carrying S:M153T was that also including N:M234I, S:D138Y and ORF8:V62L (column 2 in Fig. 4), and classified as Pango lineage B.1.1.397, the variants carrying S:M153T alone, the S:M153T + N:M234I combination and the S:M153T + N:M234I + S:N679K combination were also frequent (columns 4, 9 and 5 in Fig. 4 respectively) but were classified by PANGOLIN as other lineages (B.1.1, B.1.1.141, B.1.1.28 and others). Similarly, while B.1.1.317 is defined by the N:A211V mutation, the frequency of the variant carrying this mutation alone was relatively low (column 12 in Fig. 4), while most samples carrying it also carried 8 additional high-frequency mutations, including four potentially important changes in spike (Q675R + D138Y + S477N + A845S; column 1 in Fig. 4). The frequencies of such “non-canonical” combinations of mutations increased throughout 2020–2021 (Fig. 5).
Finally, we observe three high-frequency combinations of mutations, including the S:E484K mutation of concern as well as other mutations of interest (notably S:Δ140–142, S:Δ136–144 and nsp6:Δ106–108, also referred to as ORF1a:Δ3675–3677; columns 7, 8 and 10 in Fig. 4). These combinations have recently got B.1.1.524, AT.1 and B.1.1.525 Pango designations (for columns 7, 8 and 10 in Fig. 4, respectively).
Frequency dynamics of the variants prevalent in Russia
To study potential effects of individual mutations composing a variant on the frequency dynamics of this variant, we fit the logistic growth model for the 10 most-frequent combinations of mutations and for N:A211V (which was the 12th most-frequent combination), and compared the dynamics of nested combinations with each other (Figs. 6–8).
The variant carrying just the N:A211V change (largely coincident with the B.1.1.317 Pango lineage) has increased in frequency since the start of the epidemic in Russia. However, since fall 2020, it was being displaced by the variant with 8 additional mutations, including four in spike: Q675R + D138Y + S477N + A845S. When the logistic growth model was fit to the N:A211V variant alone, it demonstrated modest growth (Fig. 6A); however, its combination with S:Q675R + D138Y + S477N + A845S demonstrated a much more rapid frequency increase, with the estimated daily growth rate of 1.93% (95% CI: 1.8%‒2.06%; Fig. 6B). While this led to a frequency increase of the N:A211V mutation independent of the background (Fig. 6C), this suggests that the frequency increase was more likely driven by the S:Q675R + D138Y + S477N + A845S combination than the N:A211V change defining the B.1.1.317 Pango lineage.
By contrast, the frequency of the S:M153T mutation grew independently of the presence of other mutations from our list (Fig. 7). While subsequent mutations may add to the estimated growth rates, these rates were comparable when the S:M153T mutation occurred alone or in combination with N:M234I, N:M234I + S:N679K, or N:M234I + S:D138Y, and all these combinations were still frequent many months after they had originated (Fig. 5, 7).
Finally, the five remaining variants which also reached high frequency in February-March carried unnested, although partially overlapping sets of mutations. These included B.1.1.7 (Fig. 8A); a variant carrying the S:P681H mutation of interest in the absence of other high-frequency mutations (Fig. 8B); as well as three novel variants carrying the following combinations of mutations: (i) nsp6:Δ106–108 + S:P9L + S:Δ140–142 (which recently got the B.1.1.524 Pango designation; Fig. 8C); (ii) S:P9L + S:Δ136–144 + S:E484K (which recently got the AT.1 Pango designation; Fig. 8D); and (iii) nsp6:Δ106–108 + S:Δ144 + S:E484K (which recently got the B.1.1.525 Pango designation; Fig. 8E). Variants B.1.1.524, AT.1 and B.1.1.525 were only observed in 13–14 samples each, but are of interest because this constitutes an appreciable fraction of samples obtained in February-March (3.0%, 2.8% and 2.6% respectively), and also because they are composed of known mutations of interest or concern. The daily growth rate estimated for these variants by the logistic growth model was in the range of 2.44–7.18% (Fig. 8C-E).
The continued spread of some of these variants between February-April 2021 was confirmed by community-based PCR testing. To obtain independent frequency estimates, we made use of a PCR system sensitive to the presence of nsp6:Δ106–108 and S:Δ69–70 deletions (see Methods) to detect the B.1.1.7, B.1.1.524 and B.1.1.525 variants. Specifically, S:Δ69–70+ nsp6:Δ106–108+ samples correspond to B.1.1.7, while S:Δ69–70− nsp6:Δ106–108+ samples correspond to either the B.1.1.524 or the B.1.1.525 variant (Fig. 4). While the frequency estimates were highly uncertain (Table 1), they indicate that B.1.1.7, and one or both of variants B.1.1.524 and B.1.1.525, were wide-spread in April (Fig. 9, Table 1). A considerable fraction (59.6%) of PCR samples from February and March were included in our main analysis, as their sequences were in GISAID. However, the frequency increase was also observed in the 136 PCR samples for which no sequencing data was available (Fig. S2), providing independent validation of the NGS results. Similarly, it was observed when the PCR tests only for St. Petersburg were analyzed (Fig. S3, S4), indicating that the prevalence of these variants increased at least in this city as opposed to being an artefact of changing sampling between regions.
Table 1
Frequencies of (B.1.1.524 or B.1.1.525) and B.1.1.7 estimated from PCR data. The point estimate and the 95% confidence intervals (Wilson score intervals) are shown.
|
2021-Feb
|
2021-Mar
|
2021-Apr
|
nsp6:Δ106–108 + S:Δ69-70- (B.1.1.524 or B.1.1.525)
|
10.5%
(2.3%-31.4%)
|
6.9%
(4.1%-11.2%)
|
15.2%
(7.6%-28.2%)
|
nsp6:Δ106–108 + S:Δ69–70+ (B.1.1.7)
|
10.5%
(2.9%-31.4%)
|
13.7%
(9.7%-19.1%)
|
21.7%
(12.3%-35.6%)
|
|
2021-Feb
|
2021-Mar
|
2021-Apr
|
nsp6:Δ106–108 + S:Δ69-70- (B.1.1.524 or B.1.1.525)
|
10.5% (2.3%-31.4%)
|
6.9% (4.1%-11.2%)
|
15.2% (7.6%-28.2%)
|
nsp6:Δ106–108 + S:Δ69–70+ (B.1.1.7)
|
10.5% (2.9%-31.4%)
|
13.7% (9.7%-19.1%)
|
21.7% (12.3%-35.6%)
|
Mutational composition of the high-frequency variants
In this section, we discuss the mutations that constitute the variants that were spreading in Russia before April 2021.
B.1.1.317. This lineage is defined by the presence of the N:A211V mutation. Changes at nucleocapsid position 211 experienced both persistent (according to the FEL model of HyPhy10) and episodic (according to the MEME model11) positive selection both in the global12 and in the Russian dataset (p = 0.0396 for the MEME model and p = 0.0268 for the FEL model, the likelihood-ratio test), as well as a rapid increase in frequency of non-reference variants in the global dataset12. While the global frequency of 211V has remained low (< 0.4%), in Russia, it has reached 26.9% in February-March 2021. According to immunoinformatic analysis, site N:211 is included in one of the four regions of the nucleocapsid protein with the highest affinity to multiple MHC-I alleles13. Nevertheless, the frequency of the variant carrying the N:A211V mutation alone has declined since October 2020 (Fig. 5), suggesting that it is unlikely to confer transmission advantage against the background of other variants that were frequent in early 2021 (Fig. 6).
A subclade within B.1.1.317 that was spreading rapidly during the studied period carried the (Q675R + D138Y + S477N + A845S) combination of changes in spike. Two of these mutations are of interest. S:D138Y, first described as one of the lineage-defining mutations of the P.1 lineage14, is a change in the N-terminal domain (NTD) of spike. Site 138 is adjacent to the NTD antigenic supersite, and together with other NTD mutations of P.1, S:D138Y was suggested to be the cause of disruption of binding with mAb15915 which is one of the most potent inhibitory antibodies16. S:S477N is positioned in the receptor-binding motif (RBM) of the S-protein near the antibody binding site (Fig. 10) and was reported to promote resistance to multiple antibodies and plasma from convalescent patients 17,18. Additionally, S477N is thought to increase ACE2 binding19. It is one of the lineage-defining mutations of the B.1.160 (20A.EU2) variant of interest prevalent in Europe in Autumn 202020, and the only one among them to occur in the S-protein; it also defines one of the two subclades of the B.1.526 variant of interest that were spreading in the USA in the beginning of 202121,22. S:Q675R is located at the central part of S1 (Fig. 10), and S:A845S, in S2; the significance of these two mutations is unknown.
B.1.1.397+. S:M153T is a characteristic mutation of B.1.1.397, which also comprises several other mutations. However, the frequency of S:M153T in Russia also increased in the absence of these other mutations (Fig. 7). This increase has been ongoing since late spring 2020 (Fig. 5), and has been noticed in Russia23,24. S:M153T, however, has remained rare outside Russia. S:153 is the first position of the 6-amino acid insertion specific to SARS-CoV-2 and some closely related bat betacoronaviruses that was absent in SARS-CoV25. While the effect of S:M153T on antigenic properties is unknown, S:153 is a part of the N3 loop of the NTD. This loop is a part of the NTD antigenic supersite (Fig. 10; 26), and nearby residues, including S:152, were recently shown to bind highly neutralizing 4A8 antibody from a convalescent patient27. Besides Russia, S:M153T was prevalent in Kazakhstan28 which has a long border with Russia, suggesting common ancestry of this change in these countries.
The most frequent subclade within B.1.1.397+, and the one with some evidence for an independent increase in frequency (Fig. 7), was defined by the presence of two additional mutations of interest: S:D138Y discussed above in the context of the B.1.1.317 lineage (but acquired in the B.1.1.397 lineage independently); and N:M234I. Position N:234 is a part of a disordered linker domain of the nucleocapsid protein29. Outside B.1.1.397, the N:M234I change has also occurred independently in several lineages that attracted attention. It is among the lineage-defining mutations of the B.1.160 (20A.EU2) lineage as well as the B.1.526 lineage that increased in frequency in the USA at rates comparable to those of B.1.1.721. It was also one of the changes defining a lineage (preliminarily identified as B.1.x) which also contained S:N501Y and S:P681H and seemed to spread rapidly in the USA30. Independent emergence of N:M234I in several variants of interest may reflect its impact on at least one of multiple functions of the N protein31.
Other notable variants. The five other combinations of mutations observed at high frequencies in Russia in February-March 2021 were B.1.1.7, the best-known variant with increased transmissibility before the emergence of B.1.617.2 lineage; the variant carrying the S:P681H mutation alone; and three novel variants.
S:P681H is one of the nine spike changes that characterized B.1.1.7 lineage3; and S:P681R is one of lineage-defining mutations of B.1.617.2 lineage; however, changes in this site are absent from the two other lineages of concern, B.1.351 and P.1, indicating that it is not essential for increased transmissibility. The 681 position is adjacent to the furin cleavage site; this site is absent in non-human CoV, and is assumed to have contributed to pathogenicity in humans32. Changes at this position experienced both persistent and episodic positive selection12. P681H appeared to increase in frequency globally33, although it was hard to disentangle this increase from that of the other changes constituting B.1.1.7 lineage that was rapidly spreading. We find that the frequency of this mutation in Russia in the absence of other B.1.1.7 mutations did not increase (Fig. 8), indicating that it does not increase transmissibility by itself.
The three remaining high-frequency variants with evidence for rapid frequency increase in early 2021 carried combinations of the following high-frequency mutations: S:P9L, S:Δ140–142 (or S:Δ136–144), S:E484K, and nsp6:Δ106–108. The sets of mutations in these variants are in conflict (i.e., not nested within each other; Fig. 4), indicating that at least some of these mutations emerged in them independently. These mutations are of interest or concern. Specifically, S:E484K (present in AT.1 and B.1.1.525) is involved in several famous variants including the B.1.3514, P.1 5,34 and P.25,34 lineages, and has been shown by several groups to cause escape from neutralizing antibodies35–37. nsp6:Δ106–108 (also referred to as ORF1a:Δ3675–3677, and present in B.1.1.524 and B.1.1.525) is a part of three variants of concern (B.1.1.7, B.1.351, P.1). S:Δ140–142 (present in B.1.1.524), S:Δ144 (present in B.1.1.525) and S:Δ136–144 (present in AT.1) are distinct deletions at a recurrent deletion region of the spike glycoprotein which confer resistance to neutralizing antibodies38.