The mutational landscape of SARS-CoV-2 provides new insight into viral evolution and fitness

doi:10.21203/rs.3.rs-4578618/v1

Download PDF

Article

The mutational landscape of SARS-CoV-2 provides new insight into viral evolution and fitness

https://doi.org/10.21203/rs.3.rs-4578618/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Although vaccines and treatments have strengthened our ability to combat the COVID-19 pandemic, new variants of the SARS-CoV-2 continue to emerge in human populations. Because the evolution of SARS-CoV-2 is driven by mutation, a better understanding of its mutation rate and spectrum could improve our ability to forecast the trajectory of the pandemic. Here, we used circular RNA consensus sequencing (CirSeq) to determine the mutation rate of six SARS-CoV-2 variants and performed a short-term evolution experiment to determine the impact of these mutations on viral fitness. Our analyses indicate that the SARS-CoV-2 genome mutates at a rate of ~3 ´10^-6/base per round of infection and that the spectrum is dominated by C®U transitions. Moreover, we discovered that the mutation rate is significantly reduced in regions that form base-pairing interactions and that mutations that affect these secondary structures are especially harmful to viral fitness. These observations provide new insight into the parameters that guide viral evolution and highlight fundamental weaknesses of the virus that may be exploited for therapeutic purposes.

Biological sciences/Genetics/Mutation/Genomic instability

Biological sciences/Evolution/Evolutionary genetics

Biological sciences/Immunology/Infectious diseases/Viral infection

COVID-19 is a highly contagious respiratory disease that is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Although multiple vaccines and treatments have been developed to combat the COVID-19 pandemic, new variants of the SARS-CoV-2 virus continue to emerge in human populations. These variants renew the threat of COVID-19 to public health, extend the socio-economic impact of the pandemic, and highlight a growing need to understand the evolution of the SARS-CoV-2 virus. Two parameters that are key to viral evolution are the mutation rate and mutation spectrum. Together, these parameters dictate how many variants will emerge in the future and what type of mutations they are likely to carry. Their impact on the overall fitness of the virus will then decide whether these mutations are selected for or against. To understand the parameters that guide the evolution of the SARS-CoV-2 virus, we cultured 6 viral strains and used CirSeq to determine their mutation rate and spectrum. We then used a short-term evolution experiment to determine the impact of 3,603 of the most common mutations on SARS-CoV-2 fitness. Surprisingly, these experiments showed that in addition to nonsense and non-synonymous mutations, synonymous mutations have a substantial impact on viral fitness as well, especially if they disrupt secondary structures present in the viral genome. Remarkably, these structures also display reduced mutation rates, suggesting that the bases that are most essential to viral fitness are protected from mutation. Together, these observations highlight a strong evolutionary link between the structure of the SARS-CoV-2 genome, its mutation rate and viral fitness, and that this relationship may be exploited to combat both existing and future variants of the SARS-CoV-2 virus.

Overall strategy.

The mutation rates and spectra of RNA viruses (including SARS-CoV-2) are notoriously difficult to measure. For example, even though more than 10 million SARS-CoV-2 genomes have been documented across the globe (GISAID, https://gisaid.org/), this dataset only contains mutations that were successful enough to become major variants in patients. Accordingly, most studies of within-host genetic diversity are limited to variants with an allele frequency that exceeds 0.5%, while mutations that are detrimental to the virus are missed(1). In contrast, the negative correlation between mutation rate and genome size observed in viruses suggests that the spontaneous mutation frequency of the > 30kb SARS-CoV2 genome is < 1×10^− 5 per base(2), which is significantly lower than the detection threshold of traditional sequencing methods(3). As a result, RNA-sequencing methods with improved fidelity are required to determine the mutation rate of SARS-CoV-2. With these considerations in mind, we used an ultra-accurate rolling-circle RNA consensus sequencing method termed CirSeq(4) to determine the mutation rate and spectrum of six SARS-CoV-2 variants. This method was previously used to determine the mutational landscape of other RNA viruses, including the polio virus(5), the Ebola virus(6), the Dengue virus(7) and the Zika virus(8). The improved accuracy of CirSeq relies on the circularization of short RNA fragments to synthesize long cDNA molecules that carry tandem repeats of the original RNA template. These tandem repeats can then be analyzed to generate a consensus sequence, which eliminates sequencing and reverse-transcription errors from the final sequencing results (fig. S1).

To explore the mutational landscape of SARS-CoV-2, we cultured the virus in VeroE6 cells, a preferred cell line for COVID-19 research because of its susceptibility to infection, efficient viral replication and permissiveness to mutations(9). Accordingly, VeroE6 cells can support a higher degree of viral genetic diversity than other cell lines, which is useful for studies that examine viral evolution during prolonged culture conditions. In total, we cultured 6 major strains of the SARS-CoV-2 virus, including the USA-WA1/2020, Alpha and Delta strains (corresponding to clades 19B, 20I and 21J respectively), and the Beta, Gamma and Omicron strains. Although each strain was cultured in duplicate, the majority of our experiments were performed on the USA-WA1/2020, Alpha and Delta strains, which we cultured over 7 serial passages, while the Beta, Gamma and Omicron strains were profiled for a single passage (Table 1). Finally, because the VeroE6 cells were derived from the kidney of an African green monkey, we wanted to make sure that our measurements were not skewed by this unique biological environment. Thus, we cultured the Delta strain for 1 passage in Calu-3 cells (a human lung adenocarcinoma cell line) as well, and in primary human nasal epithelial cells that were grown in an air-liquid-interface (Table 1). After each passage, we monitored the sequence of the SARS-CoV-2 genome by CirSeq to take a snapshot of their mutational landscape. Across all strains and conditions, we sequenced over ~ 200 billion bases and identified more than three million mutations. Finally, we assigned the most common mutations a fitness value to determine if they are selected for or against by the SARS-CoV-2 virus and mapped these mutations onto the genome to determine the biological basis for selection.

The mutation rate and spectrum of the SARS-CoV-2 genome.

After we profiled the mutational landscape of each strain, we used lethal and highly detrimental mutations to estimate their mutation rate. Because these mutations cannot be carried over between passages, they must be produced anew in each generation, so that their frequency is equal to the mutation rate(5). We identified lethal mutations with two strategies. First, we considered mutations to be lethal if they introduced a premature stop codon (PTC) in the open reading frame of the viral RNA-dependent RNA polymerase (RdRP), a protein that is essential for viral replication(10). Second, we screened > 6 million SARS-CoV-2 genomes that were previously aligned by UShER(11, 12) and Ensembl(13) for mutations that were absent from these databases. Each of these genomes represents the sequence of the most common variant in a patient, which precludes mutations that are either lethal or highly detrimental for viral fitness. Consistent with this idea, we found that mutations that were absent in both of these databases were significantly depleted from our dataset (fig. S2). Using the mutations identified by these two strategies we then calculated the overall mutation rate of the SARS-CoV-2 genome and found that ~ 3×10^− 6 mutations arise per nucleotide per viral passage. This rate was relatively consistent across variants and replicates (Fig. 2a-b) and showed that the mutation rate of SARS-CoV-2 is ~ 10-fold lower than the poliovirus(5) and ~ 5-fold lower compared to the Dengue virus(7), two other RNA-based viruses previously examined by CirSeq. The improved fidelity of the SARS-CoV-2 virus is most likely due to the proofreading capacity of its RNA-dependent RNA polymerase (RdRp)(10, 14), which is absent in the polio and Dengue virus. In this context, it is important to note that G→A and U→C mutations displayed the largest reduction in mutation rate compared to polio (39-fold and 37-fold respectively). Intriguingly, these substitutions also increase the most when the proofreading activity of eukaryotic RNA polymerases II is compromised(15–17), suggesting the existence of a universal set of rules that govern the proofreading capabilities of RNA polymerases in eukaryotes and viruses.

Finally, we found that the mutation rate of every strain was dominated by C→U substitutions, which were ~ 5 times more common than any other base substitution (Fig. 2c). The rate with which C→U substitutions arose depended to a significant degree upon the upstream (i.e. 5′ adjacent) and downstream (i.e. 3′ adjacent) nucleotides. For example, we found that C→U mutations occur most commonly in a 5′-UCG-3′ context (Fig. 3a-d), consistent with analyses based on SARS-CoV-2 phylogeny(18). Because our measurements are independent of positive or negative selection though (which play a key role in published SARS-CoV-2 genome sequences), our analyses provide an unfiltered view of the impact of genetic context on viral mutagenesis. When taken together, these observations demonstrate that C→U substitutions add the greatest amount of genetic variation to the SARS-CoV-2 genome and provide the largest substrate for evolution to act upon, a conclusion that is also supported by more indirect observations(19).

Selection for nucleotide composition

It is likely that the mutation rate and spectrum of the SARS-CoV-2 genome affect the evolution of the virus in various ways. One of the most fundamental attributes of a genome is its nucleotide composition, which depends on the balance between the mutation spectrum and the intensity of selection for each of the four nucleotides. Using the mutational profile derived here, we estimate the equilibrium frequencies for all four nucleotides as: U = 0.4, A = 0.31, G = 0.21 and C = 0.08. This analysis translates into an equilibrium GC content of 29%, which is substantially higher than the 17% previously reported (19). However, this previous estimate is based on indirect estimations of mutation rates at 4-fold degenerate sites across lineages sequenced in GISAID, which might be impacted by selection. Regardless, both estimates are significantly lower than the observed 38% GC content of the SARS-CoV-2 genome, indicating selection for elevated GC content. Indeed, we found that the observed frequency of cytidines at 4-fold degenerate sites is ~ 1.75-times higher than predicted at mutational equilibrium (Table 2), suggesting that there is substantial selective pressure against excessive cytidine depletion, even at sites where these mutations would not affect amino acid composition. To gain more insight into the molecular mechanisms that suppress cytidine depletion at 4-fold degenerate sites, we examined the fitness landscape of C→U mutations in the SARS-CoV-2 genome.

Fitness landscape of SARS-CoV-2

To determine the consequences of C→U mutations for viral evolution, we decided to characterize their impact on the fitness of the SARS-CoV-2 virus using a strategy previously employed for the polio virus(5). Briefly, the fitness of a mutation is related to its change in frequency between consecutive passages as described in the following equation:

$${f}_{n }= {f}_{n-1}\times w + {\mu }_{n-1}$$

(Eq. 1)

with f_n and f_n-1 the observed frequency of the mutation at passages n and n-1, w the relative fitness and µ the mutation rate. Because these analyses require large amounts of data, we selected one replicate of the SARS-CoV-2 Delta variant and sequenced 155 billion bases of its genome over the course of 7 passages, reaching 1.7 million times coverage. We identified 64,967 unique mutations across these passages, with each mutation being observed on average 42 times for a total of 2.7 million mutation observations. Importantly, the SARS-CoV-2 genome is ∼30,000 bases in length and each base can be mutated into 3 different nucleotides so that a total of ∼90K base substitutions are theoretically possible. Accordingly, we identified 66% of all the mutations that could possibly arise in the SARS-CoV-2 genome. Based on our selection criteria (see methods) this sequencing effort yielded enough data to calculate fitness values for 3,603 C→U mutations (Fig. 4a, Supplementary Table 1). We performed three tests to determine the veracity of these fitness values. First, we compared the fitness values for different types of mutations and found that (as expected) synonymous mutations were less deleterious than non-synonymous mutations (0.90 vs 0.82, P = 8.5×10^− 6), and non-synonymous mutations were less deleterious than non-sense mutations (0.82 vs 0.73, P = 0.003). In a second test, we examined the fitness values of mutations that were predicted to be either lethal or highly detrimental due to their absence in the UShER and Ensembl databases. Consistent with the idea that our fitness values hold valuable information, we found that these mutations displayed significantly lower fitness values compared to all the other mutations we detected (mean fitness: 0.48 vs 0.85, P = 5.7×10^− 4, Welch Two Sample t-test). In total, 47% of the mutations absent in these databases, but detected in our experiment were assigned a fitness value of 0, a value that is only assigned to mutations that are expected to be either lethal or highly detrimental. In contrast, only 13% for the remaining C→U mutations received a fitness value of 0 (P = 4.5×10^− 7, Chi-square test). In a third test, we compared our fitness estimates to studies that used changes in mutation frequency throughout the SARS-CoV-2 phylogeny to infer fitness values(20, 21). When we compared our values to the only other study to provide fitness estimates for both synonymous and nonsynonymous mutations(21), we found a highly significant correlation between our data and this independent dataset (r = 0.42, P < 2×10^–16. Together, these comparisons strongly support the idea that our algorithms provide accurate information about the impact of mutations on viral fitness.

Paired bases contribute disproportionally to SARS-CoV-2 fitness

After establishing that our fitness values hold accurate information, we wanted to investigate why synonymous C→U mutations are selected against in the SARS-CoV-2 genome, even if they occur at 4-fold degenerate sites. Potentially, this phenomenon could be explained by stronger, more frequent purifying selection against synonymous mutations in SARS-CoV-2 compared to other viruses, such as the polio virus(5). It was recently shown that the SARS-CoV-2 genome adopts a highly specific secondary structure(22) and that bases that pair with each other to form these structures tend to display lower nucleotide diversity(23). Interestingly, we observed a similar specificity for secondary structures in our CirSeq dataset. The enzyme used to fragment viral RNA (RNAse III) prefers to cleave RNA at specific double-stranded structures, causing strong peaks and valleys in genome coverage that reflect the secondary structure of the SARS-CoV-2 genome. We found that these coverage peaks are identical between all the variants we tested, indicating that the secondary structure of the genome is highly conserved across the SARS-CoV-2 phylogeny (fig. S3). Based on these observations, we hypothesized that the need to preserve this secondary structure could be a significant factor driving purifying selection against synonymous mutations. To test this hypothesis, we split synonymous C→U mutations into two groups: those that form base-pair interactions (henceforth referred to as “paired” sites) and those that do not (henceforth referred to as “unpaired” sites)(22).

Consistent with the idea that there is strong purifying selection against synonymous mutations that affect secondary structures in the SARS-CoV-2 genome, we found that the average fitness value of synonymous C→U mutations was lower at paired sites compared to unpaired sites (0.76 vs 1.03, P < 2×10^–16, t-test, Fig. 4b-c). We observed a similar pattern for nonsynonymous mutations (average fitness: 0.65 vs 0.92 for paired and unpaired sites respectively, P < 2×10^–16, t test, Fig. 4d-e). To support this idea further, we re-examined our fitness values with the help of an independent assessment of secondary structures based on SHAPE scores(24) and found a significant positive correlation between the shape reactivity score and our fitness estimates for synonymous C→U mutations (r = 0.24, P < 2×10^–16). Taken together, these results strongly suggest that mutations that disrupt base-pairing interactions are more likely to be deleterious to SARS-CoV-2 fitness than those that don’t.

Because our fitness estimates are limited to C→U mutations, we used the fitness estimates previously published by Bloom and Neher(21) to investigate if purifying selection for synonymous mutations at paired sites was present for all types of base-substitutions. Restricting our analysis to base-substitutions with enough observations in both paired and unpaired categories (see methods), we found that synonymous mutations are significantly more deleterious at paired vs unpaired sites for U→C, G→U, G→C, C→U, C→A and A→U base substitutions (P < < 0.001 for all, Welch two samples t-test, fig. S4). U→A, U→G, C→G and A→C substitutions did not yield enough observations to calculate fitness values, while two types of base-substitutions (G→A and A→G) showed no significant difference. Interestingly though, it was previously shown that A:C and G:U base pairs (which would arise from G→A and A→G mutations respectively) allow wobble base pairing in RNA molecules(25). Therefore, it is possible that these mutations do not significantly alter the secondary structure of the SARS-CoV-2 genome, even when they occur at paired sites, allowing them to escape purifying selection.

Paired bases display a reduced mutation rate

Our data suggests that the secondary structure of the SARS-CoV-2 genome is critical for viral fitness and that SARS-CoV-2 conserves these structures by strong purifying selection. However, the pace of evolution is also controlled by the mutation rate. Accordingly, we wamted to test the impact of the secondary structure on the mutation rate. To do so, we compared the rate of mutation between paired and unpaired bases and found that C→U mutations are ~ 3 times more frequent at unpaired bases compared to paired bases in all strains (P = < 0.01, Fig. 4f, fig. S5). Other base substitutions are not increased at paired bases (Fig. 4f), indicating that the mechanism responsible for this observation is highly specific. A similar discrepancy is seen at mutational hot spots and cold spots, which are defined by locations where the mutation frequency either increases or decreases 10-fold. In hot spots, only 22.4% of nucleotides are predicted to be paired (vs 47.3% overall, P = 5×10^–15, chi-square test), while they make up 84.2% of bases in cold spots (P < 2×10^–16, chi-square test (Fig. 4g). Interestingly, unpaired cytidines are 140 times more prone to spontaneous deamination into uracil than paired bases(26), suggesting that a large fraction of C→U mutations could be caused by spontaneous cytidine deamination due to chemical damage. This hypothesis is supported by the observation that C→U mutations are also elevated at CpG sites that are expected to be methylated (methylated cytosines are 2 to 4 times more likely to be deaminated than unmethylated cytosines (27–29), Fig. 4h). Another possibility is that the reduced rate of cytidine deamination at paired sites reflects the preference of APOBEC proteins for ssRNA. For example, APOBEC3A has a strong preference for unpaired cytidines that are flanked by a 5’ uracil and a 3’ guanosine(30–32), the exact conditions that show the highest C→U mutation rate in our dataset (Fig. 3c). However, regardless of mechanism, these observations suggest that the secondary structure of the SARS-CoV-2 genome is not only preserved by strong purifying selection, but also by local changes in the mutation rate that spare paired bases. Together, these observations suggest a synergistic relationship between the secondary structure of the SARS-CoV-2 genome and its mutation rate, which reinforce each other to promote viral fitness.

Fitness values, viral evolution and potential weaknesses of SARS-CoV-2

Because fitness values reflect the forces of natural selection, we wondered whether they could predict the evolution of the SARS-CoV-2 virus. Since the emergence of the Delta variant we sequenced for our fitness analysis, multiple variants have evolved that swept the globe. Each of these strains contains defining mutations that were positively selected for during the evolution of SARS-CoV-2 in human populations (Fig. 5a-b.). Interestingly, some of these mutations were also detected in our short-term evolution experiment. When we examined the fitness values of the mutations that define strain 23H (the most advanced strain thus far), we found that these mutations displayed significantly higher fitness values compared to all other mutations detected in our evolution experiment (1.11 vs 0.84, P = 0.045, t test). Thus, the fitness landscape obtained from our dataset could help predict the mutations that arise in future variants. Accordingly, mutations with high fitness values that have not been observed in known variants thus far could be of particular interest to researchers trying to predict the evolutionary trajectory of SARS-CoV-2(33, 34).

Conversely, we also identified an array of mutations with fitness values below 0.5, suggesting that these mutations are immediate targets for purifying selection. Because these mutations tend to alter critical groups of amino acids and disturb pivotal secondary structures they tend to cluster together when mapped onto the proteome. In Fig. 5c-h, we mapped the majority of these clusters onto the spike protein and the replisome (see also Supplementary-Video 1–4). Because these clusters highlight immutable components of the SARS-CoV-2 virus, they could be valuable targets for vaccines and treatments. Frequently, viruses develop resistance to vaccines and treatments by mutation, leading to variants that lack the targets that treatments or vaccines were developed against; however, because mutation of these essential components significantly lowers viral fitness, targeting these clusters could limit the number of escapees that emerge.

Single-stranded RNA viruses are often considered to be fast evolving organisms, with some of the highest known mutation rates. Indeed, a recent study estimated the mutation rate of SARS-CoV-2 to be as high as 1x10^-4 per nucleotide per round of infection(35). Our results indicate though, that the mutation rate of SARS-CoV-2 is substantially lower than that of other ssRNA viruses, allowing ~ 3×10^− 6 base-substitutions per nucleotide per viral passage. Despite this relatively low mutation rate, our measurements are in line with expectations based on genome size(2) and indirect estimates relying on alternative methods(36). An important contributor to the enhanced fidelity of the SARS-CoV-2 replisome is its proofreading domain, which reduces the error rate of multiple base substitutions(10, 14). In addition, we found that the complex secondary structure adopted by the viral genome (which results in large fractions of the genome engaging in base-pairing interactions with distant nucleotides) provides substantial protection against mutation. Regions of the genome engaged in base-pairing activity exhibit a 3-fold lower C→U mutation rate than unpaired regions. Therefore, maintaining the secondary structure of the genome is not only important to ensure viral fitness, it also minimizes the mutation rate. Potentially, the need to package a relatively large genome into a small viral capsid, or the need for more structural motifs that help regulate a larger, more complex genome, could drive the selection against disruptive mutations.

The importance placed on the conservation of secondary structures means that purifying selection is also strong on synonymous mutations in the SARS-CoV-2 genome. Synonymous mutations are already known to impact the fitness of other viruses(37, 38) both because of their role in destabilizing RNA secondary structures(37, 39, 40) and the impact of codon usage on translation efficiency(41), although the latter is still debated(42). However, the percentage of synonymous C→U mutations that are highly deleterious for SARS-CoV-2 is higher compared to synonymous mutations in other viruses(37). It has been reported though, that the fraction of lethal mutations among synonymous mutations increases with genome size. For example, 3% of synonymous mutations are lethal for the 4.2kb Qβ genome(37, 43), 10% for the 7.5kb polio virus(5), 9% for the 9.5kb TEV genome(43, 44) and 13% for the 11kb VSV genome(37). Our studies now indicate that this fraction is 13.4% for the 30kb SARS-CoV-2 genome. Note though, that in contrast to these other studies, our measurements exclusively represent the percentage of synonymous C→U mutations that are detrimental to viral fitness. Because C→U mutations dominate the mutation spectrum of SARS-CoV-2 genome, it is likely that many non-essential cytidines have been replaced by uracil over the course of evolution of the virus. The observation that cytidines are markedly diminished at 4-fold degenerative sites provides strong evidence for this idea (Table 2). It is reasonable to assume then, that the cytidines that remain in the genome are somehow important for viral fitness, skewing the fitness impact of C→U mutations towards 0. In addition, C→U mutations could have an outsized impact on the stability of secondary structures because C:G base pairs are held together by 3 hydrogen bonds, while A:U base pairs are held together by 2. Thus, loss of a C:G base pair by mutation could have an outsized destabilizing effect on secondary structures and viral fitness.

One limitation of our fitness analysis is that the low mutation rate of the SARS-CoV-2 virus may have skewed our fitness values towards the two extremes of lethal and advantageous. Because the mutation rate is relatively low, it is possible for mutations to be lost from passage to passage because they were not present in the random subset of viral particles that infected the next culture, thereby mimicking the effect of lethal and highly detrimental mutations on viral fitness. If so, this phenomenon would increase the number of mutations with a fitness value of 0. However, we actively minimized this phenomenon by only calculating fitness values for C→U mutations that exceed the frequency at which sampling issues are likely to occur. In doing so though, we skewed our analysis towards mutations that are more common and thus more likely to be beneficial for viral fitness, which increased the percentage of mutations with a fitness value of above 1. It is possible therefore, that our fitness values could be refined in the future with the help of greater sequencing depth; however, our study is already strong enough, to demonstrate a clear evolutionary trade-off between the size and structure of the SARS-CoV-2 genome as they compete to optimize the fitness of the virus. We were able to expose this relationship by the outsized detrimental impact of synonymous C→U mutations on viral fitness.

The observation that synonymous mutations can strongly affect viral fitness has far-reaching implications for our understanding of the selective pressures acting on the evolution of the ssRNA viral genomes. The ratio of the rates of non-synonymous to synonymous substitutions (dN/dS) is one of the most widely used tests to detect the types of selection acting on coding sequences. This test relies on the assumption that synonymous mutations are mostly neutral and elevated dN/dS are often interpreted as evidence of positive selection. Recent studies comparing the sequences of different variants of SARS-CoV-2 have reported dN/dS in the range of ~ 0.5 to ~ 1.0(45), which is much higher than values typically observed in both prokaryotic and eukaryotic protein-coding genes. These elevated dN/dS values are typically interpreted as evidence for positive selection acting on the sequence of viral proteins, driven by an arms race between the virus and the host immune system. Our results show that, because of the strong constraints on maintaining the secondary structure via base-pairing, synonymous mutations can be as deleterious as non-synonymous mutations. Therefore, we suggest that the elevated dN/dS values observed in SARS-CoV-2 might stem from widespread purifying selection on synonymous mutations rather than positive selection for rapid evolution.

Finally, we suggest that these constraints on the secondary structure of the SARS-CoV-2 genome could be exploited during the development of therapeutics that target the genome. For example, methods based on genome editing such as CRISPR Cas13b or RNA translation interference (RNAi) have to rely on high sequence identity and could lose their efficiency if a mutation occurs in the targeted sequence(46, 47). Choosing targets inside regions that are important for the secondary structure of the genome should reduce the likelihood of facing mutations that escape treatment. Alternatively, it may be possible to exploit the importance of secondary structures for viral fitness with treatments that are aimed at the viral proteome. It is tempting to target amino acids that code for essential components of proteins, because those amino acids are unlikely to escape treatment by mutation. However, our data now suggests that even amino acids that play a relatively unimportant role for viral fitness could be targets for antibody-based treatments and therapies because the bases that encode them are essential to maintain immutable secondary structures. Because these structures can be quite large, multiple bases with low fitness values cluster together in defined areas of the proteome (Fig. 5c-g), which could make them valuable targets for antibody-based approaches.

Cell lines

Vero C1008 African green monkey kidney Cell Line, Clone E6, (Kindly provided by Assoc. Prof. Bosch Veterinary Medicine, University Utrecht, The Netherlands) cells were cultured in Dulbecco's Modified Eagle Medium (DMEM) with L-Glutamine (Lonza) supplemented with 10% fetal bovine serum (FBS (FBS), 100U/mL Penecilin/Streptomycin (Gibco), 1% Ultraglutamine (250 mM in 0.85% NaCl) (Lonza) (cell line culture media) in a humidified incubator at 37C 5% CO₂.

Calu E-3 cells (kindly provided by Prof. Beekman, Hubrecht Institute, the Netherlands) were cultured in cell line culture media and maintained in a humidified incubator at 37C 5% CO₂.

Human nasal epithelial cell culturing

Nasal brushing-derived human nasal epithelial cells (HNEC) (n = 1 healthy donors) were collected, isolated, and stored with informed consent of all participants and was approved by a specific ethical board for the use of biobanked materials TcBIO (Toetsingscommissie Biobanks), an institutional Medical Research Ethics Committee of the University Medical Center Utrecht (protocol ID: 16/586). HNEC were cultured as previously described (Amatngalim et al. 2022). In brief, cells (passage 5) were first expanded in 6-well culture plates coated with 50 µg/mL collagen IV (Sigma-Aldrich), using a defined expansion medium composed of: 50% (v/v) Bronchial epithelial cell medium-basal (BEpiCM-b; Sciencell), 23.5% (v/v) Advanced DMEM F12 (Thermo Fisher), 2% (v/v) B-27 (Thermo Fisher), 1% (v/v) GlutaMAX (Thermo Fisher), 10 mM HEPES (Thermo Fisher), 0.5 µg/mL (±)-Epinephrine hydrochloride (Sigma-Aldrich), 0.5 µg/mL Hydrocortisone (Sigma-Aldrich), 100 nM 3,3′,5-Triiodo-L-thyronine (Sigma-Aldrich), 1.25 mM N-Acetyl-L-cysteine (Sigma-Aldrich), 5 mM Nicotinamide (Sigma-Aldrich), 500 nM SB 202190 (Sigma-Aldrich), 1 µM DMH-1 (Selleck Chemicals), 1 µM A83-01 (Tocris), 5 µM Y-27632 (ROCKi) (Selleck Chemicals), 5 µM DAPT (Fisher Scientific), 25 ng/mL recombinant human FGF-7, 100 ng/mL recombinant human FGF-10, 5 ng/mL recombinant human EGF, 25 ng/mL recombinant human HGF (all Peprotech), 20% (v/v) Rspondin 1 conditioned medium (from Rspo1 cells Cultrex®), 1% (v/v) Penicillin-Streptomycin (Thermo Fisher), and 100 µg/mL Primocin (Invivogen). After reaching confluency, HNEC (0.2*10⁶ cells) were seeded on PureCol (Advanced BioMatrix) coated 24 well transwell inserts (0.4 µm pore size polyester membrane, Corning). Cells were first cultured in submerged conditions in expansion medium. Next, confluent monolayers were differentiated as air-liquid interface (ALI)-cultures in differentiation medium, consisting of: 98.5% (v/v) Advanced DMEM F12, 0.5 µg/mL (±)-Epinephrine hydrochloride, 0.5 µg/mL Hydrocortisone, 100 nM 3,3′,5-Triiodo-L-thyronine, 1% (v/v Penicillin-Streptomycin, 50 nM A83-01, 100 nM TTNPB (Cayman), 5 µM DAPT, and 0.5 ng/mL recombinant human EGF. ALI-HNEC cultures were differentiated for at least 18 days before use in experiments.

Viral culture

SARS-CoV-2/human/USA/WA-CDC-WA1/2020 was obtained from the BEI resources. Other SARS-CoV-2 virus strains were cultured from nasal swabs derived from patients form The University Medical Center Utrecht, the Netherlands. SARS-CoV-2/B1.1.7/Alpha/2021, SARS-CoV-2/ B 1.617.2/Delta/2021, SARS-CoV-2/ B.1.351/Beta/2021, SARS-CoV-2/ P.1/Gamma/2021 and SARS-CoV-2/ B.1.1.529/Omicron/2021 were obtained by culturing in Vero E6 cells. 1.67 million Vero E6 cells were cultured in a 25cm² filtered tissue flask (T25) in cell line culture medium one day prior to infection. At the day of infection Vero E6 cells were washed in culture flask with PBS (Lonza) and 1mL infected with viral sample in universal transport medium UTM) (Copan) was supplemented with 4mL cell line culture media adapted to 2.5% FCS and 0.1mg/mL Normicin (Invivogen) and passed through a 0.22 µm Stericup Durapore (Millipore) for 2 hours at 37C 5% CO₂. Subsequently, cells were washed with PBS and 5ml culture media (2.5% FCS, 0.1mg/mL Normicin) was added and cells were incubated at 37C 5% CO₂ until cytopathic effect (CPE) was observed.

Viral clone generation

Virus clones obtained from the first passage were serially diluted in Vero E6 cells in cell line culture medium by infecting 10,000 cells in a flat bottom 96 well cell culture plate. Virus was harvested after 5 days when CPE was observed in less than 1/3 of a serial dilution series. This increases the likelihood of clonality. We harvested two clones of SARS-CoV-2/human/USA/WA-CDC-WA1/2020, SARS-CoV-2/B1.1.7/Alpha/2021 and SARS-CoV-2/B1.617.2/Delta/2021 and one clone of SARS-CoV-2/B.1.351/Beta/2021, SARS-CoV-2/P.1/Gamma/2021 and SARS-CoV-2/B.1.1.529/Omicron/2021. Viral load for serial passaging was increased by infecting T25 tissue culture flask with 1.67 million Vero E 6 cells. Virus was harvested when CPE throughout the culture was observed.

Viral passage

From the first passage (50% Tissue Culture Infectious Dose (TCID₅₀) was determined by limiting dilution in Vero E6 cells. One day prior to infection 10,000 VeroE6 cells/well were seeded in a flat bottom tissue culture 96 well plate (Costar) in cell line culture media (2,5% FBS). Cells were infected in a limiting dilution in 5-fold dilutions in quadruplicate. Four days post infection CPE was scored and TCID₅₀ was determined.

Serial passaging of the clones of SARS-CoV-2/human/USA/WA-CDC-WA1/2020, SARS-CoV-2/B1.1.7/Alpha/2021 and SARS-CoV-2/B1.617.2/Delta/2021 was performed by infecting 5 million Vero E6 cells in a T75 filtered culture flask at a multiplicity of infection (MOI) of 0.1 in 18ml of cell line culture media (2,5% FBS). Cells were subsequently cultured for 5 days at 37C 5% CO₂. TCID50 was determined as described above to allow for subsequent infection with MOI of 0.1. The remainder of the virus was concentrated using 30 kDa centrifugal filter (Amicon) tubes and viral RNA of the concentrated virus was isolated using total RNA purification kit (Norgenbiotek) according to manufactures protocol. SARS-CoV-2/human/USA/WA-CDC-WA1/2020, SARS-CoV-2/B1.1.7/Alpha/2021 and SARS-CoV-2/B1.617.2/Delta/2021 were passed 7 times.

To compare evolution rate of SARS-CoV-2/B1.617.2/Delta/2021 clone b passage 3 was used to infect Calu-3 cells in a similar fashion and infection in Vero E6 cells and from SARS-CoV-2/B1.617.2/Delta/2021 clone b passage 2 in HNEC. Six ALI filters of HNECs were infected using an MOI of 0.1 for two hours at the apical level. Virus was washed after 2 hours of infection and supernatant was pooled and harvested 5 days post infection. RNA was extracted using total RNA purification kit (Norgenbiotek).

Clones of SARS-CoV-2/B.1.351/Beta/2021, SARS-CoV-2/P.1/Gamma/2021 and SARS-CoV-2/B.1.1.529/Omicron/2021 were used to infect T75 of Vero E6 at an MOI of 0.01 as an MOI of 0.1 was not obtained. This was done for one passage. Virus and viral RNA was harvested as described above.

Library preparation and sequencing

We followed a refined protocol of CirSeq to prepare sequencing libraries (Fritsch et al. 2018). First, RNA was DNase treated for 30 min at 37°C, followed by DNase inactivation (AM1926, Ambion). DNase-treated RNA was quantified with a Qubit RNA HS (High Sensitivity) Assay Kit (Q32852, Invitrogen) and a Qubit 4 Fluorometer (Q33238, Invitrogen). 500 ng RNA was then fragmented using RNase III (AM2290, Ambion) for 9 min at 37°C. After a clean-up with an Oligo Clean & Concentrator kit (D4061, Zymo Research), RNA fragments were circularized with T4 RNA ligase 1 (M0204S, New England Biolabs) for 2 hours at 25°C. Circular RNA was purified with an Oligo Clean & Concentrator kit and used to generate cDNA with tandem repeats by rolling-circle reverse transcription. For cDNA synthesis, circular RNA was first primed by incubation with random hexamers (N8080127, Thermo Fisher) for 10 min at 25°C. Then, the reaction was shifted to 42°C for 20 min to allow for primer extension and cDNA synthesis (18080044, Invitrogen). cDNA was purified with an Oligo Clean & Concentrator kit and second strand synthesis was performed with the NEBNext mRNA Second Strand Synthesis Module (E6111S, New England Biolabs). After a clean-up with an Oligo Clean & Concentrator kit, the remaining steps for library preparation were then performed with the NEBNext Ultra RNA Library Prep Kit for Illumina (E7530L, New England Biolabs) and NEBNext Multiplex Oligos for Illumina (New England Biolabs) according to manufacturer’s guidelines. Size selection and clean-up during sequencing library preparation were performed with AMPure XP Beads (A63881, Beckman Coulter). An Agilent TapeStation (G2991AA, Agilent) with appropriate High Sensitivity ScreenTapes and a Qubit 4 Fluorometer with HS Assay Kits were used for precise sizing and quantification of nucleic acids during different steps of library preparation. Paired-end reads (250nt) were then generated using the Illumina NovaSeq 6000 System (SP Reagent Kit v1.5).

Initial data analysis.

Rolling-circle RNA sequencing reads were processed as described in (Fritsch et al. 2018), using the pipeline available at https://github.com/jfgout/tr-errors-pipeline-v1/tree/master. Briefly, reads are first trimmed using fastp (Chen et al. 2018) with default parameters except for trimming poly-G ends (minimum size of 6) and requesting a minimum read size of 120 nucleotides after trimming. Trimmed reads are inspected to find repeats and generate a consensus sequence from the stack of repeats. Only positions in the consensus supported by at least three repeats, with all repeats giving the same base call and with a cumulative quality score of at least 100 are considered as reliable and are included in the analysis. Consensus sequences are mapped to the reference genome with Kallisto (Bray et al. 2016) followed by a local optimization of the alignment with the seqan2 library (Döring et al. 2008). For every reliable consensus sequence mapped to the genome, we record the genotype supported by the consensus sequence (= a call) to generate a table of all the reliable calls made at every position in the genome.

Mutation rate calculations

We downloaded the list of variants published from UShER (8) and based on the alignment of about six million SARS-CoV-2 genomes on May 23rd, 2023. Mutations that were never observed among this six million genome alignment were considered as lethal and retained for calculation of the mutation rate. To prevent a small subset of non-lethal mutations missed by the alignment to bias the result, we also excluded mutations that occurred at a frequency of more than 0.1% and therefore restricted our analysis to genomic positions covered by at least 1,000 reliable consensus sequences. Mutation rates were then calculated, using the genomic sites that passed these criteria, for each of the twelve possible base-substitution simply by dividing the number of calls supporting the base-substitution considered by the total number of calls at genomic position with the nucleotide considered.

Selection of mutations for fitness value calculations

Rare mutations may be absent from the subset of viral particles used to seed the next passage. Under this scenario, the bottleneck between each passage would reset the frequency of these mutations to zero, so that only new mutations generated in the current passage could be detected, mimicking the pattern expected for lethal mutations. This would result in a systematic underestimation of the fitness values for these mutations occurring at low frequency. Therefore, we transfer approximately half a million viral particles between each passage, we only included C →U mutations in our fitness analysis that occur at a rate greater than 1×10^− 5. At this frequency, at least 5 mutations should be transferred on average from one passage to the next.

Data and materials availability: All data and code required to reproduces these analysis (including the detailed list of mutations and observations at each position of the genome for each sample) are available on github at: https://github.com/jfgout/SARS-CoV-2

Raw sequencing reads will be made available on the NCBI Short Read Archive database.

Acknowledgments: We thank Prof. J.M. Beekman (Hubrecht Institute, Utrecht, the Netherlands) for providing Calu E-3 cells. We thank Dr. Jeffrey Jensen for his comments on an earlier draft of this manuscript.

Funding:

National Science Foundation award 2032784, RAPID: Probing SARS-CoV-2 evolution and vulnerabilities through its mutation and fitness landscape (MV, J-FG)

Health Holland LSHM20058 Clear Covid-19 project, ZonMw 114025009 (JS, MN)

Author contributions:

Conceptualization: JFG, MV, JS, MN

Formal analysis: JFG

Software: JFG

Methodology: JFG, MV, JS, MN

Investigation: JFG, JS, CC, BMV, SS, DdJ, GA, MN, MV

Visualization: JFG, MV

Funding acquisition: JFG, MV, MN

Project administration: JFG, MV, JS, MN

Supervision: JFG, MV, MN

Writing – original draft: JFG, MV, JS, MN

Writing – review & editing: JFG, MV, JS, MN

Competing interests: The authors declare no competing interests.

Tonkin-Hill G et al (2021) Patterns of within-host genetic diversity in SARS-CoV-2. Elife 10
Peck KM, Lauring AS (2018) Complexities of Viral Mutation Rates. J Virol 92
Minoche AE, Dohm JC, Himmelbauer H (2011) Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol 12:R112
Acevedo A, Andino R (2014) Library preparation for highly accurate population sequencing of RNA viruses. Nat Protoc 9:1760–1769
Acevedo A, Brodsky L, Andino R (2014) Mutational and fitness landscapes of an RNA virus revealed through population sequencing. Nature 505:686–690
Whitfield ZJ et al (2020) Species-Specific Evolution of Ebola Virus during Replication in Human and Bat Cells. Cell Rep 32:108028
Dolan PT et al (2021) Principles of dengue virus evolvability derived from genotype-fitness maps in human and mosquito cells. Elife 10
Grass V et al (2022) Adaptation to host cell environment during experimental evolution of Zika virus. Commun Biol 5:1115
Jefferson T, Spencer EA, Brassey J, Heneghan C (2021) Viral Cultures for Coronavirus Disease 2019 Infectivity Assessment: A Systematic Review. Clin Infect Dis 73:e3884–e3899
Denison MR, Graham RL, Donaldson EF, Eckerle LD, Baric RS (2011) Coronaviruses: an RNA proofreading machine regulates replication fidelity and diversity. RNA Biol 8:270–279
McBroome J et al (2021) A Daily-Updated Database and Tools for Comprehensive SARS-CoV-2 Mutation-Annotated Trees. Mol Biol Evol 38:5819–5824
Turakhia Y et al (2021) Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nat Genet 53:809–816
Harrison PW et al (2024) Ensembl Nucleic Acids Res 52, D891-D899 (2024)
Moeller NH et al (2022) Structure and dynamics of SARS-CoV-2 proofreading exoribonuclease ExoN. Proc Natl Acad Sci U S A 119
Chung C et al (2023) Evolutionary conservation of the fidelity of transcription. Nat Commun 14:1547
Fritsch C et al (2021) Genome-wide surveillance of transcription errors in response to genotoxic stress. Proc Natl Acad Sci U S A 118
Gout JF et al (2017) The landscape of transcription errors in eukaryotic cells. Sci Adv 3:e1701484
Sung W et al (2015) Asymmetric Context-Dependent Mutation Patterns Revealed through Mutation-Accumulation Experiments. Mol Biol Evol 32:1672–1683
Rice AM et al (2021) Evidence for Strong Mutation Bias toward, and Selection against, U Content in SARS-CoV-2: Implications for Vaccine Design. Mol Biol Evol 38:67–83
Obermeyer F et al (2022) Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science 376:1327–1332
Bloom JD, Neher RA (2023) Fitness effects of mutations to SARS-CoV-2 proteins. Virus Evol 9:vead055
Lan TCT et al (2022) Secondary structural ensembles of the SARS-CoV-2 RNA genome in infected cells. Nat Commun 13:1128
Simmonds P (2020) Pervasive RNA Secondary Structure in the Genomes of SARS-CoV-2 and Other Coronaviruses. mBio 11
Sun L et al (2021) In vivo structural characterization of the SARS-CoV-2 RNA genome identifies host proteins vulnerable to repurposed drugs. Cell 184:1865–1883e1820
Garg A, Heinemann U (2018) A novel form of RNA double helix based on G.U and C.A(+) wobble base pairing. RNA 24:209–218
Frederico LA, Kunkel TA, Shaw BR (1990) A sensitive genetic assay for the detection of cytosine deamination: determination of rate constants and the activation energy. Biochemistry 29:2532–2537
Shen JC, Rideout WM 3rd, Jones PA (1994) The rate of hydrolytic deamination of 5-methylcytosine in double-stranded DNA. Nucleic Acids Res 22:972–976
Lindahl T, Nyberg B (1974) Heat-induced deamination of cytosine residues in deoxyribonucleic acid. Biochemistry 13:3405–3410
Ehrlich M, Norris KF, Wang RY, Kuo KC, Gehrke CW (1986) DNA cytosine methylation and heat-induced deamination. Biosci Rep 6:387–393
Stavrou S, Ross SR (2015) APOBEC3 Proteins in Viral Immunity. J Immunol 195:4565–4570
Smith HC (2011) APOBEC3G: a double agent in defense. Trends Biochem Sci 36:239–244
Sharma S, Baysal BE (2017) Stem-loop structure preference for site-specific RNA editing by APOBEC3A and APOBEC3G. PeerJ 5:e4136
Rodriguez-Rivas J, Croce G, Muscat M, Weigt M (2022) Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes. Proc Natl Acad Sci U S A 119
Han W et al (2023) Predicting the antigenic evolution of SARS-COV-2 with deep learning. Nat Commun 14:3478
Bradley CC et al (2024) Targeted accurate RNA consensus sequencing (tARC-seq) reveals mechanisms of replication error affecting SARS-CoV-2 divergence. Nat Microbiol 9:1382–1392
Amicone M et al (2022) Mutation rate of SARS-CoV-2 and emergence of mutators during experimental evolution. Evol Med Public Health 10:142–155
Cuevas JM, Domingo-Calap P, Sanjuan R (2012) The fitness effects of synonymous mutations in DNA and RNA viruses. Mol Biol Evol 29:17–20
Lauring AS, Acevedo A, Cooper SB, Andino R (2012) Codon usage determines the mutational robustness, evolutionary capacity, and virulence of an RNA virus. Cell Host Microbe 12:623–632
Tubiana L, Bozic AL, Micheletti C, Podgornik R (2015) Synonymous mutations reduce genome compactness in icosahedral ssRNA viruses. Biophys J 108:194–202
Zanini F, Puller V, Brodin J, Albert J, Neher RA (2017) In vivo mutation rates and the landscape of fitness costs of HIV-1. Virus Evol 3:vex003
Nouen CL et al (2014) Attenuation of human respiratory syncytial virus by genome-scale codon-pair deoptimization. Proc Natl Acad Sci U S A 111:13169–13174
Tulloch F, Atkinson NJ, Evans DJ, Ryan MD, Simmonds P (2014) RNA virus attenuation by codon pair deoptimisation is an artefact of increases in CpG/UpA dinucleotide frequencies. Elife 3:e04531
Domingo-Calap P, Cuevas JM, Sanjuan R (2009) The fitness effects of random mutations in single-stranded DNA and RNA bacteriophages. PLoS Genet 5:e1000742
Carrasco P, de la Iglesia F, Elena SF (2007) Distribution of fitness and virulence effects caused by single-nucleotide substitutions in Tobacco Etch virus. J Virol 81:12979–12984
Yi K et al (2021) Mutational spectrum of SARS-CoV-2 during the global pandemic. Exp Mol Med 53:1229–1237
Fareh M et al (2021) Reprogrammed CRISPR-Cas13b suppresses SARS-CoV-2 replication and circumvents its mutational escape through mismatch tolerance. Nat Commun 12:4270
Becker J et al (2022) Ex vivo and in vivo suppression of SARS-CoV-2 with combinatorial AAV/RNAi expression vectors. Mol Ther 30:2005–2023

Tables 1 to 2 are available in the Supplementary Files section

There is NO Competing Interest.

SymonsSupplementaryTable1.csv
Supplementary Data Set 1
SymonsSupplementaryvideo1replisomehorizontal.mpg
Replisome-containing deleterious mutations-horizontal spin
SymonsSupplementaryvideo2replisomevertical.mpg
Replisome-containing deleterious mutations-vertical spin
SymonsSupplementaryvideo3Sproteinhorizontal.mpg
S-protein-containing deleterious mutations-horizontal spin
SymonsSupplementaryvideo4Sproteinvertical.mpg
S-protein-containing deleterious mutations-vertical spin
Tables.docx
SUPPLEMENTARYFIGURES.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

The mutational landscape of SARS-CoV-2 provides new insight into viral evolution and fitness

Status:

Version 1

Abstract

Figures

Introduction

Results

Selection for nucleotide composition

Fitness landscape of SARS-CoV-2

Paired bases contribute disproportionally to SARS-CoV-2 fitness

Paired bases display a reduced mutation rate

Fitness values, viral evolution and potential weaknesses of SARS-CoV-2

Discussion

Materials and Methods

Cell lines

Human nasal epithelial cell culturing

Viral culture

Viral clone generation

Viral passage

Library preparation and sequencing

Mutation rate calculations

Selection of mutations for fitness value calculations

Declarations

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1