Overall strategy.
The mutation rates and spectra of RNA viruses (including SARS-CoV-2) are notoriously difficult to measure. For example, even though more than 10 million SARS-CoV-2 genomes have been documented across the globe (GISAID, https://gisaid.org/), this dataset only contains mutations that were successful enough to become major variants in patients. Accordingly, most studies of within-host genetic diversity are limited to variants with an allele frequency that exceeds 0.5%, while mutations that are detrimental to the virus are missed(1). In contrast, the negative correlation between mutation rate and genome size observed in viruses suggests that the spontaneous mutation frequency of the > 30kb SARS-CoV2 genome is < 1×10− 5 per base(2), which is significantly lower than the detection threshold of traditional sequencing methods(3). As a result, RNA-sequencing methods with improved fidelity are required to determine the mutation rate of SARS-CoV-2. With these considerations in mind, we used an ultra-accurate rolling-circle RNA consensus sequencing method termed CirSeq(4) to determine the mutation rate and spectrum of six SARS-CoV-2 variants. This method was previously used to determine the mutational landscape of other RNA viruses, including the polio virus(5), the Ebola virus(6), the Dengue virus(7) and the Zika virus(8). The improved accuracy of CirSeq relies on the circularization of short RNA fragments to synthesize long cDNA molecules that carry tandem repeats of the original RNA template. These tandem repeats can then be analyzed to generate a consensus sequence, which eliminates sequencing and reverse-transcription errors from the final sequencing results (fig. S1).
To explore the mutational landscape of SARS-CoV-2, we cultured the virus in VeroE6 cells, a preferred cell line for COVID-19 research because of its susceptibility to infection, efficient viral replication and permissiveness to mutations(9). Accordingly, VeroE6 cells can support a higher degree of viral genetic diversity than other cell lines, which is useful for studies that examine viral evolution during prolonged culture conditions. In total, we cultured 6 major strains of the SARS-CoV-2 virus, including the USA-WA1/2020, Alpha and Delta strains (corresponding to clades 19B, 20I and 21J respectively), and the Beta, Gamma and Omicron strains. Although each strain was cultured in duplicate, the majority of our experiments were performed on the USA-WA1/2020, Alpha and Delta strains, which we cultured over 7 serial passages, while the Beta, Gamma and Omicron strains were profiled for a single passage (Table 1). Finally, because the VeroE6 cells were derived from the kidney of an African green monkey, we wanted to make sure that our measurements were not skewed by this unique biological environment. Thus, we cultured the Delta strain for 1 passage in Calu-3 cells (a human lung adenocarcinoma cell line) as well, and in primary human nasal epithelial cells that were grown in an air-liquid-interface (Table 1). After each passage, we monitored the sequence of the SARS-CoV-2 genome by CirSeq to take a snapshot of their mutational landscape. Across all strains and conditions, we sequenced over ~ 200 billion bases and identified more than three million mutations. Finally, we assigned the most common mutations a fitness value to determine if they are selected for or against by the SARS-CoV-2 virus and mapped these mutations onto the genome to determine the biological basis for selection.
The mutation rate and spectrum of the SARS-CoV-2 genome.
After we profiled the mutational landscape of each strain, we used lethal and highly detrimental mutations to estimate their mutation rate. Because these mutations cannot be carried over between passages, they must be produced anew in each generation, so that their frequency is equal to the mutation rate(5). We identified lethal mutations with two strategies. First, we considered mutations to be lethal if they introduced a premature stop codon (PTC) in the open reading frame of the viral RNA-dependent RNA polymerase (RdRP), a protein that is essential for viral replication(10). Second, we screened > 6 million SARS-CoV-2 genomes that were previously aligned by UShER(11, 12) and Ensembl(13) for mutations that were absent from these databases. Each of these genomes represents the sequence of the most common variant in a patient, which precludes mutations that are either lethal or highly detrimental for viral fitness. Consistent with this idea, we found that mutations that were absent in both of these databases were significantly depleted from our dataset (fig. S2). Using the mutations identified by these two strategies we then calculated the overall mutation rate of the SARS-CoV-2 genome and found that ~ 3×10− 6 mutations arise per nucleotide per viral passage. This rate was relatively consistent across variants and replicates (Fig. 2a-b) and showed that the mutation rate of SARS-CoV-2 is ~ 10-fold lower than the poliovirus(5) and ~ 5-fold lower compared to the Dengue virus(7), two other RNA-based viruses previously examined by CirSeq. The improved fidelity of the SARS-CoV-2 virus is most likely due to the proofreading capacity of its RNA-dependent RNA polymerase (RdRp)(10, 14), which is absent in the polio and Dengue virus. In this context, it is important to note that G→A and U→C mutations displayed the largest reduction in mutation rate compared to polio (39-fold and 37-fold respectively). Intriguingly, these substitutions also increase the most when the proofreading activity of eukaryotic RNA polymerases II is compromised(15–17), suggesting the existence of a universal set of rules that govern the proofreading capabilities of RNA polymerases in eukaryotes and viruses.
Finally, we found that the mutation rate of every strain was dominated by C→U substitutions, which were ~ 5 times more common than any other base substitution (Fig. 2c). The rate with which C→U substitutions arose depended to a significant degree upon the upstream (i.e. 5′ adjacent) and downstream (i.e. 3′ adjacent) nucleotides. For example, we found that C→U mutations occur most commonly in a 5′-UCG-3′ context (Fig. 3a-d), consistent with analyses based on SARS-CoV-2 phylogeny(18). Because our measurements are independent of positive or negative selection though (which play a key role in published SARS-CoV-2 genome sequences), our analyses provide an unfiltered view of the impact of genetic context on viral mutagenesis. When taken together, these observations demonstrate that C→U substitutions add the greatest amount of genetic variation to the SARS-CoV-2 genome and provide the largest substrate for evolution to act upon, a conclusion that is also supported by more indirect observations(19).
Selection for nucleotide composition
It is likely that the mutation rate and spectrum of the SARS-CoV-2 genome affect the evolution of the virus in various ways. One of the most fundamental attributes of a genome is its nucleotide composition, which depends on the balance between the mutation spectrum and the intensity of selection for each of the four nucleotides. Using the mutational profile derived here, we estimate the equilibrium frequencies for all four nucleotides as: U = 0.4, A = 0.31, G = 0.21 and C = 0.08. This analysis translates into an equilibrium GC content of 29%, which is substantially higher than the 17% previously reported (19). However, this previous estimate is based on indirect estimations of mutation rates at 4-fold degenerate sites across lineages sequenced in GISAID, which might be impacted by selection. Regardless, both estimates are significantly lower than the observed 38% GC content of the SARS-CoV-2 genome, indicating selection for elevated GC content. Indeed, we found that the observed frequency of cytidines at 4-fold degenerate sites is ~ 1.75-times higher than predicted at mutational equilibrium (Table 2), suggesting that there is substantial selective pressure against excessive cytidine depletion, even at sites where these mutations would not affect amino acid composition. To gain more insight into the molecular mechanisms that suppress cytidine depletion at 4-fold degenerate sites, we examined the fitness landscape of C→U mutations in the SARS-CoV-2 genome.
Fitness landscape of SARS-CoV-2
To determine the consequences of C→U mutations for viral evolution, we decided to characterize their impact on the fitness of the SARS-CoV-2 virus using a strategy previously employed for the polio virus(5). Briefly, the fitness of a mutation is related to its change in frequency between consecutive passages as described in the following equation:
$${f}_{n }= {f}_{n-1}\times w + {\mu }_{n-1}$$
(Eq. 1)
with fn and fn-1 the observed frequency of the mutation at passages n and n-1, w the relative fitness and µ the mutation rate. Because these analyses require large amounts of data, we selected one replicate of the SARS-CoV-2 Delta variant and sequenced 155 billion bases of its genome over the course of 7 passages, reaching 1.7 million times coverage. We identified 64,967 unique mutations across these passages, with each mutation being observed on average 42 times for a total of 2.7 million mutation observations. Importantly, the SARS-CoV-2 genome is ∼30,000 bases in length and each base can be mutated into 3 different nucleotides so that a total of ∼90K base substitutions are theoretically possible. Accordingly, we identified 66% of all the mutations that could possibly arise in the SARS-CoV-2 genome. Based on our selection criteria (see methods) this sequencing effort yielded enough data to calculate fitness values for 3,603 C→U mutations (Fig. 4a, Supplementary Table 1). We performed three tests to determine the veracity of these fitness values. First, we compared the fitness values for different types of mutations and found that (as expected) synonymous mutations were less deleterious than non-synonymous mutations (0.90 vs 0.82, P = 8.5×10− 6), and non-synonymous mutations were less deleterious than non-sense mutations (0.82 vs 0.73, P = 0.003). In a second test, we examined the fitness values of mutations that were predicted to be either lethal or highly detrimental due to their absence in the UShER and Ensembl databases. Consistent with the idea that our fitness values hold valuable information, we found that these mutations displayed significantly lower fitness values compared to all the other mutations we detected (mean fitness: 0.48 vs 0.85, P = 5.7×10− 4, Welch Two Sample t-test). In total, 47% of the mutations absent in these databases, but detected in our experiment were assigned a fitness value of 0, a value that is only assigned to mutations that are expected to be either lethal or highly detrimental. In contrast, only 13% for the remaining C→U mutations received a fitness value of 0 (P = 4.5×10− 7, Chi-square test). In a third test, we compared our fitness estimates to studies that used changes in mutation frequency throughout the SARS-CoV-2 phylogeny to infer fitness values(20, 21). When we compared our values to the only other study to provide fitness estimates for both synonymous and nonsynonymous mutations(21), we found a highly significant correlation between our data and this independent dataset (r = 0.42, P < 2×10–16. Together, these comparisons strongly support the idea that our algorithms provide accurate information about the impact of mutations on viral fitness.
Paired bases contribute disproportionally to SARS-CoV-2 fitness
After establishing that our fitness values hold accurate information, we wanted to investigate why synonymous C→U mutations are selected against in the SARS-CoV-2 genome, even if they occur at 4-fold degenerate sites. Potentially, this phenomenon could be explained by stronger, more frequent purifying selection against synonymous mutations in SARS-CoV-2 compared to other viruses, such as the polio virus(5). It was recently shown that the SARS-CoV-2 genome adopts a highly specific secondary structure(22) and that bases that pair with each other to form these structures tend to display lower nucleotide diversity(23). Interestingly, we observed a similar specificity for secondary structures in our CirSeq dataset. The enzyme used to fragment viral RNA (RNAse III) prefers to cleave RNA at specific double-stranded structures, causing strong peaks and valleys in genome coverage that reflect the secondary structure of the SARS-CoV-2 genome. We found that these coverage peaks are identical between all the variants we tested, indicating that the secondary structure of the genome is highly conserved across the SARS-CoV-2 phylogeny (fig. S3). Based on these observations, we hypothesized that the need to preserve this secondary structure could be a significant factor driving purifying selection against synonymous mutations. To test this hypothesis, we split synonymous C→U mutations into two groups: those that form base-pair interactions (henceforth referred to as “paired” sites) and those that do not (henceforth referred to as “unpaired” sites)(22).
Consistent with the idea that there is strong purifying selection against synonymous mutations that affect secondary structures in the SARS-CoV-2 genome, we found that the average fitness value of synonymous C→U mutations was lower at paired sites compared to unpaired sites (0.76 vs 1.03, P < 2×10–16, t-test, Fig. 4b-c). We observed a similar pattern for nonsynonymous mutations (average fitness: 0.65 vs 0.92 for paired and unpaired sites respectively, P < 2×10–16, t test, Fig. 4d-e). To support this idea further, we re-examined our fitness values with the help of an independent assessment of secondary structures based on SHAPE scores(24) and found a significant positive correlation between the shape reactivity score and our fitness estimates for synonymous C→U mutations (r = 0.24, P < 2×10–16). Taken together, these results strongly suggest that mutations that disrupt base-pairing interactions are more likely to be deleterious to SARS-CoV-2 fitness than those that don’t.
Because our fitness estimates are limited to C→U mutations, we used the fitness estimates previously published by Bloom and Neher(21) to investigate if purifying selection for synonymous mutations at paired sites was present for all types of base-substitutions. Restricting our analysis to base-substitutions with enough observations in both paired and unpaired categories (see methods), we found that synonymous mutations are significantly more deleterious at paired vs unpaired sites for U→C, G→U, G→C, C→U, C→A and A→U base substitutions (P < < 0.001 for all, Welch two samples t-test, fig. S4). U→A, U→G, C→G and A→C substitutions did not yield enough observations to calculate fitness values, while two types of base-substitutions (G→A and A→G) showed no significant difference. Interestingly though, it was previously shown that A:C and G:U base pairs (which would arise from G→A and A→G mutations respectively) allow wobble base pairing in RNA molecules(25). Therefore, it is possible that these mutations do not significantly alter the secondary structure of the SARS-CoV-2 genome, even when they occur at paired sites, allowing them to escape purifying selection.
Paired bases display a reduced mutation rate
Our data suggests that the secondary structure of the SARS-CoV-2 genome is critical for viral fitness and that SARS-CoV-2 conserves these structures by strong purifying selection. However, the pace of evolution is also controlled by the mutation rate. Accordingly, we wamted to test the impact of the secondary structure on the mutation rate. To do so, we compared the rate of mutation between paired and unpaired bases and found that C→U mutations are ~ 3 times more frequent at unpaired bases compared to paired bases in all strains (P = < 0.01, Fig. 4f, fig. S5). Other base substitutions are not increased at paired bases (Fig. 4f), indicating that the mechanism responsible for this observation is highly specific. A similar discrepancy is seen at mutational hot spots and cold spots, which are defined by locations where the mutation frequency either increases or decreases 10-fold. In hot spots, only 22.4% of nucleotides are predicted to be paired (vs 47.3% overall, P = 5×10–15, chi-square test), while they make up 84.2% of bases in cold spots (P < 2×10–16, chi-square test (Fig. 4g). Interestingly, unpaired cytidines are 140 times more prone to spontaneous deamination into uracil than paired bases(26), suggesting that a large fraction of C→U mutations could be caused by spontaneous cytidine deamination due to chemical damage. This hypothesis is supported by the observation that C→U mutations are also elevated at CpG sites that are expected to be methylated (methylated cytosines are 2 to 4 times more likely to be deaminated than unmethylated cytosines (27–29), Fig. 4h). Another possibility is that the reduced rate of cytidine deamination at paired sites reflects the preference of APOBEC proteins for ssRNA. For example, APOBEC3A has a strong preference for unpaired cytidines that are flanked by a 5’ uracil and a 3’ guanosine(30–32), the exact conditions that show the highest C→U mutation rate in our dataset (Fig. 3c). However, regardless of mechanism, these observations suggest that the secondary structure of the SARS-CoV-2 genome is not only preserved by strong purifying selection, but also by local changes in the mutation rate that spare paired bases. Together, these observations suggest a synergistic relationship between the secondary structure of the SARS-CoV-2 genome and its mutation rate, which reinforce each other to promote viral fitness.
Fitness values, viral evolution and potential weaknesses of SARS-CoV-2
Because fitness values reflect the forces of natural selection, we wondered whether they could predict the evolution of the SARS-CoV-2 virus. Since the emergence of the Delta variant we sequenced for our fitness analysis, multiple variants have evolved that swept the globe. Each of these strains contains defining mutations that were positively selected for during the evolution of SARS-CoV-2 in human populations (Fig. 5a-b.). Interestingly, some of these mutations were also detected in our short-term evolution experiment. When we examined the fitness values of the mutations that define strain 23H (the most advanced strain thus far), we found that these mutations displayed significantly higher fitness values compared to all other mutations detected in our evolution experiment (1.11 vs 0.84, P = 0.045, t test). Thus, the fitness landscape obtained from our dataset could help predict the mutations that arise in future variants. Accordingly, mutations with high fitness values that have not been observed in known variants thus far could be of particular interest to researchers trying to predict the evolutionary trajectory of SARS-CoV-2(33, 34).
Conversely, we also identified an array of mutations with fitness values below 0.5, suggesting that these mutations are immediate targets for purifying selection. Because these mutations tend to alter critical groups of amino acids and disturb pivotal secondary structures they tend to cluster together when mapped onto the proteome. In Fig. 5c-h, we mapped the majority of these clusters onto the spike protein and the replisome (see also Supplementary-Video 1–4). Because these clusters highlight immutable components of the SARS-CoV-2 virus, they could be valuable targets for vaccines and treatments. Frequently, viruses develop resistance to vaccines and treatments by mutation, leading to variants that lack the targets that treatments or vaccines were developed against; however, because mutation of these essential components significantly lowers viral fitness, targeting these clusters could limit the number of escapees that emerge.