We collected 537 (183 pharyngeal, 241 sputum and 113 fecal) samples from 204 patients, which covered 34.4% of total cases in Beijing before April 30 (Figure 2A, Table S1). Using deep viral genome sequencing, we obtained 8.59G (IQR: 3.12G-17.38G) base pair per sample on average, of which 11.90% (IQR: 1.32%-53.92%) were mapped to the SARS-CoV-2 reference genome Wuhan-Hu-1 (accession: NC_045512.2)18. Samples with low viral genomic coverage were removed (see Methods) and eventually 402 (136 pharyngeal, 182 sputum and 84 fecal) samples from 170 patients underwent further analysis (Figure 2B, Figure S1A).
In each sample, we examined high-depth (≥100x) SARS-CoV-2 genomic sites for iSNVs, and filtered the iSNVs with stringent threshold(≥5%), which could sufficiently distinguished the true iSNVs from the sequencing errors (see Methods). These iSNVs with stringent threshold were widely distributed along the genome, and the iSNVs number in each sample was not affected by genomic coverage ratio or sequencing depth (R2, 0.074 and 0.006, respectively, Figure S1B, Figure S1C). Totally, we identified 7,037 iSNVs in the 374 samples with a median density of 0.53 iSNVs/kb, which is comparable to the iSNVs previously observed in Ebola virus 17, Yellow fever virus 19 and Influenza A 20 (Table S2). Notably, sequencing data showed that 28 samples did not contain iSNVs compared to the reference genome (Figure S1D).
Uneven distribution of intra-host variations in SARS-CoV-2 genomes
We examined the locations of iSNVs along the SARS-CoV-2 genome. Compared to the lower density of these iSNVs (0.58 iSNVs/kb), we observed higher iSNVs density in 5’-UTR (1.23 iSNVs/kb) and 3’-UTR (1.07 iSNVs/kb) (Figure 1E). 6,790 (96.49%) iSNVs were occurred in the coding regions. Most iSNVs (4,625, 68.11%) were identified in ORF1ab, following by S gene (903 iSNVs) and N gene (459 iSNVs) (Figure S1F). However, after we normalized iSNVs with gene length, the highest frequency of iSNVs was obtained in ORF8 (1.02 iSNVs/kb), following by N protein (0.906 iSNVs/kb) (Figure 2D), which were consistent with previous identification at SNP level 21. The analysis on allele position in codon position showed that 2,329, 2,178 and 2,283 iSNVs were occurred at the 1st, 2nd and 3rd codon position, respectively (Figure 2E). The Fisher-exact test on the iSNVs in codon position for each ORF displayed that ORF10 and E gene had more iSNVs at the 1st codon positions (Figure 2E).
We then examined the distribution of iSNVs among the patients, and found the iSNVs occurred in high frequency SNPs (hfSNPs) in the global SARS-CoV-2 genome database shared in more patients 7 (Figure 2F). In addition, most (63.08 %) of iSNVs have low values of minor allele frequency ranged from 0.05 to 0.20, which were rarely shared among multiple individuals (Figure S1E). 81.02% of iSNVs were only occurred in one patient, and 11.24% were found in two individuals. According to the neutral evolution model, we constructed a simple framework to establish the basic expectations of both allele frequencies and the genomic distance between two alleles of iSNVs in a stimulated model (Figure 2G, Figure 2H). Compared to the neutral evolution model, both allele frequencies and iSNVs distribution in the genome displayed an uneven distribution of these iSNVs, suggesting a purifying and positive selection on these iSNVs (Figure 2G, Figure 2H).
Increased genetic diversity with the disease progression
To uncover the dynamic change of viral iSNVs in COVID-19 patients, we performed spatio-temporal analysis on the iSNVs along with the epidemic period and the disease progression, as well as that from different specimens. We observed a steady increase of iSNVs along with the epidemic period among the patient population (estimated value from 0.15 to 0.83 iSNVs/kb within 97 days) (Figure 3A). Meanwhile, iSNVs also accumulated during the infection in individual hosts, from 0.52 iSNVs/kb on the initial day to 0.85 iSNVs/kb on 30-day post symptom onset (Figure 3B, Table S3). This accumulation of mutations could also be observed in 81.97% (50 out of 61) paired samples from COVID-19 patients (Figure S2A). No significant difference in genomic coverage was observed among different sample types or along the epidemic period (Figure S2B). The increased genetic diversity could be observed in all three different kinds of specimens (Figure S2C). Of note, while more iSNVs were occurred in feces than in sputum and pharyngeal swab samples during the early days of symptom onset, the accumulation of iSNVs in digest system is slower than that in respiratory system (Figure S2C, Table S3).
Next, we measured dynamic changes of the genetic diversity in ORFs with the disease progression, and observed a rapid accumulation of mutation in the S, N, ORF1ab and other genes. The accumulation in nonsynonymous is more rapid in all genes, in comparison to synonymous mutations that respected as the neutrality (Figure 3C), and S gene displayed the highest accumulation rate (Figure 3C). To test whether the accumulation of genetic diversity was caused by their fitness selections, we computed the iSNVs number of nonsynonymous and synonymous divergency among genes, and found that 5,197 iSNVs displayed as nonsynonymous mutations as and higher than that in synonymous (1,593). The ratio of nonsynonymous to synonymous variants was 3.26. The ratio of nonsynonymous to synonymous of iSNVs in S gene (ratio=5.31) was significantly higher than the other part of viral genome (Figure 3D). The mean values of minor allele frequencies of non-synonymous and synonymous iSNVs were 0.189 and 0.195, respectively (Figure S2B). The ratio (Ka/Ks) between the number of nonsynonymous substitutions per nonsynonymous site (Ka) and the number per of synonymous substitutions per synonymous site (Ks) in S gene was also increased from 1.01 to 2.46 with the disease progression. Assuming that Ks represent a neutral expectation, Ka/Ks values of S gene over than 1 in the late of disease progression, indicated signatures of positive selection with disease progression at least in S genes (Figure 3E). In addition, the fraction of nonsynonymous mutations in predict epitope region 22 was significantly higher than expected (27.57% vs. 25.74%, p-value=0.016), while nonsynonymous mutations out of epitopes regions was significantly lower (Figure 3F). The further correlation analysis between the fraction of nonsynonymous sites and the time of symptoms onsets, displayed nonsynonymous increased genes along with disease progression (Figure S2E), especially in S gene (Figure 3G).
RNA editing in the increase genetic diversity
Host-dependent RNA editing of APOBEC and ADARs has been acknowledged to cause the bias of mutational type in SARS-CoV-2 23, thus, we measured all mutational types on all iSNVs. The top five mutation types of iSNVs were U-to-C, followed by C-to-U, G-to-U, A-to-G, and G-to-A (Figure 4A). Among them, four were the common substitutions caused by APOBECs/ADARs 23 (Figure 4A). In addition, unlike the A-to-I RNA editing signal in human transcriptome, we did not observe an obvious depletion of G bases in position -1 that presents at A-to-I edited positions (Figure S3A). We also calculated the correlation between the minor allele frequencies of iSNVs and the time post symptom onset. Previous study has showed an induction of APOBECs triggered by coronavirus infection but ADAR did not 23. Consistently, the minor allele frequencies of C-to-U and G-to-A mediated by APOBEC-mediated RNA editing slightly increased in vivo. By contrast, the other mutational type including ADAR-mediated RNA editing in vivo did not increased along the infection time of patients (Figure 4B, Figure S3B). These results suggested RNA-editing of APOBECs were also impacted by the infectious of SARS-CoV-2, which turned back to change the mutation rates of C-to-U and A-to-G in SARS-CoV-2.
Influence of patient-derived immune-selection on the genetic diversity
To evaluate the influence of immune-selection on viral mutation, we measured the dynamic changes of the number of iSNVs within patient that stratified based on gender, age, illness severity and viral shedding time (Figure S4A, Table 1). Each sample was recalibrated based on symptom onset date (Figure 5A). The increased genetic diversity of iSNVs could be observed in all groups (Figure 5A). Female patients, mid-age (15-65y) patients, the patients with mild symptoms and the patients within 4-6 weeks viral shedding duration accumulated iSNVs much faster than other patients in each corresponding group (Figure 5A). Different slops and initial values were observed in these patient-groups, suggested a different fitness selection at the initial infection and the following infection stage along the day post symptom onset.
To further investigate the potential mutations influenced by patient-derived immune-selection, we compared the proportions population with or without iSNVs that frequently occurred in these patients. Among the 52 iSNVs sites that were highly occurred frequency in the population (shared by more than six patients), we identified 4, 5, 4 and 8 iSNVs significantly existed in server, old, long viral shedding time and male patients, compared to mild/moderate, young, short viral shedding time(<14 days) and female patients(Figure 5B). These iSNVs were distributed in ORF1ab, S, N, ORF6 and ORF8. Intriguingly, among the 52 high frequent iSNVs sites, we identified 27 iSNVs preferred occurred in severe patients; and 12 of them were hfSNPs sites. In contrast, the 25 iSNVs that showed preference in moderate patient did not overlapped with hfSNPs (Figure 5C, p-value < 0.001). This enrichment of hfSNPs were not observe in any other categories that stratified based on gender, age, and viral shedding time (Figure S4B), indicating that the propagation of these hfSNPs might under patient-derived immune-selection. Taken together, the accumulation and the different distribution of iSNVs in each group, hinted the possibility of fitness selections along the symptom onset.
Uneven purifying selection process from iSNVs to SNPs
To identify the mutations purified from iSNVs to SNPs, we compared the genomic site of iSNVs and SNPs in public database 6,7. Among 7037 iSNVs, 15.59% of iSNVs had been identified as SNPs before our observation period (May, 2020), and 11.28% of iSNVs was fixed from May, 2020 to December, 2020, and the rest iSNVs (73.13%) were still not fixed as SNPs (Figure 6A). Nonsynonymous iSNVs displayed a lower fixation rate than synonymous iSNVs (20.92% vs. 41.12%, p-value <0.001; Figure 6B), which is supported the model that nonsynonymous iSNVs rise to high frequency with an individual due to positive selection, but are less likely to become fixed in the population due to purifying selection. Then, we performed Fisher’s exact test to compare the proportion of fixed mutations in each gene, and S and ORF1ab had a lower fixation rates (21.04% and 20.56%, Figure 6C), either in nonsynonymous or synonymous sites (Figure S5A). Consistently, the nonsynonymous-to-synonymous ratio of iSNVs in ORF1ab, N and S (excluding D614G) were over than that estimated in identified SNPs, presenting as an uneven purifying selection in these genes. As disease progress, iSNVs fixation rates in nonsynonymous and synonymous sites were stable in the population (Figure 6D, Figure S5B), indicating a similar purifying selection with disease progress. Interestingly, we also observed that mutation in iSNVs might be earlier detected before they fixed in SNPs. For example, accumulative frequencies on C7051T alleles had been observed in our studies before May, whereas the first C7051T SNP was reported until June, 2020 (Figure 6F). All these data indicate iSNVs provide prospective and complement genetic information to illustrate the SARS-CoV-2 evolution.
Molecular function on variations in S protein before purifying selection
S protein drives the cellular entry binding on receptors and acts as a major determinant of host range, cell, tissue tropism, and pathogenesis of coronaviruses 24. Therefore, we further analyzed 21 of total 606 iSNVs sites identified in the coding region of S protein that caused 20 amino acids changes. 1) Nine were detected outside of RBD more than six individuals; including three linked iSNVs (A22298G, G22299A, and G22302A) with the substitutions of two amino acids were located ahead of RBD of protein (R246E and S247N, Figure 7A). 2) Eleven resided within the receptor-binding domain (RBD) or S1/S2 cleavage sites more than two individuals, including seven iSNVs were located in the receptor-binding motif (RBM). We compared the mutation sites of seven iSNVs in RBM in SARS-CoV-2 to the consensus sequencing in other animal (Bat and Pangolin) SARS-CoV-2-like coronavirus, and found that all these sites were heterogeneous (Figure 7A). The patients with these mutations have no contact history except for two patients with G485V (Table S4); no iSNVs emerged in the first time point of genome sequencing in patients with multiple time point (Figure S6A, Table S4); and no evidence supported these iSNVs were linked in the genome. Taken together, these data indicated that iSNVs in RBD seem to be generated by independently viral evolution.
To elucidate the effects of these mutations at the molecular level, we first used the SARS-COV-2 pseudovirus infection assay to assess the viral entry efficacy of 20 of the 22 S-protein mutants except S50L and M731V. Compared to reference strain, 18 of 20 tested mutants displayed a decreased (fold-change <0.25) or comparable efficacy (fold-change <4 or fold-change >0.25) of viral entry; only mutants R685G and D614G exhibited a similar level of increased viral entry efficacy (Figure 7B, fold-change >4). Since the residue L518V is far away from the binding interface between S protein and hACE2/CB6 and four mutants (N422K, E471K, G485V and Y505C) displayed a significant decrease (fold-change < 0.25) of viral entry efficacy, the other five mutants in RBD (V407L, L452Q, V483F, Q493H and Q498H) were test for the sensitivity to CB6 (Table S5). Wide type and D614G was included as control. Modest differences between these RBD variants and reference strain (within 4-fold) were observed in their susceptibility to CB6 (Table S5). It is worth mentioning that some variants including V483F, Q493H and Q498H were even more sensitive to CB6 compared with reference strain, which illustrate that CB6 antibody could still block the entry of virus with the existence of RBD mutation.
Then, we focused on the mutations within RBM and simulated the bonding of the corresponding mutants bound to human Angiotensin-converting enzyme 2, hACE2 25 and neutralizing antibodies CB6 26 using molecular dynamics simulations (Figure 7C). The simulation of the mutation L518V were not conducted as previously described. For comparison, we also simulated and inspected the binding of WT RBD to hACE2 and CB6, respectively. Cα root-mean-square deviation (Cα RMSD) of the complex of different mutant RBD bound to hACE2 varied within the similar ranges of the corresponding complex of WT RBD, suggesting that the 9 mutations may not induce dramatic conformational changes (Figure S6B). Then we looked into more structural details and found that different mutations could affect the binding to hACE2 in varied ways. For example, the residue 422 was mutated from the uncharged side chain N to the positive charged K in mutant N422K. Therefore, a much stronger hydrogen bond was formed between K422 and E406. In addition, strong repulsion between K422 and R403 and attraction between K422 and Y453 induced the broken of the hydrogen bonding between Y505 and E37 and Y453 and H34. As a result, both the numbers of hydrogen bond and contact area formed between the mutant RBD and hACE2 were reduced apparently (Figure 7D). As for the mutation G485V, since V has a relatively bulky side chain comparing to G, in order to reposition the bulky side chain, G485V mutation led to the escaping of F486 from the hydrophobic pocket formed by the residues including F28, L79, M82 and Y83 of hACE2. As a result, not only the contact area but also the numbers of hydrogen bond formed between mutant RBD and hACE2 was reduced. Especially, the hydrogen bond formed between Y505 of S protein and E37 of hACE2 which is important in binding was lost (Figure 7E). As a possible consequence, the binding of mutants N422K/G485V to hACE2 would be weakened. The binding free energy calculations using MM/GBSA method also indicated that the weakened binding of N422K/G485 and hACE2 (Figure 7C). The hydrogen bond and contact area changes of the other mutants were also inspected and discussed (Figure S6B). These data illustrated most of the mutants with observed iSNVs in study decreased the viral entry ability by the reduce of hydrogen bond and/or contact area formed between mutant RBD and hACE2. We also analyzed simulation results of the complex with the mutant RBD and the antibody CB6 (Figure S6C). Most of the mutants reduced the contact area a lot compared to WT, such that, the decreased binding affinities were observed which was manifested by the reduced number of hydrogen bond and higher binding free energy obtained in MM/GBSA calculations. Therefore, both the observation of mutations in pseudoviral infection assay and computation of their interaction displayed a weaken trends of viral infection in most of iSNVs identified in S protein.