Characteristics of SARS-CoV-2 isolates, structural genes, and protein sequences
The 3090 SARS-CoV-2 isolates harbor 16 E alleles, 40 M alleles, 131 N alleles, and 173 S alleles. These alleles correspond to 10, 14, 88 and 99 different amino acid sequences of E, M, N, and S proteins, respectively. Protein sequence comparison of the WH01 isolate with a SARSr-CoV isolate, bat-SL-CoVZC45, shows a similarity of 100% (75/75) in E protein, 98.65% (219/222) in M protein, 94.27% (395/419) in N protein, and 80.06% (1171/1273) in S protein. These results imply a close homology between SARS-CoV-2 and bat SARSr-CoV, particularly with E and M proteins. Alternatively, it indicates an extreme conservation of E and M proteins and their functions among coronaviruses[14].
Further analysis revealed that there are 14 single nucleotide polymorphisms (SNPs) of E gene, but only 5 single amino acid polymorphic (SAP) loci on the E protein. A similar result was observed for the M gene and protein, with 37 SNPs and 9 SAPs. In contrast, 126 SNPs and 75 SAPs were detected on N gene and protein, respectively. S protein, the most important component that mediates virus entry by receptor binding and membrane fusion and determines the infectivity of SARS-CoV-2 [15], harbors 155 SNPs and 90 SAPs. Considering the size of nucleotides and amino acid residues, N gene has the maximum sequence variability with 10.02% (126/1257) SNPs and 17.90% (75/419) SAPs, respectively. However, S gene has the most pairwise nucleotide differences among the four structural genes, indicating a relatively greater genetic diversity (Table 1).
Table 1. Genetic diversity indices of the 4 structural genes of SARS-CoV-2 isolates
Gene
|
Sequence, n*
|
Sequence length
|
h
|
π
|
S
|
θ
|
SD of θ
|
ƞ
|
E
|
2928
|
228
|
16
|
0.00012
|
14
|
0.00475
|
0.00232
|
15
|
M
|
2891
|
669
|
40
|
0.00018
|
37
|
0.00665
|
0.00145
|
40
|
N
|
2253
|
1260
|
131
|
0.00056
|
126
|
0.01081
|
0.00432
|
130
|
S
|
2339
|
3825
|
173
|
0.00075
|
155
|
0.00753
|
0.00093
|
169
|
h,Haplotypes,
π, Nucleotide diversity
S, Polymorphic sites
θ, Theta (per site) from S, population mutation ration
Ƞ, Total number of mutations
SD, Standard deviation
* Some bases of SARS-CoV-2 genomic sequences are not exactly identified; thus, the number of gene sequences (n) is less than 3090.
Distinct phylogenetic patterns of the four structural genes
Phylogenetic analysis revealed that all SARS-CoV-2 E proteins form three clusters. Similar to E protein, a phylogenetic tree of SARS-CoV-2 M proteins is formed from three clusters with few branches (Fig. 1a and b). Results suggest that both E and M genes may display a relatively high conservation during coronavirus evolution. In contrast, SARS-CoV-2 N and S proteins show distinct phylogenetic patterns compared with E and M proteins. Four and three main phylogenetic clusters with various branches are identified in the N and S proteins, respectively (Fig. 1c and d). Given the crucial roles of N and S proteins in virus transcription, assembly, and entry to host cells, the influence on infectivity based on whether SARS-CoV-2 isolates harbor different N and S variants (such as those clustered into different clades) remains unknown.
Purifying selection drives evolution at a whole structural gene level of SARS-CoV-2 during human to human transmission
Although numerous studies have demonstrated that recombination plays an important role on the emergence of SARS-CoV-2 [16-18], how this virus evolves during its global transmission has not yet been profiled. Therefore, we first analyzed intragenic recombination events of each structural gene using RDP4. Results indicate that no recombination events occurred among the alleles of each gene (data not shown). Potential recombination events were also assessed by reticulate network tree—phi test, using SplitsTree4 software. Although some internal nodes are present in N and S alleles, no clear evidence for recombination can be validated for each gene via phi test (P > 0.05) (Fig. 2). This result indicates a relatively stable state of SARS-CoV-2 during transmission, though a possible genetic interaction of different isolates may have occurred when it became a global pandemic [19, 20]. Tajima’s D, and Fu and Li’s D* and F* statistics were calculated to examine the mutation neutrality hypothesis of the four structural genes of SARS-CoV-2. Results reveal that the evolution of all four genes does not support population neutrality, but rather favors purifying selection (Table 2; Additional file 1: Figure S1). The average of all pairwise dN/dS ratios (ω) among alleles of each structural gene of SARS-CoV-2 is 0.5443 in E, 0.1562 in M, 0.07978 in N, and 0.4980 in S, respectively. In aggregate, these results suggest that at the whole gene level, inconsistent purifying selection is the main evolutionary force at work given the experimental conditions (Table 2; Additional file 1: Figure S1).
Table 2. Summary of neutrality of the four structural genes in SARS-CoV-2 isolates
Gene
|
Tajima’s D
|
Fu and Li’s D* test
|
Fu and Li’s F* test
|
dN
|
dS
|
dN/dS (ω)
|
Selection
|
E
|
-2.29974, P<0.01
|
-3.18477, P<0.02
|
-3.38505, P<0.02
|
0.006836
|
0.1256
|
0.5443
|
Purifying selection
|
M
|
-2.74611, P<0.001
|
-5.64276, P<0.02
|
-5.50855, P<0.02
|
0.001294
|
0.008296
|
0.1562
|
Purifying selection
|
N
|
-2.87598, P<0.001
|
-9.67153, P<0.02
|
-7.95879, P<0.02
|
0.000251
|
0.003146
|
0.07978
|
Purifying selection
|
S
|
-2.87646, P<0.001
|
-11.01171, P<0.02
|
-8.59037, P<0.02
|
0.000609
|
0.001223
|
0.4980
|
Purifying selection
|
For Tajima's D, Fu and Li’s D* and F* tests, the cutoff was set at 0.05.
The SARS-CoV2 S gene is operated on by positive selection at a definitive codon located at the C-terminal portion of the S1 subunit, and another potential codon located at the signal sequence
Guo et al. (2020) reported that the S gene of SARSr-CoV populations in their natural host, Chinese horseshoe bats, has evolved through positive selection at some codons[13]. As mentioned above, at the whole gene level, purifying selection is the main force driving the evolution of studied genes. Whether positive selection pressure accelerates the diversification of the structural genes of SARS-CoV-2 remains unclear. We therefore used codon-substitution models to estimate the ratio of nonsynonymous over synonymous substitutions (dN/dS), also known as “ω”. These codon substitution models include M0 (one‐ratio, one ω for all sites ), M1a (nearly neutral, two classes of sites, ω0 < 1 and ω1 = 1), M2a (positive selection, allows three site classes including ω0 < 1, ω1 =1, and ω2 > 1), M3 (discrete, allows unconstrained discrete distribution of ω among sites), M7 (β, fit to a β distribution for ω among sites) and M8 (β and ω > 1, fit to a beta distribution with an extra rate that allows ω > 1) and can be typed into a null model group (M0, M1a, M7), and a positive selection model group (M3, M2a, and M8) [21]. The role of recombination in the polymorphisms of the four analyzed genes is excluded because no intragenic recombination was detected (Fig. 2). By using a Maximum Likelihood (ML) method, we did not find any codon of either the E or M gene subject to positive selection (data not shown). Although a potential positive selection site in the N gene, 208A, is identified using an M3 model, but the LRT P-value is more than 0.05 (M0 Vs. M3) and the result is not validated by other models including the M8 model, suggesting that evidence for N gene positive selection is limited (Additional file 2: Table S1). For the S gene, we found the average ω to be 0.37199, calculated using the one‐ratio (M0) model of the codeML software package, suggesting that purifying selection operates S gene evolution during SARS-CoV-2 transmission among humans. In three LRTs, all alternative models (M3, M2a, and M8) are significantly better fits (P < 10-4) than relevant null models (M0, M1a, and M7), indicating that some S sites were subjected to strong positive selection (ω = 18.22175–20.61283) (Table 3). A single positive selection site (614D) is identified in the S gene with a posterior probability of 1.000 in all three models [22], clear evidence for this site experiencing positive selection while the virus transmits between human hosts. This result is furthermore validated using internal fixed effects likelihood (IFEL) and evolutionary fingerprinting methodology, implemented using the HyPhy software package (Fig. 3) [23-25]. To our surprise, the positive selection site is not located at the receptor binding domain (RBD) or receptor binding motif (RBM), as we anticipated, which generally play the most integral role in virus-receptor interaction and virus entry into host cells [26]. This result suggests that relative genetic stability of this motif benefits virus survival. Interestingly, the site under positive selection pressure always has a D614G (for the S gene is 1841A>G) mutation, implying that such a mutation may enhance virus adaptability in human hosts. Another potential positive selection site at codon 5 was also identified, and a L5F mutation (for the S gene is 13C>T) was always found, with posterior probabilities greater than 0.95, 0.93 and 0.92 (critical values) calculated by M3, M2a and M8 models (Table 3), respectively. A similar result was also confirmed using an evolutionary fingerprinting method (Additional file 3: Figure S2).
Table 3. Log-likelihood values and parameter estimates for the SARS-CoV-2 S gene sequences
Model
|
Ln L
|
Estimates of parameters
|
Model compared
|
LRT P-value
|
Positive sites
|
M3 (discrete)
|
-6766.339162
|
p0=0.96797, p1=0.02883, p2=0.00320
ω0=00.26126, ω1= 2.70530, ω2=20.61283
|
M0 vs. M3
|
0.000000001
|
5 L 0.958*, 28 Y 0.850, 221 S 0.901, 614 D 1.000***, 677 Q 0.891
|
M0 (one ratio)
|
-6790.072925
|
ω0=0.37199
|
Not Allowed
|
M2a(selection)
|
-6766.432802
|
p0=0.81731, p1=0.17872, p2=0.00397
ω0=0.17504, ω1=1.00000, ω2=18.76936
|
M1a vs. M2a
|
0.000004385
|
5 L 0.9258, 28 Y 0.812, 221 S 0.832, 614 D 1.000***, 677 Q 0.828
|
M1a (neutral)
|
-6778.770190
|
p0=0.70461, p1=0.29539
ω0=0.04395, ω1=1.00000
|
Not Allowed
|
M8(β & ω)
|
-6768.829411
|
p0=0.99578, p=0.40368, q=0.82224
p1= 0.00422, ω= 18.22175
|
M7 vs.M8
|
0.000030400
|
5 L 0.931, 28 Y 0.817 ,221 S 0.831, 614 D 1.000***, 677 Q 0.828
|
M7(β)
|
-6779.230494
|
p=0.00857, q=0.02623
|
Not Allowed
|
LnL is log likelihood; ω is ratio of dN/dS, LRT P-value indicates the value of the chi-square test; Parameters indicating positive selection are presented in bold; Positive selection sites were identified by the Bayes empirical Bayes (BEB) methods under M8 model. The posterior probabilities (p) ≥ 0.80 are shown, (p) ≥ 0.95 (p) ≥ 0.99, and (p) = 1.000 are indicated by *, ** and ***, respectively.
The evolutionary relationship of S gene alleles with and without D614G and L5F mutations
Phylogenetic trees of S gene alleles were derived to test the evolutionary relationship among alleles with and without the D614G mutation. As shown in Fig. 4a, the 173 alleles of the S gene clustered into four clades. Alleles with a D614G mutation could be found in all the 4 clades. A dominant clade contains 79 out of total 85 alleles with this mutation. The remaining 6 mutated S alleles were distributed among the other 3 clades. This result is supported by a parsimony network of S gene alleles created using PopART software (http://popart.otago.ac.nz) [27]. Two central alleles (representative virus isolates are WH01 and GZMU0019) and associated alleles around them form a star-scattering network, suggesting that the S gene may have two potential origins (Fig. 4b). All S alleles with a D614G mutation are closely related with a few point mutations, and comprise a scattered star structure, suggesting the expansion of the SARS-CoV-2 population with a D614G mutation on the S gene. In contrast, alleles of the N gene suggest a single ancestor, as analyzed by parsimony networking, though 3 phylogenetic clades were identified (Additional file 4: Figure S3).
A total of 5 alleles with L5F mutations were discovered, and all of them were located within a single clade, accounting for 83.33% of all alleles in the clade (Additional file 5: Figure S4a). Further parsimony network analyses revealed that S alleles with a L5F mutation are not closely related, but rather are distributed in both WH01 and GZMU0019 haplotype groups (Additional file 5: Figure S4b). No scattered star structure could be formed, indicating that the L5F mutation might arise from independent origins, unlike the D614G mutation.
Frequency of S alleles with D614G mutations increases in SARS-CoV-2 isolates during human to human virus transmission
Considering that mutations in a positive selection site should be beneficial to the survival of the individuals carrying the mutation, we postulate that the D614G (1841 A>G) mutation may increase the spread of SARS-CoV-2. Evidence is obtained from the haplotype network analysis aforementioned of S alleles (Fig. 4b). S gene haplotypes (alleles) with a D614G mutation (representative isolate, GZMU0019) have evolved many subtypes and comprise a star structure with GZMU0019 in the center. This starburst pattern with one haplotype in the center and many other haplotypes surrounding the central haplotype suggests a signature of rapid population expansion [28]. To further study whether SARS-CoV-2 isolates with a D614G mutation have a selective advantage in survival during transmission among humans we calculated the frequency of S alleles carrying a D614G mutation in each week from collected SARS-CoV-2 isolates from December 24, 2019 through April 20, 2020 (17 epidemiological weeks). Detailed information on these isolates including collection date, collection region, and accession or biosample number is available in Additional file 6: Table S2, and Additional file 7: Table S3.
In 173 S gene alleles, 85 carried a D614G mutation, accounting for 49.13% of the total. Similarly, 47 of 99 S proteins were found to carry a D614G mutation, accounting for 47.47% of the total. The first two isolates of our collected data, GWHABKF00000001 and WH01 (isolated in December 24, 2019 and December 26, 2019, respectively), carried 614D mutations in the S protein. The first SARS-CoV-2 isolate with a D614G mutation was isolated from a patient with COVID-19 on February 5, 2020 (week 7 in our dataset). After that, except for weeks 9 and 10 (possibly due to the small number of samples and sampling deviation), an increasing trend in the proportion of isolates carrying the D614G mutation in the S protein was noteworthy. In week 17, the last week of our dataset, 91.11% of SARS-CoV-2 isolates carried this mutation (Fig 5a; Additional file 6: Table S2). Further analyses revealed that the frequency of the D614G mutation in the S gene steadily increased when data from weeks 6 through 17 were combined (Fig 5b; Additional file 6: Table S2). To exclude the influence of sample size on the result (in some weeks, only 4–6 isolates were collected) we reorganized the dataset by taking both sample size and sampling time into account. Various panels of 200–300 isolates were studied, with similar results observed (Fig. 5c and d; Additional file 7: Table S3). Taken together, these results suggested that SARS-CoV-2 isolates with a D614G mutation might increase the virus’s ability to transmit and hence rapidly spread around the world.
D614G mutations in the S gene may destabilize the S protein trimer and promote receptor binding and membrane fusion
We found that the D614G mutation is located at the subdomain 2 (SD2) at the C-terminus of RBD and close to the two potential cleavage sites between S1 and S2 [29] (Fig. 6a). Considering positive selection is usually beneficial to the survival of the individual carrying the mutation, we speculate that the D614G mutation may facilitate structural conformation change to promote receptor binding or membrane fusion[8, 30], and in turn, improving infectivity. From the latest cryo-electron microscopy-determined (cryo-EM) structure of the SARS-CoV-2 S protein, the negatively charged sidechain of D614 points toward the positively charged sidechain of K854 from the neighboring monomer (Fig. 6b) [29] . The distance between the closest atoms of the two residues is 2.6 Å, which is an optimal distance to form a salt bridge (Fig. 6c). From the modelled structure with a D614G mutation, the distance is increased to 5.2 Å (Fig. 6d), which would potentially abolish the salt bridge and destabilize the integrity of the S trimer in wild type. It has been reported that human receptor ACE2 binds to an “open” conformation of the S protein, where RBD moves away from the core structure, and exposes its receptor binding surface. The entire S trimer then undergoes a series of dramatic conformation changes, including cleavages between S1 and S2, disassociation of S1, and post-fusion transformation of S2 [31, 32]. Mutations at cleavage sites and adding internal crosslinks to the S trimer would keep the protein in a stable and “closed” conformation, where the receptor binding surface of RBD is inaccessible [29, 33]. Therefore, we hypothesize that the highly transmissible D614G mutation, driven by positive evolutionary selection, promotes the accessibility of RBD by eradicating a critical salt bridge between S protein monomers, which subsequently triggers membrane fusion upon ACE2 binding.