Identification of 5 novel mutations
From the 474 sequences available in GenBank, a group of 100 SARS-CoV-2 genomes were found to have a nucleotide (nt 25563) mutated from G to T (G25563T). The mutation was exclusive to the US isolate sequences collected since March 2020 in the GenBank (downloaded April 11, 2020). The new mutants accounts for 21.1% (100/474) of all full genome sequences submitted to GenBank, or 27.9% (100/358) of the US full genome sequences in GenBank. Most of the G25563T isolates (94/100) co-possessed a C1059T mutation. Moreover, 16 of the G25563T isolates had an additional C27964T mutation, which accounts for 3.4% (16/474) of all full genome GenBank sequences, or 4.5% (16/354) of the US full genome sequences in GenBank. Among all 474 full genome sequences in GenBank, 48 collected from the US in March 2020 have a G29553A mutation. In addition, a mutation (C241T) was found in 30.8% (109/354) US isolates collected mostly in March 2020. The GenBank accessions of the isolates that we found with the novel mutations are shown in supplement Table S1. Of the 5 mutations described above, 3 mutations are substitution mutations in the coding regions, which resulted in amino acid sequence changes (missense mutation; non-synonymous mutations). They are C1059T causing amino acid 265 mutation from T to I (T265I) in orf1ab, G25563T (Q57H) in orf3a, C27964T (S24L) in orf8. The G29553A mutation is in a noncoding region upstream of orf10; the C241T mutation is at the 5’ untranslated (5’UTR) region. These mutations have not been described previously, to our knowledge, and were found only in the isolates submitted mostly in and after March 2020 (including a few isolates in late February; Table 1). The representative images of the 5 mutations are shown in supplement Figure S1.
Proposed classification of the new SARS-CoV-2 isolates
Recently, the SARS-CoV-2 isolates have been classified into 3 clusters (groups), namely group A, B and C, based on 3 mutations2. The original isolates without mutation collected in Dec 2019 from China were classified as group A; the isolates with C8782T/Y and T28144C mutations were labeled as group B (mutated from group A); when group B isolates mutated with G26144T, the mutated isolates were labeled as group C. The isolates with the 3 nonsynonymous (missense) mutations identified in our study did not fall in the category of group A, B, C, since they had many mutations on top of group A, but did not have marker mutations C8782T/Y and T28144C (group B), nor G26144T (group C). To be consistent with the recent cluster (group) classification2, we classified the isolates with novel amino acid changes as follow: C1059T(T265I) and G25563T(Q57H) usually co-existed, they are group D; the ones with the C27964T (S24L) change are in group E.
The emerging geographic locations of group D and E SARS-CoV-2 isolates
The earliest SARS-CoV-2 sequences were collected from China in December 2019 (Table 1). Of the 19 early identified sequences, 12 were group A, 2 were group B, and 5 were group C. These data suggest that most of the isolates in the early stage of outbreak were group A. In addition, it also revealed that mutations to group B and C existed as early as December 2019. Similarly, Taiwan and India collected group A and B isolates in January 2020. In addition, Iran, Japan, Pakistan, Viet Nam, and Australia had collected only group A isolates in January 2020 (Table 1). By the time the outbreak spread to Spain in February and March 2020, all isolates collected in GenBank belonged to group B and C. In the US, the SARS-CoV-2 isolates collected in the early stage (January 2020) were group A and B, each accounting for about 50% of the isolates; 9 of 17 group A and 8 of 17 group B, respectively. However, in March, the percentage of group A isolates dropped dramatically to 5.7% (17/300); isolates in group B and their variants in group C together accounted for 62% (179/300) of the isolates submitted from the country (Table 1). More strikingly, ~ 1/3 of the US Mar-2020 isolates have at least 2 mutations identified in the current study. From the GenBank SARS-CoV-2 database (Table 1), we can see that the virus started mainly as group A, with a portion of variants mutated into group B and group C in December 2019. Thereafter, most isolates were group B and C. Then new mutants of groups D & E started to emerge, accounting for approaching 40% of the US isolates in March 2020.
Although a fairly representative snapshot, the GenBank information is obviously not a complete picture. As of April 13, 2020, 8,126 sequences were available in the GISAID hCoV-19(SARS-CoV-2) database. To validate our findings on GenBank, we retrieved all complete or near-complete genomes (>29,160 nt) from the GISAID hCoV-19 database. We analyzed these 8,008 with the focus on the new mutations (Table 2)
In the GISAID hCoV-19 database, 17.7 % (1,417/8,008) and 0.6% (50/8,008) were group D and E isolates, respectively (Table 2). In addition, we identified 55.3% (4,427/8,008) with the novel mutation of C241T. Consistent with our finding from GenBank sequences, 43% of the US isolates belong to group D. In addition, group D isolates have been present widely; they account for substantial isolates submitted to GISAID hCoV-19 database in late February to March 2020: Canada (21.7%, 28/129), UK (6.4%, 175/2,726), France (53.9%, 110/204), Iceland (17.3%, 104/601), Australia (16.9%, 66/391), Netherlands (11.1%, 65/585), Belgium (12.1%, 39/322), Luxembourg (37.2%, 32/86), and Finland (40%, 16/40). It is striking to note that no group D mutation was found in any of the SARS-CoV-2 isolates submitted by Italy (44) and Spain(105), respectively, although the outbreaks in those 2 countries were severe and several weeks earlier than the countries in other parts of Europe and North America. We speculate that group D mutations occurred in late February to early March 2020. Since group D were found in multiple countries in a relatively short period of time, the mutation may have possibly emerged in multiple countries independently. Among the 8,008 genomes in the GISAID hCoV-19 database, 50 (0.6%) had the C27964T (group E) mutation, 42 from the US, 2 from Canada, and 6 from Australia. Although it is a relatively small number, this mutation is in a coding region resulting in an amino acid sequence change and is thus also worth attention. The 6 Australian group E isolates are different from those collected from the US in that they did not have the mutations of group D and C1059T. Since the Australian group E isolates are different from the ones collected in the US and Canada, they possibly evolved in Australia independently.
Group B (C8782T/Y and T28144C), and group C (C26144T) sequences were found in 29.5%, 30.5%, and 6.3% of 95 isolates collected before Feb 14, 2020.5 However, these mutations are absent in the genomes of the US group D and E isolates, suggesting that the US group D isolates evolved directly from the ancestral strains (group A). Another interesting finding of our study was the discovery of the mutation G29553A. It was found in 1.4% (110/8,008) GISAID SARS-CoV-2 genomes from the world, or 6.9% (109/1,591) in the US SARS-CoV-2 genomes. The >100 G29553A isolates are almost exclusively, with the exception of one (Iceland), from the US. The mutation is in a noncoding region of the virus genome, although the significance of the mutation is currently unknown.
The potential impact of the emergence of group D and E SARS-CoV-2 strains.
Group D and group E defining mutations found on orf3a and orf8 respectively are regions associated with the expression of accessory proteins. Accessory proteins are not required for viral replication but may affect viral virulence and pathogenesis.5 Orf3a is 72% conserved between SARS-CoV and SARS-CoV-2. Based on its function in SARS-CoV, it has been postulated that Orf3a is involved in cell apoptosis.7 Mutations in Orf3a in SARS-Cov-2 have been shown to also result in loss or change of epitopes that may help the virus evade the host immune response7. There may be clinical implications of the missense mutations of these proteins. First, patients who have already recovered from earlier COVID-19 infection may have incomplete or reduced immunity when subsequently exposed to the newly emerging group D or group E SARS-CoV-2. Second, development of ELISA serologic testing must account for the potential epitope variability among different SARS-CoV2 groups. Accuracy of serologic testing may be adversely affected by current and emerging mutations in these accessory proteins. Further study on the biochemical and clinical impact of the Q57H substitution noted in orf3a (group D) and the S24L substitution on orf8 (group E), especially on viral virulence, and pathogenesis host immune response, are warranted. Most group D isolates also demonstrated the missense C1059T mutation in orf1ab (T265I). Orf1ab encodes a replicase that is involved in viral transcription and replication.8 It would be important to further elucidate the role of T265I substitution in viral replication.
Global efforts to increase sequencing of SARS-CoV-2 isolates will be critical for mutation monitoring and clinical correlation. In addition to epidemiologic analysis, identifying new mutations in the SARS-CoV-2 isolates may, among other efforts, shed light on vaccine development, and help in evaluating the current molecular testing protocol. Fortunately, none of the group D and E mutations that we identified were in the PCR targets in the protocols listed in WHO website (WHO.int, access April 17, 2020).