InDels were identified at six mutation sites, named M1 to M6 (Table 1), in ORF3a, membrane (M), ORF7a, ORF7b, ORF8 and nucleocapsid (N), respectively (Fig. 1). Based on these InDels at six sites, betacoronavirus subgroup B (Materials and Methods) was classified into two classes: (1) the first class includes SARS-CoV-2 (from humans) and SARS2-like CoV (from animals), and (2) the second class includes SARS-CoV (from humans) and SARS-like CoV (from animals). As a variable genomic site, M1 has a length of 8 nt in betacoronaviruses of the second class and 11 nt in betacoronaviruses of the first class. M2, M3, M4 and M5 in betacoronaviruses of the first class have 3-nt deletions, which are complete codons, whereas M6 in betacoronaviruses of the first class has 6-nt deletion.
Table 1
Annotations of recombinant regions and mutation sites
CDS | Start | End | Length(nt) | Start | End | Length(nt) |
ORF1a | 266 | 13483 | 13217 | 266 | 13477 | 13212 |
ORF1b | 13768 | 21555 | 7788 | 13762 | 21549 | 7788 |
S | 21563 | 25384 | 3822 | 21556 | 25239 | 3684 |
ORF3a | 25393 | 26220 | 828 | 25248 | 26075 | 828 |
ORF3b/Nankai CDS | 25814 | 26281 | 468 | 25669 | 26136 | 468 |
E | 26245 | 26472 | 228 | 26100 | 26327 | 228 |
M | 26523 | 27191 | 669 | 26378 | 27043 | 666 |
ORF6 | 27202 | 27387 | 186 | 27054 | 27239 | 186 |
ORF7a | 27394 | 27759 | 366 | 27246 | 27611 | 366 |
ORF7b | 27756 | 27887 | 132 | 27608 | 27739 | 132 |
ORF8 | 27894 | 28259 | 366 | 27746 | 28114 | 369 |
N/ORF9a | 28274 | 29533 | 1260 | 28116 | 29375 | 1260 |
ORF9b | 28284 | 28577 | 294 | 28126 | 28419 | 294 |
ORF14 | 28734 | 28955 | 222 | 28576 | 28797 | 222 |
ORF15 | 29558 | 29674 | 117 | 29400 | 29516 | 117 |
RC1 | 21761 | 21796 | 36 | #AIHVSGTNGTKR | |
RC2 | 21971 | 22054 | 84 | #NDPFLGVYYHKNNKSWMESEFRVYSSAN |
RC3 | 22277 | 22348 | 72 | #QTLLALHRSYLTPGDSSSGWTAGA |
RC4 | 22874 | 22918 | 45 | #SNNLDSKVGGNYNYL | |
RC5 | 22964 | 23020 | 57 | #ISTEIYQAGSTPCNGVEGF | |
M1 | 26109 | 26119 | 11 | 25964 | 25974 | 11 |
M2 | 26449 | -3GAA | 26303 | -3GAA |
M3 | 27679 | -3GAG | 27530 | -3GAG |
M4 | 27882 | -3AAA | 27733 | -3AAA |
M5 | 27906 | -3ATT | 27757 | ATT* |
M6 | 29512 | -6AGCTTC | 29353 | -6AGCTTC |
Six mutation sites named M1 to M6 in the viral genomes of SARS-CoV-2 (GenBank: MN908947) and RmYN02 (GISAID: EPI_ISL_412977). M1 has a 11-nt length in the genomes of viruses from the SARS-CoV-2 cluster, while M1 has a 8-nt length in the genomes of viruses from the SARS-CoV and SARS-like CoV clusters. As for M2 to M6, the deleted nucleotides belong to the viral genome of SARS-CoV (GenBank: AY278489). * Since RmYN02 has a type 2 ORF8, it has the same allele at the M5 site as SARS-CoV. Besides the recombination event in ORF8, other recombination events occurred in five regions, named RC1 to RC5 in the S1 subunit of the gene S. # These amino acid sequences are encoded by RC1 to RC5 from SARS-CoV-2 (GenBank: MN908947). |
Recently, two betacoronavirus strains RmYN01 and RmYN02 (GISAID: EPI_ISL_412976 and EPI_ISL_412977) were detected from a bat (Rhinolophus malayanus) [5]. Since betacoronaviruses from subgroup B share many highly similar regions in their genome sequences (Fig. 1), it is very difficult to assemble them correctly using high-throughput sequencing (HTS) data from one sample. Therefore, EPI_ISL_412976 was only assembled into a partial sequence in a previous study [5]. However, the exact identification of viruses requires the complete genomes or even the full-length genomes. Using paired-end sequencing data, we reassembled these two virus genomes and obtained two full-length sequences to update EPI_ISL_412976 and EPI_ISL_412977 (Supplementary 1). Using 5' UTR barcodes, the betacoronaviruses RmYN01 and RmYN02 were identified as belonging to subgroup B. Using the InDels at six sites, RmYN01 and RmYN02 were further identified as belonging to the second and first classes, respectively. In addition, RmYN02 was also identified as a recombinant SARS2-like CoV strain.
A recombination event occurred in a gene named open reading frame 8 (ORF8), existing only in the genomes of betacoronaviruses from subgroup B. Based our analysis of 1265 betacoronaviruses, all of identified recombination events occurred in genes ORF1a, S and ORF8. Although many recombination events in ORF8 of betacoronaviruses have been reported in sequence analysis results, it is difficult to determine whether they were recombination events or small-size mutation (InDel & SNP) accumulation as most of them only occurred over very small genomic regions, excepting a few events (e.g., the 382-nt deletion in ORF8 [8]). We reported—for the first time—a recombination event at the whole-gene level in a bat, which had been co-infected by two betacoronavirus strains. Besides the recombination event in ORF8, other recombination events occurred over five regions, named RC1 to RC5 (Table 1) in the S1 region of the gene S. To initiate the coronavirus infection, the S protein encoded by the gene S need to be cleaved into the S1 and S2 subunits for receptor binding and membrane fusion.
Using a large segment spanning S2, ORF3a, ORF3b, envelope (E), M, ORF6, ORF7a, ORF7b, N, ORF9b, ORF14 and ORF15 (Table 1), phylogenetic tree 1 (Fig. 2A) showed that 21 betacoronaviruses from subgroup B (Materials and Methods) were classified into two major clades, corresponding to two classes classified using the InDels at six sites: (1) the first major clade, named the SARS-CoV-2 cluster, includes SARS-CoV-2 and SARS2-like CoV (from bats and pangolins), and (2) the second major clade includes two clusters—the SARS-CoV cluster including SARS-CoV and a few SARS-like CoVs (from bats or civets) and the SARS-like CoV cluster including all other SARS-like CoVs (from bats). Therefore, these 21 betacoronaviruses were classified into the SARS-CoV-2, SARS-CoV and SARS-like CoV clusters named clusters 1, 2 and 3, respectively.
Identified as belonging to the SARS-CoV-2 cluster, RmYN02 was thought to have a 3-nt deletion at the M5 site; however, it did not (Table 1). This led us to identify three types of ORF8 genes in the betacoronaviruses from subgroup B. Using only ORF8, phylogenetic tree 2 (Fig. 2B) also shows that the 21 betacoronaviruses were classified into the SARS-CoV-2, SARS-CoV, and SARS-like CoV clusters. However, this tree did not reflect the evolutionary relationship of the three clusters due to the recombination events of ORF8. Types 1, 2, and 3 ORF8 genes belong to the genomes of viruses from clusters 1, 2, and 3, respectively. Type 1 ORF8 genes possess low nucleotide identities (below 70%) to type 3 ORF8 genes, while type 2 ORF8 genes are so highly divergent from types 1 and 3 ORF8 genes, they cannot be aligned to calculate nucleotide identities between type 2 ORF8 genes and types 1 or 3 ORF8 genes. RmYN01 belongs to cluster 3 and has a type 3 ORF8, while RmYN02 belongs to cluster 1 but has a type 2 ORF8. Thus, RmYN02 was identified as a recombinant SARS2-like CoV strain.
Comparing phylogenetic tree 1 using large segments with tree 2 using only ORF8 genes, almost all viruses were consistently classified into the same clusters in both trees, except RmYN02 (GISAID: EPI_ISL_412977) and the SARS-like CoV strain WIV1 (GenBank: KF367457). WIV1 was classified into cluster 2 in tree 1 but cluster 3 in tree 2, as WIV1 has type 3 rather than type 2 ORF8 (Fig. 2B). WIV1, isolated from Chinese horseshoe bats (Rhinolophus sinicus), was considered the most closely related to SARS-CoVs in humans and civets [9]. A previous study predicted the immediate ancestor of SARS-CoV based on the following hypothesis: the ancestor of civet SARS-CoV is a recombinant virus with ORF8 originating from greater horseshoe bats (Rhinolophus ferrumequinum) and other genomic regions originating from different horseshoe bats [6]. Although whether these recombination events occurred in bats or civets remains unknown [6], analysis of all recombination events in S1 subunits of all betacoronaviruses may help to find an answer (See below).
As phylogenetic tree 1 was built without recombinant regions (i.e., ORF1a, S1 and ORF8), it revealed that SARS-CoV-2 is most closely related to the well-known strain RaTG13 (GenBank: MN996532) isolated from intermediate horseshoe bats (Rhinolophus affinis). Phylogenetic tree 2 shows that ORF8 of RaTG13 has the highest identity to that of SARS-CoV-2 (Fig. 2B). RmYN02 was classified into cluster 1 in tree 1 but cluster 2 in tree 2. This suggested that the recombination events happened across the SARS-CoV-2 and SARS-CoV clusters. In addition, all betacoronaviruses from pangolins (Manis javanica) used in the present study were identified as belonging to the SARS-CoV-2 cluster. However, pangolins are unlikely to be the intermediate host(s) of SARS-CoV-2 for two reasons: (1) betacoronaviruses from pangolins do not contain a junction FCS [3], and (2) the strains (e.g., GISAID: EPI_ISL_410721) are farther from SARS-CoV-2 than RaTG13 in the phylogenetic tree 1 (Fig. 2A).
Guided by joint analysis of both molecular function and phylogeny, we conducted further research on the biological functions of ORF8. RmYN01 and RmYN02 were simultaneously detected in a bat, providing a special opportunity to compare their copy numbers. As RmYN01 and RmYN02 have type 3 and type 2 ORF8 genes, respectively, the difference between copy numbers of RmYN01 and those of RmYN02 can be estimated by their relative RNA abundances to test a previous hypothesis that type 2 ORF8 genes increase replication efficiency of viruses. Aligning RNA-seq data to the genomes of RmYN01 and RmYN02, our calculation showed that the RmYN01 genome was covered 99.85% of its length with an average depth of 32.89 (Fig. 1A), while the RmYN02 genome was covered 99.89% with an average depth of 298.99 (Fig. 1B). The RNA abundance of RmYN02 is 9 times that of RmYN01. Although this does not rule out other factors that contribute to the differences in RNA abundances of the two virus strains, these results combined with other evidence [6] suggest that type 2 ORF8 genes increase replication efficiency of RmYN02.
By analysis of all recombination events in S1 subunits of all betacoronaviruses, we obtained the following results: (1) there were a few genotypes of each recombinant region (RC1 to RC5); and (2) and betacoronaviruses within the SARS-CoV-2, SARS-CoV and SARS-like CoV clusters had the same genotypes. These results suggested that the occurrences of all recombination events in bats were prerequisite for speciation of SARS-CoV and SARS-CoV-2. Further analysis showed that two recombination regions (RC4 to RC5) are localized in the receptor binding domain (RBD) of S1 (Fig. 3), while three other recombination regions (RC1, RC2 and RC3) are localized in the N-terminal domain (NTD). Almost all the secondary structures of five protein segments encoded by RC1 to RC5 are disordered, which are responsible for protein protein interaction (PPI). This suggested that the recombination of RC1 to RC5 improve the host adaptability of betacoronaviruses by enhancing interaction of RBD and NTD with their receptors, exhibiting that the positive selection of S was particularly strong [6]. Since both RBD and NTD have similar recombination events in their PPI regions, we proposed that NTD has a specific receptor as RBD has angiotensin-converting enzyme 2 (ACE2). Thus, the S1 subunit of SARS-CoV-2 may have more than one specific receptor (Fig. 3) as gp120 of HIV has the cluster of differentiation 4 receptor (CD4) and the C-C chemokine receptor 5 (CCR5).
Comprehensive analysis and reuse of data from different sources are necessary to determine the other receptor/s of SARS-CoV-2. A previous study identified two genetic susceptibility loci (rs11385942 at locus 3p21.31 and with rs657152 at locus 9q34.2) in COVID-19 patients with respiratory failure using genome-wide association analysis [10]. The locus 3p21.31 was associated to six genes SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6 and XCR1. However, this study only focused on the further analysis of the locus 9q34.2 and confirmed a potential involvement of the ABO blood-group system. These researchers did not notice that three chemokine receptors CCR9, CXCR6 and XCR1 merit further investigation as candidates of SARS-CoV-2 receptors. The analysis of bulk RNA-seq data showed high expression of CCR9 and XCR1 in thymus and CXCR6 in T cells, compared to other tissues and cell types [10]. In particular, the thymic cells were consistently negative for ACE2 and many CoVs can infect thymus [11]. By investigating interaction of three protein segments encoded by RC1 to RC3 in NTD (Table 1) with CCR9, CXCR6 and XCR1, we found that CCR9 is the most possible candidate among three chemokine receptors. However, the final determination of the other receptor/s of SARS-CoV-2 need more calculation and experiments on candidates at the whole-genome level. Our study did not rule out the possibilities of non-receptor proteins binding to NTD.
Rapid recombination and mutation, a strong first IRESs, a junction FCS and a type 2 ORF8 are four main factors contributing to transmission, virulence and host adaptability of coronavirus. By analysis of these four factors in 1,265 complete genomes of betacoronavirus, we concluded: (1) rapid recombination and mutation of viral genomes (particularly in the S gene) provide coronavirus the strong ability of cross-species transmission; (2) betacoronavirus jumps from bats to a new host species due to the host range expansion caused by recombination and mutations; (3) the ancestor of betacoronavirus (and even coronavirus) had a strong first IRES and at least one (most likely two) junction FCS, allowing for sustained transmission that can lead to significant outbreaks; (4) after jumping into a new host (e.g. hedgehog), betacoronavirus will be attenuated to spread widely in the host population by loss of junction FCSs or strong first IRESs and thus, it can not jump to another host species (e.g. human) if not triggered by new genetic events (e.g. acquiring a type 2 ORF8). The ancestor of betacoronavirus (and even coronavirus) had most likely two junction FCSs, which were close to N and C terminals. SARS-CoV-2 (GenBank: MN908947) has the junction FCS "RAAR" located at position 685 in the S protein, which is noted as R685. Although SARS-CoV-2 lost a junction FCS by substituting "KNTQ" (R779) for "RNTR", this FCS was inactivated in the alpha secondary structures. Due to the host range expansion, SARS-CoV-2 infected intermediate hosts (not necessary) and then humans before it attenuated. Originated from the same ancestor of SARS-CoV-2, SARS-like CoVs were attenuated by inactivation of the junction FCS "RNTR" (R761) and loss of the junction FCS (e.g R667) close to N terminal. By acquiring a type 2 ORF8, SARS-CoV infected civets and then humans before being attenuated. After a period, a attenuated variant of SARS-CoV with a 29-nt deletion in the ORF8 gene was reported [7]. Originated from the same ancestor, the betacoronavirus strains RaTG13, RmYN02 and betacoronaviruses from pangolins were attenuated variants of SARS-CoV-2 by loss of junction FCSs [3]. In the betacoronavirus subgroup C, middle east respiratory syndrome coronavirus (MERS-CoV) have two junction FCSs (Fig. 4). The first one "RSTR" (R694) is likely to be dysfunctional, resulting in attenuation, as there is a disulfide bond cross the FCS R694. However, the second junction FCS "RSVR" (R751) is functional. Due to the host range expansion, MERS-CoV infected camels and then humans before it attenuated. Originated from the same ancestor of MERS-CoV, other MERS-like CoVs (e.g. hedgehog coronavirus) were attenuated by loss of junction FCSs. In the betacoronavirus subgroup A (Fig. 4), all strains were attenuated from the ancestor of betacoronavirus by loss of strong first IRESs. Since almost all betacoronaviruses of the subgroup A have junction FCSs, they (e.g. OC43 and HKU1) can still infect human with attenuated virulence. In addition, the average arginine (R) pencentage of S proteins from betacoronaviruses of the subgroup A except mouse hepatitis virus (MHV) 2.63% is significantly lower than those from MHV, betacoronaviruses of the subgroup B, C and D (3.34%, 3.33%, 3.32% and 3.33%). This indicated that accumulated mutations caused attenuation by loss of arginine residues, since arginine residues usually reside at protease cleavage sites. In the betacoronavirus subgroup D, HKU9 does not have a junction FCS but have a strong first IRES. Therefore, HKU9 has the potential ability to cause an epidemic.