DNA or RNA sequences are composed of four nucleotides. They can also be considered polymers of 16 dinucleotides. Odds ratio is a value defined to indicate relative abundance of a nucleotide, which is the ratio of observed to expected frequency of a dinucleotide9. The genome of SCoV2 (29,903 nucleotides2, sequence number NC_045512) has 29.94% of A, 32.08% of T (T is used here instead of U for simplicity), 19.61% of G and 18.37% of C. Thus, the expected frequency of CG dinucleotide in viral genome is 3.60% (i.e. 19.61% x 18.37%). However, only 439 CGs are observed, which means the observed frequency is 1.47% (i.e. 439/29,902). Therefore, odds ratio of CG in SCoV2 is 0.41 (i.e. 1.47%/3.60%). Furthermore, odds ratio of CG in open reading frames (ORFs) of the virus is 0.39, being the lowest among 24 coronaviruses under survey (Fig. 1a and Extended Data Table 1). Because a codon is composed of three nucleotides, a dinucleotide (e.g. CG) has three possible locations. Herewith, they are designated as (CG)12, (CG)23 and (CG)31 respectively. We found that the odds ratio of (CG)23 in ORFs of SCoV2 is as low as 0.25, while that of (CA)23 and (CT)23 is as high as 1.54 and 1.92 respectively (Fig. 1c). Moreover, odds ratio of (CG)31 in ORFs of SCoV2 is 0.50, while that of (AG)31 and (TG)31 is 1.52 and 2.64 respectively (Fig. 1d). These data strongly suggest that (CG)23 has been mutated into (CA)23 and (CT)23, and (CG)31 has been mutated into (AG)31 and (TG)31.
The above-stated mutations are possible because very few of these mutations lead to changes in amino acids. To be specific, there are four codons containing (CG)23. They are TCG, CCG, ACG and GCG which code for serine, proline, threonine and alanine, respectively. Mutation of G at codon position 3 into T, C or A in all of them does not change the amino acid they encode. As for (CG)31, there are 16 codons having C at position 3. If this C is mutated into T, all 16 codons have the same meanings. And if it is mutated into A, 9 out of 16 codons still have the same meanings. Therefore, it is concluded that SCoV2 has evolved to reduce CG in ORFs mainly through mutating its G of (CG)23 and C of (CG)31 into A and T. Among them, C-to-T (i.e. C-to-U in RNA) occurs at a very high frequency probably because it is the simplest way to change a nucleotide (C becomes U after deamination). Besides, odds ratio of (CC)23 is much lower than that of (CA)23 and (CT)23. This does not mean that G of (CG)23 has not been mutated into (CC)23. In fact, low odds ratio of (CC)23 is the result of high mutation frequency of (CG)31 into (TG)31 (Fig. 1c and 1d). The above views are also supported by codon usage bias in SCoV2 (Fig. 2), which shows that A/T-ended codons are much more frequently used than their synonymous G/C-ended codons. Besides, all four codons containing (CG)23 have the lowest percentages of usage among synonymous codons.
Odds ratios of CG in ORFs of other coronaviruses are also very low (mean value = 0.50, Extended Data Figure 1 and Extended Data Table 1). This could have profound effect on viral replication, because ORFs of coronaviruses are immediately translated by host ribosomes after being released into the cytoplasm of host cells10. The translation of viral RNA is affected by two factors. One is that host ribosomes must be recruited to the 5’-UTR (untranslated region) of viral RNA for initiation of translation. The other is that stem-loops formed by ORFs of viral RNA must be disrupted during translation. In contrast to ORFs, 5’-UTR of coronaviruses have quite high odds ratios of CG (mean value = 0.84, Extended Data Table 2). This would facilitate formation of stable secondary structure that could serve as the internal ribosome entry site (IRES)11–13 for host ribosome (Extended Data Figure 2). Meanwhile, the viral RNA beginning at the translation start site (TSS) forms relatively unstable secondary structure, because its stem-loops are maintained by less hydrogen bonds (an A-T base pair has one less hydrogen bond than a C-G base pair). Stability variations of viral genomes at 5’-UTR and TSS-to-end regions could probably determine virulence of different viruses, because high stability of IRES structure means high efficiency in initiating translation, and high stability of TSS-to-end region means high energy consumption during translation. For example, both 5’-UTR and TSS-to-end regions of human MCoV are highly stable (Table 1). High stability of 5’-UTR means that host ribosomes can be recruited to translate viral RNA at high rate. And, high stability of ORFs means that more energy is consumed to disrupt stem-loops in viral RNA during translation. Thus, normal translation of host cell mRNAs is greatly affected, suggesting that MCoV is highly virulent. 5’-UTRs of human SCoV and SCoV2 are less stable than MCoV, meaning that host ribosomes are not recruited to initiate translation of viral RNA at high rate. Yet, TSS-to-end region of SCoV2 is less stable than SCoV (Table 1), meaning that less energy is consumed by translation of viral RNA. Thus, SCoV2 is less virulent than SCoV. This conclusion is consistent with estimations on case fatality ratio of MCoV, SCoV and SCoV2, which is 35%, 9% and 2.4% respectively14. Three other human coronaviruses also have different stability in 5’-UTR and TSS-to-end regions (Table 1). Specifically, human CoV 229E has low stability in 5’-UTR and high stability in TSS-to-end region. Human CoV NL63 and HKU1 have medium and low stability in both regions, respectively. Such variations indicate that these coronaviruses could also have different virulence.
It seems that the strategy of “reducing CG content to increase gene expression efficiency” has also been adopted by cellular organisms. As we have observed, CG in both ORFs and inter-genic regions of bacteria, archaea, fungi, plants and animals has an average odds ratio of 0.81, and that in introns of fungi, plants and animals is as low as 0.69. At time of our previous report15, we did not know why CG has such a low odds ratio in surveyed organisms. Now, after analysing cases in coronaviruses, we realize that low CG content in cellular organisms should also be the evolutionary consequence of increasing gene expression efficiency, because lowered CG content means reduced number of hydrogen bonds between DNA double strands (of the same length). Expression of a gene with low CG content saves energy not only in separating DNA double strands during transcription but also in disrupting stem-loops formed by mRNA during translation. Coincidently, CG is the very dinucleotide related to existence of mutational hotspots and CpG islands in DNA sequences of cellular organisms. A mutational hotspot is defined as CG with methylated C, in which the methylated C is frequently mutated into T through deamination16–18. A CpG island is defined as a region of DNA with less methylated C, and this region generally contains actively expressed genes19–21. The relationship between CG reduction and these two important features of cellular DNA sequences is worthy of further investigations.
If reducing hydrogen bonds is the goal of base mutation, why is CG but not GC, GG or CC taken as the target for mutation? An examination on number of silent mutations of each dinucleotide at various codon positions reveals that CG has the highest number (47) among these four dinucleotides (Table 1 and Extended Data Table 3). This explains why CG is the best target for mutation. Although CT has the same highest number like CG, it is not taken as the target for mutation because a T-to-C or T-to-G mutation would increase number of hydrogen bonds between potential base pairs, which is contradictory to the target of mutation. Our present study provides a novel insight into the evolution of human SCoV2. It is evident that this virus has evolved to reduce CG intensely in its ORFs. Such reduction is achieved mainly through mutating G of (CG)23 and C of (CG)31 into A or T (Fig. 1). Meanwhile, C or G not of CG may also be mutated. For example, TCA in SCoV2 of S-type has been mutated into TTA in that of L-type22. GTC and GGT in SCoV2 isolated from France have been mutated into TTC and GTT respectively in that from Wuhan (China)23. Although the mutated C or G is not of CG and not at codon position 3, they do reduce C or G in viral RNA. As such, it is speculated that G+C content may be used as an indicator of evolution degree for different SCoV2 isolates (i.e. the lower the G+C content, the higher the evolution degree). However, this speculation presumes that mutations aiming to reduce C or G occur predominantly in SCoV2. To test this presumption, further investigations are expected to identify and analyse detailed mutational events occurring in different SCoV2 isolates.