Sequence comparison and SNP analyses of related SARS-COV–2 and RaTG13 genomes
We compared frequency of mutations in SARS-CoV–2 (accession #MN908947) using bat RATG13 (MN996532.1) genome as a reference. The SARS-CoV–2 and RATG13 were respectively 29903 and 29855 nt long. Compared to RATG13, there were 7 insertions and 2 deletions in the SARS-CoV–2 genome, the longest insertion spanning 12 nt long GC rich region between 23582–23583 of a bat sequence. There were 1136 SNPs between SARS-COV–2 and RaTG13 equalling to about 4% divergence consistent with the previous study (Zhou et al. 2020). Summary of the mutation analysis is given in Supplementary Table S2. Characteristics of single nucleotide variation (SNV) are shown in Fig. 1a. It is evident that the C>U and reverse U>C transitions were far most frequent mutations totally accounting for 86% of SNV variation. Comparison of other isolates of human coronavirus from 2019/2020 yielded essential the same results since there are no or little variations between the genomes (as further below).
Spike protein harbours a domain binding the ACE2 receptor on the cell surface believed to be important for host-virus interactions. Within a stretch of 20 amino acids (positions 486–501 in the protein sequence) differences in five amino acids were previously identified between SARS-COV–2 and RATG13 peptides (Andersen et al. 2020). Alignment of the corresponding 60 nt-long RNA sequence revealed 19 differences (31.7% divergence)(Fig. 2).. Out of these, 9 (50%) showed C>U (U>C) patterns. Out of five, three polymorphic amino acids (bold) contained the C>U mutations in their codons.
Mutation patterns in different SARS-COV–2 isolates
We analysed 13 sequenced coronavirus isolates from different populations (Table 1).. We selected sequences representing world-wide virus diversity. The number of SNPs per genome ranged from two (three isolates) to six (#MT093571 isolate from Sweden). Out of the 35 polymorphic sites, 14 were shared between at least two accessions, 21 were unique variants occurring in a single accession. The U>C SNP at 28144 position was found in three genomes (#MN985325 and # MT020880 (both USA) and #MT066175 from Taiwan).The G>U SNP at 26144 was found in #MT126808 (Brazil), #MT093571 (Sweden), #MT007544 (Australia) and Italy (#MT066156). Variants involving C>U and U>C mutations were abundantly represented in the data sets (Fig. 1b)..
CpG depletion analysis in coronaviruses
The CpG depletion is a characteristic feature of eukaryotic genomes and some viruses. We determined the frequency of CpG dinucleotide in 17 human (including 8 SARS-CoV–2 accessions), 9 bat and 8 other animal betacoronaviruses (Fig. 3a).. The number of CpGs ranged 664–1766 between the genomes. The lowest level (652) was found in the human NL63 strain (#MK334043.1); the highest in the Japanese bat Pipistrellus abramus (1766). The Rhinolophus affinis (bat) RATG13 coronavirus had 882 CpGs in its genome, 4 sites more than SARS-CoV–2 (accession). Both ARS-CoV–2 and NL63 strain exhibited the lowest CpGobs/exp levels. Statistical evaluation of data is presented by box plots (Fig. 3b)..