Epi-Clock: A sensitive platform to help understand pathogenic disease outbreaks and facilitate the response to future outbreaks of concern.

doi:10.21203/rs.3.rs-2062759/v5

Download PDF

Research Article

Epi-Clock: A sensitive platform to help understand pathogenic disease outbreaks and facilitate the response to future outbreaks of concern.

https://doi.org/10.21203/rs.3.rs-2062759/v5

This work is licensed under a CC BY 4.0 License

Version 5

posted

You are reading this older preprint version

Read the latest preprint version →

To predict potential epidemic outbreaks, we tested our strategy, Epi-Clock, which applies the novel ZHU algorithm on different SARS-CoV-2 datasets before outbreaks to search for real significant mutational accumulation patterns correlated with the outbreak events. Surprisingly, some inter-species genetic distances of Coronaviridae may represent the intermediate states of different species or subspecies in the evolutionary history of Coronaviridae. The insertions and deletions of whole genome sequences between different hosts were separately associated with important roles in the host transmission and shifts of Coronaviridae. Furthermore, we believe that non-nucleosomal DNA may play dominant roles in the divergence of different lineages of SARS-CoV-2 in different regions of the world because of the lack of nucleosome protection. We suggest that strong selective variation among different lineages of SARS-CoV-2 is required to produce strong codon usage bias, significantly appear in B.1.640.2 and B.1.617.2 (Delta). Interestingly, we found that an increasing number of other types of substitutions, such as those resulting from the hitchhiking effect, have accumulated, especially in the pre-breakout phase, even though some previous substitutions were replaced by other dominant genotypes. From most validations, we could accurately predict the potential pre-phase of outbreaks with a median interval of 5 days before.

Divergence

Host bias

Mutation type

Codon usage bias

ZHU algorithm

Determining how viruses originated and diverged and how a disease was transmitted are extremely important for understanding viral disease outbreaks and facilitating responses to future outbreaks. Since viruses lack fossilization, preventing the availability of any references or ancestors for inferring evolutionary processes, some theories about the origin of viruses have been reported, such as the degeneracy theory, DNA escape from plasmids or transposons, and viroid or satellite viruses[1]. Several methods have been proposed, i.e., phylogenetic[2], neutral selection[3], inferring the ancestry shared[4] and divergence[5] approaches. Most high-pathogenicity mutations go extinct in the population, but adaptive mutations can be fixed in the population[3]. To answer the above crucial questions about outbreaks[6], sequencing of viral samples and supplementation of epidemiological methods could play an important role in providing nucleotide-level resolution data for outbreak-causing pathogens.

How does viral evolution occur across hosts? Why pathogens successfully jump between some host species but not others? Sometimes preventing host shifts may come at a cost to other aspects of the pathogen’s fitness[7–10]. The movement of EBOV evolution[11] and the specific amino acid substitutions in the EBOV GP have increased tropism for human cells and infectivity of enhancing the ability of transmitting among humans[12]. The association between the divergence of viruses and adaptation of host transfer is under strong selection. Nucleosome occupancy nearly eliminated cytosine deamination, remarkably, spontaneous mutations were suppressed by nucleosomes in a base-specific manner in eukaryotes[13]. Viral doublet histones are essential for viral infectivity, localize to cytoplasmic viral factories after virus infection and ultimately are found in mature virions[14]. Giant viruses belonging to the nucleocytoplasmic large DNA virus (NCLDV) group have histone-like genes in their genomes. Host‒virus arms races are a powerful source of adaptation, and most genes gained by NCLDVs from their different hosts are likely linked to viral defences[15].

To compare virus genomes before/after epidemics, the Global Epidemic and Mobility Model[16] and Epi-Factors[17] were to search for the accumulation pattern of pathogenic mutations contributing to the outbreak to prevent recurrent spread of the disease worldwide[18]. As a result of the virus evolving under immune system selection pressure in infected individuals, at least lineage B.1.1.7 have presumably arisen[19]. Lambda, a new variant of interest, is now spreading in some South American countries attributed to the T76I and L452Q mutations[20]. With the first wave fixed at the start of the pandemic, D614G is the foundation of all subsequent waves of strains(C241T, C3037T, C14408T and A23403G)[21]. Tracking the spread of infectious disease to assist in their control has traditionally relied on the analysis of case data gathered as an outbreak proceeds[22]. Here, we demonstrate our strategy, with the aim of developing a sensitive process for predicting the potential trigger of outbreaks to facilitate the response to future outbreaks of concern.

Divergence of the whole Coronaviridae family within different populations and adaptation to cross host species barriers.

To comprehensively explore the evolution of the whole family of Coronaviridae, we performed a genomic comparison of Coronaviridae to search for new age patterns related to virulence or transmissibility, as shown in Supplementary Fig. 1. In the analysis of intra-species and inter-species genetic distances of whole genome sequences of Coronaviridae, it was obvious that most inter-species genetic distances of Coronaviridae are longer than intra-species distances, i.e., SARS-CoV-2, SARS-CoV, SADS, NL63, MERS, London1, HKU5, HKU4, HKU3, HKU2, and BATS. For instance, the sequencing similarities of SARS-CoV-2 and bat-SL-CovZC45 and SARS-CoV-2 and bat-SL-CovZXC21 are nearly 88%; however, the similarities of SARS-CoV-2 and SARS-CoV and SARS-CoV-2 and MERS are 79% and 50%, respectively. According to a comparison of the genetic distances of H1N1 and H3N2 with those of SARS-CoV-2 and SARS-CoV, we found that the p-distance of SARS-CoV-2 and Bat SARS-like CoV was nearly 0.13 and the p-distance of SARS-CoV-2 and SARS-CoV was 0.24 in MEGA[23]. Both of these values are lower than the p-distances of H1N1 and H3N2 (0.8) and H1N1 and H7N9 (0.73). Conversely, some inter-species genetic distances of Coronaviridae are shorter than the intra-species genetic distances, such as those for OC43 and 229E, which may be the intermediate states of different species or subspecies in the whole evolutionary history of Coronaviridae.

Here, we explore the divergence time of species or sub-species of Coronaviridae, which presented the diversity of such ages within populations separately associated with new functions or important turning points. Interestingly, we explored the dated variants of different species or subspecies and generated an atlas of new ages with SARS-CoV-2 as the reference genome in Supplementary Fig. 2. It was demonstrated that SARS-CoV, HKU3, and BATS are very close to SARS-CoV-2 and far away from OC43 and HKU1 on the S protein, although there are 5 kb insertions of BATS in Orf7a, Orf7b, and Orf8 and 15 kb deletions of OC43, HKU9, and HKU1 in Orf7a, Orf7b, Orf8 and N. Most strikingly, as shown in Fig. 1 and Supplementary Table 1, the insertions and deletions of whole genome sequences between different hosts should play an important role in host transmission and the shift of Coronaviridae presented by Circos. For instance, human coronavirus 229E in Camelus and alpaca (Vicugna pacos) have S protein sequences similar to those observed in Homo sapiens, while having 500 bp deletions in striped leaf-nosed bats (Hipposideros and Macronycteris vittata). When used as the reference genome, Rhinacovirus in piglets (Sus scrofa) was similar to that in wild greater horseshoe bat (Rhinolophus ferrumequinum) and divergent from that in intermediate horseshoe bat (Rhinolophus affinis) in that insertions and deletions all appear in the S protein, which may be related to host bias. It is likely that accumulations of insertions and deletions of alphacoronavirus, human coronavirus OC43, Middle East respiratory syndrome-related coronavirus, severe acute respiratory syndrome-related coronavirus, Deltacoronavirus, and avian coronavirus emerged on the S protein. The similarity of SARS-CoV-2 S proteins in the Chinese rufous horseshoe bat (Rhinolophus sinicus), the big-eared horseshoe bat (Rhinolophus macrotis), dogs (Canis lupus familiaris), and Malayan tiger (Panthera tigris jacksoni) is very high with the SARS-CoV-2 S protein in Homo sapiens, which means that the ancestral hosts of SARS-CoV-2 in Homo sapiens should be closely related to the species as detailed in Supplementary Figs. 3–10. In summary, the more mutations viruses have, the greater the possibility of adaptation allowing them to cross host species barriers, which indicates that increased host diversity of pathogen would accumulatively enlarge the population size of potential hosts with different species or subspecies, and protect against the emergence of pathogens.

The distribution of rates of different mutation types and codon usage bias in different lineages of SARS-CoV-2.

Across different lineages of SARS-CoV-2, 538 nucleotide substitutions such as C1876T(B.1.1.318), A2791G(B.1.1.529), C874T(B.1.1.7), C1055T(B.1.351), C1473T(B.1.525), C1034T(B.1.526), C2999T(B.1.617.1), C3012T(B.1.617.2), C787T(B.1.617.3), C3033T(B.1.621), C535T(B.1.640.1), C1154T(B.1.640.2), C515T(C.1.2), C569T(C.36.3), C3032T(C.37), T702C(P.1), C1714T(P.3) were found in the regions of the NSP2, NSP3, NSP12, NSP13, S, N, NS8, NS3 gene as presented in Supplementary Fig. 11 and the “AA substitutes” sheet of Supplementary Table 1. Then, we demonstrated the richness distribution of different mutation types across the whole genomes, in which mutation rates of C->T, G->T and T->C were dominant in driving the divergence of different lineages of SARS-CoV-2, as shown in Fig. 2a and Supplementary Figs. 12–17. In particular, the mutation rate distributions of C->T are highly enriched in all lineages of SARS-CoV-2, and the mutation rate distributions of G->T are relatively enriched in B.1.1.318, B.1.1.529 (Omicron), B.1.617.1 (Kappa), B.1.617.3, B.1.621 (Mu), B.1.640.1, B.1.640.2, C.1.2, C.37 (Lambda), and P.3 (Theta). Conversely, the patterns of A->T, T->G, G->C are relatively rare, and it was clear that some points with high mutation rates in the A->T distribution derive from B.1.1.7 (Alpha) and B.1.621 (Mu); several points with high mutation rates in the T->G distribution from B.1.526 (Lota), B.1.617.1 (Kappa), B.1.617.2 (Delta), and C.36.3; and some points in the G->C distribution from B.1.525 (Eta), B.1.617.2 (Delta), B.1.617.3, B.1.621 (Mu), B.1.640.1 and B.1.640.2. Therefore, we believe that nucleosomal hereditary material (DNA/RNA) undergoes fewer C->T mutations in SARS-CoV-2, protected by viral doublet histones essential for viral infectivity and viral defences leading to multiple host adaptation/transmission. In contrast, non-nucleosomal deoxyribonucleic acid may play dominant roles in the divergence of various lineages of SARS-CoV-2 in different regions of the world without the protection of nucleosomes.

Furthermore, we present the recent amino acid substitutions among different SARS-CoV-2 lineages, including B.1.617.2 (Delta), B.1.1.529 (Omicron), B.1.1.7 (Alpha), B.1.351 (Beta), P.1 (Gamma), C.37 (Lambda), B.1.621 (Mu), C.1.2, B.1.525 (Eta), B.1.526 (Lota), B.1.617.1 (Kappa), B.1.617.3, P.3 (Theta), B.1.1.318, C.36.3, B.1.640.1, and B.1.640.2, with a focus on NSP2, NSP3, NSP5, NSP12, NSP13, Spike, NS3, NS8, and N proteins in Fig. 2b and Supplementary Table 1, which have strong effects on the divergence and evolutionary history of different lineages. This is especially true for the Spike protein, which contains a receptor-binding domain (RBD), a fusion domain and a transmembrane domain, and the NSP3 protein that is the N-terminus of coronavirus SARS-CoV non-structural protein 3 (Nsp3) and related proteins. Interestingly, compared with the dominant amino acid substitutions, i.e., NSP12:P323L and S:D614G, there seem to be unique and specific amino acid substitutions in one lineage relative to another lineage. Examples include the NSP2:P129L, NSP2:E272G, NSP2:L462F, S:E154Q, S:I210T, and S:D936H substitutions in the B.1.640.1 lineage; NSP13:V371A, S:P129L, S:D1139H, and NS8:C37R substituted in the B.1.640.2 lineage; NSP2:S358L, and NSP3:E378V substitutions in the B.1.1.318 lineage; NSP3:A41V, NSP3:D821Y, NSP13:D105Y, S:S12F, S:G212V, and S:A899S substitutions in the C.36.3 lineage; S:K2Q, S:L280F, S:G313S, S:A368V, S:D736G, S:E1092K, and S:H1101Y substitutions in the P.3 (Theta) lineage; NSP3:A1T, S:Q779K, S:E1072K, NS8:T26I, and N:P67S substitutions in the B.1.617.3 lineage; NSP13:G206C, NSP13:M429I, S:Q1071H, and NS3:L15F in the B.1.617.1 (Kappa) lineage; NSP13:Q88H, S:D253G, NS3:P42L, N:P199L, N:M234I, S:D253G, NS3:P42L, N:P199L, and N:M234I substitutions in the B.1.526 (Lota) lineage; NSP3:T1198I, S:Q52R, S:P323F, S:Q677H, NS3:T89I, NS3:S92L, NS3:G172C, and N:A12G substitutions in the B.1.525 (Eta) lineage; NSP3:P822S, S:P25L, S:C136F, S:A879T, NS3:T223I, and NS3:D238Y substitutions in the C.1.2 lineage; NSP3:T237A, NSP3:A562T, NSP3:T720I, NSP13:P419S, S:Y146S, S:H147T, S:T205I, S:R346K, NS8:P38S, and NS8:S67F substitutions in the B.1.621 (Mu) lineage; S:P13L, S:G75V, S:V76I, and S:T428I substitutions in the C.37 (Lambda) lineage; S:L18F, S:T20N, S:P26S, S:D138Y, S:S247G, S:E341D, S:S371L, S:K977Q, and S:T1027I substitutions in the P.1 (Gamma) lineage; NSP3:K837N, S:D80A, NS3:L52F, and NS3:S171L substitutions in the B.1.351 (Beta) lineage; NSP3:T183I, NSP3:A890D, S:A570D, S:S982A, S:D1118H, NS8:R52I, NS8:K68Stop, NS8:Y73C, and N:S235F substitutions in the B.1.1.7 (Alpha) lineage; NSP3:K38R, NSP3:M84V, NSP3:V1069I, S:G339D, S:S371L, S:S373P, S:S375F, S:N440K, S:G446S, S:S477N, S:Q493R, S:G496S, S:Q498R, S:Y505H, S:T547K, and S:N764K substitutions in the B.1.1.529 (Omicron) lineage; and NSP3:A488S, NSP12:P227S, and S:P77L substitutions in the B.1.617.2 (Delta) lineage.

Provided that the amino acid composition of the proteomes reflects the action of natural selection to enhance metabolic efficiency, synonymous codon usage bias as a measure of translation rates and shows increases in the abundance of less energetically costly amino acids in highly expressed proteins[24]. The compiled codon usage data for different lineages of SARS-CoV-2 are presented in Supplementary Table 1. We summarize the synonymous codon usage of all different lineages of SARS-CoV-2 in Fig. 2c and Supplementary Table 1. In the B.1.640.2 lineage, codon AGU was obviously biased for Ser, codon UUU was stronger for Phe in the B.1.640.1 and P.3 (Theta) lineages, and codon GCU was clearly biased for Ala in the B.1.617.2 (Delta) lineage. Here we summarize the mutation rates across closely related species over short time scales in large effective population. That is, a large selective difference between lineages of SARS-CoV-2 was required for strong codon usage bias.

Clock-like prediction of focal outbreak points worldwide to provide warnings.

To explain the epidemic outbreak points related to key mutations, we set the Epi-Clock device to predict potential epidemics and assist in the presentation of detailed mutation information for severely affected areas. Therefore, we analysed the whole evolutionary pathway in the timeline of lineages of SARS-CoV-2 presented in Supplementary Table 2; i.e., each lineage shows information for the earliest publicly collected samples in different regions of the world. At the same time, we summarized the smoothed distribution of new cases per million cases of SARS-CoV-2 in different severely affected areas, such as Africa, Asia, Europe, North America, Oceania, and South America, along the timeline from OWID[25] in Supplementary Figs. 18–24. Here we defined the amino acid substitutes according the latest reported data around June 9th 2022. Interestingly, we found the pattern of an increasing number of other types of substitutions (shown with the red asterisk) as the hitchhiking effect progressed, i.e. a few other types increased dominantly especially in the pre-breakout phase, even though some substitutions were replaced by other dominant genotypes (shown with the red box) in Fig. 3a, Fig. 3b, Supplementary Table 2 and Supplementary Figs. 25–32.

We hypothesized that specific amino acid substitutions are triggers for outbreaks. To predict potential epidemic outbreaks, we proposed the ZHU algorithm presented in Fig. 4a and tested it on different true sets of SARS-CoV-2 data before outbreaks to search for real significant mutational accumulation patterns related to the outbreak events. We found 171 statistically significant substitutions (Significance level p < 0.05) as potential epi-factors within 55 different countries and regions. We proposed the ZHU prediction which was similar with the “China's abacus”, to perform the cycling of N generations of training by GLM and reordering according to AIC (The Akaike Information Criterion). Finally, we performed the ZHU prediction based on the weighted intercept estimates provided by the supporting information of true sets, and accessed by 42 validation sets with positive precision, sensitivity, and accuracy, as follows.

However, our prediction of outbreak was only based on significant amino acids substitutes excluded other epi-factors. Very interestingly, it was obvious that amino acid substitution type X13.NSP3_T183I.I was a potential epi-factor in Iran, the Netherlands and Norway. X74.S_S371L.L was significantly correlated with numbers of new cases in Finland, Canada and Brazil, as shown in Fig. 4b. Across N generations of training by GLM and reordering, we found 171 significant substitutions as potential epi-factors within 55 different countries and regions, as demonstrated in Supplementary Table 3. Finally, we successfully presented the performance of the ZHU algorithm on 42 validation sets in Fig. 4c. From most validations, we could accurately predict the potential pre-phase of the outbreak by ZHU prediction. By counting the number of true instances, we summarized the positive precision, sensitivity, and accuracy of Epi-Clock with the true validation sets as presented in Fig. 4d and Supplementary Table 3, where the median interval before outbreaks was 5 days.

We found significant mutational accumulation according to the frequency distribution of amino acid substitutes types among different lineages of SARS-CoV-2 in severely affected areas. Large epidemic was to be a heterogeneous and spatially dissociated collection of transmission clusters of varying size, duration and connectivity[26]. There were several studies to propose the epidemiological principle of pathogens, such as two co-circulating lineages of influenza B virus[27], West African Ebola outbreak[22] and repeatable/predictable parallel adaptation including cross-species transmission, drug resistance, and host immune escape, and its existence [28, 29]. Especially omicron strains are the turning point of SARS-CoV-2 epidemic with strong infectivity, which provides great opportunity to the viral adaptive evolution within large scales of individual population size. It represents diverse amino acid substitutes on different sites achieved by different ages, species, or physiological environment. Except with mutational shifts among different hosts, pathogenicity and fatality have come down gradually. Here we could only accurately predict the potential pre-phase of the outbreak by ZHU prediction where the median interval before outbreaks was 5 days. Nevertheless, it is difficult to ascertain the significance of this small-time frame of prediction to make necessary precautionary measures to prevent future outbreaks. Only in four countries, our model could not be fitted because of unavailable sequencing samples in the outbreak phase, such as in Jamaica, Kyrgyzstan, Libya, and Palestine. As distinguished by EpiFactors [18], which included 815 proteins with 95 histones and protamines involved in epigenetic regulation, we extracted 171 significant amino acid substitutions as potential epi-factors within 55 different countries and regions, as presented in Supplementary Table 3. We believe that it would never repeat severely epidemic clinical events in a few years again, because SARS-CoV-2 has been totally adaptive in human body. The strict and comprehensive pandemic control strategies implemented in Shanghai were able to reduce the number of people infected so that the case fatality rate could be minimized and to buy time for full vaccination coverage[30].

Here we summarized amino acid substitutes among different types of lineage to find the codon usage bias according to relative synonymous-codon usage, involved with the dated variants of different species or subspecies, the unique and specific amino acid substitutions in one lineage. The complexity of many host‒pathogen interactions is broadening the definition of a pathogen’s immunological niche[31]. From the host transmission and the shift of Coronaviridae shown in Supplementary Figs. 3–10 and Supplementary Table 1, we demonstrated the illustration of insertions and deletions of whole genome sequences between different hosts and codon usage bias among different lineages. The important predictor of disease emergence has been proposed that wildlife host species richness. Similarly, host populations with low biodiversity might harbour an increased risk of emergence. Conversely, high host biodiversity has also been linked to the ‘dilution effect’ with a decrease in disease risk[8]. How do influenza viruses evolve within human hosts? It maybe the factors such as antigenic selection, antiviral treatment, tissue specificity, spatial structure, and multiplicity of infection[32]. Host switching often leads to viral emergence to overcome barriers to infection of the new host[10]. Host gene editing, as the major source for existing SARS-CoV-2 mutations [41], a higher rate of severe outcomes and considerable mortality exist in unvaccinated people, especially older adults. How natural selection has shaped immunity and host defence genes[33]? Natural selection causes micro-evolutionary changes increase the fitness, while random gene drift is strongly linked to gene size fixed by the population[34, 35]. Obviously that long-term co-speciation[36], host range property[8], single-cell technologies[37, 38] could explore the properties of host cells harbouring infection, the host’s pathogen-specific immune responses, and the mechanisms pathogens evolved to escape host control.

Of course, our study has several limitations to explore the future work. As we know, pathogens have always imposed strong selection pressure on the human genome[33]. It might be expected to follow viral emergence in a new host species because positive selection would obviously entail a major boost in the number of susceptible hosts and a concomitant increase in fitness[39]. We demonstrated the richness distribution of different mutation types across the whole genomes, in which mutation rates of C->T, G->T and T->C were dominant in driving the divergence of different lineages. Similar with reported vRNP structures, phosphorylation of N protein in its disordered serine/arginine region weakens these interactions to generate less compact vRNPs to support other N protein functions in viral transcription[40]. All populations show evidence of insertions, deletions or substitutions that have driven the divergence of the whole family of Coronaviridae in response to natural selection, random genetic drift, host gene editing, viral proofreading[35]. It suggested that these probably represent another independent acquisition of new “functional sequences” though either specific horizontal gene transfer or recombination events[41–43]. The value of using age information in interpreting variants of functional and selective importance, such as using allele age estimates to infer the ancestry shared between individual genomes[4]. But we still could not demonstrate the codon usage bias contribute to natural selection or genetic drift. Especially, closely related species do not usually exhibit major shifts in codon preferences, but changes in mutation rates over short time scales are quite common in large effective population sizes[44]. To increase gene expression levels through codon optimization, common application is the design of transgenes for development of transgenic crops. Only with large effective population sizes, selection on codon usage is strong in species. With reduction in effective population size, codon bias declines. Long-term reduction leads to major shift in genome evolution. Until effective population size is small, genetic drift becomes dominant over natural selection[44].

We attempt to develop the platform which will predict future pathogenic disease outbreaks. With the accumulation of sequencing datasets, the performance of our strategy will improve, and the approach will become more sensitive in its ability to predict the potential trigger of epidemic outbreaks and describe the spatiotemporal and geographical mutational landscape of the pathogens with special emphasis on SARS-CoV-2 virus, which will ultimately facilitate responses to future outbreaks of concern. We believe that the host’s non- nucleosome DNA plays a key role in the evolutionary divergence and emergence of new strains responsible for the outbreak.

To explore divergence of the whole family of Coronaviridae, we computed intra-species and inter-species p-distances of whole genome sequences by MEGA11 and plotted the distances by the boxplot function of R[45]. To illustrate the atlas of new ages within different populations, we applied a genome alignment-based pipeline to infer the origin time of a given genomic region by a 6 bp sliding window with numbers of dated variants, including insertions in red and deletions in blue. We scanned the whole genome sequences with sliding windows and summarized the mean values of mutation rates of different mutation types. According to the reference genomes among different hosts, we demonstrated host shifts within sliding windows in the evolution of Coronaviridae and presented them by Circos.

Similarly, we demonstrated nucleotide mutations and amino acid substitutions among different SARS-CoV-2 lineages. We computed mutation rates by sliding windows for different mutation types. Then, we presented the 144 amino acid substitution distributions of different lineages by GeneWise[46]. The codon usage numbers were converted into relative synonymous-codon usage (RSCU) values[28], which was simply the observed frequency of a codon divided by the frequency expected under the assumption of equal usage of the synonymous codons for an amino acid[28] [47], where i is the specific number of amino acid, j is the specific number of codon, xij is the observed number of the j-th codon type for the i-th amino acid, and ni is the numbers of alternative codon types for the i-th amino acid.

$$\text{R}\text{S}\text{C}\text{U}\text{i}\text{j} = \frac{\text{X}\text{i}\text{j}}{1/\text{n}\text{i}} {\sum }_{j=1}^{ni}Xij$$

A neighbour-joining phylogenetic tree of separate lineages of SARS-CoV-2 was constructed by MEGA v11.0.11. To predict the potential epidemic mutation patterns in the severely affected areas, we summarized testing data from OWID (Our World in Data) until Feb 7th 2022 that were continually collated from official government sources worldwide. We plotted the frequency distribution of confirmed cases of SARS-CoV-2 in different severely affected areas, such as Africa, Asia, Europe, North America, Oceania, and South America, along the timeline. At the same time, we set the baseline of new cases per millions of substitutions as the internal control group and the widespread types of substitutions as the external control group so that we could exclude other effects on epidemic outbreak events. We excluded areas without any outbreak time, as labelled by the red box, and searched the mutation patterns appearing to be related to the outbreak until the appearance of the Omicron variant because this corresponded with the time of universal immunization[48, 49]. To avoid system errors, that is, errors caused by regions varying in other epi-factors such as population density, geographical environment, vaccination coverage rates and national/social rules and norms, we extracted the true sets by the control group in different individual countries or regions as the individual related values of the new cases per million. To ensure the true values of the outbreak of new cases of SARS-CoV-2, we carefully detected peaks following strict standards; that is, the values of new cases per million had to be above 30. Then, we separately split the sets of population samples from 1 day to 30 days before outbreaks based on the same location according to 144 amino acid substitutions or deletions shown in Fig. 2b and Supplementary Table 2. The dataset contained 13,740,300 observations, including 6,300 true sets and 2,181 features, which amounted to 117 different countries or regions on all five continents, divided into the 75 training sets and 42 validation sets listed in Supplementary Table 3. Then, we separately performed GLM (generalized linear model) analysis on individual countries/regions to find the optimal mutational patterns as in the training phase presented in Supplementary Table 3 such as follows.

glm(formula, family = gaussian, data, weights, subset,

na.action, start = NULL, etastart, mustart, offset,

control = list(...), model = TRUE, method = "glm.fit",

x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...)

We found 171 statistically significant substitutions (Significance level p < 0.05) as potential epi-factors within 55 different countries and regions. We proposed the ZHU prediction which was similar with the “China's abacus”, to perform the cycling of N generations of training by GLM and reordering according to AIC (The Akaike Information Criterion). Finally, we performed the ZHU prediction based on the weighted intercept estimates provided by the supporting information of true sets, and accessed by 42 validation sets with positive precision, sensitivity, and accuracy, as follows.

Positive precision =$\frac{TP}{\text{T}\text{P} + \text{F}\text{P}}$

Sensitivity =$\frac{TP}{\text{T}\text{P} + \text{F}\text{N}}$

Accuracy=$\frac{TP + TN}{\text{T}\text{P} + \text{T}\text{N} + \text{F}\text{P} + \text{F}\text{N} }$

DNA

Deoxyribonucleic acid

RNA

Ribonucleic acid

vRNP

Viral ribonucleoprotein

SARS-CoV-2

Severe acute respiratory syndrome coronavirus 2

RBD

Receptor-binding domain

EBOV

Ebola virus

Glycoprotein

NCLDV

Nucleocytoplasmic large DNA virus

MVs

Marseilleviridae

D614G substitute

GLM

generalized linear model

AIC

The Akaike Information Criterion

MEGA11

Molecular Evolutionary Genetics Analysis

RSCU

Relative synonymous-codon usage

OWID

Our World in Data

GLM

Generalized linear model

True positive

False positive

False negative

True negative

SARS-CoV

Severe acute respiratory syndrome-related coronavirus

SADS

Swine acute diarrhoea syndrome coronavirus

NL63

Human coronavirus NL63

MERS

Middle East respiratory syndrome-related coronavirus

London1

Betacoronavirus England 1

HKU5

Bat coronavirus HKU5

HKU4

Bat coronavirus HKU4

HKU3

Bat coronavirus HKU3

HKU2

Bat coronavirus HKU2

BATS

Bat SARS-like coronavirus WIV1

OC43

Human coronavirus OC43

HKU9

Rousettus bat coronavirus HKU9

HKU1

Human coronavirus HKU1

229E

Human coronavirus 229E

ORF

Open reading frame

NSP

Non-structural protein

Surface glycoprotein

Envelope protein

Membrane glycoprotein

Nucleocapsid phosphoprotein

HIV

Human immunodeficiency virus

H1N1pdm09

2009 H1N1 Pandemic

Ethics approval and consent to participate

All protocols were approved by the Liferiver Science and Technology Institute, Shanghai ZJ Bio-Tech Co., Ltd and Use Committee (Shanghai, China).

Consent for publication

Not applicable.

Availability of data and materials

All shared mutations and substitutes from different hosts, lineages or regions are available in the supplementary materials and https://bioinfo.liferiver.com.cn/#/home.

Competing interests

We were funded by the Liferiver Science and Technology Institute of Shanghai ZJ Bio-Tech Co, Ltd. The authors declare that they have no competing interests.

Funding

Not applicable.

Author contributions

The authors’ responsibilities were as follows: JBS and CJ designed and conducted the research; CJ analysed the data and performed the analysis; CJ wrote the paper; JBS and CJ revised the manuscript; CJ had primary responsibility for the final content. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Cong Ji and Junbin (Jack) Shao.

Acknowledgements

Appreciation goes to Han Chen and Guangzhong Wang for their valuable suggestions and comments on this work. Thank you so much for the discussion from Si Chen, Yan Liu, and Hanyan Zhang. Many facets of the user-interface design benefited from help by Xingsheng Niu, Yajie Pan and Wang Lu. Great thanks to Elizabeth Sung for flow and language corrections. All the other data supporting the findings of this study and the computational code used in this study are available from the corresponding authors upon reasonable request. The other authors declare no competing interests.

Authors’ information

Affiliations

Liferiver Science and Technology Institute, Shanghai ZJ Bio-Tech Co., Ltd. Shanghai, China

Cong Ji, Junbin (Jack) Shao

Forterre P, Prangishvili D (2009) The origin of viruses. Res Microbiol 160(7):466–472
Kitchen A, Shackelton LA, Holmes EC (2011) Family level phylogenies reveal modes of macroevolution in RNA viruses. Proc Natl Acad Sci USA 108(1):238–243
Vieira MC, Zinder D, Cobey S (2018) Selection and Neutral Mutations Drive Pervasive Mutability Losses in Long-Lived Anti-HIV B-Cell Lineages. Molecular biology and evolution. 35(5):1135–1146
Albers PK, McVean G (2020) Dating genomic variants and shared ancestry in population-scale sequencing data. PLoS Biol 18(1):e3000586
Simmonds P, Aiewsakun P (2018) Virus classification - where do you draw the line? Arch Virol 163(8):2037–2046
Wohl S, Schaffner SF, Sabeti PC (2016) Genomic Analysis of Viral Outbreaks. Annual Rev Virol 3(1):173–195
Longdon B et al (2014) The evolution and genetics of virus host shifts. PLoS Pathog 10(11):e1004395
Geoghegan JL, Holmes EC (2017) Predicting virus emergence amid evolutionary noise. Open biology 7(10):170189
Kaján GL et al (2020) Virus–Host Coevolution with a Focus on Animal and Human DNA Viruses. J Mol Evol 88(1):41–56
Parrish CR et al (2008) Cross-species virus transmission and the emergence of new epidemic diseases. Microbiol Mol Biol Rev 72(3):457–470
Park DJ et al (2015) Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone. Cell 161(7):1516–1526
Urbanowicz RA et al (2016) Human Adaptation of Ebola Virus during the West African Outbreak. Cell 167(4):1079–1087e5
Chen X et al (2012) Nucleosomes suppress spontaneous mutations base-specifically in eukaryotes. Science 335(6073):1235–1238
Liu Y et al (2021) Virus-encoded histone doublets are essential and form nucleosome-like structures. Cell 184(16):4237–4250e19
Vannini A, Marazzi I (2021) A small nucleosome from a weird virus with a fat genome. Mol Cell 81(17):3447–3448
Tizzoni M et al (2012) Real-time numerical forecast of global epidemic spreading: case study of 2009 A/H1N1pdm. BMC Med 10:165–165
Medvedeva YA et al EpiFactors: a comprehensive database of human epigenetic factors and complexes. Database (Oxford), 2015(bav067).
Forster P et al (2020) Phylogenetic network analysis of SARS-CoV-2 genomes. Proceedings of the National Academy of Sciences, 117(17): p. 9241
Kemp SA et al (2021) Recurrent emergence and transmission of a SARS-CoV-2 spike deletion H69/V70. bioRxiv, : p. 2020.12.14.422555
Kimura I et al (2021) SARS-CoV-2 Lambda variant exhibits higher infectivity and immune resistance. bioRxiv, : p. 2021.07.28.454085
Ruan Y et al (2022) The twin-beginnings of COVID-19 in Asia and Europe-one prevails quickly. Natl Sci Rev 9(4):nwab223
Grubaugh ND et al (2019) Tracking virus outbreaks in the twenty-first century. Nat Microbiol 4:10–19
Kumar S et al (2018) MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. Mol Biol Evol 35(6):1547–1549
Yang Z et al (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155(1):431–449
Hasell J et al (2020) A cross-country database of COVID-19 testing. Sci Data 7(1):345
Dudas G et al (2017) Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature 544(7650):309–315
Vijaykrishna D et al (2015) The contrasting phylodynamics of human influenza B viruses. Elife 4:e05055
Gutierrez B, Escalera-Zamudio M, Pybus OG (2019) Parallel molecular evolution and adaptation in viruses. Curr Opin Virol 34:90–96
Han X et al (2022) SARS-CoV-2 nucleic acid testing is China's key pillar of COVID-19 containment. Lancet 399(10336):1690–1691
Zhang X, Zhang W, Chen S (2022) Shanghai's life-saving efforts against the current omicron wave of the COVID-19 pandemic. Lancet
Cobey S (2014) Pathogen evolution and the immunological niche. Ann N Y Acad Sci 1320(1):1–15
Xue KS et al (2018) Within-Host Evolution of Human Influenza Virus. Trends Microbiol 26(9):781–793
Barreiro LB, Quintana-Murci L (2010) From evolutionary genetics to human immunology: how selection shapes host defence genes. Nat Rev Genet 11(1):17–30
Vargas-Aguilar AL et al (2021) Genomic and molecular evolutionary dynamics of transcriptional response regulator genes in bacterial species of the Harveyi clade of Vibrio. Gene 783:145577
Wang R et al (2021) Vaccine-escape and fast-growing mutations in the United Kingdom, the United States, Singapore, Spain, India, and other COVID-19-devastated countries. Genomics 113(4):2158–2170
de Vienne DM et al (2013) Cospeciation vs host-shift speciation: methods for testing, evidence from natural associations and relation to coevolution. New Phytol 198(2):347–385
Chattopadhyay PK, Roederer M, Bolton DL (2018) A deadly dance: the choreography of host–pathogen interactions, as revealed by single-cell technologies. Nat Commun 9(1):4638
Kazer SA-O et al (2020) Integrated single-cell analysis of multicellular immune dynamics during hyperacute HIV-1 infection. Nat Med 26(4):511–518
Holmes EC (2006) The evolution of viral emergence. Proc Natl Acad Sci USA 103(13):4803
Carlson CR et al (2022) Reconstitution of the SARS-CoV-2 ribonucleosome provides insights into genomic RNA packaging and regulation by phosphorylation. J Biol Chem 298(11):102560
Forni D et al (2022) Homology-based classification of accessory proteins in coronavirus genomes uncovers extremely dynamic evolution of gene content. Mol Ecol 31(13):3672–3692
Makarenkov V et al (2021) Horizontal gene transfer and recombination analysis of SARS-CoV-2 genes helps discover its close relatives and shed light on its origin. BMC Ecol Evol 21(1):5
Shukla A, Hilgenfeld R (2015) Acquisition of new protein domains by coronaviruses: analysis of overlapping genes coding for proteins N and 9b in SARS coronavirus. Virus Genes 50(1):29–38
Parvathy ST, Udayasuriyan V, Bhadana V (2022) Codon usage bias. Mol Biol Rep 49(1):539–565
Team RC (2014) R: A language and environment for statistical computing. MSOR connections, 1
Birney E, Clamp M, Durbin R (2004) GeneWise Genomewise Genome Res 14(5):988–995
Sharp PF, Li WH, Li WH (1986) An evolutionary perspective on synonymous codon usage in unicellular organisms. J Mol Evol 24(1–2):28–38
Chowdhury S et al (2022) Omicron variant of SARS-CoV-2 infection elicits cross-protective immunity in people who received boosters or infected with variant strains. Int J Immunopathol Pharmacol 36:3946320221133001
Zarębska-Michaluk D et al (2022) COVID-19 Vaccine Booster Strategies for Omicron SARS-CoV-2 Variant: Effectiveness and Future Prospects. Vaccines (Basel), 10(8)

The authors declare no competing interests.

file.png
Supplementary Figure 1. Intra-species and inter-species genetic distances of Coronaviridae. Coronaviridaeinclude SARS-CoV-2, SARS-CoV, SADS, OC43, NL63, MERS, London1, HKU9, HKU5, HKU4, HKU3, HKU2, HKU1, BATS, and 229E.
2.png
Supplementary Figure 2. Atlas of new ages within different populations. Here, we separately present the divergence of species or subspecies from SARS-CoV-2, i.e., SARS-CoV, SADS, OC43, NL63, MERS, London1, HKU9, HKU5, HKU4, HKU3, HKU2, HKU1, BATS and 229E. Red represents insertions, and blue represents deletions.
SupplementaryFigure3.jpg
Supplementary Figure 3. The indels of whole genome sequences of human coronavirus 229E between different hosts. Several hosts are shown, such as (from outside to inside) Camelus, Hipposideros, Macronycteris vittata, and Vicugna pacos, while the genome of Homo sapiens was used as the reference genome.
SupplementaryFigure4.jpg
Supplementary Figure 4. The indels of whole genome sequences of Rhinacovirus between different hosts. Several hosts are shown, such as (from outside to inside) Rhinolophus ferrumequinum and Rhinolophus affinis, while the genome of Sus scrofa was used as the reference genome.
SupplementaryFigure5.jpg
Supplementary Figure 5. The indels of whole genome sequences of Tegacovirus between different hosts. Several hosts are shown, such as (from outside to inside) Felis catus, Canis lupus familiaris, and Feliformia, while the genome of Sus scrofa was used as the reference genome.
SupplementaryFigure6.jpg
Supplementary Figure 6. The indels of whole genome sequences of human coronavirus OC43 between different hosts. Several hosts are shown, such as (from outside to inside) Bos taurus, Bos grunniens, Canis lupus familiaris, Pan troglodytes verus, Sus scrofa, Hydropotes inermis, Bubalus bubalis, Camelus bactrianus, Equus caballus, Kobus ellipsiprymnus, Odocoileus virginianus, Rusa unicolor, Bovidae, Giraffa camelopardalis, and Vicugna pacos, while the genome of Homo sapiens was used as the reference genome.
SupplementaryFigure7.jpg
Supplementary Figure 7. The indels of whole genome sequences of Middle East respiratory syndrome-related coronavirus between different hosts. Several hosts are shown, such as (from outside to inside) Camelus dromedaries, Lama glama, Neoromicia capensis, Hypsugo savii, Pipistrellus kuhlii, and Vespertilio sinensis, while the genome of Homo sapiens was usedas the reference genome.
SupplementaryFigure8.jpg
Supplementary Figure 8. The indels of whole genome sequences of severe acute respiratory syndrome coronavirus 2 between different hosts. Several hosts are shown, such as (from outside to inside) Mus musculus, Rhinolophus sinicus, Mustela lutreola, Rhinolophus macrotis, Canis lupus familiaris, Panthera tigris jacksoni, Rhinolophus affinis, Aselliscus stoliczkanus, Rhinolophus ferrumequinum, Chiroptera, Rhinolophus pusillus, Chaerephon plicatus, Chlorocebus aethiops, Paradoxurus hermaphroditus, Viverridae, Paradoxurus hermaphroditus, and Paguma larvata, while the genome of Homo sapiens was usedas the reference genome.
SupplementaryFigure9.jpg
Supplementary Figure 9. The indels of whole genome sequences of Deltacoronavirus between different hosts. Several hosts are shown, such as (from outside to inside) Pycnonotus jocosus, Mareca, Lonchura striata, Zosteropidae, Falco, Chlamydotis, Columbidae, Galliformes, Passeridae, Muscicapidae, Turdus hortulorum, Ardeidae, Gallinula chloropus, and Coturnix japonica, while the genome of Sus scrofa was used as the reference genome.
SupplementaryFigure10.jpg
Supplementary Figure 10. The indels of whole genome sequences of avian coronavirus between different hosts. Several hosts are shown, such as (from outside to inside) Phasianinae, Anatidae, and Meleagris gallopavo, while the genome of Gallus gallus was used as the reference genome.
SupplementaryFigure11.pdf
Supplementary Figure 11. The whole-genome percentage nucleotide identity of different lineages of SARS-CoV-2. There were several lineages such as P.3(Theta), P.1(Gamma), C.37(Lambda), C.36.3, C.1.2, B.1.640.2, B.1.640.1, B.1.621(Mu), B.1.617.3, B.1.617.2(Delta), B.1.617.1(Kappa), B.1.526(Lota), B.1.525(Eta), B.1.351(Beta), B.1.1.7(Alpha), B.1.1.529(Omicron), and B.1.1.318.
SupplementaryFigure12.pdf
Supplementary Figure 12. The richness distribution of the mutation rate of C->T across the whole SARS-CoV-2 genomein different lineages. There were several lineages such as P.3(Theta), P.1(Gamma), C.37(Lambda), C.36.3, C.1.2, B.1.640.2, B.1.640.1, B.1.621(Mu), B.1.617.3, B.1.617.2(Delta), B.1.617.1(Kappa), B.1.526(Lota), B.1.525(Eta), B.1.351(Beta), B.1.1.7(Alpha), B.1.1.529(Omicron), and B.1.1.318.
SupplementaryFigure13.pdf
Supplementary Figure 13. The richness distribution of themutation rate of G->T across the whole SARS-CoV-2 genome in different lineages. The same as above Supplementary Figure 12.
SupplementaryFigure14.pdf
Supplementary Figure 14. The richness distribution of the mutation rate of A->T across the whole SARS-CoV-2 genomein different lineages. The same as above Supplementary Figure 12.
SupplementaryFigure15.pdf
Supplementary Figure 15. The richness distribution of the mutation rate of T->C across the whole SARS-CoV-2 genomein different lineages. The same as above Supplementary Figure 12.
SupplementaryFigure16.pdf
Supplementary Figure 16. The richness distribution of themutation rate of T->G across the whole SARS-CoV-2 genome in different lineages. The same as above Supplementary Figure 12.
SupplementaryFigure17.pdf
Supplementary Figure 17. The richness distribution of the mutation rate of G->C across the whole SARS-CoV-2 genomein different lineages. The same as above Supplementary Figure 12.
SupplementaryFigure18.pdf
Supplementary Figure 18. Smoothed distribution of new cases per million cases of SARS-CoV-2 in different severely affected areas, such as Africa, Asia, Europe, North America, Oceania, and South America, along the timeline.
SupplementaryFigure19.jpg
Supplementary Figure 19. Smoothed distribution of new cases per million cases of SARS-CoV-2 in Africa along the timeline. Red boxes represent no continued outbreaks excluded in our analysis.
SupplementaryFigure20.jpg
Supplementary Figure 20. Smoothed distribution of new cases per million cases of SARS-CoV-2 in Asia along the timeline.
SupplementaryFigure21.jpg
Supplementary Figure 21. Smoothed distribution of new cases per million cases of SARS-CoV-2 in South America along the timeline. Red boxes represent no continued outbreaks excluded in our analysis.
SupplementaryFigure22.jpg
Supplementary Figure 22. Smoothed distribution of new cases per million cases of SARS-CoV-2 in Europe along the timeline. Red boxes represent no continued outbreaks excluded in our analysis.
SupplementaryFigure23.jpg
Supplementary Figure 23. Smoothed distribution of new cases per million cases of SARS-CoV-2 in North America along the timeline. Red boxes represent no continued outbreaks excluded in our analysis.
SupplementaryFigure24.jpg
Supplementary Figure 24. Smoothed distribution of new cases per million cases of SARS-CoV-2 in Oceania along the timeline. Red boxes represent no continued outbreaks excluded in our analysis.
SupplementaryFigure25.pdf
Supplementary Figure 25. Frequency distributionof different types for substitutions 1-20. Different types of amino acid substitutes are represented by different colours, such as A (Gla), C (Cys), D (Asp), E (Glu), F (Phe), G (Gly), H (His), I (Ile), K (Lys), L (Leu), M (Met), N (Asn), P (Pro), Q (Gln), R (Arg), S (Ser), Stop, T (Thr), V (Val), W (Trp), Y (Tyr).
SupplementaryFigure26.pdf
Supplementary Figure 26. Frequency distributionof different types for substitutions 21-40. Different types of amino acid substitutes are represented by different colours, such as A (Gla), C (Cys), D (Asp), E (Glu), F (Phe), G (Gly), H (His), I (Ile), K (Lys), L (Leu), M (Met), N (Asn), P (Pro), Q (Gln), R (Arg), S (Ser), Stop, T (Thr), V (Val), W (Trp), Y (Tyr).
SupplementaryFigure27.pdf
Supplementary Figure 27. Frequency distributionof different types for substitutions 41-60. Different types of amino acid substitutes are represented by different colours, such as A (Gla), C (Cys), D (Asp), E (Glu), F (Phe), G (Gly), H (His), I (Ile), K (Lys), L (Leu), M (Met), N (Asn), P (Pro), Q (Gln), R (Arg), S (Ser), Stop, T (Thr), V (Val), W (Trp), Y (Tyr).
SupplementaryFigure28.pdf
Supplementary Figure 28. Frequency distributionof different types for substitutions 61-80. Different types of amino acid substitutes are represented by different colours, such as A (Gla), C (Cys), D (Asp), E (Glu), F (Phe), G (Gly), H (His), I (Ile), K (Lys), L (Leu), M (Met), N (Asn), P (Pro), Q (Gln), R (Arg), S (Ser), Stop, T (Thr), V (Val), W (Trp), Y (Tyr).
SupplementaryFigure29.pdf
Supplementary Figure 29. Frequency distributionof different types for substitutions 81-100. Different types of amino acid substitutes are represented by different colours, such as A (Gla), C (Cys), D (Asp), E (Glu), F (Phe), G (Gly), H (His), I (Ile), K (Lys), L (Leu), M (Met), N (Asn), P (Pro), Q (Gln), R (Arg), S (Ser), Stop, T (Thr), V (Val), W (Trp), Y (Tyr).
SupplementaryFigure30.pdf
Supplementary Figure 30. Frequency distributionof different types for substitutions 101-120. Different types of amino acid substitutes are represented by different colours, such as A (Gla), C (Cys), D (Asp), E (Glu), F (Phe), G (Gly), H (His), I (Ile), K (Lys), L (Leu), M (Met), N (Asn), P (Pro), Q (Gln), R (Arg), S (Ser), Stop, T (Thr), V (Val), W (Trp), Y (Tyr).
SupplementaryFigure31.pdf
Supplementary Figure 31. Frequency distributionof different types for substitutions 121-140. Different types of amino acid substitutes are represented by different colours, such as A (Gla), C (Cys), D (Asp), E (Glu), F (Phe), G (Gly), H (His), I (Ile), K (Lys), L (Leu), M (Met), N (Asn), P (Pro), Q (Gln), R (Arg), S (Ser), Stop, T (Thr), V (Val), W (Trp), Y (Tyr).
SupplementaryFigure32.pdf
Supplementary Figure 32. Frequency distributionof different types for substitutions 141-158. Different types of amino acid substitutes are represented by different colours, such as A (Gla), C (Cys), D (Asp), E (Glu), F (Phe), G (Gly), H (His), I (Ile), K (Lys), L (Leu), M (Met), N (Asn), P (Pro), Q (Gln), R (Arg), S (Ser), Stop, T (Thr), V (Val), W (Trp), Y (Tyr).
SupplementaryTable1.xlsx
Supplementary Table 1. Comprehensive information on Coronaviridae, especially different lineages of SARS-CoV-2. Sheet “Plague” contains data on the outbreak of epidemic disease in human history. Sheet “Cov” contains the taxonomy of Coronaviridae according to the NCBI. Sheet “branch” contains the studied groups related to theoutgroup on the evolutionary branch of Coronaviridae. Sheet “host” contains the basic information of hosts of Coronaviridae. Sheet “AA substitutes” contains specific amino acid substitutions with different mutation types in different lineages of SARS-CoV-2. Sheet “Codon Usage” contains the codon numbers of different lineages of SARS-CoV-2. Sheet “RSCU” contains RSCU values for different lineages of SARS-CoV-2. Sheet “lineages” contains the global distribution of lineages of SARS-CoV-2.
SupplementaryTable2.xlsx
Supplementary Table 2. The comprehensive information of true sets. Sheets include the distribution of a total of144 amino acid substitutions in different lineages, new cases per million from different countries/regions ondifferent continents, and the statistics of different mutation types of amino acid substitutes.
SupplementaryTable3.xlsx
Supplementary Table 3. Comprehensiveinformation of ZHU prediction. Sheets include 75 training sets and 43 validation sets, the information on the input data, GLM coefficient estimates and AIC values, 171 significant substitutions, prediction results for the training sets, GLM and reordered processing data, the ZHU prediction model, prediction results for the validation sets, and performance parameters.
Readme.txt
Readme
Supplementarymaterials.z01
Supplementary materials.z01
Supplementarymaterials.z02
Supplementary materials.z02
Supplementarymaterials.z03
Supplementary materials.z03
Supplementarymaterials.z04
Supplementary materials.z04
Supplementarymaterials.z05
Supplementary materials.z05
Supplementarymaterials.z06
Supplementary materials.z06
Supplementarymaterials.z07
Supplementary materials.z07
Supplementarymaterials.z08
Supplementary materials.z08
Supplementarymaterials.z09
Supplementary materials.z09
Supplementarymaterials.z10
Supplementary materials.z10
Supplementarymaterials.z11
Supplementary materials.z11
Supplementarymaterials.z12
Supplementary materials.z12
Supplementarymaterials.zip
Supplementary materials.zip

Download PDF

Version 5

posted

You are reading this older preprint version

Read the latest preprint version →

Epi-Clock: A sensitive platform to help understand pathogenic disease outbreaks and facilitate the response to future outbreaks of concern.

Status:

Version 5

Abstract

Figures

Introduction

Results

Discussion

Conclusions

Materials and Methods

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 5