The various genomic features described in this study are based on the data obtained during the period between the emergence of the first wave and the peak of the last pandemic wave, when the SC2 pandemic started to subside. Our observations and analyses are described at several different levels of genomic demography during this period. Since some features, observed at the whole population or major genomic group (MGG) level, are caused by the features observed at one or more deeper levels of individual MGG or individual virus level, we contiguously numbered all 13 notable genomic features observed with
At the level of the whole population:
- There are approximately 44,000 unique sequence-variants of SC2. An examination of the curated whole-genome sequences of 406,004 SC2 viruses (from the NCBI database (2) as of the peak of the last pandemic wave around February 2022) revealed that most of them are highly redundant for the ECR. We identified 43,678 unique sequences (called “44K” data) that we call “reliable”, based on our filtering criteria (see Table S1 in B. Supplementary Figures and Tables of the SI). Thus, approximately 89% of the whole sequences are the duplicate sequences of 11%, which have unique sequences different from the reference sequence of the Wuhan-Hu-1 strain (2), which was chosen as the reference sequence in this study.
- Mutational variations are found at 63% of the ECR positions of the reference sequence. Mutational events occurred very broadly, covering 18,655 positions of the ECR of the reference, which is approximately 63% of 29,549 ECR sequence positions (see Table S2 in B. Supplementary Figures and Tables of the SI). An overwhelming portion of the mutational positions showed very low mutational frequencies, suggesting that they did not contribute to the pandemic. However, there are 104 positions (approximately 0.35% of the ECR) where the frequencies of the unique mutated-genotypes are extremely high (approximately 90% or higher; see Observation #10 below).
- There are only two types of variational events. In the 44K data, there were only two types of variational (mutational) events observed during the pandemic period: the overwhelming portion (99.6%) of all variational events were point substitutions, and the remaining variational events were short indels, approximately 90% of which were deletional events (see Table S2 in B. Supplementary Figures and Tables of the SI).
- One to 77 variational events occurred per virus. Figure 1a shows that, despite the vast number of variants, the number of genomic variational events per virus ranged from one to 77 by the time of the peak of the last pandemic wave (February 2022). This rate corresponds to the average (red line) accumulation of 29 variational (mutational) events per virus per year or approximately one variational event per 1000 loci per year.
At the level of four major genomic groups (MGGs):
- Four major genomic “clusters” based exclusively on genomic variational distance. Figure 2 shows the plot of principal component analysis (PCA) (10; see Principal Component Analysis (PCA) in A. Details of Materials and Methods in the SI) on the entire 44K data set. By the time of the peak of the last pandemic wave, all the unique genomic variants in the study could be grouped into four major clusters that were well segregated in the PCA space, exclusively based on genomic variational distances without assuming any evolutionary model. We named them Major Genomic Group (MGG) I, II, III, and IV (also see Fig. S2 (video) in B. Supplementary Figures and Tables of the SI). Among the many different labels we tested, the Variants of Concerns (VOCs) of the World Health Organization (WHO) (7) showed the simplest correlation: MGG II, III and IV consisted of almost exclusively Alpha, Delta, and Omicron variants of WHO, respectively, and MGG I consist of a mixture of the rest (other earlier VOCs and those not belonging to any designated VOCs including the reference group). When each MGG is magnified (not shown) it consists of multiple subgroups, each of which appears to be related to a different lineage, where members are separated by one or a few variational events in PCA space.
- Four major genomic “clades” based on the phylogenetic tree of the genomic variational distance and evolutionary model. Using an independent and different method on the entire 44K data set, Fig. 3 shows a circular representation of a phylogenetic tree in a distance space of genomic variational events under the assumptions of a particular evolutionary model of bifurcating/neighbor-joining and maximum parsimony of branch-lengths (BioNJ (11); see the BioNJ tree in A. Details of Materials and Methods of the SI). The figure also reveals four major “clades” of the tree, corresponding approximately to the four MGGs of the PCA in Observation #5 above.
- Pandemic waves (except the first wave) consisted of the variants from more than one MGG. The pandemic came in several waves. As shown in Fig. 1a, although the average number of unique sequences of the emerging variants gradually increased as the pandemic progressed (the red line in Fig. 1a), the ranges of the number of the variants undulate like waves as the pandemic progressed. Furthermore, since all the subsequent variants of the same sequence (not represented in Fig. 1) continue propagating as the pandemic progresses, each wave (except the first wave), at most time points, consists of the variants from more than one MGG as shown in Table S3 in B. Supplementary Figures and Tables in the SI. This observation becomes relevant in the Implications and Discussion: “Panvalent” vaccine design against the current and next wave(s) below.
- “Coemergence” of the founders of MGG II, III, and IV: Although the centers of the four MGG populations appear to have emerged sequentially (see the red line in Fig. 1), there are three lines of evidence for the approximate “coemergence” of the founders of the last three MGGs:
(a) Based on sample collection-dates: Each panel in Figure 1a-e shows the sample-collection date of each FUS (the first sample of each subgroup with a unique sequence of the ECR) on the X-axis vs. the count of genomic variational events on the Y-axis. These findings suggest that the members of MGG I evolved during a period of one and a half years (see Fig. 1b). Then, those of MGG II and MGG III (see Figs. 1c and 1d), as well as the “early” MGG IV (a small isolated subcluster located far left of the major population cluster of the MGG IV, the “late” MGG IV (see Fig. 1e)), emerged at almost the same time around Dec. 2020.
(b) Based on the cumulative branch lengths in the phylogenetic tree: An examination of the cumulative branch lengths of the founders of the three MGGs (II, III, and IV) in Fig. 3 shows that they are very similar, suggesting that the three founders emerged within a short time range.
(c) Most of the loci of very highly conserved variant-genotypes (evolutionarily selected) of each of the 3 late-emerging MGGs were not correlated with each other (see Observation #10 below) suggesting that each of the three MGGs evolved independently.
At the level of the individual Major Genomic Group:
- Bimodal distribution of mutational frequencies: The loci of mutations in all four MGGs are uniformly distributed throughout the entire length of the ECR (see Fig. 4). However, at a given locus of each MGG, the distribution of the frequency of each variational event is highly bimodal: the variational frequencies of the overwhelming portion of all loci are lower than 10% (rarely conserved, i.e., random events as indicated by a green “band” formed by contiguous green dots) of the members of each MGG population, but for a very few loci, the frequency is more than 90%, i.e., very highly conserved or selected (see Observation #10 below), as shown by the vertical lines with dots on top.
- Highly-conserved variant-genotypes (HCVGs) for each MGG: Examination of the genotypes of each MGG in detail revealed that each MGG has a very small number of selected loci with HCVGs, which are present only among all variant viruses, but not among the founding viruses. These HCVGs have the following interesting characteristics and implications (also see Implications and Discussion below):
- 129 HCVGs: Figure 4 shows that, at 4, 32, 35 and 58 loci of MGG I, II, III, and IV, respectively, each locus has an HCVG (approximately 90% or more of its members of each of the four MGGs, shown as vertical lines with dots on top, have conserved variant genotypes). These 129 HCVGs among the four MGGs are found at 104 unique loci (i.e., the union of 129 loci from 4 MGGs). Although they correspond to less than 0.4% of all loci of the reference sequence, they must play very important roles in the survival of the variant viruses in each MGG; thus, they are “naturally selected” for survival over host immunity.
- “Universal” and “MGG-specific HCVGs: There are only 4 HCVGs at loci 241, 3037, 14408, and 23403 (see Fig. S3 in B. Supplementary Figures and Tables of the SI and Fig. 4), which are highly conserved in all four MGGs. Thus, they are “universal” HCVGs, and the rest are “MGG-specific” HCVGs. This observation suggests that the four HCVGs of MGG I (the first MGG to emerge) were inherited from the founder(s) of MGG I, which may have acquired the four mutations while in a nonhuman host, thus gaining the initial capacity to infect humans.
- Coemergence of MGG-specific HCVGs: Almost all loci of the HCVGs of each of the remaining three MGGs are not correlated, i.e., they are highly conserved within each MGG only, suggesting that the three MGGs have evolved independently, and that their founders emerged at almost the same time (see Observation #8). This suggests that, although each wave of infection may appear to occur sequentially, the order of emergence of the founders of each wave may not be sequential.
- Near complete set of MGG-specific HCVGs: Eighty-eight to 97% of the members of each MGG had the complete set of their respective HCVGs (see Table S4 in B. Supplementary Figures and Tables of the SI).
- Predictability of HCVGs of the next wave: Taken together, these observations suggest that the HCVGs are selected naturally during one wave, and some of the selected HCVGs are inherited in the following waves, or, at least, in the immediate next wave or waves. This implies that one or more sets of the current HCVGs from the four MGGs are predictable to become the HCVGs of the new MGG(s) in the next wave, although from which current MGG they will come is not predictable.
The above genomic characteristics of HCVGs, (a) to (e), suggest a practical approach for designing a new type of vaccine (see “Panvalent” vaccine design for the current and next wave(s) in Implications and Discussion below; see also Fig. 4 and Fig. S3 in B. Supplementary Figures and Tables of the SI).
f. Interestingly, both Fig. 4 and Table 1 show that MGG IV has the largest number of MGG-specific HCVGs (58 HCVGs), mostly due to the presence of many HCVGs confined to the “spike” gene (S gene) and the N gene (nucleocapsid gene), suggesting that they may account for MGG IV members’ enhanced infectivity over other variants. Because of MGG IV’s unusual jump in variational changes (most of which are MGG-specific HCVGs) many hypotheses have been proposed on its origin (12 - 15).
- Two Big Bursts in mutational events at the beginning and the end of the pandemic period: Figure 5 shows a plot combining a box plot and a violin plot of the distribution of mutational event counts for the four MGGs. There are two jumps (“bursts”) in average mutational events: the first jump at the very beginning of the pandemic when the founder subgroup (containing the reference) gained approximately 18 mutational events, on average, to emerge as all variants of MGG I (minus the reference subgroup); the other jump is at the emergence of the last MGG IV, where the founders of MGG IV also gained 18 mutational events, on average, to emerge from MGG III or 23 mutational events to emerge from MGG II. One of possible suggestion is that the first burst was for the virus to overcome a new environment, i.e., the human host’s immune system (not yet exposed to SC2), by generating a highly diverse population of viral variants to increase the chance of giving birth to a few highly human-infective variants; the second burst was to overcome the improved host immune system (which has now gained acquired-immunity by the vaccines and natural infections of SC2) by, again, generating a highly diverse variant population. The virus won the first competition but partially lost the second, thus, losing the pandemic infectivity while still maintaining endemic infectivity of the virus.
- HCVGs in a wide range of genomic regions. Table 1 shows that the HCVGs of SC2 are found in the coding as well as non-coding regions of the viral genome, suggesting that they play diverse roles in the survival of the variant SC2s by overcoming the wide spectrum of immune systems of human hosts, such as the antibody-mediated and the cell-mediated immune systems as well as the innate immune system of the host. It is noteworthy, however, that two small regions (the coding region of the S protein, a protein located on the outer surface of the virus, and the N protein, the nucleocapsid protein coating the viral RNA inside of the virus) have the highest density of the HCVGs. Their locations in the S region suggest that the HCVGs of the region may interfere with the initial interaction between the virus and host cells required for the initiation of viral infection. However, the HCVGs of the N region may have a direct or indirect role in overcoming the host immune system through their involvement in viral assembly, the host cell cycle, interferon-mediated innate immunity and/or other processes (16, 17, 18) via non-antibody-mediated mechanism(s).
At the level of individual variations:
- Multiallelic Single Nucleotide Variations (SNVs): When all (including conserved and nonconserved) SNV loci of all MGGs were compared with the reference, an unexpectedly high proportion of multiallelic SNVs were found in our 44K data. Table S2 in B. Supplementary Figures and Tables of the SI shows that the biallelic SNV loci account for approximately 68% of all SNV loci, the tri-allelic loci approximately 28% and the tetra allelic loci account for approximately 4%. This observation is in contrast to the single nucleotide polymorphisms (SNPs) found in the germline genomes of humans, only other life form for which a massive genomic information is available, where an overwhelming portion of all SNPs are biallelic (19).
Implications and Discussion: “Panvalent” vaccine design against the current and next wave(s)
Some of the demographic characteristics described in Observations section above imply that they can be re-interpreted for the purpose of designing a “panvalent” vaccine against the current as well as the next possible wave of Covid-19:
“Preview” of the genomic features of the next-wave viruses. To prevent the next wave of a current endemic or pandemic viral infection, one of the most effective ways would be to vaccinate local or world populations, respectively, against two types of the disease-causing viral strains: (a) one against the current (and most recent past) strains and (b), more importantly, the other against any future strains that can bypass the host immune system established for the currently prevailing strains. Experts in the field can come up with good ideas about the first type, but how to design vaccines for the next wave is not usually obvious because of the lack of key information needed to predict the genomic features of future viruses. However, with SC2 we have, for the first time, a massive amount of information on the genomic features of the current and past SC2 virus variants, their epidemiology, physiology, and others covering one entire period of the pandemic consisting of multiple waves of the infection. Such data allow us to perform a “hypothetical preview” experiment, where we position ourselves at a given time point and divide the entire pandemic period into “past/present” and “future”. Then, we ask what highly conserved genomic features in the “past/present” variants are passed to the “future” variants at high frequency.
Inheritance of HCVGs from MGG I to the remaining 3 MGGs. As for SC2 shown in Fig. 2, only four MGGs emerged during the pandemic period (see Observation #5 and #6), and, as mentioned in Observation #10, at 4, 32, 35 and 58 loci of MGG I, II, III, and IV, respectively, each locus had an HCVG (approximately 90% or more members of each of the four MGGs had identical and highly-conserved (”naturally selected”) variant genotype (see Figure 4 and Fig. S3 in B. Supplementary Figures and Tables of the SI).
As mentioned in Observation #10b, the 4 HCVGs of MGG I of the first wave are also found in the rest of MGGs of the following waves, suggesting that they are inherited from MGG I. This interpretation suggests that the next new wave(s) may emerge from the member(s) of one or more of the current MGGs carrying their respective HCVGs (see below).
Predictability of Inheritance of HCVGs from one or more of the four current MGGs to most or all of the next wave viruses. Based on the pattern of inheritance of the HCVGs mentioned above (see Observation #10(e)), we predict that all or most of the founders of the next-wave viruses will have emerged from one or more of the 4 current MGGs. Therefore, although we cannot predict which members from which MGG(s) will be selected as the founders for the next wave, we can predict that all the HCVGs that will be present in the variants of the next wave will be a subset of the sum of all sets of HCVGs of the four current MGGs.
HCVGs and the broad spectrum of host immunity. As mentioned in Observation #12 the HCVGs are found in almost all coding and noncoding regions of the virus, suggesting that they play an important role in the survival of the variant SC2s by overcoming the broad spectrum of the host immune system including innate and adaptive immunity. For example, it is likely that the role of the HCVGs of the S gene (coding for the “spike protein” on the outer surface of the virus) is to help the variants to overcome host’s antibody-mediated adaptive immunity acquired before the wave of the current infection, while the role of the HCVGs of the N gene (encoding for the nucleocapsid protein inside the virus) may be to overcome host’s cell-mediated adaptive immunity, innate or/and other types of immunity acquired before the current infection (16, 17, and 18).
A new approach for developing vaccines for the next wave: Since predicting the founder-subpopulation of the new wave from approximately 44,000 unique viral sequences is impossible, we can use a different approach, where we select one viral sequence from each of the four current MGGs, such that the four selected viral sequences contain all the HCVGs of the four MGGs, thus, “panvalent” HCVGs. Since one or small number of subpopulations from the current 4 MGGs are expected to emerge as the founder(s) of the next wave all of the next wave viruses would have inherited all the HCVGs of the founding subpopulation. For a more specific example of this approach, see C. “Panvalent” vaccine design” in the SI.
Similar approaches can be applied to design “panvalent” vaccines against infections caused by other viruses, or even by bacterial and other agents if sufficient whole-genome information is available for their variants.