The mutational landscape of SARS-CoV-2 varies at the dominant viral genome sequence and minor variant genomes4,8. After spill overs from the intermediate animal host in the Huanan Seafood Wholesale Market1,2, subsequent human-to-human infection has resulted in SARS-CoV-2 genomic diversification from the original zoonotic spill overs. These mutations have arisen through genomic changes from virus and host derived single nucleotide polymorphisms (SNPs), homologous and heterologous recombination events, and insertions and deletions (indels). The first major changes in the SARS-CoV-2 genome in the human population were the D614G substitution in the spike glycoprotein (spikeD614G) accompanied by the P323L substitution in the viral RNA dependent RNA polymerase (NSP12) (NSP12P323L). These changes were associated with increased transmissibility, replication and fitness5,6,9.
Diversity in dominant and through minor variant genomes exists in humans/animals infected with SARS-CoV-210 (Extended Data Fig. 1A). These genomes will have both synonymous (non-coding) and non-synonymous (coding) differences. Minor variant genomes may become dominant when they confer a selective advantage either through the specific repertoire of SNPs and/or recombination. Since the start of the COVID-19 pandemic, different dominant viral genome sequences and non-synonymous changes appear to rise and fall in the SARS-CoV-2 global sequences11,12. Genetic change has resulted in the sequential dominance of several Variants of Concern (VoCs) associated with waves of high case numbers throughout the pandemic. With time, VoCs have emerged with examples representing large genomic jumps from circulating strains, with many novel mutations having the appearance of occurring at once, such as with the emergence of Alpha13 and Omicron14. One hypothesis is that these divergent variants may stem from persistent infection in immune-deficient hosts13,15. In this study, artificial intelligence/machine learning (AI/ML) was used to investigate the evolutionary patterns of amino acids changes at a population level and used to predict the conserved and future mutable amino acid sites of the spike protein, informing vaccine and antibody therapies for SARS-CoV-2.
Genetic variation in SARS-CoV-2 increased with time.
To understand the emergence of minor variant genomes, and derive amino acid variation frequencies, sequence data gathered by COG-UK (COVID-19 Genomics UK consortium) was analysed. Sequences produced by the ARTIC amplification method and sequenced by Illumina NovaSeq 6000 were selected. In total, 96,559 SARS-CoV-2 consensus genomes were derived, together with the associated minor variant genomic information (representing approximately 100 genomes per day for nearly three years, starting in March 2020 up to 31st December 2022). The nucleotide sequences were in silico translated into amino acid sequences. The average variation frequencies of each amino acid were calculated to investigate the pattern of mutational emergence in SARS-CoV-2 proteins over the three-year period. We found the presence of artificial low-frequency genetic variants due to errors acquired during amplicon generation and sequencing16 (Extended Data Fig. 1B and C; see more details in Supplementary File 1). To mitigate this effect, the average variation frequencies of amino acids at the same site in different months was compared with each other to reveal differences in frequencies of amino acid changes, instead of comparing different sites. The average variation frequencies of amino acid sites in each protein were compared across time by month (Fig. 1a). All proteins (except ORF7a, ORF7b, ORF8 and ORF10) had a similar pattern of increasing diversity from the start of the pandemic until a slight decrease associated with the bottleneck and spread of the emergence of the Alpha variant in October 2020 and then subsequent increasing diversity at the minor genomic variant level between December 2020 and January 2021.
Emergence of amino acid substitutions associated with VoCs in the UK.
The monthly average variation frequency of each amino acid site in SARS-CoV-2 from April 2020 to October (1st to 15th) 2021 was compared to that of March 2020 (Extended Data Fig. 2 – animation of variation for each protein). Because the ARTIC primers were updated from version 3 to version 4 from 15th October 2021 (to mitigate amplicon drop-outs caused by the emergence of the Delta variant), the monthly average variation frequency from November 2021 to December 2022 was compared to that of October (16th to 31st) 2021. As a proportion of the viral genome/proteome, the data indicated that spike had the greatest variation, reflecting the fact that spike is under strong selection pressure from host immunity (Extended Data Fig. 2). As an example, the average variation frequencies of amino acid sites in April 2020 all data points revolved around the 1:1 line, indicating that the variation frequencies of amino acid sites in April were like those in March for spike (Extended Data Fig. 3A). The yellow-colored dot in the figure that separated from the major cluster is the D614G substitution with average variation frequency ~1.0 in either March or April 20205. The D614G substitution was then maintained in all subsequent variants in this study without a decline in the variation frequency over a three-year period (Extended Data Fig. 3).
Two variants (Alpha and Delta) were prevalent in the UK before the introduction and evolution of the Omicron variants (Extended Data Fig. 3). Mutations of Alpha were emerging from October 2020 and were fixed in January 2021 (Fig. 1B and Extended Data Fig. 3). Afterwards, mutations of Delta emerged from April 2021 (Fig. 1C and Extended Data Fig. 3) and were fixed by June 2021, whilst Alpha and Delta were coexisting in May 2021. Our previous findings showed that it took three months for the D614G mutation (associated with the Lineage B.1 variant) to increase in frequency from the minor variant level to the dominant sequence during the containment phase in the UK with the two variants coexisting in the same sample (and therefore person) at a high frequency in the second month5. Most of the mutations from the Alpha and Delta variants followed the same pattern of transition in the analyzed data, except P681H from the Alpha variant (Fig. 2A and Extended Data Fig. 3), and R158G, A222V and P681R from the Delta variant (Fig. 2B and Extended Data Fig. 3). P681H (Alpha) and R158G (Delta) became dominant sequence at a slower rate than other amino acid substitutions associated with these VoCs. Meanwhile, P681R (Delta) was maintained at a constant coexisting frequency (note that for Delta there were several co-circulating sub-lineages17). Substitutions associated with the first Omicron variant emerged from the minor variant genomes to the dominant sequence from December 2021 and became fixed by February 2022 (Extended Data Figs. 2 and 3). Afterwards, new mutations were continually emerging and withdrawing, associated with new Omicron sub-variants (Extended Data Figs. 2 and 3).
Evolutionary patterns and mutational signals from the start of the pandemic
To investigate whether we could predict or rank potential amino acids in spike that would be subject to evolutionary pressure, the W value (referred to as the Shapiro-Wilk statistic W - a test of normality) was used to measure the evolutionary patterns of amino acids in these samples. If we assume that the virus was not subjected to any evolutionary forces, the observed variations at amino acid sites would be a result of amplification and sequencing errors and/or stochastic errors during viral replication, in which case the frequencies of variations at an amino acid site in monthly samples should exhibit symmetry around their mean and display an approximate normal distribution and have a high W value (Fig. 3A, blue curve). These amino acids would cluster to the top when the relationship between the W value and average variation frequency is visualized (amino acids contained within the blue circle in Fig. 3B). Viruses under evolutionary pressure would have increased frequencies of variation in some individuals (samples) at amino acid sites relevant to adapting to the pressure. This would increase the thickness of the right tail, generating a right-skewed distribution and lower W value (Fig. 3A, green curve). This would result in amino acids that were present in the top blue cluster shifting downwards - such amino acid sites would appear in the green circle of Fig. 3B. Stronger evolutionary pressure would increase the density of higher variation frequencies, generating a left-skewed distribution and a much lower W value (Fig. 3A, red curve). This would result in the amino acid site shifting to the far right of the plot, as highlighted in the red circle of Fig. 3B (further details can be found in Supplementary File 1).
The relationship between the W value and average variation frequency was visualised for each amino site in spike for each month. Two separated clusters for amino acid variation in spike were observed throughout all the months analyzed (Extended Data Fig. 5 and 6 and Fig. 3B). Some amino acids in spike kept high W values close to 1 through all three years of the pandemic and some shifted between high and low W value clusters (Extended Data Figs. 5 and 6). All amino acid substitutions that defined the VoCs travelled through the cluster associated with a low W value before they became fixed in the population (Extended Data Figs. 6). However, we found that amino acid sites associated with the VoCs presented in the low W value cluster or shifted between the two clusters (under selection pressures) from start of the pandemic (Extended Data Figs. 5 and 6). Especially, substitutions associated with Omicron could be observed in the low W value cluster in the first three months of UK data analyzed in 2020, although this variant emerged in late 2021 with a proposed origin in Southern Africa14 (Extended Data Figs. 5 and 6). These mutational signals suggest that such variants could have emerged anywhere and was perhaps only a matter of time and number of people infected (assuming the same pattern of recombination that led to VoCs).
Predicting past and future mutable sites in SARS-CoV-2 spike
The determination of a W value for each amino acid in spike in a large data set, given the background context of evolution in a population, may provide a predictor for whether that site would be subject to future selection. To predict which sites in spike may be subject to future evolutionary pressure we used an AI/ML technique that identified models (patterns, structures, or relationships) in a dataset without any prior knowledge. The K-Medoid Clustering (pam) algorithm was used to separate the conserved amino acid sites from the mutable amino acid sites in spike using the W values of each amino acid site as variables by month. With this algorithm, amino acid sites with similar W values were grouped into the same cluster based on the W values from March 2020 to May 2020 for three months (pam3), March 2020 to August 2020 for six months (pam6), March 2020 to August 2020 for nine months (pam9), March 2020 to February 2021 for 12 months (pam12) and March 2020 to December 2022 for 34 months (pam34) (Fig. 4A and B, Extended Data Fig. 7).
To evaluate the performance of the models to cluster amino acid with similar W values and to exam whether these were related to potential variability we examined substitutions in spike associated with WHO VoCs (https://covariants.org updated on 2023-04-26, Supplementary information 1), substitutions that are most frequent (tracked by https://www.bv-brc.org updated on 2023-01-12) in frequencies of 0.001 and 0.005, (Supplementary information 2) and substitutions that appeared in any Pango lineage (collected by https://cov-lineages.org extracted on 2023-06-10, Supplementary information 3). In general, the pam clusters performed better when the models were calculated using W values from more months (Fig. 4, Extended Data Fig. 7 and Table 1). Amino acid sites within the cluster exhibiting high W values are potentially conserved, while substitutions are more likely to occur on those amino acids within the cluster with low W values. In all the developed models, only a small number of tested mutation sites were found in cluster 1, which had the highest W values. Most of the tested mutation sites were found in the cluster with the lowest W values (Extended Data Table 1). Using the pam12 model, approximately 13% of the mutation sites of interest were found in cluster 1, while approximately 87% were found in the other clusters (Extended Data Table 1). In other words, the pam12 model correctly clustered around 87% of the mutation sites. The pam12 model successfully predicted 27 out of 39 (69.2%) Omicron-specific mutation sites that emerged since late 2021 (Supplementary Information 1). Similarly, the pam34 model correctly clustered 98% of the mutation sites of interest (Extended Data Table 1). Both the pam12 and pam34 models also demonstrated excellent performance in clustering the mutation sites that occurred most frequently (0.001 and 0.005) (Extended Data Table 2).
These models were also applied to cluster the 413 substitution sites in spike found across all Pango lineages (total 2600), including 306 mutation sites (in 2560 lineages) that were identified since March 2020 (Supplementary Information 3). Although most of these Pango lineages were not prevalent and the mutations occurred at these sites were rare, the pam12 model was still able to predict approximately 75% of the 306 mutation sites. This method not only clustered amino acid sites with similar patterns of W values but also served as a good predictor for identifying amino acid sites with a high likelihood of substitution. The conserved amino acid sites identified by the pam12 model included 92% (587) of the conserved amino acid sites identified by the pam34 model, indicating that W values derived using 12 months from start of the pandemic could be utilized to develop an effective predictive model for future evolutionary sites in spike.
The pam34 spike model was developed based on a large dataset (96,209 samples) collected over a three-year period since the beginning of the pandemic. The virus has been subjected to various evolutionary forces that are present in these individuals (e.g. transmission, prior infection with SARS-CoV-2 and thus immunity, vaccination). With the exception of a virological process such as recombination and treatment with antivirals, it is unlikely that many additional evolutionary forces will act on the virus at a population level. Furthermore, no substitutions caused by the vaccine have been reported18. Therefore, we predict that the pam34 model would be expected to remain applicable for a significant period in the future. The sites were modelled onto the spike protein using PyMOL (Fig. 4C and D) and, for example, identifies position V483 on the receptor binding domain that is associated with escape to monoclonal antibody therapy19,20 and G486 that is associated with a loss in neutralisation21. This study identifies sites in the spike protein that are subject to change but also highlights those sites that remain conserved (Fig. 4C and D) and would logically inform part of a universal vaccine that would be protect against potential future variants.