A prediction of mutations in infectious viruses using artificial intelligence

doi:10.21203/rs.3.rs-4922705/v1

Download PDF

Research Article

A prediction of mutations in infectious viruses using artificial intelligence

https://doi.org/10.21203/rs.3.rs-4922705/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Many subtypes of SARS-CoV-2 have emerged since its early stages, with mutations showing regional and racial differences. These mutations significantly affected the infectivity and severity of the virus. This study aimed to predict the mutations that occur during the evolution of SARS-CoV-2 and identify the key characteristics for making these predictions. We collected and organized data on the lineage, date, clade, and mutations of SARS-CoV-2 from publicly available databases and processed them to predict the mutations. In addition, we utilized various artificial intelligence models to predict newly emerging mutations and created various training sets based on clade information. Using only mutation information resulted in low performance of the learning models, whereas incorporating clade differentiation resulted in high performance in machine learning models, including XGBoost (accuracy: 0.999). However, mutations fixed in the receptor-binding motif (RBM) region of Omicron resulted in decreased predictive performance. Using these models, we predicted potential mutation positions for 24C, following the recently emerged 24A and 24 B clades. We identified a mutation at position Q493 in the RBM region. Our study developed effective artificial intelligence models and characteristics for predicting new mutations in continuously evolving infectious viruses.

Machine Learning

Deep Learning

SARS-CoV-2

Clade

Mutation

Prediction

Coronavirus Disease 2019 (COVID-19) has become prevalent worldwide since 2019. Although its infectivity has decreased recently, it still occurs frequently (https://ourworldindata.org/). The World Health Organization (WHO) distinguishes pathological differences by lineage based on combinations of mutation types, categorizing SARS-CoV-2 into variants of concern (VOCs), variants of interest (VOIs), and variants being monitored (VBMs). Through its continuous evolution, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has continuously produced mutations, resulting in 40 clades. These mutations are associated with disease severity and transmission to humans[1, 2].

Mutations that have occurred during the evolution of SARS-CoV-2 have mainly been observed in the receptor-binding domain (RBD) of the spike protein. These mutations facilitate immune evasion, bind to the host cell's ACE2 receptor, and are key targets for vaccines and treatments, necessitating close monitoring[3–5].

A substantial amount of epidemiological, genetic, and vaccine-related data have been accumulated for COVID-19. Numerous studies have been conducted to utilize these data effectively for diagnosis, prevention, and treatment. Mathematical and statistical models have been used to quantify the virulence and transmissibility of SARS-CoV-2 and to predict the spread of the omicron variant[6, 7]. Artificial intelligence is also used to build literature-based learning models for diagnosis and prognosis related to SARS-CoV-2 or to utilize clinical markers and clinical information for severity prediction or diagnosis of COVID-19[8–10]. However, the frequent mutations in SARS-CoV-2 make it substantially more difficult to develop treatments or vaccines compared to previous infectious diseases, making it necessary to predict these mutations.

Obermeyer et al. used machine learning to identify mutations occurring in different structures of SARS-CoV-2 and predict new lineages[11]. Additionally, they employed a phylogenetic tree-based sampling method that integrated temporal and sequence information to predict mutations[12]. Ultimately, accurately predicting the exact mutations that branch from a phylogenetic tree is challenging. Studies on mutation prediction using deep learning have also declined (Supplementary Fig. 1). Moreover, using mutation data from the early Omicron variant for training resulted in a lower accuracy in predicting recent mutations.

To enhance the accuracy of mutation prediction, it is crucial to carefully select the features used in the analysis and the correlations between mutations. We aimed to provide precise mutation information by leveraging key data, including mutation details, time-series information, and insights into the phylogenetic relationships between lineages for prediction.

Data collection

We utilized the RBM region of the Spike protein (P0DTC2) from Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), obtained from Uniprot. Specifically, the sequence from the N-terminal NSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCN

GVEGFNCYFPLQSYGFQPTNGVGYQPY C-terminal was used.

Pre-processing for learning

We retrieved metadata from Nextstrain’s ncov open data page (https://data.nextstrain.org/files/ncov/open/metadata.tsv.zst). The data includes the host and collection date of virus samples, the collection region, the gender and age of the sample, lineage, and mutation information. We filtered and processed the data using the host, collection date, lineage, and mutation information to create the training data.

Before the learning process, we filtered and standardized the data as follows. We formatted the dates in the YYYY-MM-DD format. We specified the host as 'human'. For the Pango lineage and Nextstrain clade, we removed Not a Number (NaN) and '?' values. For amino acid substitutions, we filtered and only used mutations found in the RBM (437–508) region. Through the filtering process, we extracted 8,411,025 samples from a total of 8,586,162 samples (Fig. 1).

Secondly, we performed the following steps for data pre-processing. We converted the date information to the number of days elapsed since the initial collection date then normalized it to a range of 0–1 using a min-max scaler[13]. We used the mutation information from the RBM region of the Spike protein (P0DTC2) to create a 72-position mutation sequence. We connected the parent clade and corresponding subclades involved in the clade diversification to create datasets. (Data available at honglab.catholic.ac.kr).

Data processing

We used various models for the learning process, including Machine Learning models (LightGBM, XGBoost, Random Forest) and a Deep Learning model (GRU), LightGBM[14], XGBoost[15], Random Forest[16] (Fig. 1) (https://github.com/Honglab-Research/Covid-mutation-probability).

We performed the training set process as follows, considering the clade's timeline, mutations, and clade branching points. In the mutation prediction process, we created training datasets from three perspectives, considering temporal information and the differentiation process of SARS-CoV-2.

First, we conducted training using only mutation information, without considering any additional information such as time and differentiation data. We used a random state to randomly generate the training set, validation set, and test set from the entire dataset.

Second, we created the training set and test set with temporal information. We investigated the outbreak periods (waves) of SARS-CoV-2 to generate the datasets: wave 1 (March 2020 - June 2020), wave 2 (September 2020 - January 2021), wave 3 with the Alpha variant (January 2021 - June 2021), wave 4 with the Delta variant (July 2021 - October 2021), and wave 5 with the onset of the Omicron variant (from November 2021). Based on these periods, we organized three datasets. The first training set used wave 1 as the training set and wave 2 as the test set, training to predict wave 3. The second training set used wave 1 and wave 2 as the training set and wave 3 as the test set, training to predict wave 4. The third training set used wave 1 as the training set and wave 2 as the test set, training to predict wave 4.

Third, we created training datasets based on the clade differentiation process. The first training set utilized clades prior to 21M (Omicron B.1.1.529), including 19A, 19B, 20A, 20B, 20C, 20D, 20E (B.1.177), 20F (D.2), 20G, 20H, 20I, 20J, 21A (Delta, B.1.617.2), 21B (Kappa, B.1.617.1), 21C (Epsilon, B.1.427, B.1.429), 21D (Eta, B.1.525), 21E (Theta. P.3), 21F (Iota, B.1.526), 21G (Lambda, C.37), 21H (Mu, B.1.621), 21I (Delta), and 21J (delta). The validation set was trained using clades after 21M (Omicron B.1.1.529). The second training set was created by adding clades 21K (Omicron BA.1) and 21L (Omicron BA.2) to the first training set data. The validation set was trained using clades 22A (Omicron BA.4), 22B (Omicron BA.5), 22C (Omicron BA.2.12.1), 22D (Omicron BA.2.75), 22E (Omicron BQ.1), 22F (Omicron XBB), 23A (Omicron XBB.1.15), 23B (Omicron XBB.1.16), 23C (Omicron CH.1.1), 23D (Omicron XBB.1.9), 23E (Omicron XBB.2.3), 23F (Omicron EG.5.1), 23G (Omicron XBB.1.5.70), 23H (Omicron HK.3), and 23I (Omicron BA.2.86). The third training set was created by adding clades 22A (Omicron BA.4), 22B (Omicron BA.5), 22C (Omicron BA.2.12.1), and 22D (Omicron BA.2.75) to the second training set data. The validation set was trained using clades 22E (Omicron BQ.1), 22F (Omicron XBB), 23A (Omicron XBB.1.15), 23B (Omicron XBB.1.16), 23C (Omicron CH.1.1), 23D (Omicron XBB.1.9), 23E (Omicron XBB.2.3), 23F (Omicron EG.5.1), 23G (Omicron XBB.1.5.70), 23H (Omicron HK.3), and 23I (Omicron BA.2.86). The fourth training set was created by adding clades 22E (Omicron BQ.1), 22F (Omicron XBB), 23A (Omicron XBB.1.15), and 23B (Omicron XBB.1.16) to the third training set data. The validation set was trained using clades 23C (Omicron CH.1.1), 23D (Omicron XBB.1.9), 23E (Omicron XBB.2.3), 23F (Omicron EG.5.1), 23G (Omicron XBB.1.5.70), 23H (Omicron HK.3), and 23I (Omicron BA.2.86). Fifth, we used early Omicron variants 21M, 21K, and 21L for the training dataset then used subsequent variants 22A, 22B, 22C, and 22D for the validation dataset. Sixth, we used 21M, 21K, 21L, 22A, 22B, 22C, and 22D as the training dataset and 22E, 22F, 23A, and 23B as the test dataset for training (Fig. 2A) (https://github.com/Honglab-Research/Covid-mutation-probability).

Data analysis

The criteria for predicting the next mutation through the learning model were defined based on the lineage pipeline rules from Pangolin. A lineage-defining mutation is considered to occur when it appears in the first 80% of records for that lineage as logged by Pangolin. This study also defines new mutations according to this criterion (https://cov-lineages.org/resources/pangolin.html, https://ncbiinsights.ncbi.nlm.nih.gov/2024/05/02/automated-lineage-definitions-ncbi-virus-sars-cov-2-variants-overview/). Therefore, based on this criterion, mutation predictions were performed using the learning model. To evaluate efficiency, accuracy, precision, recall, and F-score were utilized[17].

Data investigation for SARS-CoV-2 mutation prediction

We investigated the clades and lineages of SARS-CoV-2 from its outbreak to the present, organizing the data necessary for training based on the mutation frequency. We focused on the RBM of the spike protein, which showed the highest mutation frequency (Supplementary Fig. 2).

The datasets required for training were structured as training sets that included the entirety of the SARS-CoV-2 clades (training sets 1, 2, 3, and 4) and training sets constructed using omicron clades with the highest number of mutations (training sets 5 and 6). Finally, datasets were created using a random state method for all the clades that emerged (Fig. 3A). In the training sets, the mutation occurrence positions were weighted towards specific locations within each set, showing a high frequency only at those positions. Therefore, refined data suitable for training were required (Fig. 2B).

Data construction for accurate predictions of SARS-CoV-2 mutations

Using clades from 2019 to May 2023, we aimed to predict SARS-CoV-2 mutations using machine learning (random forest, XGBoost, and LightGBM) and deep learning (GRU) models. Randomly extracted clades were used to predict potential mutations. The results showed low accuracy in both machine learning models (random forest, XGBoost, and LightGBM) and the deep learning model (GRU) (Fig. 3A). For accurate mutation prediction, the time and region were considered in addition to the mutation information. We generated datasets and performed modeling based on the timing of pandemic waves, their prevalence periods, and mutation information. From the data collected in NextStrain, time information refers to the collection date rather than the initial report date. In addition, we confirmed that most of the sequence collection and location data were from North America and Europe (Supplementary Fig. 3).

Wave 1 and Wave 2 featured the wild-type, whereas an alpha variant with N501Y characterized Wave 3. Wave 4 was dominated by the delta variant with L452R and T478K, whereas wave 5 saw the emergence of the omicron variant. To verify the causal relationships between mutations, we used pandemic wave data to predict subsequent mutations (Fig. 3B). We constructed three prediction models for the analysis: model ① predicting wave 3 using waves 1 and 2; model ② predicting wave 4 using waves 1, 2, and 3; and model ③ predicting wave 4 using waves 1 and 2. The wave 3 prediction model ① showed an accuracy of approximately 0.32 using XGBoost. The wave 4 prediction model ② showed an accuracy of approximately 0.168 using XGBoost. Finally, the Wave 4 prediction model showed an accuracy of approximately 0.022 using XGBoost (Fig. 3B).

In waves 1 and 2, approximately 40% of the mutation information required for training included the mutation AA 501 in the alpha variant in wave 3. In contrast, the mutations at positions 452 and 478 of the delta variant in wave 4 had a frequency of approximately 16.7%, making accurate mutation prediction challenging (Fig. 3C).

For the Omicron variant in wave 5, mutations were found at various locations, making it difficult to make predictions using only information from waves 1, 2, 3, and 4 (Fig. 3B). Additionally, for recent clades 23I, 23H, 24A, and 24 B, the mutation rates in the RBM were fixed at specific locations, presenting challenges for predicting new mutation sites (Fig. 3D).

Prediction of new SARS-CoV-2 mutations

We created training data with mutation, collection time, and clade information and trained each model accordingly using XGBoost. Earlier training sets had low accuracy (Training set 1, accuracy: 0.765; Training set 2, accuracy: 0.639; Training set 3, accuracy: 0.605; Training set 4, accuracy: 0.593). In contrast, training sets composed solely of Omicron data showed very high accuracy, with training sets 5 and 6 achieving an accuracy of 0.999 (Fig. 4A). Random Forest and LightGBM showed similar results to XGBoost, whereas GRU showed a lower performance (Supplementary Fig. 4 and Supplementary Table 4).

Using information from recently reported clades 23H, 23I, 24A, and 24 B, we investigated mutations that could occur in the RBM region. Using information from the recently reported lineages 24A and 24 B, we predicted potential mutations in the RBM region. Mutations likely occurred at positions 441, 444, 453, 475, 493, and 500. Mutations at position 493 were also observed in the 23C lineage (Fig. 4B).

The inclusion of the period in which numerous mutations occurred in Omicron, and the temporal and differentiation periods led to improved mutation prediction performance. Compared to machine learning models such as XGBoost, LightGBM, and Random Forest, the GRU model showed a relatively lower performance.

Configuration of a new mutation prediction algorithm in SARS-CoV-2 infectivity prediction

We incorporated an algorithm for predicting new SARS-CoV-2 mutations into our pre-existing infectivity prediction system (Artificial Intelligence analytics toolkit for predicting viral mutations in protEin). To run this algorithm within the system, we set up a server equipped with 96 CPU cores, 256GB RAM, and three RTX 8000 GPUs.

To use the mutation prediction feature, users can access the mutation prediction page and input the RBM sequence they wish to analyze. This process involves using the sequence of the RBM region of the Spike protein (P0DTC2) as a reference, with users being able to modify the sequence to generate mutation sequences. Once the desired sequence was generated, the user submitted a task and sent it to the server. The server receives the request, checks the available resources, and allocates them accordingly. The sequence is encoded using the CPU, and then predictions of the mutation positions are performed using pregenerated learning models on the GPU. Upon task completion, the server sent the results back to the user, providing information on the positions of the mutation predictions and model performance.

Previous infectious diseases either surged briefly, disappeared, or were primarily limited to specific regions[18–20]. Despite the implementation of stringent legal restrictions and limitations on interregional travel to curb the initial spread of COVID-19, more potent lineages have emerged over time[21–24].

SARS-CoV-2 has given rise to numerous lineages and clades, each distinguished by mutations that significantly accumulate as these lineages continue to evolve[25, 26]. Even in infectious diseases such as severe acute respiratory syndrome-associated coronavirus (SARS) and Middle East respiratory syndrome coronavirus (MERS), mutations occur during the epidemic period. However, these infectious diseases do not cause many mutations because of their relatively short durations or regional limitations[27–29].

The numerous mutations that occurred in SARS-CoV-2 resulted in the initially designed COVID-19 vaccines sometimes being ineffective against the Omicron lineage[30, 31]. Additionally, there have been reports of some treatments being ineffective despite the development of appropriate therapeutics owing to mutations[32–34]. Therefore, it is crucial to quickly and accurately predict mutations that occur during the evolution of infectious diseases.

Recently, research using artificial intelligence methods to analyze genomes has become increasingly prominent. Numerous studies have focused on predicting and preventing SARS-CoV-2 infection using genomic information.

Bhowmick et al. proposed two new mutations (P499S and T500R) based on a protein three-dimensional structure prediction algorithm. They estimated binding affinity through the interaction between the receptor-binding domain (RBD) and host cell receptor ACE2[35]. However, these mutations were inferred based on the original Wuhan-Hu-1 sequence, using physicochemical binding interactions. Therefore, they did not predict the mutations included in the current major lineages or consider clade differentiation and temporal elements of clades and lineages.

Saldivar-Espinoza et al. used machine learning to predict that mutations occurring multiple times independently throughout the evolution of the virus are more likely to result from host deaminase activity than from replication errors[36]. They also predicted the occurrence of mutations based on the SARS-CoV-2 structure. They calculated the values based on whether the changes and mutations in nucleotides belonged to a lineage-associated clade.

In these predictions, the pattern of amino acid substitutions in SARS-CoV-2 was missed because of results at the nucleotide level. They failed to predict new mutations in the spike protein, which has the most mutations that bind to host ACE2.

Moreover, in the early lineages and clades, a small number of mutations occurred in the RBM region. When examining clades 19A-21E that predate Omicron, the major amino acid substitution sites were L452, S477, T478, E484(K), F490, and N501. The number of mutations was also limited to one or two, making accurate predictions challenging.

The temporary emergence and disappearance of mutations have made it more challenging to determine the correlation between mutations, except for those at major substitution sites. This limitation affects the accuracy of predicting the location of the mutation. Analysis of the 12 clades that emerged by 2020 revealed that eight clades other than 20F, 20H, 20I, and 20 J did not have mutation sides with a frequency above 0.8. The highest frequency of the mutation sites did not exceed 0.3. Only five mutation sites had a frequency greater than 0.1, which made it challenging to identify the characteristics of certain mutations.

On the other hand, after the Omicron variant, specifically clade 21M, in addition to the existing amino acid substitution sites, mutations occurred at various positions such as N440, V445, G446, F456, N460, E484(A), F486, F490, Q493, Q498, N501, and Y505. Because of these diverse mutations and the availability of numerous samples, the predictions were more accurate (Fig. 2). In the case of recent omicron variants, the prediction accuracy of mutations decreased because of the increased similarity of mutations (Fig. 4).

In the predicted 24C clade, the Q493E mutation occurred at position 493, where the mutation had previously occurred and then disappeared. Given the mutation frequency data up to clade 23B clade, the likelihood of mutations occurring at position 493 gradually decreased. Considering that the predicted frequency was not high, the Q493E mutation was considered a transient mutation that may disappear again (Supplementary Fig. 5).

As the frequency of mutations becomes increasingly fixed at specific positions, future mutations are likely to occur either as temporary additional mutations at fixed locations or as changes in the type of amino acid in the existing mutations. Over time, the mutation process results in only transient or minor changes.

During the initial stages of the pandemic, mutations sporadically appeared and then disappeared, making it difficult to identify correlations between mutations. However, in recent samples, the difficulty in distinguishing between mutations arose from the increased similarity between samples. This difficulty can lead to overfitting in the prediction models and reduce the accuracy of the predictions.

Our study had several limitations. First, our study was limited to the RBM region of the spike protein, where many significant mutations were found. Our study could not fully assess mutation probabilities in other regions. Second, the data used in our study correspond to the time of genome sequencing rather than the initial occurrence of the mutation, indicating that the date information was not entirely reflective of evolutionary continuity. Third, the limited mutation information available before Omicron makes it difficult to predict the mutations that occur in Omicron.

Our study proposes an accurate method for predicting mutations in infectious diseases using mutation information and time data based on artificial intelligence. This approach aims to enhance the precision of the predictions.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Funding

This work was supported in part by grants from the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) (NRF-2021M3H9A2097227, NRF: 2021M3A9I2080490, NRF-2022R1A2C3008162, and RS-2023-00220840), the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (RS-2023-00265923), and the Basic Medical Science Facilitation Program through the Catholic Medical Center of the Catholic University of Korea funded by the Catholic Education Foundation.

Author Contribution

D.H conceived and designed these studies.W.J.C, D.Y.S, D.S.C, and J.P gathered SARS-CoV-2 data and performed data analyses.W.J.C, D.Y.S, J.P, and D.H wrote the manuscript with input from all other authors.

Acknowledgement

Data Availability

The data is available for access at GitHub (https://github.com/Honglab-Research/Covid-mutation-probability).

Ghafari M, Hall M, Golubchik T, Ayoubkhani D, House T, MacIntyre-Cockett G, Fryer HR, Thomson L, Nurtay A, Kemp SA, et al: Prevalence of persistent SARS-CoV-2 in a large community surveillance study. Nature 2024, 626:1094–1101.
Team C-F: Past SARS-CoV-2 infection protection against re-infection: a systematic review and meta-analysis. Lancet 2023, 401:833–842.
Xue S, Han Y, Wu F, Wang Q: Mutations in the SARS-CoV-2 spike receptor binding domain and their delicate balance between ACE2 affinity and antibody evasion. Protein Cell 2024, 15:403–418.
Lan J, Ge J, Yu J, Shan S, Zhou H, Fan S, Zhang Q, Shi X, Wang Q, Zhang L, Wang X: Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor. Nature 2020, 581:215–220.
Harvey WT, Carabelli AM, Jackson B, Gupta RK, Thomson EC, Harrison EM, Ludden C, Reeve R, Rambaut A, Consortium C-GU, et al: SARS-CoV-2 variants, spike mutations and immune escape. Nat Rev Microbiol 2021, 19:409–424.
Xu Z, Wei D, Zeng Q, Zhang H, Sun Y, Demongeot J: More or less deadly? A mathematical model that predicts SARS-CoV-2 evolutionary direction. Comput Biol Med 2023, 153:106510.
Oh J, Apio C, Park T: Mathematical modeling of the impact of Omicron variant on the COVID-19 situation in South Korea. Genomics Inform 2022, 20:e22.
Wang L, Zhang Y, Wang D, Tong X, Liu T, Zhang S, Huang J, Zhang L, Chen L, Fan H, Clarke M: Artificial Intelligence for COVID-19: A Systematic Review. Front Med (Lausanne) 2021, 8:704256.
Chadaga K, Prabhu S, Sampathila N, Chadaga R, Umakanth S, Bhat D, G SS: Explainable artificial intelligence approaches for COVID-19 prognosis prediction using clinical markers. Sci Rep 2024, 14:1783.
Mei X, Lee HC, Diao KY, Huang M, Lin B, Liu C, Xie Z, Ma Y, Robson PM, Chung M, et al: Artificial intelligence-enabled rapid diagnosis of patients with COVID-19. Nat Med 2020, 26:1224–1228.
Obermeyer F, Jankowiak M, Barkas N, Schaffner SF, Pyle JD, Yurkovetskiy L, Bosso M, Park DJ, Babadi M, MacInnis BL, et al: Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science 2022, 376:1327–1332.
Zhou B, Zhou H, Zhang X, Xu X, Chai Y, Zheng Z, Kot AC, Zhou Z: TEMPO: A transformer-based mutation prediction framework for SARS-CoV-2 evolution. Comput Biol Med 2023, 152:106264.
Patro S, Sahu KK: Normalization: A preprocessing stage. arXiv preprint arXiv:150306462 2015.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y: Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 2017, 30.
Chen T, Guestrin C: XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 785–794; 2016:785–794.
Breiman L: Random forests. Machine learning 2001, 45:5–32.
Olson DL, Delen D: Advanced data mining techniques. Springer Science & Business Media; 2008.
Peiris JS, Guan Y, Yuen KY: Severe acute respiratory syndrome. Nat Med 2004, 10:S88-97.
Cho SY, Kang JM, Ha YE, Park GE, Lee JY, Ko JH, Lee JY, Kim JM, Kang CI, Jo IJ, et al: MERS-CoV outbreak following a single patient exposure in an emergency room in South Korea: an epidemiological outbreak study. Lancet 2016, 388:994–1001.
Ebrahim SH, Maher AD, Kanagasabai U, Alfaraj SH, Alzahrani NA, Alqahtani SA, Assiri AM, Memish ZA: MERS-CoV Confirmation among 6,873 suspected persons and relevant Epidemiologic and Clinical Features, Saudi Arabia – 2014 to 2019. EClinicalMedicine 2021, 41:101191.
Leung K, Lau EHY, Wong CKH, Leung GM, Wu JT: Estimating the transmission dynamics of SARS-CoV-2 Omicron BF.7 in Beijing after adjustment of the zero-COVID policy in November-December 2022. Nat Med 2023, 29:579–582.
Walensky RP, Walke HT, Fauci AS: SARS-CoV-2 Variants of Concern in the United States-Challenges and Opportunities. JAMA 2021, 325:1037–1038.
Dong R, Hu T, Zhang Y, Li Y, Zhou XH: Assessing the Transmissibility of the New SARS-CoV-2 Variants: From Delta to Omicron. Vaccines (Basel) 2022, 10.
Jalali N, Brustad HK, Frigessi A, MacDonald EA, Meijerink H, Feruglio SL, Nygard KM, Ro G, Madslien EH, de Blasio BF: Increased household transmission and immune escape of the SARS-CoV-2 Omicron compared to Delta variants. Nat Commun 2022, 13:5706.
da Costa CHS, de Freitas CAB, Alves CN, Lameira J: Assessment of mutations on RBD in the Spike protein of SARS-CoV-2 Alpha, Delta and Omicron variants. Sci Rep 2022, 12:8540.
Gangavarapu K, Latif AA, Mullen JL, Alkuzweny M, Hufbauer E, Tsueng G, Haag E, Zeller M, Aceves CM, Zaiets K, et al: Outbreak.info genomic reports: scalable and dynamic surveillance of SARS-CoV-2 variants and mutations. Nat Methods 2023, 20:512–522.
Li W, Shi Z, Yu M, Ren W, Smith C, Epstein JH, Wang H, Crameri G, Hu Z, Zhang H, et al: Bats are natural reservoirs of SARS-like coronaviruses. Science 2005, 310:676–679.
Wong LR, Zheng J, Sariol A, Lowery S, Meyerholz DK, Gallagher T, Perlman S: Middle East respiratory syndrome coronavirus Spike protein variants exhibit geographic differences in virulence. Proc Natl Acad Sci U S A 2021, 118.
Kleine-Weber H, Elzayat MT, Wang L, Graham BS, Muller MA, Drosten C, Pohlmann S, Hoffmann M: Mutations in the Spike Protein of Middle East Respiratory Syndrome Coronavirus Transmitted in Korea Increase Resistance to Antibody-Mediated Neutralization. J Virol 2019, 93.
Lau JJ, Cheng SMS, Leung K, Lee CK, Hachim A, Tsang LCH, Yam KWH, Chaothai S, Kwan KKH, Chai ZYH, et al: Real-world COVID-19 vaccine effectiveness against the Omicron BA.2 variant in a SARS-CoV-2 infection-naive population. Nat Med 2023, 29:348–357.
Andrews N, Stowe J, Kirsebom F, Toffa S, Rickeard T, Gallagher E, Gower C, Kall M, Groves N, O'Connell AM, et al: Covid-19 Vaccine Effectiveness against the Omicron (B.1.1.529) Variant. N Engl J Med 2022, 386:1532–1546.
Bajema KL, Berry K, Streja E, Rajeevan N, Li Y, Mutalik P, Yan L, Cunningham F, Hynes DM, Rowneki M, et al: Effectiveness of COVID-19 Treatment With Nirmatrelvir-Ritonavir or Molnupiravir Among U.S. Veterans: Target Trial Emulation Studies With One-Month and Six-Month Outcomes. Ann Intern Med 2023, 176:807–816.
Pochtovyi AA, Kustova DD, Siniavin AE, Dolzhikova IV, Shidlovskaya EV, Shpakova OG, Vasilchenko LA, Glavatskaya AA, Kuznetsova NA, Iliukhina AA, et al: In Vitro Efficacy of Antivirals and Monoclonal Antibodies against SARS-CoV-2 Omicron Lineages XBB.1.9.1, XBB.1.9.3, XBB.1.5, XBB.1.16, XBB.2.4, BQ.1.1.45, CH.1.1, and CL.1. Vaccines (Basel) 2023, 11.
Takashita E, Kinoshita N, Yamayoshi S, Sakai-Tagawa Y, Fujisaki S, Ito M, Iwatsuki-Horimoto K, Halfmann P, Watanabe S, Maeda K, et al: Efficacy of Antiviral Agents against the SARS-CoV-2 Omicron Subvariant BA.2. N Engl J Med 2022, 386:1475–1477.
Bhowmick S, Jing T, Wang W, Zhang EY, Zhang F, Yang Y: In Silico Protein Folding Prediction of COVID-19 Mutations and Variants. Biomolecules 2022, 12.
Saldivar-Espinoza B, Macip G, Garcia-Segura P, Mestres-Truyol J, Puigbo P, Cereto-Massague A, Pujadas G, Garcia-Vallve S: Prediction of Recurrent Mutations in SARS-CoV-2 Using Artificial Neural Networks. Int J Mol Sci 2022, 23.

No competing interests reported.

GNISupplementaryinformation.docx

Download PDF

Editorial decision: Revision requested
08 Sep, 2024
Reviews received at journal
08 Sep, 2024
Reviews received at journal
01 Sep, 2024
Reviewers agreed at journal
23 Aug, 2024
Reviews received at journal
21 Aug, 2024
Reviewers agreed at journal
20 Aug, 2024
Reviewers agreed at journal
20 Aug, 2024
Reviewers invited by journal
20 Aug, 2024
Editor assigned by journal
19 Aug, 2024
Submission checks completed at journal
19 Aug, 2024
First submitted to journal
16 Aug, 2024

You are reading this latest preprint version

A prediction of mutations in infectious viruses using artificial intelligence

Status:

Version 1

Abstract

Figures

Background

Methods

Data collection

Pre-processing for learning

Data processing

Data analysis

Results

Data investigation for SARS-CoV-2 mutation prediction

Data construction for accurate predictions of SARS-CoV-2 mutations

Prediction of new SARS-CoV-2 mutations

Configuration of a new mutation prediction algorithm in SARS-CoV-2 infectivity prediction

Discussion

Conclusions

Declarations

Ethics approval and consent to participate

Funding

Author Contribution

Acknowledgement

Data Availability

References

Additional Declarations

Supplementary Files

Status:

Version 1