Viral Genome Surveillance via Modifiable Microarray Sequencing and a Supervised Stack Ensemble Neural Network Model: SARS-CoV-2 as a Case Study

doi:10.21203/rs.3.rs-4999540/v1

Download PDF

Research Article

Viral Genome Surveillance via Modifiable Microarray Sequencing and a Supervised Stack Ensemble Neural Network Model: SARS-CoV-2 as a Case Study

https://doi.org/10.21203/rs.3.rs-4999540/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

Viral outbreaks, including Dengue, Zika, Ebola, and particularly SARS-CoV-2, have caused significant global impacts and unprecedented losses of life. SARS-CoV-2, in particular, continues to be a leading cause of death worldwide and in the United States, with many individuals experiencing prolonged symptoms. In this study, we present a novel genomic surveillance approach that combines a stack-ensembled neural network and microarray genome resequencing by hybridization.

Results

The resequencing microarray features ~ 240,000 probes for approximately 30,000 nucleotides per genomic sample. The data utilized were derived from our previously reported cost-effective and rapid full-genome tiling array technology. Our base-calling algorithms were enhanced with 48 input features per base position and multiple scanning exposure times. The training dataset included 570,000 data points from which over 12,000 neural network models were developed. To assess the accuracy of our stack-ensembled models in base-calling and variant identification, we analyzed genomic data from four clinical samples with a cycle threshold value ≤ 24 via neural network and logistic regression meta-models.

Conclusions

Our models demonstrated accuracies exceeding 99% and coverages comparable to existing standards. Microarray genome resequencing of clinical viral samples provides significant benefits in terms of cost-effectiveness, speed, and flexibility, allowing for the surveillance of diverse viral genomes without the need for extensive algorithm retraining.

Genome Resequencing

Machine Learning

Covid

Microchip Arrays

The COVID-19 pandemic originated in Wuhan, China, in 2019 [1] and rapidly escalated into a global crisis [2–3]. COVID-19 is caused by the SARS-CoV-2 virus, and the ongoing transmission, infection, and evolution of the virus underscore the need for affordable and timely technologies for the constant surveillance of viral genome dynamics. While this study focuses on SARS-CoV-2, the methods can be applied to other viral genomes, such as Zika, Ebola, and Dengue, which have also caused significant fatalities globally and in the United States [4–6]; for example, global reports indicate 33,414 Ebola cases with over 14,375 deaths (2001–2022) [4] and 43,311 Zika virus cases in the United States (2015–2022) [5] and reports by the CDC document 42,164 Dengue virus cases between 2010 and 2022 [6]. These statistics highlight the critical need for genomic surveillance, making public health essential for monitoring the spread of viral infections. However, traditional methods are often time-consuming and costly, limiting their effectiveness in monitoring rapidly evolving viruses such as SARS-CoV-2.

As new variants of SARS-CoV-2 emerge and the virus evolves to escape immune protection from prior infection or vaccination [7–9], researchers need to assess the impact of mutations on therapeutic treatments and vaccine effectiveness. High-throughput surveillance is essential for understanding which strains are circulating at any given time, informing vaccine production and medical interventions. However, traditional next-generation sequencing (NGS) methods face challenges such as high costs and slow turnaround times [10–13]. Monitoring emerging strains with innovative technologies enables the development of the next generation of vaccines and preventative measures [15–17].

In this study, we extracted additional data from the resequencing microarray images to integrate average and standard deviation intensity values over all microarray features, which were then parsed into a neural network model (NNM) designed for high-accuracy viral genome surveillance, including monitoring instances where no bases were called at specific genomic positions. We trained over 12,000 neural network models and conducted rigorous validation procedures using clinical samples to assess their performance. This project aims to address the current challenges posed by COVID-19 and enhance our preparedness for future viral outbreaks.

Generation of training, testing, and validation datasets

Data from resequencing microarrays

The NNM was constructed by incorporating input features from the image acquired from the resequencing microarray, utilizing over 30 clinical samples from the Wyoming Public Health Laboratory, as detailed in our previous work [14]. The training and test datasets in this study consisted of measurements of average pixel intensity and standard deviation of pixel intensities from each sense and antisense probe at each base position in the genome (Fig. 1). After these data are extracted from the images, each base position consists of 48 variables used in the NNM, with each probe set consisting of eight nucleotide features, A, T, C, and G, on both the sense and antisense strands, as demonstrated in previous reports [14]. For each of the features, the data included a measurement of the average intensity of the pixels and the standard deviation of the pixels for the feature at three different image exposure times: 0.25 s, 1 s, and 4 s (2 metrics × 3 exposures × 8 features = 48 measurements per base). This resulted in approximately 570,000 data points for training and validating the models. Additionally, we designed multinomial linear regression and neural network meta-learner models, whose performance was tested on four clinical samples.

Preprocessing/Model Selection

An extensive ensemble of over 12,000 supervised neural network models (SNNMs) was trained by applying varying neural network architectural hyperparameter combinations and additional preprocessing parameters: ‘mid-select' and ‘consensus-type.’ Thus, each permutation of the hyperparameter combination was a stand-alone model. The first novel preprocessing hyperparameter introduced was ‘mid-select’, which excludes from the training set a predetermined number of nucleotides at both ends of the sequenced genome due to indeterminacy in probe hybridization at the ends of the probed genome. For example, a mid-select parameter of 1000 would remove 2000 total positions from each training sample. The ‘consensus-type’ hyperparameter is described in the next section. Figures 2 and 3 portray the frequency of hyperparameters used in the total models and top models (base models), respectively. To address the multilabel classification, given the presence of five distinct labels with one potential output at each genomic position, a transformation pipeline on the categorical variables representing the base list (“A”, “C”, “G”, “T” and “N”) was incorporated to output a numerical representation of the most likely base at each position via the Softmax activation function, thus facilitating the classification capabilities of the model. During the preprocessing phase of the neural network, varying implementations of the mid-select hyperparameter in the range of 0 ≤ x ≤ 10,000 were incorporated into the training data. The range of neuron numbers spanned 5 ≤ x ≤ 11, while the batch size range spanned 500 ≤ x ≤ 5,000, along with a layer depth of two or more. Additionally, nonlinear activation functions such as relu, selu and tanh were integrated within the network architecture, and to mitigate overfitting concerns, the entire dataset was split into separate training (~ 74%) and validation (~ 26%) datasets at the beginning of the neural network model (NNM) assembly.

Consensus-Type Hyperparameter Approach

Sequencing errors are inherent in sequencing technologies, such as the resequencing method employed in our study or any next-generation sequencing technology [24]. Therefore, we created a ‘consensus-type’ preprocessing hyperparameter to increase confidence in selecting training targets to supervise our models. The consensus framework was based on evaluating base agreements across all training target options from the sense and antisense probes at each of the three available exposure times. At each position in the genome, the highest average hybridization signal intensity for the probe sets was used to make an initial base call for the use of the 6 datasets (sense and antisense readings at 3 different exposures). The base call agreement majority among the probe sets subsequently assigns a consensus score of \(\:n\) to that position, where \(\:n\) can range between 2 and 6, inclusive. Therefore, during preprocessing, the training set at a consensus level of \(\:N\) is said to have satisfied the consensus requirements if \(\:n\ge\:N\) (Table 1). To remove ambiguities due to ties in agreement, \(\:N\) is restricted to values of 4, 5, or 6. Note that N is manually varied to process our training dataset, as we describe below.

Table 1: Examples of consensus score calculations (n) and majority agreements for the 6 probe sets. We give an example of the decision to satisfy the consensus requirements for N=5.

	0.25 s Exposure		1 s Exposure		4 s Exposure
n	Sense	Antisense	Sense	Antisense	Sense	Antisense	Base Call Majority Agreement	Satisfies Consensus Requirements?
3	A	T	A	C	G	A	A	No
3	A	A	A	T	T	T	A and T	No
6	T	T	T	T	T	T	T	Yes
4	A	T	A	A	T	A	A	No
2	C	A	C	T	T	A	A and T	No

The reference positions in the SARS-CoV-2 genome were chosen as the initial targets. Thus, the reference consensus type serves as a control type. This retains the largest set of data for training but fails to address possible true variants within the samples. We then created a nonhybrid consensus algorithm to select targets on the basis of their experimental hybridization intensities. Each potential training example is checked to ensure that it satisfies the consensus requirements. For those examples, we use the consensus base as its training target; otherwise, that example is discarded. While this rarely changes the training target for that position from the reference, it does allow for any true variants according to experimental data but reduces the overall training dataset size. A third consensus type, called a hybrid consensus, was created to bolster the strengths of the two prior types. In the hybrid version, a slight change is introduced in which every training example is kept. However, the target base call is only changed from the reference if the consensus requirements are satisfied, with the majority consensus base differing from the reference at that position. With this consensus type as an option, the training set size can be kept at its maximum, and experimental hybridization intensities can inform the model about true variants. Each type is employed to investigate the computational advantages of their preprocessing features and provide a diverse set of models to be stacked at a later stage.

Selection criteria for clinical sample test data

During the validation process with clinical samples, we limited the analysis to samples with a cycle threshold (Ct) value less than or equal to 24 to ensure the reliability of the Stack Ensemble Neural Network Model for variant identification [25]. This requirement serves as a common and standard metric to guarantee the reliability of the sample input for predictions. The samples we analyzed were USA/WY-WYPHL-00024/2020, USA/WY-WYPHL-00026/2020, USA/WY-WYPHL-00032/2020, USA/WY-WYPHL-00036/2020, USA/WY-WYPHL-00041/2020, USA/WY-WYPHL-00044/2020, USA/WY-WYPHL-00059/2020, and USA/WY-WYPHL-00064/2020. We will simply refer to these samples as WYOM 24, WYOM 26, WYOM 32, WYOM 36, WYOM 41, WYOM 44, WYOM 59, and WYOM 64, respectively. Those that are of a sufficient Ct value used in the main body of this work are WYOM 36, WYOM 41, WYOM 59, and WYOM 64. Data on the remaining samples are provided in the supplementary information.

Training the Neural Network Model

Neural Network Model Training and Architecture Assembly

Following the completion of the preprocessing phase, the training data comprising ~ 74% of the total dataset were curated and refined for optimal feature filtering, while the remaining ~ 26% of the total dataset was used for validation. L1 featurewise normalization was implemented to filter the dataset to ensure standardized scales across the input features. The neuron layers utilized were symmetrically designed to highlight a crescendo-like structure (Fig. 4) consisting of even and odd-numbered layers. Using the Keras framework, we created a sequential model of 48 dimensions representing the number of features per base position. The NNM was augmented by introducing an output layer comprising five neurons and employing a SoftMax activation function. To finalize the model configuration, categorical cross-entropy was designated as the loss function, the Adam optimizer was utilized, and the evaluation metric was set as ‘accuracy’. The functional computation and implementation filter out any possible symmetry that could be introduced into the architectural framework of the models through iterations, random initialization of weights and adjustments of applicable hyperparameters.

The surveillance algorithm was designed to yield the maximum likelihood call, which represents the highest probability base call among the five labels [‘A’, ‘T’, ‘C’, ‘N’ and ‘G’], for each position in the validation dataset via the SoftMax activation function. Figure 5 illustrates the neural network model’s pipeline, depicting the trajectory of the base call predictions following the model training process.

Top Model Ensemble Stacking-Base Model Selection

The ensemble stacking approach was adopted to optimize the performance of the neural network model. As a concept dating back to the early 1990s, existing reports demonstrate that ensemble learning has enhanced model performance compared with generated base models [26–28] and its applications in various domains, including biotechnology and electricity generation forecasting [29–33]. Ensemble stacking addresses two important requirements: capturing regions where individual models’ performances are at their optimum and creating prediction capabilities on unseen sample predictions.

The ensemble learning approach required a preprocessing step of selecting the top models on the basis of different accuracy categories of the boxplot, the first and third quartiles within the ~ 98% \(\:<\:x<\) ~99% interval, whereas the total model accuracies were in the range of ~ 30% \(\:<\:x<\) ~99% interval (Figs. 6, S10), thus justifying the optimization approach to stack the best performing models by stacking their SoftMax output vectors, \(\:{[P}_{A},\:{P}_{C},\:{P}_{G},\:{P}_{N},\:{P}_{T}]\), as features for a meta-model. Since each model uses a random weight initialization on three individual compute nodes for each hyperparameter set during the fine-tuning process, we had to consider accuracy over three copies of each model in top model selection. First, of the ~ 12,000 neural networks initially trained, the best model candidates were chosen by converting average accuracies on a validation dataset to the standard normal distribution and selecting by a ranked z score to obtain the most accurate models. The top 1,000 accurate models were reduced to 511 via the coefficient of variation to retain the most consistent models. Finally, the top models (base models) consisted of the 278 models that had an accuracy of 99% or greater on the validation dataset prestacking during model saving. Therefore, we use 5*278 features to train the meta-models. The stacking ensemble approach is described in Fig. 7.

Training of the Meta-Model Architecture

We investigated two model architectures as our meta-model for the newly formed training dataset with 1,390 features at each position: multinomial linear regression and a neural network. For supervision, we used the reference genome bases as the assumed targets. We also utilized stratified K-fold cross-validation (K = 10) to facilitate hyperparameter tuning and model generalization. This cross-validation approach introduces a lower variance than a single hold-out model does, which can be significant if the available data are limited. In our case, the multinomial regression meta-learner was initialized with varying hyperparameter combinations, which included a maximum iteration of 100,000, a convergence threshold value of 0.001 and an inverse regularization strength (C) to mitigate the risk of overfitting. The neural network meta-model utilized 250 epochs and 200 batch size hyperparameters and a validation loss checkpointing configuration to prevent overfitting while simultaneously retaining the most accurate models. Upon completing the training of the meta-models, final predictions were obtained by testing the performance of the trained models with a test set of 4 unseen clinical samples.

Performance Evaluation of the Ensemble Architecture

To assess the performance of the two meta-models (multinomial logistic regression and neural network models) along with the base models, we utilized various methodologies for evaluating the supervised models; thus, using a confusion matrix (Fig. 8 and Supplementary Figs. S1, S2 and S3), we determined the sensitivity, specificity, accuracy, precision and F score (Table 2). Additionally, we assessed the relationship between the quality of base calls via a quality metric (Q score) and the overall accuracy of the model (pre- and poststacking). Since the SoftMax activation function outputs a probability distribution of 5 possible base calls, the call for each position is defined to be the base with the highest probability, \(\:{P}_{max}\). Therefore, we calculate the Q-score as \(\:Q=\:-{\text{log}}_{10}(1-{P}_{max})\) . We also evaluated the relationship between coverage and the Q-score to demonstrate the proficiency of the base-calling model per 29,846 bases in each of the four test clinical samples.

The stacked models were evaluated along with the base models by utilizing an aggregate confusion matrix per model and for all four clinical testing samples, which yielded an outcome corresponding to each positional base. Requisite components for the aggregate confusion matrices were derived by converting individual confusion matrices to a binary format. We found that on average, logistic regression meta-models yielded 116,973 true negatives and 29,626 true positives and 116,855 true negatives and 29,625 true positives for the neural network meta-model (Table S1). This outcome presents a notable increase in true base calls compared with the unified confusion matrix of the base models, which yielded 116,487 true positives and 29,271 true negatives over all bases per sample (Figs. 8a, 8b and 8c).

In Table 2, we present additional performance metrics for both the base models and stacked models, highlighting key indicators such as accuracies, sensitivities, specificities, precisions, and F-scores. This step is crucial for evaluating the overall accuracy of predictions across all models, assessing their ability to correctly identify instances where the correct base is called at each position in the genome, and evaluating other aspects, such as the accuracy of positive predictions and the balance between precision and recall. The evaluation process considers the multiclass nature of the labels used in this project. The performance metrics varied across the models, with specificity values ranging from 99.01% \(\:\le\:\:\varvec{x}\:\le\:\) 99.90% for samples WYOM 36, WYOM 41, WYOM 59 and WYOM 64 and sensitivity values ranging from 98.26% \(\:\le\:\:\varvec{x}\:\le\:\) 99.50% for the same set of clinical samples. The accuracy ranged from 98.83% \(\:\le\:\:\varvec{x}\:\le\:\) 99.75% for samples WYOM 36, WYOM 41, WYOM 59 and WYOM 64, along with the other metrics presented in Table 2. To evaluate the accuracy and coverage proficiency of our surveillance models, the percentages of correctly identified genome segments at varying Q-score intervals (10 \(\:\le\:\:\varvec{x}\:\le\:\) 20) for all four clinical samples are presented in Tables 2 and 3. Individual plots showing accuracies at different Q-scores for both the base (average accuracies) and stacked models are depicted in Figs. S4, S5, and S6. Additionally, sample-based plots illustrating coverages at varying Q-score values (average coverages) for the base models and stacked models are presented in Figs. S7, S8, and S9. The plot of the average accuracies versus the Q score shown in Fig. S4 was utilized to assess the performance of the base models, revealing notable variability, as evidenced by the substantial difference between the upper and lower confidence limits. This variance indicates a level of uncertainty in the mean estimate, suggesting that the true population parameter falls within a wide range. The confidence interval provides valuable insight into the precision of the average accuracies, emphasizing the need for the optimization of the base models.

Using the Ct value cutoff of \(\:\le\:\:\)24, the model’s ability to identify base-call variants after resequencing samples from these microarrays is presented in Table 4, where the model’s success in identifying known variants when each sample is passed through our stacked ensemble machine learning model is highlighted, and we observe no discrimination in variant calling success for any type of base-to-base variant. Our overall accuracy for known variant identification success is 84.37%, with perfect identification scores for more than half of these samples. For a comprehensive performance comparison between the base and stacked models, combined comparison plots of accuracy versus Q-scores and coverage versus Q-scores are provided in Figs. 9 and 10.

A further assessment of the models revealed unified confusion matrices, which revealed increased counts of true positives and true negatives within the stacked models, as depicted in Figs. 8b., 8c. and Table S1. These characteristics signify a balance in the sensitivities and specificities in the meta-models compared with the base models (Fig. 8a) and are inherently linked to the proficiency of stacked models in accurately capturing a significant proportion of positive cases while minimizing instances of false positives.

The comparison plots of coverage versus the Q-score (Figs. 9a–9d), which were used to assess the percentage of the SARS-CoV-2 genome for which we have called a base, showed a sharp decline at coverages corresponding to a Q-score > 25 for the stacked models and a sharp decline starting from a Q-score of Q > 35 for the base models. Although the anticipated expectation is an enhanced or comparable performance between the stacked models and the base models, which maintained higher confidence even at a Q > 30, the noticeable divergence in trajectory may be attributed to increased stringency at higher quality scores or a preprocessing constraint, leading to a more rigorous filtering process. This could result in the exclusion of certain data points, particularly in regions with lower confidence, ultimately reducing overall coverage, especially in areas where the model exhibits lower confidence in its predictions.

In a highly variant viral genomic sample, we anticipate that the coverage percentage is influenced by the genomic coordinates of variant bases and the nature of the features incorporated in training the NNM. An increased model stringency may occur, but a definitive outcome could be ascertained only by testing the model with additional highly variant patient samples. Addressing this challenge could involve fine-tuning hyperparameters in the pre- and postprocessing stages of building the surveillance architecture, adjusting confidence thresholds, and reassembling the models for reevaluation on clinical samples.

The comparison plots illustrating the relationship between accuracy and the Q-score (Figs. 10a–10d), which were utilized to evaluate the number of correctly called base positions in the genome, revealed an inversely proportional trend to the coverage observed across all the models. Across all four clinical samples, the stacked models consistently exhibited higher accuracies than did the base models. The tradeoff observed between accuracy and coverage in the stacked models versus the base models suggests that while the stacked models excel in correctly identifying bases, they tend to filter out regions where confidence in the accuracy of base calls is lower. An examination of the true positive rate (TPR) and false positive rate (FPR) presented in Table S1 reveals TPR values of 0.9791 and an FPR of 0.0250 in the base models, whereas the logistic regression and neural network meta-models presented TPR values of 0.9932 and 0.0019 and 0.9926 and 0.0018, respectively. These results indicate a higher TPR and lower FPR in the stacked models and underscore the stacked models' capacity to accurately identify positive instances without a substantial increase in false positive predictions. This improved discrimination between positive and negative instances is largely attributed to the enhancement of the base models by stacking.

It would be constraining to extrapolating the efficacy of our surveillance models to Illumina sequencing owing to contextual disparity, which arises from dissimilarities between Illumina and the sequencing platform employed in deriving training features in our models. Consequently, it would be unsuitable to compare our results directly with those of Illumina. We observed variability in performance metric values at the sample level, which could be linked to minor heterogeneity within the four test genomic samples and potential fluctuations in sequencing data quality. During the training phase of the overall models, additional considerations could have included the variation of activation functions per neuron layer. This adjustment might have substantially mitigated residual linearity and improved threshold relaxation, addressing specific complexities associated with the dataset employed in designing our surveillance architecture.

This research introduces an innovative, timely, and cost-effective method for monitoring SARS-CoV-2 in viral samples from patients. A base-calling accuracy exceeding 99% was achieved via the stack ensemble method, and a comparable variant identifying capability was achieved. This technique has potential for monitoring the emergence of new strains, and its applicability extends to improving existing viral surveillance methods, particularly in the context of whole-genome sequencing [34–37].

We developed a surveillance pipeline that synergistically combines modifiable microarray genome sequencing and a stacked neural network that can prospectively monitor any category of virus. Our findings indicate that this surveillance methodology can effectively monitor SARS-CoV-2 and the emergence of new strains. Although the focus of this report is on the genomic surveillance of SARS-CoV-2, the microarray genome sequencing technology and supervised neural network classification approach presented in this report could be used for a variety of unrelated viruses by modifying the tiling array chip and updating the features of the neural network architecture, ultimately enabling the application of this technology to monitor various unrelated viruses. The supervised model proposed in this study holds promise for monitoring emerging variants of SARS-CoV-2, as well as other respiratory and nonrespiratory viruses, as illustrated by the results of clinical sample testing.

Future consideration would entail training our models with high-variant viral genomes and equally testing with relevant clinical samples to better assess the stringent constraints of our models. We hope to consider the use of unsupervised models with similar and additional clinical samples to test the performance of our proposed unsupervised models. Finally, we intend to employ feature normalization techniques across independent intensity samples to address clinical sample restrictions and allow for more robust resequencing capability of future versions of this model.

Ethics approval and consent to participate

Not Applicable

Consent for publication

Not Applicable

Availability of data and materials

Data are extracted and formatted for use, along with python code and environment YAML files. These can be accessed at https://github.com/plackow1993/Viral_Genome_Surv_ML. Clinical sample sequences are found in the EpiCoV database within GISAID at https://gisaid.org.

Competing interests

KH and JSE are part of Centrillion Biosciences and we may commercialize a chip-based viral resequencing methodology. The other authors declare that they have no competing interests.

Funding

Not Applicable

Authors' contributions

KP conducted the research, wrote the software, analyzed the data, and participated in drafting of the manuscript. IE assisted in analyzing data and drafted the manuscript. EO participated in analysis of the data and editing of the manuscript. KH collected the data and edited the manuscript. JSE conceived of the project and edited the manuscript.

Acknowledgements

We would like to thank the UNM Center for Advanced Research Computing, supported in part by the National Science Foundation, for providing the high-performance computing nodes and large-scale storage resources used in this work.

AlTakarli NS. Emergence of COVID-19 infection: What is known and what is to be expected. Dubai Med J. 2020;3(1):13–8. https://doi.org/10.1159/000506678.
Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, Zhang L, Fan G, Xu J, Gu X, Cheng Z, Yu T, Xia J, Wei Y, Wu W, Xie X, Yin W, Li H, Liu M, Xiao Y, Gao H, Guo L, Xie J, Wang G, Jiang R, Gao Z, Jin Q, Wang J, Cao B. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet. 2020;395(10223):497–506. https://doi.org/10.1016/S0140-6736(20)30183-5.
Wang C, Horby PW, Hayden FG, Gao GF. A novel coronavirus outbreak of global health concern. Lancet. 2020;395(10223):470–3. https://doi.org/10.1016/S0140-6736(20)30185-9.
CDC. Outbreak history. Ebola. 2024 [cited 2024 Aug 26]. https://www.cdc.gov/ebola/outbreaks/index.html
CDC. Zika cases in the United States. Zika Virus. 2024 [cited 2024 Aug 26]. https://www.cdc.gov/zika/zika-cases-us/?CDC_AAref_Val=https://www.cdc.gov/zika/reporting/index.html
CDC, Historic data. (2010–2023). Dengue. 2024 [cited 2024 Aug 26]. https://www.cdc.gov/dengue/data-research/facts-stats/historic-data.html
Simon-Loriere E, Holmes EC. Why do RNA viruses recombine? Nat Rev Microbiol. 2011;9(8):617–26. https://doi.org/10.1038/nrmicro2614.
Yang S, Yu Y, Xu Y, Jian F, Song W, Yisimayi A, Wang P, Wang J, Liu J, Yu L, Niu X, Wang J, Wang Y, Shao F, Jin R, Wang Y. Fast evolution of SARS-CoV-2 BA.2.86 to JN.1 under heavy immune pressure. Lancet. 2023;24(2):E70–2. https://doi.org/10.1016/S1473-3099(23)00744-2.
Carabelli MA, Peacock TP, Thorne LG, Harvey WT, Joseph Hughes J, COVID-19 Genomics UK Consortium, Peacock SJ, Barclay WS, de Silva TI, Towers GJ, Robertson DL. SARS-CoV-2 variant biology: immune escape, transmission and fitness. Nat Rev Microbiol. 2023;21:162–77. https://doi.org/10.1038/s41579-022-00841-7.
Chiara M, D’Erchia AM, Gissi C, Manzari C, Parisi A, Resta N, et al. Next generation sequencing of SARS-CoV-2 genomes: challenges, applications and opportunities. Brief Bioinform. 2021;22(2):616–30. http://dx.doi.org/10.1093/bib/bbaa297.
Chen X, Kang Y, Luo J, Pang K, Xu X, Wu J, et al. Next-generation sequencing reveals the progression of COVID-19. Front Cell Infect Microbiol. 2011;11. https://doi.org/10.3389/fcimb.2021.632490.
Aynaud M-M, Hernandez JJ, Barutcu S, Braunschweig U, Chan K, Pearson JD, et al. A multiplexed, next generation sequencing platform for high-throughput detection of SARS-CoV-2. Nat Commun. 2021;12(1405). https://doi.org/10.1038/s41467-021-21653-y.
Carpenter RE, Tamrakar V, Chahar H, Vine T, Sharma R. Confirming multiplex RT-qPCR use in COVID-19 with next-Generation Sequencing: Strategies for epidemiological advantage. Glob Health Epidemiol Genom. 2022;2270965. https://doi.org/10.1155/2022/2270965.
Hoff K, Ding X, Carter L, Duque J, Lin J-Y, Dung S, et al. Highly accurate chip-based resequencing of SARS-CoV-2 clinical samples. Langmuir. 2021;37(16):4763–71. https://doi.org/10.1021/acs.langmuir.0c02927.
Mohammadi M, Sabati H. When successive viral mutations prevent definitive treatment of COVID-19. Cell Mol Biomed Rep. 2022;2(2):98–108. https://doi.org/10.55705/cmbr.2022.339012.1040.
Lauring AS, Hodcroft EB. (2021) Genetic variants of SARS-CoV-2—what do they mean? JAMA. 2021; 325(6): 529 – 31. https://dx.doi.org/10.1001/jama.2020.27124
Fernandes Q, Inchakalody VP, Merhi M, Mestiri S, Taib N, Moustafa A, El-Ella D, et al. Emerging COVID-19 variants and their impact on SARS-CoV-2 diagnosis, therapeutics and vaccines. Ann Med. 2022;54(1):524–40. http://dx.doi.org/10.1080/07853890.2022.2031274.
Nagpal S, Pal R, Ashima, Tyagi A, Tripathi S, Nagori A, et al. Genomic surveillance of COVID-19 variants with language models and machine learning. Front Genet. 2022;13. http://dx.doi.org/10.3389/fgene.2022.858252.
Torun H, Bilgin B, Ilgu M, Yanik C, Batur N, Celik S, et al. Machine learning detects SARS-CoV-2 and variants rapidly on DNA aptamer metasurfaces. bioRxiv. 2021. http://dx.doi.org/10.1101/2021.08.07.21261749.
Chandra R, Bansal C, Kang M, Blau T, Agarwal V, Singh P, et al. Unsupervised machine learning framework for discriminating major variants of concern during COVID-19. PLoS ONE. 2023;18(5):e0285719. http://dx.doi.org/10.1371/journal.pone.0285719.
Subramanian N, Elharrouss O, Al-Maadeed S, Chowdhury M. A review of deep learning-based detection methods for COVID-19. Comput Biol Med. 2022;143(105233):105233. http://dx.doi.org/10.1016/j.compbiomed.2022.105233S.
Serte S, Demirel H. Deep learning for diagnosis of COVID-19 using 3D CT scans. Comput Biol Med. 2021;132(104306):104306. http://dx.doi.org/10.1016/j.compbiomed.2021.104306.
Alves MA, Castro GZ, Oliveira BAS, Ferreira LA, Ramírez JA, Silva R, et al. Explaining machine learning based diagnosis of COVID-19 from routine blood tests with decision trees and criteria graphs. Comput Biol Med. 2021;132(104335):104335. http://dx.doi.org/10.1016/j.compbiomed.2021.104335.
Shiri I, Sorouri M, Geramifar P, Nazari M, Abdollahi M, Salimi Y, et al. Machine learning-based prognostic modeling using clinical data and quantitative radiomic features from chest CT images in COVID-19 patients. Comput Biol Med. 2021;132(104304):104304. http://dx.doi.org/10.1016/j.compbiomed.2021.104304.
Yamashita K, Taniguchi T, Niizeki N, Nagao Y, Suzuki A, Toguchi A, et al. Cycle threshold (Ct) values of SARS-CoV-2 detected with the GeneXpert® System and a mutation associated with different target gene failure. Curr Issues Mol Biol. 2023;45(5):4124–34. http://dx.doi.org/10.3390/cimb45050262.
Stoler N, Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform. 2021;3(1). http://dx.doi.org/10.1093/nargab/lqab019.
Hansen LK, Salamon P. Neural network ensembles. IEEE Trans Pattern Anal Mach Intell. 1990;12(10):993–1001. http://dx.doi.org/10.1109/34.58871.
Ginzburg I, Horn D. Combined neural networks for time series analysis. Neural Information Processing Systems. 1993 [cited 2024 Aug 26]; https://www.semanticscholar.org/paper/ee5e1ebc2ba53d047f70ee267655921ac20f5b45
Perrone MP, Cooper LN. When networks disagree: Ensemble methods for hybrid neural networks. World Scientific Series in 20th Century Physics. WORLD SCIENTIFIC; 1995. pp. 342–58.
Dietterich TG. Ensemble methods in machine learning. In: Proceedings of the First International Workshop on Multiple Classifier Systems. 2000. pp. 1–15.
Mendes-Moreira J, Soares C, Jorge AM, Sousa JFD. Ensemble approaches for regression: A survey. ACM Comput Surv. 2012;45(1):1–40. http://dx.doi.org/10.1145/2379776.2379786.
Verma AK, Pal S. Prediction of skin disease with three different feature selection techniques using stacking ensemble method. Appl Biochem Biotechnol. 2020;191(2):637–56. http://dx.doi.org/10.1007/s12010-019-03222-8.
Li X, Luo J, Jin X, He Q, Niu Y. Improving soil thickness estimations based on multiple environmental variables with stacking ensemble methods. Remote Sens (Basel). 2020;12(21):3609. http://dx.doi.org/10.3390/rs12213609.
Yang Y, Wei L, Hu Y, Wu Y, Hu L, Nie S. Classification of Parkinson’s disease based on multi-modal features and stacking ensemble learning. J Neurosci Methods. 2021;350(109019):109019. http://dx.doi.org/10.1016/j.jneumeth.2020.109019.
Kwon H, Park J, Lee Y. Stacking ensemble technique for classifying breast cancer. Healthc Inf Res. 2019;25(4):283. http://dx.doi.org/10.4258/hir.2019.25.4.283.
Kardani N, Zhou A, Nazem M, Shen S-L. Improved prediction of slope stability using a hybrid stacking ensemble method based on finite element analysis and field data. J Rock Mech Geotech Eng. 2021;13(1):188–201. http://dx.doi.org/10.1016/j.jrmge.2020.05.011.
Crawford DC, Williams SM. Global variation in sequencing impedes SARS-CoV-2 surveillance. PLoS Genet. 2021;17(7):e1009620. http://dx.doi.org/10.1371/journal.pgen.1009620.

Tables 2-5 are available in the Supplementary Files section.

Competing interest reported. KH and JSE are part of Centrillion Biosciences and we may commercialize a chip-based viral resequencing methodology. The other authors declare that they have no competing interests.

Tables25.docx

Download PDF

Editorial decision: Revision requested
02 Sep, 2024
Editor assigned by journal
30 Aug, 2024
Submission checks completed at journal
30 Aug, 2024
First submitted to journal
29 Aug, 2024

You are reading this latest preprint version

Viral Genome Surveillance via Modifiable Microarray Sequencing and a Supervised Stack Ensemble Neural Network Model: SARS-CoV-2 as a Case Study

Status:

Version 1

Abstract

Background

Results

Conclusions

Figures

Background

Methods

Generation of training, testing, and validation datasets

Data from resequencing microarrays

Preprocessing/Model Selection

Consensus-Type Hyperparameter Approach

Selection criteria for clinical sample test data

Training the Neural Network Model

Neural Network Model Training and Architecture Assembly

Top Model Ensemble Stacking-Base Model Selection

Training of the Meta-Model Architecture

Performance Evaluation of the Ensemble Architecture

Results

Discussion

Conclusions

Declarations

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1