Generation of training, testing, and validation datasets
Data from resequencing microarrays
The NNM was constructed by incorporating input features from the image acquired from the resequencing microarray, utilizing over 30 clinical samples from the Wyoming Public Health Laboratory, as detailed in our previous work [14]. The training and test datasets in this study consisted of measurements of average pixel intensity and standard deviation of pixel intensities from each sense and antisense probe at each base position in the genome (Fig. 1). After these data are extracted from the images, each base position consists of 48 variables used in the NNM, with each probe set consisting of eight nucleotide features, A, T, C, and G, on both the sense and antisense strands, as demonstrated in previous reports [14]. For each of the features, the data included a measurement of the average intensity of the pixels and the standard deviation of the pixels for the feature at three different image exposure times: 0.25 s, 1 s, and 4 s (2 metrics × 3 exposures × 8 features = 48 measurements per base). This resulted in approximately 570,000 data points for training and validating the models. Additionally, we designed multinomial linear regression and neural network meta-learner models, whose performance was tested on four clinical samples.
Preprocessing/Model Selection
An extensive ensemble of over 12,000 supervised neural network models (SNNMs) was trained by applying varying neural network architectural hyperparameter combinations and additional preprocessing parameters: ‘mid-select' and ‘consensus-type.’ Thus, each permutation of the hyperparameter combination was a stand-alone model. The first novel preprocessing hyperparameter introduced was ‘mid-select’, which excludes from the training set a predetermined number of nucleotides at both ends of the sequenced genome due to indeterminacy in probe hybridization at the ends of the probed genome. For example, a mid-select parameter of 1000 would remove 2000 total positions from each training sample. The ‘consensus-type’ hyperparameter is described in the next section. Figures 2 and 3 portray the frequency of hyperparameters used in the total models and top models (base models), respectively. To address the multilabel classification, given the presence of five distinct labels with one potential output at each genomic position, a transformation pipeline on the categorical variables representing the base list (“A”, “C”, “G”, “T” and “N”) was incorporated to output a numerical representation of the most likely base at each position via the Softmax activation function, thus facilitating the classification capabilities of the model. During the preprocessing phase of the neural network, varying implementations of the mid-select hyperparameter in the range of 0 ≤ x ≤ 10,000 were incorporated into the training data. The range of neuron numbers spanned 5 ≤ x ≤ 11, while the batch size range spanned 500 ≤ x ≤ 5,000, along with a layer depth of two or more. Additionally, nonlinear activation functions such as relu, selu and tanh were integrated within the network architecture, and to mitigate overfitting concerns, the entire dataset was split into separate training (~ 74%) and validation (~ 26%) datasets at the beginning of the neural network model (NNM) assembly.
Consensus-Type Hyperparameter Approach
Sequencing errors are inherent in sequencing technologies, such as the resequencing method employed in our study or any next-generation sequencing technology [24]. Therefore, we created a ‘consensus-type’ preprocessing hyperparameter to increase confidence in selecting training targets to supervise our models. The consensus framework was based on evaluating base agreements across all training target options from the sense and antisense probes at each of the three available exposure times. At each position in the genome, the highest average hybridization signal intensity for the probe sets was used to make an initial base call for the use of the 6 datasets (sense and antisense readings at 3 different exposures). The base call agreement majority among the probe sets subsequently assigns a consensus score of \(\:n\) to that position, where \(\:n\) can range between 2 and 6, inclusive. Therefore, during preprocessing, the training set at a consensus level of \(\:N\) is said to have satisfied the consensus requirements if \(\:n\ge\:N\) (Table 1). To remove ambiguities due to ties in agreement, \(\:N\) is restricted to values of 4, 5, or 6. Note that N is manually varied to process our training dataset, as we describe below.
Table 1: Examples of consensus score calculations (n) and majority agreements for the 6 probe sets. We give an example of the decision to satisfy the consensus requirements for N=5.
|
0.25 s Exposure
|
1 s Exposure
|
4 s Exposure
|
|
|
n
|
Sense
|
Antisense
|
Sense
|
Antisense
|
Sense
|
Antisense
|
Base Call Majority Agreement
|
Satisfies Consensus Requirements?
|
3
|
A
|
T
|
A
|
C
|
G
|
A
|
A
|
No
|
3
|
A
|
A
|
A
|
T
|
T
|
T
|
A and T
|
No
|
6
|
T
|
T
|
T
|
T
|
T
|
T
|
T
|
Yes
|
4
|
A
|
T
|
A
|
A
|
T
|
A
|
A
|
No
|
2
|
C
|
A
|
C
|
T
|
T
|
A
|
A and T
|
No
|
The reference positions in the SARS-CoV-2 genome were chosen as the initial targets. Thus, the reference consensus type serves as a control type. This retains the largest set of data for training but fails to address possible true variants within the samples. We then created a nonhybrid consensus algorithm to select targets on the basis of their experimental hybridization intensities. Each potential training example is checked to ensure that it satisfies the consensus requirements. For those examples, we use the consensus base as its training target; otherwise, that example is discarded. While this rarely changes the training target for that position from the reference, it does allow for any true variants according to experimental data but reduces the overall training dataset size. A third consensus type, called a hybrid consensus, was created to bolster the strengths of the two prior types. In the hybrid version, a slight change is introduced in which every training example is kept. However, the target base call is only changed from the reference if the consensus requirements are satisfied, with the majority consensus base differing from the reference at that position. With this consensus type as an option, the training set size can be kept at its maximum, and experimental hybridization intensities can inform the model about true variants. Each type is employed to investigate the computational advantages of their preprocessing features and provide a diverse set of models to be stacked at a later stage.
Selection criteria for clinical sample test data
During the validation process with clinical samples, we limited the analysis to samples with a cycle threshold (Ct) value less than or equal to 24 to ensure the reliability of the Stack Ensemble Neural Network Model for variant identification [25]. This requirement serves as a common and standard metric to guarantee the reliability of the sample input for predictions. The samples we analyzed were USA/WY-WYPHL-00024/2020, USA/WY-WYPHL-00026/2020, USA/WY-WYPHL-00032/2020, USA/WY-WYPHL-00036/2020, USA/WY-WYPHL-00041/2020, USA/WY-WYPHL-00044/2020, USA/WY-WYPHL-00059/2020, and USA/WY-WYPHL-00064/2020. We will simply refer to these samples as WYOM 24, WYOM 26, WYOM 32, WYOM 36, WYOM 41, WYOM 44, WYOM 59, and WYOM 64, respectively. Those that are of a sufficient Ct value used in the main body of this work are WYOM 36, WYOM 41, WYOM 59, and WYOM 64. Data on the remaining samples are provided in the supplementary information.
Training the Neural Network Model
Neural Network Model Training and Architecture Assembly
Following the completion of the preprocessing phase, the training data comprising ~ 74% of the total dataset were curated and refined for optimal feature filtering, while the remaining ~ 26% of the total dataset was used for validation. L1 featurewise normalization was implemented to filter the dataset to ensure standardized scales across the input features. The neuron layers utilized were symmetrically designed to highlight a crescendo-like structure (Fig. 4) consisting of even and odd-numbered layers. Using the Keras framework, we created a sequential model of 48 dimensions representing the number of features per base position. The NNM was augmented by introducing an output layer comprising five neurons and employing a SoftMax activation function. To finalize the model configuration, categorical cross-entropy was designated as the loss function, the Adam optimizer was utilized, and the evaluation metric was set as ‘accuracy’. The functional computation and implementation filter out any possible symmetry that could be introduced into the architectural framework of the models through iterations, random initialization of weights and adjustments of applicable hyperparameters.
The surveillance algorithm was designed to yield the maximum likelihood call, which represents the highest probability base call among the five labels [‘A’, ‘T’, ‘C’, ‘N’ and ‘G’], for each position in the validation dataset via the SoftMax activation function. Figure 5 illustrates the neural network model’s pipeline, depicting the trajectory of the base call predictions following the model training process.
Top Model Ensemble Stacking-Base Model Selection
The ensemble stacking approach was adopted to optimize the performance of the neural network model. As a concept dating back to the early 1990s, existing reports demonstrate that ensemble learning has enhanced model performance compared with generated base models [26–28] and its applications in various domains, including biotechnology and electricity generation forecasting [29–33]. Ensemble stacking addresses two important requirements: capturing regions where individual models’ performances are at their optimum and creating prediction capabilities on unseen sample predictions.
The ensemble learning approach required a preprocessing step of selecting the top models on the basis of different accuracy categories of the boxplot, the first and third quartiles within the ~ 98% \(\:<\:x<\) ~99% interval, whereas the total model accuracies were in the range of ~ 30% \(\:<\:x<\) ~99% interval (Figs. 6, S10), thus justifying the optimization approach to stack the best performing models by stacking their SoftMax output vectors, \(\:{[P}_{A},\:{P}_{C},\:{P}_{G},\:{P}_{N},\:{P}_{T}]\), as features for a meta-model. Since each model uses a random weight initialization on three individual compute nodes for each hyperparameter set during the fine-tuning process, we had to consider accuracy over three copies of each model in top model selection. First, of the ~ 12,000 neural networks initially trained, the best model candidates were chosen by converting average accuracies on a validation dataset to the standard normal distribution and selecting by a ranked z score to obtain the most accurate models. The top 1,000 accurate models were reduced to 511 via the coefficient of variation to retain the most consistent models. Finally, the top models (base models) consisted of the 278 models that had an accuracy of 99% or greater on the validation dataset prestacking during model saving. Therefore, we use 5*278 features to train the meta-models. The stacking ensemble approach is described in Fig. 7.
Training of the Meta-Model Architecture
We investigated two model architectures as our meta-model for the newly formed training dataset with 1,390 features at each position: multinomial linear regression and a neural network. For supervision, we used the reference genome bases as the assumed targets. We also utilized stratified K-fold cross-validation (K = 10) to facilitate hyperparameter tuning and model generalization. This cross-validation approach introduces a lower variance than a single hold-out model does, which can be significant if the available data are limited. In our case, the multinomial regression meta-learner was initialized with varying hyperparameter combinations, which included a maximum iteration of 100,000, a convergence threshold value of 0.001 and an inverse regularization strength (C) to mitigate the risk of overfitting. The neural network meta-model utilized 250 epochs and 200 batch size hyperparameters and a validation loss checkpointing configuration to prevent overfitting while simultaneously retaining the most accurate models. Upon completing the training of the meta-models, final predictions were obtained by testing the performance of the trained models with a test set of 4 unseen clinical samples.
Performance Evaluation of the Ensemble Architecture
To assess the performance of the two meta-models (multinomial logistic regression and neural network models) along with the base models, we utilized various methodologies for evaluating the supervised models; thus, using a confusion matrix (Fig. 8 and Supplementary Figs. S1, S2 and S3), we determined the sensitivity, specificity, accuracy, precision and F score (Table 2). Additionally, we assessed the relationship between the quality of base calls via a quality metric (Q score) and the overall accuracy of the model (pre- and poststacking). Since the SoftMax activation function outputs a probability distribution of 5 possible base calls, the call for each position is defined to be the base with the highest probability, \(\:{P}_{max}\). Therefore, we calculate the Q-score as \(\:Q=\:-{\text{log}}_{10}(1-{P}_{max})\) . We also evaluated the relationship between coverage and the Q-score to demonstrate the proficiency of the base-calling model per 29,846 bases in each of the four test clinical samples.