We conducted a bench analytic study to evaluate the performance of the MALDI-TOF-MS COVID-19 testing method using SARS-CoV-2 RNA PCR positive and negative samples. The goal of this study was to determine the accuracy along with the positive percent agreement (PPA) and negative percent agreement (NPA) of the MALDI-TOF-MS method to the PCR method that was used as a comparative approach.
Study Population / Samples:
The study was approved by the UC Davis Institutional Review Board. Informed consent was obtained for 226 nasal swab samples (anterior nares) preserved in saline transport media were obtained from the UC Davis Clinical Laboratory Biorepository. Patients included asymptomatic and symptomatic populations including those meeting COVID-19 testing criteria (i.e., patients who presented with or without symptoms at the time of collection) as well as asymptomatic volunteers as part of workplace screening. Saline viral transport media was used due to its widespread availability and compatibility with MALDI-TOF-MS techniques. Commercially available swabs (Copan, Murrieta, CA) were used for collection. All samples were stored at -70ºC prior to testing.
MALDI-TOF-MS Method:
The study testing workflow is illustrated by Figure 1. Mass spectrometry testing was performed on a Shimadzu 8020 (Shimadzu Scientific Instruments, Columbia, MD) MALDI-TOF-MS analyzer. Sample processing was conducted under a Class II Biosafety Cabinet. Nasal swabs were first plated directly onto the MALDI-TOF-MS target plate, followed by addition of α-cyano-4-hydroxycinnamic acid (CHCA), ethanol, acetonitrile, and water solution with 3% trifluoroacetic acid (TFA). Plated samples were then inactivated by ultraviolet (UV) irradiation for 10 minutes. Thereafter, the target plate was transferred to the MALDI-TOF-MS analyzer for testing. MALDI-TOF-MS settings included a mass range of 2,000 to 20,000 Daltons. Ten laser shots were fired for each profile at a frequency of 100 Hz using a dithering pattern (total of 1,000 shots per well) and Gaussian smoothing method. Post-acquisition baseline subtraction and smoothing was performed using MALDI Solutions software (Shimadzu Scientific Instruments, Columbia, MD) (parameters: Baseline Filter Width = 250, Smoothing = Gaussian, and Smoothing Width = 50, Peak Width = 5). Peak picking was also performed by MALDIQuant software (Shimadzu Scientific Instruments, Columbia, MD). Threshold Apex algorithm was used for peak selection where the peak mass is assigned by selecting the highest point on the peak. Based on this protocol, the MALDI-TOF-MS would complete 48 runs (samples and quality control) every 20 minutes. Mass spectra were then standardized prior to analysis by ML with peak selection / alignment performed using MALDIQuant software.
Comparative Method:
Residual saline transport media was tested by RT-PCR using Food and Drug Administration (FDA) emergency use authorized (EUA) assays (Table 1).5 These EU assays included the cobas 6800 (Roche Molecular Systems, Pleasanton, CA), and digital droplet RT-PCR (Bio-Rad, Hercules, CA). Briefly, the cobas 6800 SARS-CoV-2 EUA assay targets open reading frame 1ab (ORF1ab) and envelope protein (E) gene regions, while the digital droplet RT-PCR method targeted two regions within the nucleocapsid (N) protein region. Both assays report sensitivity and specificity of >99% based on their FDA EUA documentation. The use of two different assays was due to supply constraints during the pandemic.
Machine Learning:
The Machine Learning (ML) aspects of this study were carried out through the Machine Intelligence Learning Optimizer (MILO) automated ML platform (MILO ML, LLC, Sacramento, CA) which has been published in several recent papers.13-16 Briefly, MILO includes an automated data processor, a data feature selector (ANOVA F select percentile feature selector and RF Feature Importances Selector) and feature set transformer (e.g., principal component analysis), followed by its custom supervised ML model builder using its custom hyperparameter search tools (i.e., its custom grid search along with its random search tools) to help find the optimal hyperparameter combinations within the variety of its embedded supervised algorithms/methods (i.e., deep neural network [DNN], logistic regression [LR], naïve Bayes [NB], k-nearest neighbor [k-NN], support vector machine [SVM], random forest [RF], and XGBoost gradient boosting machine GBM]). Ultimately, MILO employs a combination of unsupervised and supervised ML platforms from a large set of algorithms, scalers, scorers and feature selectors/ transformers to create thousands of unique ML pipelines (Figure 2) that generates over a hundred thousand models that are then statistically assessed to ultimately identify the best performing model for one’s given task.
For this study, we imported the trial data into MILO using COVID-19 status as the outcome measure for analysis. The aforementioned functions were then performed automatically by MILO. Information is assessed to ensure model training and the initial validation step is based on a balanced dataset. Initially in the build phase of MILO, the first balanced Dataset A is split into training and validation test sets in an 80-20 split with a 10k-fold cross validation step, respectively. Since many algorithms benefit from scaling, in addition to using the unscaled data, the training dataset also underwent two possible scaling transformations (i.e., standard scaler and minmax scaler). To evaluate the effect of different features within the datasets on model performance, a combination of various statistically significant feature subsets (i.e., various MS peaks) or transformed feature sets were also selected to build new datasets with less features or transformed feature sets to feed into the various aforementioned supervised algorithms. The features selected in this step are derived from several well-established unsupervised ML/statistical techniques including ANOVA F-statistic value select percentile, RF Feature Importances or transformed using its principal component analysis approach.9 A large number of supervised machine learning models are then built through this approach from these datasets with optimal parameters through MILO’s various supervised algorithms (i.e., DNN, SVM, NB, LR, k-NN, RF, and GBM), scalers, hyper-parameters, and feature sets. Notably, the final validation of each model within MILO is not based on the 20% test set mentioned earlier that generated from the initial training dataset (i.e., Dataset A) but rather each ML model’s true performance is based on its predictive capability on the independent secondary dataset (Dataset B). Ultimately, for final model validation, MILO’s thousands of generated models are then individually passed onto this next phase of the MILO engine generalization assessment phase (Figure 2). This secondary testing approach markedly reduces the possibility of overfitted ML models since the model’s final performance measures are based on an independent secondary dataset only (Dataset B) as noted above. The final machine learning model performance data results are then tabulated by MILO’s interface and reported as clinical sensitivity, specificity, accuracy, negative predictive value (NPV), positive predictive value (PPV), F1 score, receiver operator characteristic (ROC) curves, and brier scores with reliability curves.
Statistical Analysis:
Statistical analysis was performed using JMP Software (SAS Institute, Cary, NC). Area under the ROC curve analysis was also performed, as well as calculating PPA and NPA which served as surrogates for sensitivity and specificity. The use of PPA and NPA is recommended by the FDA due to not having a proven “gold standard” for SARS-CoV-2 detection at this time.5,17 An independent Principal component analysis (PCA) within scikit learn was also performed on the greater than 600 MS peaks evaluated here and it’s PC1, 2 and 3 components (results not shown) highlighted many of the shared peaks noted within the MILO feature selector approach (i.e., RF Importances features [25%]) that found one of the best performing ML models for this study.