In this study, our machine learning framework provides a proof-of-concept for predicting COVID-19 status (RT-PCR test result) relying only on baseline demographics, comorbidities, vitals, and lab values. The predictive models could prioritize sub-populations for COVID-19 status in situations where testing capacity may be limited, or they could also be used in conjunction with clinical judgment or other predictive models (ex. based on mobile phone data) to verify RT-PCR test results. Our predictive models also identified key clinical features that correlate with a positive test status, providing insights on efficient patient stratification and population screening. Moreover, the decision algorithm derived from the single-tree XGBoost model provides a simple, clinically operable method of stratifying sub-populations that can be replicated in other settings.
As the best performing model in the test set, the multi-tree XGBoost model revealed several key clinical variables predictive of positive RT-PCR test result. One key finding was that serum calcium levels is a highly predictive feature of COVID-19 status; concurrently, previous studies identified serum calcium as a biomarker of clinical severity and poor prognosis in COVID-19 patients.23,24 Our single-tree XGBoost model uses serum calcium level < 9.05 mg/dL as the first split in the decision tree (Fig. 4), which also correlates with previous findings confirming the prevalence of hypocalcemia in severe COVID-19 patients.23 Notably, a serum calcium test typically have a rapid turnaround time within a day and thus may be valuable in complementing existing tests.
The development of acute respiratory distress syndrome (ARDS) and/or sepsis, along with their associated symptoms, have also been shown to be a key indicator of positive COVID-19 status.2,25 While the datasets used to train and test the machine learning models did not directly include symptoms of COVID-19, the trained models prioritized features that may contribute to COVID-19 positivity in both symptomatic and asymptomatic individuals. Our multi-tree XGBoost model identified features such as age, lab values (AST levels, total bilirubin levels), comorbidities (cancer, diabetes, HIV, smoking), vitals (oxygen saturation, temperature), and hematologic features (lymphocyte count, hemoglobin) to be predictive of positive test status. Many of these identified features have been previously reported as markers of COVID-19 severity. For instance, abnormal liver function tests, which includes elevated levels of AST and bilirubin, have previously been found to be a marker of poor clinical outcome in COVID-19 patients.24–26 Decreased white blood cell count (lymphopenia) and hemoglobin levels (anemia) have been confirmed to be positively correlated with serum calcium levels and severe COVID-19 disease progression.6,23,24,27
We highlight multiple conscious design choices in constructing the machine-learning models for practical reasons. First, a split of train vs. test sets based on date mimics real-world situations, in which predictive models can only be trained on past data to facilitate prospective predictions. The samplings for tested individuals likely differ across the two time periods in New York City, and the ability of our model to predict a prospective cohort based on past data provide confidence to this approach. Second, our models rely solely on baseline features that can be easily obtained at initial patient encounter, which will have significant practical implications in prioritizing sub-populations for testing in areas with limited test kits or testing capacity. Indeed, the resulting models remain a proof-of-principal; given the different sampling populations and available lab tests, best performance of COVID-19 status prediction can likely be achieved if each testing site derive its own predictive model. Our codes are made openly available for future implementations.
Our machine learning models had several limitations. First, the variable predicted by the models was RT-PCR test result, which, despite being widely regarded as the gold standard6,7 for the diagnosis of COVID-19, is still prone to error and limits the peak performance of the model. We note that if data from a new test outperforming the RT-PCR test become available, our predictive models can be adapted to leverage the results of the new test. Second, the data contained high proportion of missing values in certain variables, especially in lab tests, which may have contributed to the low precision in our models. Although XGBoost models are compatible with missing values, the inclusion of more complete patient records may improve the performance of subsequent model versions. Finally, as more data becomes available, our machine learning models could be retrained and validated in other settings (e.g., health systems, testing sites, schools) to evaluate model performance and utility across populations.
Overall, this study provided a proof-of-concept that predictive models of COVID-19 case status can be developed to help prioritize sub-populations for more efficient screening or complement existing tests. Given that the COVID-19 pandemic continues to affect large fractions of populations,2,15 efficient screening of COVID-19 status, identification of high risk factors of COVID-19 positivity, and stratification of patient populations will play a crucial role in the allocation of limited testing resources for efficient testing and facilitation of patient management.