To our knowledge, this study represents the first attempt to use an integrated and interpretable ML analytical framework on a large series of COVID-19 infected patients, to uncover more in-depth insights, into the risk factors that shaped their symptom status. We identified important demographic and clinical predictors of symptom status and unveiled their complex non-linear relationships. In general, we found that the combination of clinical and demographic features, including targeted inflammatory markers, radiographic findings, laboratory blood tests, and patients’ characteristics, were important in predicting the risk of being symptomatic.
Our ML model revealed CRP levels, diffuse opacification in a CXR, and respiratory rate (Fig. 1), were the most important features in predicting which patients infected with COVID-19 will subsequently develop symptoms. These results are consistent with past studies, as CRP levels have been identified as potential biomarkers for symptomatic COVID-19 patients [31]. Moreover, these finding supports the notion, that imaging modalities, such as chest x-rays and computerized topography, are important for diagnosing COVID-19 patients [32].
Community type, travel history, and transmission type were the most important demographic features for predicting the risk of symptomatic status (Fig. 1). Being a citizen-resident of Kuwait is associated with having a higher socioeconomic status compared non-citizen residents. Despite this, our results indicate that citizen-residents were more likely develop symptoms (Fig. 2B). This finding is surprising, especially as Alkhamis et al., inferred that significant spreading and cluster events in migrant workers communities were substantially more severe than residents-citizens due to their densely populated areas and poor living conditions [33]. A potential explanation for this, may be that migrant workers, tend to represent a much younger subset of the population in Kuwait [34]. Our ML model inferred that patients with a recent travel history are less likely to develop symptoms (Fig. 2C). We attributed this finding to the government’s extensive intervention measures of testing and forced institutional quarantine of arriving travelers at the beginning of the epidemic in Kuwait [33]. Also, healthcare workers were more likely to develop symptoms despite having access to personal protective equipment during their duties (Fig. 2F). Being in close contact with a large number of COVID-19 patients for a prolonged period of time, whilst performing various high-risk procedures such as aerosol generated procedures (e.g. intubation) may be contributory risk factors [35].
Past studies have inferred a linear relationships between evidence of kidney injury, on admission, and the clinical course of COVID-19 [36]. Using our ML pipeline, we were able to explore this relationship further, by modeling nonlinear interactions between features (Fig. 3) and found that eGFR has the strongest overall interactions with other variables in shaping the risk of being symptomatic (Fig. 3A). Indeed, COVID-19 patients with low eGFR may be more likely to be severely ill on admission than patients with normal kidney function, as described elsewhere [37]. The mechanistic process that underlies this has been hypothesized to be due to the presence of the angiotensin-converted enzyme-2 (ACE-2) receptor in the kidney, which has been shown to be 100 times greater in the kidney, than in the lung. In addition to its function as a receptor for SARS-CoV-2 entry into the alveolar cells of the lung, the ACE-2 enzyme has also been shown to interact with the virus directly, affecting the renin-angiotensin aldosterone system (RAAS) physiologically. This process might indicate that patients with chronic kidney disease (CKD) may be more susceptible to getting a complicated COVID-19 infection since they have high RAAS activities, resulting in a systemic increased expression of ACE-2, a major entry site for the virus. Feature interaction plots with eGFR show that patients with low urea (Fig. 4A), elevated total protein (Fig. 4D), and hyponatremia (Fig. 3F) are at higher risk of being symptomatic. These results unveil the complexity of the acute disease phase upon admission, in which patients might experience multiple severe inflammatory processes and a negative fluid balance, as a result of impaired renal function [38].
A general limitation of the present study is the population size and potential selection bias toward our study population. That said, our data were collected from the official COVID-19 treatment hospital (i.e., Jaber Al-Ahmad Al-Sabah Hospital), which makes this population representative of the whole state of Kuwait. Furthermore, our ML pipeline is incapable of characterizing the uncertainties in the model predictions well. Methods such as Bayesian additive regression tree (BART) are more robust in quantifying such uncertainties, although they are limited by their requirement for larger datasets and demanding computations [39]. An advantage of the present analytical pipeline is the remarkable applicability of Shapley values to interpreting, at a finer scale, what our model means in terms of classifying symptom status (i.e., why a specific infected individual developed COVID-19 symptom, while the other did not?). For example, for a randomly selected patient from our cohort (Fig. 5A), having high CPR, low eGFR, hyponatremia, and diffuse opacification were associate with that COVID-19 patients becoming symptomatic.
By providing deeper insights into the underlying disease process that dictate patients’ clinical course, our ML pipeline can potentially be used to risk-stratify patients. Biomarkers and demographic data can be used as a proxy for disease status, potentially eliminating the need for extensive testing, which has exhausted healthcare resources, particularly for COVID-19 worldwide. Also, ML models can be robust tools for COVID-19 case definitions, and therefore, may help avoid inaccurate mapping of epidemic trajectories through public health surveillance activities [40]. It worth noting that while our ML model identified community type as an important feature (Fig. 1), it was insignificantly associated (p-value = 0.270; Additional file 1) with the study outcome using traditional statistical methods. Thus, commonly used p-values to assess the statistical significance of the association between two variables might not be a reliable measure of inference in population-based studies [41].