In this multicentre observational retrospective cohort study we aimed to evaluate the feasibility and utility of ML models to predicting mortality in HD patients with COVID-19. Unlike most mortality studies, [24] the present study is based only on demographics/comorbidities and clinical findings collected by several dialysis centers. It is an innovative initiative, using one of the largest databases of patients with COVID-19 on dialysis in the world.
In a meta-analysis by Chen et al. the mortality rate in COVID-19 HD patients was 22.4% (95% CI: 17.9– 27.1%), and significant statistical heterogeneity among the studies was found (I2 = 87.1%, p < 0.001), but no publication bias. [11] Also, according to the same authors, patients from non-Asian countries had a higher mortality rate (26.7%, 95% CI: 22.5–31.0%), and in studies considered to be of good quality, mortality was estimated to be 23.8% (95% CI: 20.2–27.6%), which reconciles with the overall mortality of 21.46% in the present study.
We observed a noticeable difference in the 90-day mortality in the presence of the following variables: dyspnea, advanced age, diabetes, absence of symptoms (asymptomatic), altered mental status, and arterial hypertension. It is worth mentioning that dyspnea is reported as one of the most prevalent clinical findings in several studies on COVID-19 HD patients, right after fever.
According to Chen et al. dyspnea was present in 16 studies involving HD patients with COVID-19, affecting 438 of 1246 patients (35.2%; 95% IC 16.9–36.6%) [11]. In the present study, dyspnea emerged not only as a frequent finding but also as the most relevant variable associated with mortality in HD patients. This result is consistent with the work dating from the beginning of the pandemic by Zou et al. that found that dyspnea was an independent risk factor for death (OR = 1.146; 95% CI: 1.026 to 1.875; p = 0.034) [24].
Interestingly, as can be seen in Fig. 3B, the presence of dyspnea in patients over 60 years would not increase the probability of death. In contrast, higher odds ratios values for mortality were observed in patients under 60 years with dyspnea. We believe that such findings should be further explored in the future.
In the temporal validation subset, we observed an increase in the number of patients vaccinated with at least one dose. In addition, substantial advances as to the treatment protocols and understanding of this disease had emerged by that time. These changes caused a dataset shift. The increased vaccination rates generated a prior probability shift, as it modified the mortality distribution of patients with COVID-19.
Furthermore, the vaccination altered the distribution of latent covariates when compared to observable covariates generating a simple covariate shift. The progress in therapeutics changed the presentation of clinical findings, and impacted on mortality, as commented in the RECOVERY study [25]. However, even in the face of all these disease modifiers, our algorithm performed consistently.
It deserves comments that several algorithms proposed to predict mortality in patients with COVID-19 deal with invasive physiological and laboratory data, such as in SAPS II and APACHE II [26], which consume a large amount of financial as well as human resources. The application of these scores can be particularly challenging in remote regions, with limited access to laboratory testing.
To improve the performance of the models in the absence of utilization of invasive data, we used hyperparameter search algorithms, despite literature claiming that such a strategy may not be necessary when using RF [27]. Of note, even without invasive data, the AUROC was not inferior to previously proposed algorithms.
Traditional models resorting either to scores with a sum of cutoff points or logistic regression may not be able to reflect the non-linear complexity between the dependent variables and the predictive ones since they are essentially linear. Furthermore, models such as logistic regression are less prone to capturing interaction effects when compared to decision trees. The RF exhibits approximately 70% higher prediction performance in comparison to logistic regression [28], and the advantage is maintained in medical and biomedical datasets. ML models are more flexible, being more susceptible to overfitting, since they learn directly from the data, generating hyperplanes with high variance [29]. In contrast, regressive models are based on assumptions and a priori knowledge, showing less variance and overfitting. [30]
Using Artificial Intelligence, we have successfully explored a large Brazilian database applying different validation processes. All used ML models exhibited excellent performance in predicting 90-days mortality, especially the ones using combined data (D/C and CF). These models have shown consistent performance by internal and temporal validation, even in the presence of data shifts, ensuring high reproducibility, fidelity, uniformity, and possible future clinical implementation. It should be underscored that the multicenter nature of the study increases its external validity, especially considering the high COVID-19 variability between different populations.
One of the highlights of this study is the use of simple and easy-to-obtain variables, strictly clinical, without involving laboratory or imaging testing, which would entail a high cost. Also, to the best of our knowledge, no previous studies tried to predict mortality in COVID-19 HD patients using ML.
Despite these strengths and original design, the study portrays some limitations, such as the requirement of preprocessing the data, and a safe and stable internet connection to ensure the security of the patients' data. A further large-scale external validation in different populations is warranted for clinical use deployment.
In summary, our study is the first attempt to develop ML models to predict mortality in patients with COVID-19 on HD relying on demographics/comorbidities and clinical features. We resorted to three different ML models – Random Forest, Support Vector Machine, and TabNet – reporting their performance. Despite the study limitations, considering the impact of this ongoing pandemic, our findings and conclusions are conspicuous and could be useful to help the management of such harmful disease in HD patients worldwide. In the future, this proposed model could allow fast and effective screening of COVID-19 HD patients to guide appropriate interventions and improve their prognosis while reducing costs.