Patients Data collection
COVID-19 positive patients admitted between March to April 2020 were recruited from Al Kuwait Hospital, Dubai, UAE. The study was approved by Ministry of Health and Prevention (MOHAP) Research Ethics Committee number (MOHAP/DXB-REC/MMM/NO.44/2020). Adult patients (above 18 years) with COVID-19 (confirmed by nasopharyngeal polymerase chain reaction; PCR positive sample) were enrolled. Complete current and past medical history, along with their demographic data, a history of a recent travel or contact with a confirmed or suspected cases were documented. The main presenting symptoms were enlisted, including (fever, cough, fatigue, anorexia, shortness of breath (SOB), sputum production, myalgias, headache, confusion, rhinorrhea, sore throat, hemoptysis, vomiting, diarrhea, nausea, anosmia, and ageusia). Risk factors for severe illness were examined, including old age, cardiovasdcular diseases (CVD), diabetes mellitus (DM), hypertension (HTN), prior stroke and or transient ischemic attack, cancer, chronic lung disease, and chronic kidney disease (CKD).
Patients classification
Patients were classified according to "Clinical Management of Critically Ill COVID-19 Patients" guidelines (Version 1- April 15 2020) issued by MOHAP [6]. Accordingly, patients were classified into mild illness, pneumonia, and severe pneumonia (fever or suspected respiratory infection, plus one of the following: respiratory rate > 30 breaths/min, severe respiratory distress, and SpO2 ≤ 93% on room air). Severe cases that need oxygen therapy with no response to titrated oxygen therapy will require ICU treatment.
Blood and Radiological tests
Laboratory tests were retrieved that includes (1) complete blood count, including neutrophil count (NR:2-7 x10(3)/mcL), lymphocyte count (NR: 1-3 x10 (3)/mcL), heamoglobin; Hb (NR: 12-15 gm/dL), white cell count; WCC (NR: 4-11 x10(3)/mcL), and platelets count (NR:150-450 x10(3)/mcL), (2) coagulation profile, including interenational normalized ratio; INR (NR: 0.8-1.29 second), Prothrombin time; PT (NR: 9.9-12.3 seconds), (3) electrolytes, including sodium; Na (NR: 136-145 mmol/L) and potassium; K (NR: 3.6-5.1 mmol/L)), (4) renal function tests, including urea (NR: 2.5-6.5 mmol/L, creatinine (NR: 53-88 umol/L), and estimated glomerular filtration rate; eGFR (NR: 90-120 mL/min/1.73m2), (5) liver function tests, including total serum bilirubin (NR: 3-17 umol/L), alanine aminotransferase; ALT (NR:16-63 IU/L), aspartate aminotransferase; AST (NR: 15-37 U/L), alkaline phosphatase; ALP (NR: 46-116 IU/L), and albumin (NR: 34-50gm/L), (6) inflammatory markers, including C-reactive protein; CRP (NR: 0-3 mg/L), D-dimers (NR: mg/dL, lactate dehydrogenase; LDH (NR: 85-227 IU/L), procalcitonin (NR: ug/L) and ferritin ( 8-388 mcg/L).
For risk of severe cases, the presence of lymphopenia, neutrophilia, high ALT/AST, high LDH, high CRP, high ferritin, high d-dimer, and high pro-calcitonin, above the age and gender-matched references were used as indicators of risk. Admission chest X-Ray (presence of bilateral air consolidation), and computerized tomography (CT) scan (presence of bilateral peripheral ground-glass opacities) were documented.
Statistical Analysis
Out of the 70 data predictors input, 7 (10%) had missing data (Diarrhoea, severity-critical or not, chest X-ray, prothrombin time, LDH, INR, and ferritin levels). Out of them, five variables had missing data percentage ranging from (16% to 34%). Records that showed missing input in the five variables were excluded (20 records), and the remaining 120 records were further processed. Among the 70 variables, those who showed missing data more than 10% were excluded from prediction model generation, and the remaining 65 variables were selected. Remaining variables with missing inputs were tested whether they are missing completely at random (MCAR) using expectation-maximization (EM) method in SPSS statistical software, version 16 (SPSS, Inc., Chicago, IL, USA). Data were considered MCAR as the significance value is higher than 0.05. We replaced missing data with the estimated mean for each variable.
Feature selection
Ensemble Feature Selection (EFS): an ensemble feature selection tool (R-package) was used to identify the relative importance of each variable. It incorporates eight feature selection methods for binary classifications[7]. Features that get an accumulative score of more than 50% and 0.7 correlating with other features were selected, and its performance in comparison with all features was evaluated using receiver operating characteristics (ROC).