Prediction of COVID-19 Severity and Mortality in Hospitalized Children Using Machine Learning Tree-based Classifiers

doi:10.21203/rs.3.rs-4926945/v1

Download PDF

Research Article

Prediction of COVID-19 Severity and Mortality in Hospitalized Children Using Machine Learning Tree-based Classifiers

https://doi.org/10.21203/rs.3.rs-4926945/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

Children make up a large percentage of Coronavirus Disease 2019 (COVID-19) hospital admissions, but there is little information available about the features to predict the severity status of the illness or mortality in pediatrics. Logistic regression, supporting vector machine and ensemble machine learning algorithms were used to develop predictive models and identify prognostic factors for severity and mortality of COVID-19 in hospitalized children.

Methods

A total of 183 children with COVID-19 under the age of 18 years hospitalized in a referral hospital in Yazd province, Iran, from March 1, 2020 to August 1, 2021 were considered for this study. Logistic regression, and machine learning classifiers including supporting vector machine, decision tree, random forest, Bagging classifier trees, Gradient boosted decision trees, and Adaptive boost classifier trees were employed to predict the development of mild/severe or critical COVID-19 and death occurrence during hospitalization. Each model performance was assessed through five-fold cross-validation method, with evaluation metrics and area under the curve. In addition, the best clinical predictive models were used to identify significant factors between severe and non-severe groups, as well as between survivors and non-survivors.

Results

Seven predictive models were developed using the medical files of 183 hospitalized children, consisting of 94 and 89 (48.6%) in non-severe and severe groups, respectively, as well as 159 survivors and 24 (13%) non-survivors. In prediction of severity status, both decision tree and random forest algorithms had the highest accuracy of 73.3% and 68.7% to predict severity status in balanced data, respectively. Based on decision tree, respiratory distress and cough at the time of admission could be regarded as the as the key factors to estimate the likelihood of severity status. The results also showed that Gradient boosted decision trees, and Adaptive boost classifier trees had the best performance for mortality prediction in balanced data considering the accuracy of 88.8% and 87.7%, respectively. Cough at the time of admission, age group of 1–13 years old, and non-normal WBC could be considered as predictive factors for death occurrence.

Conclusions

This study indicated that tree-based classifiers were the best machine learning approaches for predicting severity status and mortality in hospitalized children with COVID-19. Clinical symptoms at the time of admission identified as the most predictive features though optimal algorithms.

Machine Learning

COVID-19

Tree-based Classifiers

Ensemble Classifiers

Decision Tree

Random Forest

In December 2019, a new illness known as Coronavirus Disease 2019 (COVID-19) emerged in Wuhan, China, capturing global attraction. The involvement of various organs, diversity of disease symptoms, unfamiliarity of specialists with the behavior of the virus, especially at the beginning of the pandemic, the lack of highly effective vaccinations in the early days of the outbreak, the absence of standard treatment protocols, and the rapid spread of the virus over the world contributed to the rapid transformation of the disease into a pandemic. Although the prevalence of infectious has been considerably lower in children compared to adolescents, the severity of disease has been higher in pediatrics and it has even resulted in fatalities.

The most frequent symptoms associated with COVID-19 in children reported as fever, cough, vomiting, diarrhea, sore throat, and dyspnea. Moreover, low oxygen saturation and elevated D-dimer levels could be considered as the most common laboratory results (1). Infant under the age of 1, Children exhibiting with dyspnea, and those with underlying diseases were categorized as high-risk group for hospitalization, ICU admission, and potential mortality (2–5).

The inability to perform expensive diagnostic methods in many countries, particularly in the developing world, underscores the importance of employing alternative approaches such as learning techniques. These methods can predict the severity and prognosis of the diseases using basic characteristics, simple clinical symptoms, and laboratory results. Extensive researches have been conducted on the application of machine learning (ML) methods for predicting the prognosis of diseases, and managing the various illnesses. Furthermore, numerous studies have explored the utilizing of ML algorithms for diagnosis of COVID-19 infectious, and prediction the severity of the disease. The ML and deep learning algorithms demonstrated the ability to diagnose COVID-19 accurately, suggesting that these approaches could serve as valuable complements to existing diagnostic methods including Immunoglobulin M (IgM), Immunoglobulin (IgG), and radio graphic findings (CT-scan and chest x-ray), also reverse transcription-polymerase chain reaction (RT-PCR) (2). The application of computational techniques, particularly those trained on extensive clinical information, holds significant promise in COVID-19 diagnostic models. This achievement could help alleviate the challenges posed by current limitations in testing, thereby allowing for more effective management and control of the disease (6). Additionally, ML algorithms might hold valuable clinical implications for early detection of children infected with COVID-19 (7). Although there are various predictive techniques for severity of SARS-COV-2 infection in adults, minority were created with data from children. Therefore, it is vital to develop a specific ML algorithm for severity prediction in children. In a study nomogram was employed to classify children based on the likelihood of moderate/severe disease using nine independent risk factors (8). The study aimed to assess the nonparametric models (tree-based ML algorithms and logistic regression) to predict sever and non-severe patients, as well as survivors and non-survivors in enrolled children with confirmed COVID-19 from Shahid Sadoughi hospital. Furthermore, the k-fold cross validation method with indices were used to evaluate the ML algorithms to predict the severity and mortality in children with COVID-19.

Design and setting of the study

All children were included in this study aged 1-month to 18 years old admitted to Sahid Sadoughi referral hospital in Yazd province, between March 1, 2020 and August 1, 2021. Ultimately, information from a total of 183 instances was incorporated into the study. Demographics, clinical characteristics and laboratory findings at the time of admission, and the drugs used were collected for all cases. However, not all of these variables were suitable for inclusion in the present study.

In this research, COVID-19 diagnosis was established through either a laboratory test (SARS-CoV-2-positive swabs (RT-PCR)), a clinical assessment based on defined symptoms as determined by clinicians, or the CT-scan radiographic findings. The severity of disease served as one the outcome in this study, and was defined as WHO recommendations definition in three categories: non-severe (absence of signs of severe or critical diseases), severe (Oxygen saturation in an air room (O2sat) lower than 94% on room air, tachypnea, and signs of severe respiratory distress), and critical (requires life-sustaining treatment, acute respiratory distress syndrome, sepsis, and septic shock). Death of hospitalized children was considered as the other outcome that was reported for thirteen percent (n = 24) of patients. Accordingly, a total of 89 (48.6%) of children were severe and critical patients and 94 (51.4%) recorded as non-severe.

Data acquisition, and Demographics, and clinical characteristics

Demographic attributes included sex, age of children, underlying diseases, hospitalization history, parental relationship, having insurance, and the length of hospital stay while clinical symptoms at the time of admission involved fever, cough, respiratory distress, diarrhea, vomiting, and O2sat. Moreover, laboratory findings at the time of admission encompassed PCR test, C-reactive protein (CRP), white blood cell count (WBC), platelet count (PLT), and serum levels of sodium (Na), potassium (K), calcium (Ca), alanine transaminase (ALT), aspartate aminotransferase (AST), erythrocyte sedimentation rate (ESR), lactate dehydrogenase (LDH), and the presence of lymphopenia while radiographic findings considered as lung CT-scan findings and chest X-ray scoring (CXR).

This study was conducted under the Helsinki Declaration (2013) guidelines. The ethics committee of the Shahid Sadoughi University of Medical Sciences and Health Services of Yazd had approved the research (IR.SSU.REC.1399.312)

Pre-processing and features extraction

To optimize the development of the predictive model, pre-processing techniques were applied to the data set, which effectively reduced its dimensions and ensured it was properly prepared for model building. The attributes with more than 30% missing values were not considered in algorithms. Therefore, parental relationship, having insurance as demographic features, and ALT, AST, ESR, LDH values, and CRP categories as laboratory findings did not consider in model specification. In order to handle missing data, the model-based imputation methods employed, which allowed to fill the missing values in remained features. To improve the performance of each ML algorithm, age of children categorized in four age groups (1month-12 months, 1–5 years, 5–13 years, and 13–18 years), the length of hospital stay classified as less than or equals 1 week and more than 1 week, and laboratory tests were considered as normal and abnormal categories based on medical references.

For prediction of severity in hospitalized children the demographic characteristics, clinical symptoms and laboratory findings at the time of admission were considered. Consequently, each ML algorithm took into account 18 attributes with the aim of determining the most effective method for predicting severity. However, for forecasting mortality among hospitalized children with COVID-19, all features; including radiographic findings (CT-scan and CXR) were incorporated into the ML algorithms to enhance the accuracy. Furthermore, the synthetic minority oversampling technique (SMOTE) was employed to dataset to handle the imbalance distribution of death rate, with a 13% death rate compared to 87% of survival rate. In case of severity prediction, the SMOTE was also utilized to generate the synthetic samples from the minority class, enabling a comparison of the predictive indices with the original data set. Subsequently, the original and balanced datasets were used to apply various ML algorithms, and the outcomes were compared based on the model performance indices. Notably, the predictive models incorporated 183 samples, along with the severity status and death occurrence as two binary targets. Figure 1 showed the step-by-step data refinement and model selection procedure.

Model development

Supervised ML algorithms were established using five-fold cross-validation approach to evaluate the effect of features on the model’s performance. The original and balanced datasets were split up into five randomly selected folds, with 4-folds applied for training, and the hold-out fold used for model testing. The logistic regression (LR), supporting vector machine (SVM), and decision tree (DT) as classic ML classifiers and random forest (RF), Bagging classifier trees (BC), Gradient boosted decision trees (GBDT), and Adaptive boost classifier trees (AdaBoost) as ensemble classifiers, used 70% of the data as a training and 30% for testing in each iteration. All these algorithms considered as supervised learning methods, and their goal was classification. The validation data was used to assess the models' performance, and the iterative procedure was repeated until satisfactory outcomes were achieved. The brief overview of ML classification techniques employed in this study was provided in the following sections.

Logistic regression (LR)

LR is a supervised learning algorithm which estimates probabilities of binary dependent considering continuous or discrete predictors with the help of sigmoid function. The LR model can calculate the probability of an individual who experienced severity or death based on the input of risk factors (9). This model is trained by estimating the effective parameters through maximum likelihood technique. In the case of binary outcome, the logistic model formula is defined as follows:

$$\:{ln}\left(\frac{{p}_{i}}{1-{p}_{i}}\right)=\:\sum\:_{i=0}^{k}{\beta\:}_{i}{x}_{i}\:or\:{p}_{i}=\:\frac{1}{1+{e}^{-\sum\:_{i=0}^{k}{\beta\:}_{i}{x}_{i}}}\:$$

Here, $\:{\pi\:}_{i}$ denotes the probability that a sample is in a severe or survivor categories while severity and death occurrence considered as binary response variable. $\:{p}_{i}$ commonly referred to as the "probability of success", and is 0 ≤ $\:{p}_{i}$ ≤ 1; while $\:\beta\:$ represents the regression coefficient, k is the number of features, and $\:{x}_{1},\:{x}_{2},\dots\:,\:{x}_{k}\:$represents each predictor (10).

Decision Tree (DT)

DT is a basic supervised machine learning algorithm for classification. The classification DT model has a tree-like structure (nodes and branches) which can be constructed based on features (11). DT algorithm has several steps; developing trees with nodes as input attributes, selecting the most important attribute to predict the outcome from input attributes, computing the highest information gained from each feature in each node of tree (Gini index/entropy index), repeating the previous steps to create subtrees based on not-incorporated features in previous nodes. The widely used criterion for splitting and measuring impurity are the Gini index and entropy index that for a categorical response with classes k = 1, 2…, K, these indices are given by

$$\:{I}_{Gini}\left(\tau\:\right)=\:\sum\:_{k=1}^{K}\widehat{p}{k}_{\tau\:}(1-\widehat{p}{k}_{\tau\:})$$

entropy=-$\:\sum\:_{k=1\:}^{K}\widehat{p}{k}_{\tau\:}$ $\:\times\:$ $\:log\:\left(\widehat{p}{k}_{\tau\:}\right)$ (3)

with $\:\widehat{p}{k}_{\tau\:}$ defined as the proportion of observations being observed from class k in node τ, and Since $\:0\le\:\widehat{p}{k}_{\tau\:}\le\:1$, it follows that $\:0\le\:-\widehat{p}{k}_{\tau\:}\:log\widehat{p}{k}_{\tau\:}$ (12).

Over time, Researchers have developed numerous DT algorithms, progressively improving their performance and ability to manage the various dataset.

Random Forest (RF)

RF is a prominent ensemble learning method that develops single tree models based on CART algorithm (12). Leo Bremen represented an algorithm that utilizes the bagging techniques to generate multiple trees and then combines them into a robust ensemble method for prediction (13). In suggested method, a number of single DT will be generated based on bootstrapped training samples considering a random sample of features as split candidates from the full set of features. In the case of categorical outcomes, the predicted class $\:{\widehat{C}}_{b}\left(x\right)\:$of each decision tree within the ensemble is accounted, and the most frequently occurring class across all generated trees is selected by

$\:{\widehat{C}}_{B}\left(x\right)=$ majority vote $\:\left\{{\widehat{C}}_{b}\left(x\right)\right\}$ (4)

The instability result of single DT could be controlled by averaging (voting) while developing process would generate decorrelated trees to decrease the variance effectively (12). Consequently, RF often demonstrate a significant improvement in prediction accuracy when compared to the CART algorithm.

Bagging Classifier (Aggregation Boosting)

The bootstrap aggregating (bagging) is an ensemble technique that aims to enhance model stability and reduce the variance of learning method. This approach adds the predictions from multiple bootstrap samples with replacement (several b subsets) generated from training dataset that can lead to noticeable modifications in obtained predictor. Consequently, the prediction is then computed by averaging (for regression) or voting (for classification) the predictions of all the individual models. The bagging technique achieves its best results when the ensemble members exhibit high variance while maintain low bias. Additionally, the bagging approach is able to encompass the diversity in input data because of bootstrap sampling. Although this method increase the accuracy considerably, the selected features during the bagging procedure may not be interpretable (14).

Boosting Classifier

Boosting is a type of ensemble ML meta-algorithm capable of creating a strong classifier from weak learners to reduce the bias and variance of predictions. The key concept of boosting technique involves iteratively utilizing the base algorithm to altered versions of input data (14). Particularly, the boosting approach has several stages including, training a weak learner by input data, calculating predations from weak learners, identification of misclassified training samples and training the remained weak learners with adjusted training dataset, and repeating the previous steps to find specified weighted base learners (15). This learning approach is inappropriate for learning noisy dataset, because more emphasis is placed to the weights given to noisy samples. Nevertheless, the boosting-based ensemble methods could be considered as the most successful ML algorithms for prediction. Adaptive Boost classifier (AdaBoost) and eXtreme Gradient Boost (XGBoost) are two types of ML framework which that effectively use the concept of boosting algorithm to improve the performance and accuracy of predictive models. AdaBoost classifier is one of the robust ML algorithms that employ decision trees with single split as the base learners. The AdaBoost learning process includes, training a base classifier (DT), adjusting sample weights considering the classifiers’ predictions, and ultimately training the subsequent classifier by using the adjusted samples (14, 16).

Gradient Boosted Decision Tree (GBDT)

Gradient boosting is a specific implementation of boosting technique to produce robust ensemble classifier. In this approach, decision trees are mainly used as the base learners, so it is referred to as gradient boosted decision tree (GBDT). The main idea of this algorithm is training the highly correlated base learner (DT) with negative gradient of the loss function regarding the entire ensemble to create a robust classifier (14). The greatest strength of GBDT is that, similar to other boosting algorithms, it is capable of identifying complex patterns of input data because it is designed to correct the inaccuracies of the initial model during the training process. However, the noisy input dataset can impact the accuracy of the enhanced model and potentially lead to overfitting. This technique is ideally suited for situations involving limited or small dataset (17).

Supporting Vector Machine (SVM)

SVM is commonly employed supervised ML technique, mostly utilized for classification purpose. The key concept of this approach is setting hyperplane in high dimensional space which can separate classes. Selecting the hyper plane to divide the classes best, finding margin (a hyper plane to calculate the distance between the planes and the data), selecting the classes with high margin, are steps for SVM algorithm (18). The SVM identifies the best separating hyper plane by maximizing the margin between two decision boundaries. Mathematically the distance between hyper plane (5) and hyper plane (6) is defined as $\:D=\:\frac{2}{‖w‖}$

$$\:{w}^{T}x+b=-1\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$

$$\:{w}^{T}x+b=1$$

Data Partitioning and Performance evaluation

The ML algorithms previously outlined; LR, DT, RF, BC, AdaBoost, GBDT, were employed for predicting severity status and mortality in hospitalized children with COVID-19. The input data were divided into training and testing in a 70:30 ratio. Subsequently with both original and balanced datasets, the performance metrics were computed using 5-fold cross-validation and ROC curve. The area under curve (AUC), accuracy, F1 score, precision, sensitivity, specificity, and negative predictive value (NPV) were reported, and all of them obtained by averaging the metrics calculated from the individual cross-validation runs. The ML algorithms were implemented using Python Software Foundation. Python Language Reference, version 3.9 Available at https://www.python.org/ (19). NumPy, pandas, and Scikit-learn classification libraries were used for fitting the ML algorithms, and cross-validation library was applied for evaluating the model performance. Moreover, Matplotlib and seaborn library were employed for visualization of the results.

demographics and baseline statistical analysis

A total of 183 registered children with covid-19 who experienced severe and critical status in 89 (48.63%) cases were included in this epidemiologic study. The mean ± standard deviation (SD) age of patients was 6.95 ± 5.59 years, aged 1 months to 18 years old, and male patients accounted for 50.3% (92 cases) and female patients 49.7% (91 cases). At the time of admission, 119 (66.8%) patients had high body temperature (> 38^º C), 42 (31.3%) had low oxygen saturation (< 94%), 41 (23.7%) had respiratory distress, and 81 (44.3%) patients had at least one comorbidity. Out of the total number of patients, 24 (13%) patients died that most aged 1–5 years old. Demographic, clinical symptoms at the time of admission, and laboratory findings were provided in Table 1–2.

Table 1

Characteristics, Clinical symptoms, and laboratory findings of severe and non-severe children with COVID-19 in referral center in Yazd
Variable	Total (N = 183)	Non-Severe (N = 94)	Severe (N = 89)	P value^*
Age				0.777
1 month- 1year	28 (15.3)	16 (17)	12 (13.5)
1 year- 5 years	59 (32.2)	31 (33)	28 (31.5)
5 years- 13 years	60 (32.8)	31 (33)	29 (32.6)
13 years- 18 years	36 (19.7)	16 (17)	20 (22.5)
Sex				0.826
male	92 (50.3)	46 (49)	45 (50.6)
female	91 (49.7)	48 (51)	44 (49.4)
History of hospitalization				0.017
yes	49 (26.8)	18 (19.1)	31 (34.8)
no	134 (73.2)	76 (80.9)	58 (65.2)
Underlying disease				0.857
yes	81 (44.3)	41 (43.6)	40 (44.9)
no	102 (55.7)	53 (56.4)	49 (55.1)
Parental relationship				0.031
yes	57 (35.2)	23 (27.4)	34 (43.6)
no	105 (64.8)	61 (72.6)	44 (56.4)
Insurance				0.907
yes	133 (80.1)	67 (79.8)	66 (80.5)
no	33 (19.9)	17 (20.2)	16 (19.5)
Fever- on admission				0.001
yes	119 (66.8)	50 (57.5)	69 (80.2)
no	54 (31.2)	37 (42.5)	17 (19.8)
Cough- on admission				0.134
yes	59 (34.1)	25 (28.7)	34 (39.5)
No	114 (65.9)	62 (71.3)	52 (60.5)
Respiratory distress- on admission				0.006
yes	41 (23.7)	13 (14.9)	28 (32.6)
no	132 (76.3)	74 (85.1)	58 (67.4)
Diarrhea- on admission				0.03
yes	47 (27.2)	30 (34.5)	17 (19.8)
no	126 (72.8)	57 (65.5)	69 (80.2)
Vomiting- on admission				0.409
yes	46 (26.6)	21 (23.9)	25 (29.4)
no	127 (73.4)	67 (76.1)	60 (70.6)
O2 saturation				< 0.001
< 94%	42 (31.3)	1 (1.5)	41 (60.3)
>=94%	92 (68.7)	65 (98.5)	27 (39.7)
PCR test				0.064
positive	128 (69.9)	60 (63.8)	68 (76.4)
negative	55 (30.1)	34 (36.2)	21 (23.6)
CRP				0.305
0 & 1⁺	32 (32.7)	12 (27.3)	20 (37)
2⁺ & 3⁺	66 (67.3)	32 (72.7)	34 (63)
WBC (×103/µL)				0.683
5000–15000 (normal)	104 (60.8)	53 (62.4)	51 (59.3)
< 5000/ >15000	67 (39.2)	32 (37.6)	35 (40.7)
PLT (×103/µL)				0.354
15000–45000 (normal)	113 (66.9)	54 (63.5)	59 (70.2)
< 15000/ >45000	56 (33.1)	31 (36.5)	25 (29.8)
Na (mEq/L)				0.997
135–145 (normal)	95 (60.5)	46 (60.5)	49 (60.5)
120–135/ >145	62 (39.5)	30 (39.5)	32 (39.5)
K (mEq/L)				0.33
3.5–5.5 (normal)	133 (88.1)	68 (90.7)	65 (85.5)
< 3.5/ >5.5	18 (11.9)	7 (9.3)	11 (14.5)
Ca (mg/dL)				0.007
8.8–10.8 (normal)	43 (49.5)	29 (63)	14 (34.1)
< 8.8/ >10.8	44 (50.5)	17 (37)	27 (65.9)
ALT (U/L)				0.632
8–35 (normal)	65 (59.6)	28 (57.1)	37 (61.7)
> 35	44 (40.4)	21 (42.9)	23 (38.3)
AST (U/L)				0.598
15–45 (normal)	63 (57.3)	30 (60)	33 (55)
> 45	47 (42.7)	20 (40)	27 (45)
ESR (mm/h)				0.049
<= 30 (normal)	49 (51.6)	28 (62.2)	21 (42)
>30	46 (48.4)	17 (37.8)	29 (58)
LDH (U/L)				0.193
<=500 (normal)	45 (47.4)	24 (54.5)	21 (41.2)
>500	50 (52.6)	20 (45.5)	30 (58.8)
Lymphopenia				0.192
yes	105 (62.9)	55 (67.9)	50 (58.1)
no	62 (37.1)	62 (32.1)	36 (41.9)
CT findings				0.001
normal	19 (76.8)	12 (44.4)	7 (12.7)
abnormal	63 (23.2)	15 (55.6)	48 (87.3)
CXR score				< 0.001
normal	115 (62.8)	73 (77.7)	42 (47.2)
abnormal	68 (37.2)	21 (22.3)	47 (52.8)
Hospital stays (days)				< 0.001
<=7	141 (77)	84 (89.4)	57 (64)
> 7	42 (23)	10 (10.6)	32 (36)
Data presented as frequency (percent). *Chi-square test was used to compare the frequencies. Significance level is 0.05

Table 2

Characteristics, clinical symptoms, and laboratory findings of survivors and non-survivors of children with COVID-19 in referral center in Yazd
Variable	Total (N = 183)	Survivor (N = 159)	Non-Survivor (N = 24)	P value^*
Age				0.231
1 month- 1year	28 (15.3)	22 (13.8)	6 (25)
1 year- 5 years	59 (32.2)	51 (32)	8 (33.3)
5 years- 13 years	60 (32.8)	56 (35.2)	4 (16.7)
13 years- 18 years	36 (19.7)	30 (18.9)	6 (25)
Sex				0.682
male	92 (50.3)	79 (49.7)	13 (54.2)
female	91 (49.7)	80 (50.3)	11 (45.8)
History of hospitalization				0.006
yes	49 (26.8)	37 (23.3)	12 (50)
no	134 (73.2)	111 (76.7)	12 (50)
Underlying disease				0.137
yes	81 (45.5)	67 (42)	14 (58.5)
no	102 (54.5)	92 (58)	10 (41.5)
Parental relationship				0.430
yes	57 (35.2)	48 (34.0)	9 (42.9)
no	105 (64.8)	93 (66)	12 (57.1)
Insurance				0.830
yes	133 (80.1)	115 (79.9)	18 (81.8)
no	33 (19.9)	29 (20.1)	4 (18.2)
Fever- on admission				0.158
yes	119 (68.8)	101 (66.9)	18 (81.8)
no	54 (31.2)	50 (33.1)	4 (18.2)
Cough- on admission				0.092
yes	59 (34.1)	55 (36.4)	4 (18.2)
No	114 (65.9)	96 (63.3)	18 (81.8)
Respiratory distress- on admission				0.042
yes	41 (23.7)	32 (21.2)	9 (40.9)
no	132 (76.3)	119 (78.8)	13 (59.1)
Diarrhea- on admission				0.60
yes	47 (27.2)	40 (26.5)	7 (31.8)
no	126 (72.8)	111 (73.5)	15 (68.2)
Vomiting- on admission				0.104
yes	46 (26.6)	37 (24.5)	9 (40.9)
no	127 (73.4)	114 (75.5)	13 (59.1)
O2 saturation				0.028
< 94%	42 (31.3)	34 (28.3)	8 (57.1)
>=94%	92 (68.7)	86 (71.7)	6 (42.9)
PCR test				0.393
positive	128 (69.9)	113 (71.1)	15 (62.5)
negative	55 (30.1)	46 (28.9)	9 (37.5)
CRP				0.171
0 & 1⁺	32 (32.7)	26 (30.2)	6 (50)
2⁺ & 3⁺	66 (67.3)	60 (69.8)	6 (50)
WBC (×103/µL)				0.642
5000–15000 (normal)	104 (60.8)	89 (60.1)	15 (65.2)
< 5000/ >15000	67 (39.2)	59 (39.9)	8 (34.8)
PLT (×103/µL)				0.624
15000–45000 (normal)	113 (66.9)	98 (67.6)	15 (62.5)
< 15000/ >45000	56 (33.1)	47 (32.4)	9 (37.5)
Na (mEq/L)				0.413
135–145 (normal)	95 (60.5)	84 (61.8)	11 (52.4)
120–135/ >145	62 (39.5)	52 (38.2)	10 (47.6)
K (mEq/L)				0.776
3.5–5.5 (normal)	133 (88.1)	115 (87.8)	18 (90)
< 3.5/ >5.5	18 (11.9)	16 (12.2)	2 (10)
Ca (mg/dL)				0.066
8.8–10.8 (normal)	139 (76)	38 (54.3)	5 (29.4)
< 8.8/ >10.8	44 (24)	32 (45.7)	12 (70.6)
ALT (U/L)				0.765
8–35 (normal)	65 (59.6)	56 (60.2)	9 (56.3)
> 35	44 (40.4)	37 (39.8)	7 (43.8)
AST (U/L)				0.084
15–45 (normal)	63 (57.3)	57 (60.6)	6 (37.5)
> 45	47 (42.7)	37 (39.4)	10 (62.5)
ESR (mm/h)				0.678
<= 30 (normal)	49 (51.6)	42 (52.5)	7 (46.7)
>30	46 (48.4)	38 (47.5)	8 (53.3)
LDH (U/L)				0.672
<=500 (normal)	45 (47.4)	40 (48.2)	5 (41.7)
>500	50 (52.6)	43 (51.8)	7 (58.3)
Lymphopenia				0.07
yes	105 (62.9)	95 (65.5)	10 (45.5)
no	62 (37.1)	50 (34.5)	12 (54.5)
CT findings				0.031
normal	63 (76.8)	50 (72.5)	13 (100)
abnormal	19 (23.2)	19 (27.5)	0 (0)
CXR score				0.064
normal	115 (62.8)	104 (65.4)	11 (45.8)
abnormal	68 (37.2)	55 (34.6)	13 (54.2)
Hospital stays (days)				0.001
<=7	141 (77)	129 (81.1)	12 (50)
> 7	42 (23)	30 (18.9)	12 (50)
Data presented as frequency (percent). *Chi-square test was used to compare the frequencies. Significance level is 0.05

Table 1–2.

Results of ML methods

At first, LR as a common method was used to estimate the effect of each feature for early prediction of severity and mortality in children with confirmed covid-19. Table 3 represented the feature importance using regression coefficients, odds ratios (OR), and P values. Based on the results, having respiratory distress at the time of admission and lymphopenia had the highest OR to predict the severity and mortality in children with COVID-19, respectively. Overall, clinical symptoms at the time of admission could be considered as predictive features of severity status, while abnormal findings in CXR and lung CT-scan had the predictive effect on the odds of occurrence of mortality among hospitalized children.

Table 3

Logistic Regression Classifier features in decreasing order of their importance for predicting the severity and mortality of children with COVID-19 in referral center in Yazd
Variables		Severe Patients			Variables		Non-Survivors
Variables		Coefficients	OR	P value	Variables		Coefficients	OR	P value
Respiratory Distress	yes	1.11	3.03	< 0.001	Lymphopenia	yes	1.2	3.32	< 0.001
Respiratory Distress	no	-			Lymphopenia	no	-
Vomiting	yes	0.75	2.45	0.004	Diarrhea	yes	-0.46	0.64	0.029
Vomiting	no	-			Diarrhea	No
Fever	yes	0.71	2.02	0.004	Respiratory Distress	yes	0.30	1.35	0.057
Fever	no	-			Respiratory Distress	no	-
Cough	yes	0.53	1.51	0.064	CXR Score	abnormal	0.26	1.29	0.065
Cough	no	-			CXR Score	normal	-
Na	abnormal	0.37	1.30	0.094	CT findings	abnormal	0.25	1.28	0.065
Na	normal				CT findings	normal	-
PCR	Positive	0.26	1.27	0.10	Underlying diseases	yes	0.24	1.27	0.10
PCR	Negative	-			Underlying diseases	no
Underlying diseases	yes	0.24	1.19	0.11	Fever	yes	0.1	1.1	0.13
Underlying diseases	no	-			Fever	no

Table 3.

The performance of LR and the ML algorithms, which were assessed using 5-fold CV reported in Table 4. It could be highlighted that the performance of all ML methods was satisfying after using the synthetic minority oversampling technique (SMOTE) which removed the imbalance in dataset. Although, RF and SVM method had similar AUC to predict the severity in children, the highest sensitivity and accuracy (75.3% and 73.3% respectively) were computed for DT algorithm. As shown in Table 4, the balanced dataset increased the AUC of DT significantly compared to other methods, indicating that this algorithm was able to better discriminate between sever and non-sever children. However, there was no remarkable difference in performance of RF algorithm before and after using the SMOTE, and in some ML methods such as GBDT and AdaBoost the AUC decreased after balancing the data. The results of Table 4 suggested that DT and RF techniques could be considered as the most suitable ML algorithms for estimating the severity in children with confirmed COVID-19, representing the complicated and non-linear relationships between the attributes and the outcome.

Subsequently, the accuracy of GBDT method to predict the mortality in patients was 88.8%, indicating the best performance among the other algorithms. The AdaBoost and SVM models had the accuracy of 87.7% which could be regarded as the second optimal algorithms to predict the mortality in children with COVID-19. Although the AUC of SVM was estimated as the highest, 97.8% in balanced data, the GBDT technique had satisfied performance and ranked as the second based on the AUC value of 96%. Considering the performance indices of DT and BC algorithms in Table 4, it was clear that they could not be used as the methods to predict mortality in children with COVID-19.

The diagnostic ability of ML algorithms was assessed by employing the ROC curves, separately for severity and mortality outcome (Fig. 2). The balanced dataset was used to estimate the AUC of each ML algorithm. Using a 5-fold CV for the evaluation of methods resulted in a little difference from what showed in Table 4. Given the result of 5-fold CV, DT and RF methods could be regarded as proposed techniques to predict the severity, and the GBDT, SVM and AdaBoost algorithms for estimating mortality in children with COVID-19. In overall, boosting ensemble methods had better performance compared to other ML models in prediction of mortality in hospitalized children.

Figure 3 indicated how the probability of severity and occurrence the death could depend on the multiple features of children. As showed in Fig. 3.a, respiratory distress at the time of admission could be regarded as the most important attribute to predict the severity status in children with COVID-19. Furthermore, cough, positive PCR test, and fever at the time of admission could be viewed as the predictive features for prediction of severe and critical status in children. Regarding Fig. 3.b, cough at admission was determined as the most significant variable for mortality prediction. In addition, children aged 1–13 years old, abnormality in WBC (< 5000 and > 15000 ×103/µL) and CXR findings played key roles in predicting mortality.

Table 4.

Table 4

Performance criteria for the ML methods for predicting the severity and mortality of children with COVID-19 in referral center in Yazd
		Severity
Methods		AUC	Accuracy	F1 Score	Precision	Sensitivity	Specificity	NPV
LR	Original	60	63.6	61.9	64.2	64	69.5	59.2
LR	Balanced	62.5	64.2	62.5	66.5	62	70.8	60.7
DT	Original	57	72.7	72.6	73	74.6	58.6	62.9
DT	Balanced	73.6	73.3	73.4	74	75.3	59.2	57.1
RF	Original	67.9	65.4	63.9	66.7	63.3	70	51.8
	Balanced	68.7	68.7	67.3	68.4	67.9	68.1	53.6
BC	Original	60.1	61.1	59	61.3	59.6	65	48.1
	Balanced	63.9	61.1	54	65	46	65	46.4
GBDT	Original	64.6	58.9	51.2	58.5	54.6	70	51.8
	Balanced	61.9	49	35.8	53.3	52.6	61.9	46.4
AdaBoost	Original	65.3	61.8	61.4	63.2	63.3	59	48.1
	Balanced	58.3	60.4	61.2	60.5	64.6	65.2	53.5
SVM	Original	68.6	56.3	55.8	56.5	61.3	73.6	51.8
	Balanced	67.6	58.9	58	58.1	62	70	51.8
		Mortality
Methods		AUC	Accuracy	F1 Score	Precision	Sensitivity	Specificity	NPV
LR	Original	74	90	0.0	0.0	90.7	0.0	0.0
	Balanced	93.3	80.6	79.8	80.4	80.3	82.6	87.7
DT	Original	41.7	90	0.0	0.0	90.1	0.0	0.0
	Balanced	64	63.2	60.8	63	62.4	58.6	65.3
RF	Original	77.9	90	0.0	0.0	90.1	0.0	0.0
	Balanced	83.4	73.2	73.9	70.1	78.8	23	42.8
BC	Original	63.9	84.7	90.3	94	94	0.0	0.0
	Balanced	63.9	55	42	59	33	59.2	32.6
GBDT	Original	73	89	36.6	26.6	40	0.0	0.0
	Balanced	96	88.8	87.5	90.5	98	64.3	36.7
AdaBoost	Original	65	85.4	23.3	16.6	40	0.0	0.0
	Balanced	89.8	87.7	88.8	83.4	96	68	34.7
SVM	Original	82	90	0.0	0.0	90.9	0.0	0.0
	Balanced	97.8	87.7	85.8	90.6	83.3	69.5	32.6
Indices presented in percent.

In the present study, LR, DT, and SVM as classic supervised ML classifiers and four developed ensemble classifiers including, RF, BC, AdaBoost, and GBDT were implemented to predict the severity status and death occurrence in hospitalized children with COVID-19. The clinical efficacy of the algorithms was measured by accuracy, AUC, F1 score, precision, sensitivity, specificity, and NPV. According to the indices, while all of the methods indicated the reasonable performance, the ensemble ML algorithms demonstrated greater predictive efficiency than the others. DT and RF had the highest accuracy to predict severity status, and GBDT and AdaBoost had the best performance for mortality prediction. Based on the optimal predictive models, respiratory distress and cough at the time of admission could be considered as the key factors to estimate the likelihood of severity status and death occurrence, respectively.

A number of ML algorithms have been proposed to predict the outcomes in COVID-19 studies (7, 20–25). A study investigated the prediction of early detecting of COVID-19 in 5664 children admitted in medical centers, based on laboratory findings using supervised learning Techniques. ANN, RF, SVM, DT, and GBDT were used to identify COVID-19 and standard 10-fold cross validation procedure was used to evaluate the performance of the five ML algorithms. The results of this study revealed that classification and regression trees (CART) models had the highest accuracy (92.5%) for binary classes based on laboratory outcomes. Leukocytes, Monocytes, Potassium, and Eosinophils were the most important features that could predict the COVID-19 in admitted children (20). In relation to the findings of the current study, DT with CART algorithm had the highest accuracy to predict the severity status in hospitalized children with COVID-19, however none of the laboratory findings were identified as the important features to predict the critical or severe status in patients.

In 2020, Ma et al. predicted the chest CT results among RT-PCR positive pediatric patients aged 16 and under 16 years old using Bayesian optimization process. In that study, the researcher used the clinical symptoms and laboratory results of 102 children with normal and 142 cases with abnormal CT findings to compare the performance of the suggested approach with regular techniques in learning models. Based on the presented results, the Bayesian optimization achieved an AUC = 0.84 with 0.82 and 0.84 for accuracy and sensitivity to predict CT outcomes, respectively. Their results showed age, lymphocyte, neutrophil, ferritin and C-reactive protein were the most related para-clinical results to predict CT findings for pediatric patients with positive RT-PCR testing (21). The radiographic findings were omitted for predicting the severity status in present study, because the abnormality in CT results could be considered as the crucial predictive for severe and critical status in pediatric patients with COVID-19. Based on the results, the clinical symptoms including; respiratory distress, cough, and fever at the time of admission, were identified as the most important features to predict the severity status in hospitalized children. Although the children’s ages were nearly the same in both studies, the non-parametric models used for predicting severity yielded significantly different results.

A cross-sectional study was conducted on 556 children in Serbia between 2020 and 2022. The research included 280 pediatrics with PCR-confirmed COVID-19 and 286 children with respiratory symptoms with negative result of PCR. The researchers used six ML techniques (RF, SVM, linear discriminant analysis, ANN, K-nearest neighbors, and DT) to help healthcare providers to detect children with COVID-19 in the rapid triage. According to performance indices, RF and SVM indicated the highest accuracy of 85% and 82.1%, respectively, and the most prominent features were shown as mean platelet volume (MPV), WBC, mean corpuscular hemoglobin concentration (MCHC), platelet distribution width, (PDW), and absolute lymphocyte count (LYM) to predict COVID-19 in the early stage (7). Although the outcomes of the present study differed, tree-based algorithms- specially DT and RF for severity status, and GBDT and AdaBoost for mortality prediction- were identified as the most effective ML techniques. However, the features recognized for predicting the outcomes were dissimilar.

544 hospitalized children with COVID-19 participated in a study at children hospital in China between 2022 and 2023, with 243 and 301 in the mild and severe categories, respectively. For prediction algorithms including LR, RF, XGBoost, AdaBoost, categorical Boost (CatBoost), and light Gradient Boosting machine (LightGBM) the potential attributes including patient characteristics, and medical information were taken into consideration. The performance of each ML model was evaluated from 5-fold cross-validation method (24). The results of the study demonstrated the RF + TomekLinks model as the better method with AUC of 82.1% choosing the 10 most significant variables, which was compatible with the findings of the current study.

A systematic review study in 2024 investigated the research methodologies, computational modeling strategies, and performance assessment standards used by studies employing ML techniques to establish clinical predictive models for children and adolescents infected with COVID-19. Ten studies published from 01/01/2020 to 10/25/2023 were included in the investigation, and widely used ML methods were tree-based models, such as XGBoost, DT, and CatBoost, and neural networks (ANNs), like multilayer perceptron (25). It was demonstrated that ML models could potentially develop correct clinical predictive models to boost the patient care by recognizing the high-risk individuals who may get the early interventions or personalized treatments. Despite the successful results, the consistency in reporting model development and validation approaches were not satisfying which was provided in this research.

There were some important considerations in this study that should be acknowledged. The model training was restricted to hospitalized children who made a significant proportion of COVID-19 hospital admissions. Furthermore, the multi factors might be related to severity status and mortality rate of the disease in children; basic characteristics, clinical symptoms, biochemical features, and radiographic findings, were included in ML algorithms. Utilizing ensemble ML algorithms, particularly boosting models, and SMOST enhanced the model validation indices significantly. This improvement enabled the identification of key features associated with severe disease outcome. Ultimately,

However, there were certain limitations to this study that it is necessary to be considered for further researches. First, it was an observational investigation conducted based on the data from a single-center referral hospital during the initial stage of the COVID-19 global outbreak. As a result, a limited number of children were included in this research. Additionally, due to inadequate documentation, some parameters influencing the illness prognosis were not accurately registered from the onset. Second, blood counts, biomarker parameters, and clinical symptoms involved in model training were obtained at the time of admission. The dynamic changes of important laboratory findings and clinical symptoms could improve the model validation, resulting in a potentially better clinical predictive model. Third, developing the more complex algorithms, such as ANNs, could potentially enhance the predictive capabilities compared to tree-based models, which were not considered in this study.

To summarize, five tree-based classifiers and two classic supervised techniques were implemented to reveal the COVID-19 prognosis using demographics, clinical symptoms and laboratory findings at the time of admission, and CT-scan and CXR results in this research. The results demonstrated that COVID-19 severity and mortality in children could be predicted by tree-based ML classifiers. The most important features were respiratory distress and cough at the time of admission for predicting the severity status and death occurrence, respectively. The addition of age categories and other clinical symptoms could increase the strength of prediction.

Ethical Approval and consent to participate

This study was approved by the ethics committee of the Shahid Sadoughi University of Medical Sciences and Health Services of Yazd (IR.SSU.REC.1399.312). This study was conducted under the Helsinki Declaration (2013) guidelines and later amendments. All participants older than 16 years provided written informed consent prior to enrolling in the study, and for the individuals younger than the age of 16 the written informed consent was obtained from their parents or legal guardians.

Consent for publication

Not applicable. No personal or identifiable data were collected during the conduct of the study.

Availability of data and materials

The registered data used to provide the findings of this study are restricted by the Ethics Committee of Shahid Sadoughi University of Medical Sciences and Health Services in order to protect patient privacy. The datasets used during the current study are available from Mehran Karimi at [email protected] on reasonable request for researchers who meet the criteria for access to confidential data.

Competing Interests

The authors declare no conflict of interests.

Funding

This research was supported by research grant from Shahid Sadoughi University of Medical Sciences and Health Services, Yazd, Iran [COVID-19 Internal Grant Announcement].

Author contributions

Mehran Karimi, Farimah Shamsi, and Zahra Nafei conceived and designed the study, and provided guidance on the research methodology. Mehran Karimi and Zahra Nafei supervised the overall project, reviewed and edited the manuscript, and provided critical revisions to the manuscript. Elahe Akbarian assisted in data curation, contributed to data analysis and model evaluation, and provided critical revisions to the manuscript. Farimah Shamsi conducted the data analysis and model training, drafted the manuscript, reviewed and edited the manuscript, and provided critical revisions to the manuscript. All the authors have read and approved the final version of the manuscript for submission.

Acknowledgements

The authors would like to thank the study team, Iman Borhani, Faezeh Dehghan, Anahid Afnani for medical information registration. Additionally, the authors want to thank Alireza Emarati for data entry, management and organization.

Mansourian M, Ghandi Y, Habibi D, Mehrabi S. COVID-19 infection in children: A systematic review and meta-analysis of clinical features and laboratory findings. Archives de Pédiatrie. 2021;28(3):242–8.
Madani S, Shahin S, Yoosefi M, Ahmadi N, Ghasemi E, Koolaji S, et al. Red flags of poor prognosis in pediatric cases of COVID-19: the first 6610 hospitalized children in Iran. BMC Pediatr. 2021;21(1):563.
Armin S, Fahimzad SA, Rafiei Tabatabaei S, Mansour Ghanaiee R, Marhamati N, Ahmadizadeh SN, et al. COVID-19 Mortality in Children: A Referral Center Experience from Iran (Mofid Children’s Hospital, Tehran, Iran). Can J Infect Dis Med Microbiol. 2022;2022(1):2737719.
Shafaei B, Nafei Z, Karimi M, Behniafard N, Shamsi F, Faisal M, et al. Which Groups of Children Are at More Risk of Fatality during COVID-19 Pandemic? A Case‐Control Study in Yazd, Iran. Can J Infect Dis Med Microbiol. 2023;2023(1):8838056.
Shamsi F, Karimi M, Nafei Z, Akbarian E. Survival and Mortality in Hospitalized Children with COVID-19: A Referral Center Experience in Yazd, Iran. Can J Infect Dis Med Microbiol. 2023;2023(1):5205188.
Li WT, Ma J, Shende N, Castaneda G, Chakladar J, Tsai JC, et al. Using machine learning of clinical data to diagnose COVID-19: a systematic review and meta-analysis. BMC Med Inf Decis Mak. 2020;20:1–13.
Dobrijević D, Vilotijević-Dautović G, Katanić J, Horvat M, Horvat Z, Pastor K. Rapid Triage of Children with Suspected COVID-19 Using Laboratory-Based Machine-Learning Algorithms. Viruses. 2023;15(7):1522.
Ng DC-E, Liew C-H, Tan KK, Chin L, Ting GSS, Fadzilah NF, et al. Risk factors for disease severity among children with Covid-19: a clinical prediction model. BMC Infect Dis. 2023;23(1):398.
Collins GS, Moons KG, Debray TP, Altman DG, Riley RD. Systematic reviews of prediction models. Systematic Reviews in Health Research: Meta-Analysis in Context. 2022:347 – 76.
Arreola EV, Irimata K, Wilson JR. Common errors of interpretation in biostatistics. Biostatistics Epidemiol. 2020;4(1):238–46.
Song Y-Y, Ying L. Decision tree methods: applications for classification and prediction. Shanghai archives psychiatry. 2015;27(2):130.
Kern C, Klausch T, Kreuter F, editors. Tree-based machine learning methods for survey research. Survey research methods. NIH Public Access; 2019.
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
Mienye ID, Sun Y. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access. 2022;10:99129–49.
Li Q-F, Song Z-M. High-performance concrete strength prediction based on ensemble learning. Constr Build Mater. 2022;324:126694.
Azmi SS, Baliga S. An overview of boosting decision tree algorithms utilizing AdaBoost and XGBoost boosting strategies. Int Res J Eng Technol. 2020;7(5):6867–70.
Jiang J, Wang R, Wang M, Gao K, Nguyen DD, Wei G-W. Boosting tree-assisted multitask deep learning for small scientific datasets. J Chem Inf Model. 2020;60(3):1235–44.
Gaye B, Zhang D, Wulamu A. Improvement of support vector machine algorithm in big data background. Math Probl Eng. 2021;2021(1):5594899.
Python. [ https://www.python.org/
Al Mamlook R, Al-Mawee W, Alden AYQ, Alsheakh H, Bzizi H, editors. Evaluation of machine learning models to forecast COVID-19 relying on laboratory outcomes characteristics in children. IOP Conference Series: Materials Science and Engineering; 2021: IOP Publishing.
Ma H, Ye Q, Ding W, Jiang Y, Wang M, Niu Z, et al. Can clinical symptoms and laboratory results predict CT abnormality? initial findings using novel machine learning techniques in children with COVID-19 infections. Front Med. 2021;8:699984.
Piparia S, Defante A, Tantisira K, Ryu J. Using machine learning to improve our understanding of COVID-19 infection in children. PLoS ONE. 2023;18(2):e0281666.
Pavliuk O, Kolesnyk H. Machine-learning method for analyzing and predicting the number of hospitalizations of children during the fourth wave of the COVID-19 pandemic in the Lviv region. J Reliable Intell Environ. 2023;9(1):17–26.
Liu P, Xing Z, Peng X, Zhang M, Shu C, Wang C, et al. Machine learning versus multivariate logistic regression for predicting severe COVID-19 in hospitalized children with Omicron variant infection. J Med Virol. 2024;96(2):e29447.
dos Santos AL, Pinhati C, Perdigão J, Galante S, Silva L, Veloso I et al. Machine learning algorithms to predict outcomes in children and adolescents with COVID-19: A systematic review. Artif Intell Med. 2024:102824.

No competing interests reported.

Download PDF

Editorial decision: Revision requested
29 Oct, 2024
Reviews received at journal
22 Oct, 2024
Reviews received at journal
10 Oct, 2024
Reviewers agreed at journal
09 Oct, 2024
Reviewers agreed at journal
07 Oct, 2024
Reviews received at journal
02 Oct, 2024
Reviewers agreed at journal
30 Sep, 2024
Reviewers agreed at journal
30 Sep, 2024
Reviewers agreed at journal
28 Sep, 2024
Reviewers agreed at journal
24 Sep, 2024
Reviewers agreed at journal
23 Sep, 2024
Reviewers invited by journal
23 Sep, 2024
Editor invited by journal
27 Aug, 2024
Editor assigned by journal
27 Aug, 2024
Submission checks completed at journal
27 Aug, 2024
First submitted to journal
16 Aug, 2024

You are reading this latest preprint version

Prediction of COVID-19 Severity and Mortality in Hospitalized Children Using Machine Learning Tree-based Classifiers

Status:

Version 1

Abstract

Background

Methods

Results

Conclusions

Figures

Background

Methods

Design and setting of the study

Data acquisition, and Demographics, and clinical characteristics

Pre-processing and features extraction

Model development

Logistic regression (LR)

Decision Tree (DT)

entropy=-\(\:\sum\:_{k=1\:}^{K}\widehat{p}{k}_{\tau\:}\) \(\:\times\:\) \(\:log\:\left(\widehat{p}{k}_{\tau\:}\right)\) (3)

Random Forest (RF)

\(\:{\widehat{C}}_{B}\left(x\right)=\) majority vote \(\:\left\{{\widehat{C}}_{b}\left(x\right)\right\}\) (4)

Bagging Classifier (Aggregation Boosting)

Boosting Classifier

Gradient Boosted Decision Tree (GBDT)

Supporting Vector Machine (SVM)

Data Partitioning and Performance evaluation

Results

demographics and baseline statistical analysis

Results of ML methods

Discussion

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1