A Joint Ensemble Framework for the Detection of Acute Exacerbations in Chronic Obstructive Pulmonary Disease

doi:10.21203/rs.3.rs-3712629/v1

Download PDF

Research Article

A Joint Ensemble Framework for the Detection of Acute Exacerbations in Chronic Obstructive Pulmonary Disease

https://doi.org/10.21203/rs.3.rs-3712629/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Objective

The purpose of this study was to create a joint ensemble framework for identifying AECOPD and providing a plausible explanation of model predictions.

Methods

From MIMIC-III, we extracted and organized records for COPD and AECOPD patients. Furthermore, we integrated missing value imputation, joint feature selection, advanced ML algorithms, Bayesian optimization techniques, and the SHAP interpretable method to construct a joint optimized ensemble framework, serving as the predictive model for AECOPD risk identification. The efficacy of the model's prediction was evaluated using a composite score of six evaluation measures.

Results

CAD and 19 other variables significantly impacted AECOPD. Various resampling methods and classifiers yielded diverse prediction accuracies. LightGBM and LR models with NC processing showcased optimal combined performance pre-heterogeneous combination. The Voting ensemble with MWMOTE achieved superior balanced classification.

Conclusion

The joint ensemble framework improved AECOPD risk identification performance in clinically relevant data of COPD patients admitted in the ICU.

AECOPD

machine learning

class imbalance

prediction

Chronic obstructive pulmonary disease (COPD) is a prevalent chronic respiratory condition characterized by persistent respiratory symptoms and airflow restriction[1]. COPD is a developing concern for healthcare systems in terms of public health. According to the World Health Organization, by 2030 COPD will be the third leading cause of death worldwide and the primary cause of economic burden[2]. Acute exacerbation of COPD (AECOPD) is defined as an acute deterioration of COPD respiratory symptoms, and patients with AECOPD require additional treatments relative to COPD treatment modalities[3]. Moreover, AECOPD has a substantial effect on the mortality rate of COPD patients[4], particularly among those who must be confined to the intensive care unit (ICU), where the mortality rate ranges from 16.9–48.8%[5, 6]. In addition, the economic burden of AECOPD should not be underestimated[7, 8]. The costs associated with respiratory diseases comprise 6% of the total EU health expenditure, with COPD accounting for 56% and AECOPD accounting for the largest proportion[8].

Unquestionably, timely hospital identification of patients with AECOPD is crucial for reducing the associated mortality and economic burden[9–11]. Recognizing acute exacerbations of COPD is unfortunately not a simple endeavor[12]. Frequently, the diagnosis of AECOPD is uncertain for several days after hospital admission, making it difficult for even seasoned medical personnel to provide a definitive clinical diagnosis[9] This is due, in part, to the lack of consensus regarding the definition of deterioration[13, 14]. Thus far, the definition of AECOPD can be broadly classified according to patient symptoms (e.g., worsening dyspnea, increased sputum, or pus volume) or designating events (hospitalization requirement, antibiotic administration)[14, 15]. Moreover, acute exacerbations may vary significantly between patients and within the same patient due to differences in underlying conditions, etiology, and comorbidities[16–18]. These factors make identifying AECOPD extremely problematic. To resolve these issues, Shah et al. have called for the development of methods to identify AECOPD so that appropriate medical treatment can be initiated, thereby reducing mortality and economic burden[7]. Moreover, the degree of acute exacerbation reflected in clinical outcomes is strongly correlated with patient prognosis. Accurate decision-making and prompt treatment can result in an improved prognosis, and it is essential to identify the factors that can reliably identify patients with AECOPD.

In recent years, several researchers have analyzed and investigated methods for the stratified identification of AECOPD and explored independent factors that predict AECOPD. For instance, Chia-Tung Wu et al. developed a prediction system using lifestyle data, environmental factors, and patients' symptoms for the early detection of AECOPD within the next seven days[19]. Lanc Lus et al. utilized functional respiratory imaging (FRI) information to facilitate a better understanding and quantification of disease manifestation and progression by identifying AECOPD-specific FRI parameters and assessing their ability to predict future AECOPD using machine learning algorithms[20]. Individual-level accurate machine learning (ML) models were constructed by Kor et al. to predict the first acute exacerbation of chronic obstructive pulmonary disease (COPD). The findings of the investigation demonstrated: GBDT and SVM models demonstrated superior discrimination (AUC = 0.833, 95% CI: 0.745–0.921 and AUC = 0.836 95% CI: 0.757–0.915, respectively). The COPD Assessment Test (CAT) and dyspnea were the two most significant characteristics[3]. Wang et al. compared the performance of various ML algorithms in predicting the identification of AECOPD patients and used a feature screening technique to screen predictors that were strongly correlated with the identification of AECOPD; the results of the study demonstrated that SVM achieved the best prediction performance[8]. Wang et al. proposed a migration learning algorithm based on balanced probability distributions to predict the risk of deterioration in AECOPD patients, and the model performed well in terms of prediction on a limited dataset[21]. However, few studies have focused specifically on AECOPD patients in intensive care units. In addition, few studies on AECOPD risk identification have accounted for the class imbalance of clinical data species, and most studies have employed a single classification algorithm for model prediction, whereas ensemble algorithms are an important strategy in ML for enhancing prediction performance, and there has been no report on AECOPD prediction models based on the concept of Voting ensemble.

The Medical Information Marketplace in Critical Care III (MIMIC III) database[22] can supply routine analyses with extensive clinical data. This study will therefore utilize COPD patients from MIMIC III data to develop an AECOPD risk assessment model for ICU patients using a multilevel modelling strategy of joint data filling, preprocessing, feature selection, class imbalance processing, and ensemble learning. The objective is to apply data mining techniques to the risk identification of AECOPD high-risk groups in COPD patients hospitalized in ICUs, to identify AECOPD patients in COPD early, to manage remediation in a stratified manner, and to explore the risk factors for COPD transitioning to acute exacerbation to provide a scientific basis for the risk of poor prognosis in COPD patients hospitalized in ICUs.

2.1 Data Sources

The Medical Information Bank for Intensive Care (MIMIC)-III (v 1.4) was queried for information. MIMIC-III is a large, open, public database comprising information on over 50,000 patients admitted to the intensive care unit at Beth Israel Deaconess Medical Center between 2001 and 2012[22]. We entered MIMIC-III after completing the Protecting Human Research Participants exam (certification number: 56365161). The Beth Israel Deaconess Medical Center and MIT Institutional Review Boards approved the creation and use of the database. All data were anonymized, so informed consent was not required. From the database, adult patients (≥ 18 years) with a diagnosis of COPD and AECOPD were selected. The COPD and AECOPD were defined using International Classification of Diseases, 9th edition (ICD-9) codes: (ICD-9CM codes: 490.x, 491.xx, 492.xx, 494.x, and 496.xx) and (ICD-9CM codes: 491.21, 491.22, 4941). Other diseases' diagnoses are listed in Table S1. The MIMIC-III database was queried using Structured Query Language. Patient's information at the time of admission was recorded (Table S1-3). We selected the initial laboratory measurement taken within 24 hours of admission as the result. Variables or samples with ≥ 30% missingness were excluded, and the MissForest method was used to fill in missing values for variables or samples with < 30% missingness.

2.2 Data Pre-Processing

The MissForest[23] method was used to fill in the data for variables with a missing percentage of less than 30%, while variables with a missing percentage of more than 30% were eliminated. Additionally, categorical variables in the dataset were processed by One-Hot[24], while continuous variables were normalized using Min-Max.

2.3 Feature selection

Since the high dimensionality of the data has the potential to reduce the accuracy and efficiency of the model[25], researchers frequently perform data dimensionality reduction of the variables through variable screening to select the subset of features from the original feature set that have the greatest impact on the target variable AECOPD, with the goals of improving the accuracy and interpretability of the model as well as reducing the computational cost and the noise from irrelevant features. Considering the multiple feature screening methods available, each with its own set of benefits and drawbacks, the fact that it is challenging to adequately characterize all the correlations between variables in a single feature screening method, and that each method measures relevance differently. Therefore, in this study, three feature screening methods (GA[26], Boruta[27] and RF-RFE[28]) will be used to perform comprehensive feature screening on the original dataset, and the variables selected by all three feature screening methods simultaneously will be selected as the final predictors of the prediction model.( See Additional file 1 for a detailed description of the methodology).

2.4 Class imbalance

According to the data of this study, the proportion of COPD patients was nearly four times that of AECOPD patients, resulting in an apparent imbalance between classes. COPD patients with larger samples are typically considered the majority group, while AECOPD patients with smaller samples are considered the minority group. Currently, resampling, cost-sensitive analysis, and ensemble learning are the primary approaches to the class imbalance problem[29]. This study primarily employs the resampling algorithm to address the class imbalance problem, as well as some advanced ensemble models.

This study simultaneously applies and analyses eight resampling methods, including under sampling, oversampling, and combined resampling. Among the under-sampling methods, we selected one-sided selection (OSS)[30] and Neighborhood Cleaning Rule (NC)[31] that integrate retention and deletion methods to retain and delete samples in the majority class. For oversampling and combined resampling: we consider that the SMOTE algorithm is the most widely used oversampling method, and its main idea is to add new sample points by inserting new non-existent sample points into samples of similar positions in the minority group[32], which can effectively avoid the "overfitting" problem[33], compared with the simple random oversampling; therefore, we employ the SMOTE algorithm in solving the imbalance problem. Meanwhile, considering the blindness of the SMOTE algorithm in the process of synthesizing new samples, this paper further incorporates a number of improvements methods (MWMOTE[34], Kmeans-SMOTE[35], Safe_Level_SMOTE[36], SMOTE-IPF[37]) around the SMOTE to investigate and compare the performance between them. And the decision tree classification algorithm in the SMOTE-IPF algorithm is replaced with the CatBoost algorithm to determine if the enhanced SMOTE-IPF algorithm can increase the performance of the classification model in resolving the class imbalance problem.

2.5 Prediction models

Nine ML models were established to predict COPD, including a support vector machine (SVM)model[38], a logistic regression (LR) model[39], a MLP model[40], a Random Forest (RF) model[41], a GBDT model[42], an Extreme gradient Boost (XGBoost) model[43], a Light Gradient Boosting Machine (LightGBM) model[44], a Category Boosting (CatBoost) model[45] and a Voting heterogeneous ensemble model. The stratified hold-out test (7:3) and stratified ten-fold cross-validation were used to train, construct, and evaluate the nine predictive models.

The selection of models was based on several currently adopted and advanced predictive model types. For example, the LR model, MLP, RF, GBDT and SVM model have been widely used in many clinical applications[3, 46, 47]. The XGboost and LightGBM models have also been used in clinical research[48, 49]. CatBoost achieves high accuracy by modifying the gradient to avoid shifting the prediction order. It can handle large amounts of data while consuming less memory. It reduces the likelihood of overfitting, resulting in a more generalized model[50]. As a result, all above the models were chosen to build predictive models. A Bayesian optimization method with five-fold CV was performed on the training set data to determine the optimal hyperparameters of the LR, SVM, MLP, RF, GBDT, XGBoost, LightGBM and CatBoost models. In Additional file: Table S5, all the pertinent parameters are presented.

2.6 Model interpretation and feature importance

ML models are usually thought of as black boxes as a result of it's troublesome to interpret why an algorithmic program provides correct predictions for a selected participant; therefore, we introduced the SHAP value in this study. SHAP is a machine learning model interpretation method proposed by Scott et al[51]. that possesses both local and global interpretability. It creates a linear model that can be used to describe the output of any machine learning model based on the best "shapley value" in game theory. It has historically performed well in terms of its interpretability[52, 53]. We leveraged SHAP to supply an explanation for our predictive model, which incorporates connected risk factors that cause developing AECOPD patients in COPD patients. To determine the main predictors of the event of AECOPD within the COPD patients, we have a tendency to calculate the importance of the ranking features from the ultimate model.

2.7 Evaluation parameters

2.7.1 Individual evaluation indicators:

In this study, we examined the overall performance of the classifiers using various popular performance indicators, such as Accuracy, Specificity, Sensitivity, F1-Score, G-mean, and the area under the receiver-operating characteristic curve (AUC).

\(Accuracy=\frac{{\left( {TN+TP} \right)}}{{\left( {TP+TN+FP+FN} \right)}} \times 100\%\) (1) \(Specificity=\frac{{TN}}{{\left( {TN+FP} \right)}} \times 100\%\) (2)\(Sensitivity=\frac{{TP}}{{\left( {TP+FN} \right)}} \times 100\%\) (3)\(F1 - Score=\frac{{2Precision}}{{Precision+Recall}} \times 100\%\) (4)\(G\_mean=\sqrt {\frac{{TP}}{{TP+FN}} \times \frac{{TN}}{{TN+FP}}} \times 100\%\) (5)

Where TP represents a true positive, TN represents a true negative, FP represents a false positive, and FN represents a false negative.

2.7.2 Comprehensive evaluation approach:

In light of the fact that the performance of different models on different indicators may be inconsistent (e.g., some models have high recall and low specificity, and some models have stable and balanced performance in terms of all indicators, but not all of them are optimal). To comprehensively consider all the indicators and identify the joint strategy with the best overall performance, this study ranks and sums the six traditional evaluation indicators to calculate the comprehensive scores of all the models; the scoring of each indicator is based on the model's ranking in different indicators; the higher the ranking, the higher the score. (e.g., If there are 66 joint models in total, then the first ranked model has the highest score of 66, and then decreases in descending order, with the last ranked model having the lowest score of 1). According to the formula " Composite score = Comprehensive index score (AUC, F1-score, G_mean)×2 + Recall score×1.5 + Accuracy score×1 + Specificity score×1", the final performance score of each model is calculated, and the model with the highest score is selected as the optimal joint strategy.

2.8 Experimental Setup

To establish the optimal joint strategy for accurately identifying patients at high risk for AECOPD, several phases need to be completed: ① Confirm the completeness of the initial dataset and interpolate missing values using the MissForest method for variables and samples with a missing rate of less than 30%; ② The original feature set is feature-selected utilizing a joint screening strategy of three feature screening methods, GA, Boruta, and RF-RFE, to obtain a new post-screening dataset, which is then introduced into eight classifiers: LR, SVM, MLP, RF, GBDT, XGBoost, LightGBM, and CatBoost to generate predictions for the unbalanced dataset; ③ On the premise of feature screening, a variety of resampling techniques are combined to equalize the feature-screened dataset, followed by the reintroduction of the aforementioned eight classifiers to generate new predictions.

For model validation, the training set and test set are divided with a 7:3 ratio. The training set is utilized for internal validation, whereas the test set is employed for external validation. To ensure the stability of the model, we repeat the stratified 10-fold cross-validation 10 times in the training set and use the mean of the 10 10-fold cross-validation results as the predicted value when evaluating the model's training set results. In addition, the predictive performance of the models was validated using a test set to demonstrate their initial generalization performance. Six evaluation metrics, namely, accuracy, specificity, sensitivity, AUC, F1 score, and G-mean, were used to evaluate the performance of each model. The final performance scores of each model were obtained using composite score, and the model with the highest score was determined to be the optimal joint strategy.

To improve model performance further, we try to select the three types of models that perform optimally under different resampling methods processing, and then integrate these models separately using the Voting heterogeneous ensemble strategy, along with the abovementioned evaluation pattern, to obtain the composite score of the models after the heterogeneous ensemble and plot their ROC and KS curves to evaluate the model discriminability. The entire process is presented in Fig. 1.

2.9 Statistical analysis

For normal distribution, continuous variables are expressed as the mean and standard deviation (SD), whereas for skewed distribution, they are expressed as the median and interquartile range (IQR). T-test or Mann-Whitney u-test was used to compare continuous variables. Comparisons of categorical variables were performed using the χ² test or Fisher's exact test. The classification model construction, data imbalance treatment, parameter optimization, and GA feature screening were performed in Python 3.9.7, whereas R 4.0.2 was used to implement methods such as missing value filling, Boruta, and RF-RFE.

3.1 Baseline Characteristics

In this study, a total of 55 variables from 6691 participants were extracted. After deletion of variables and samples with ≥ 30% missing, a total of 42 variables from 6680 COPD patients were included in the study (See Table S3 for details of missing data). There were 1286 cases of AECOPD and 5394 cases of regular COPD. As demonstrated by Fig. 2, the prevalence of AECOPD was higher in men than in women; the prevalence of AECOPD was higher in divorced/separated/widowed and single COPD patients than in married and unknown marital status patients; the prevalence of AECOPD was lower in the White population than in other populations; and the proportion of smokers among AECOPD patients was greater than that of nonsmokers. The age distributions of whether or not they smoked and of the different gender populations were broadly consistent among patients with AECOPD and regular COPD.

Figure 2 Distribution of basic features. (A) The prevalence of AECOPD in different gender, marital status, ethnicity and smoking; (B) Age distribution of smokers and non-Smokers in COPD/AECOPD; (C) Age distribution of male and female in COPD/AECOPD

3.2 Univariate Analysis

The distribution of AECOPD patients among different factors and the results of the univariate analysis were depicted in Tables S4. The univariate factor analysis was realized using the chi-square test and nonparametric statistics (Mann-Whitney), and the test level \(\alpha\) was set at 0.10. The results showed that for 33 factors including gender, marital status, Insurance, smoking, CAD, CHF, PVD, LAMA, pneumonia, sepsis, CKD, LABA, inhaled corticosteroid, oral corticosteroid, beta blockers, calcium channel blocker, diuretics, antiplatelet, nitrates, sofa, apsiii, oasis, WBC, neutrophils, lymphocytes, platelet, hemoglobin, sodium, potassium, bicarbonate, RBC, creatinine and glu (see Tables S4 for details of other factors), there was a statistically significant difference between groups in the prevalence of AECOPD (P < 0.1).

3.3 Variable Selection

Given that the presence of redundant information in clinical data may result in unsatisfactory AECOPD classification results, this paper used three feature screening methods (GA, Boruta, and RF-RFE) for comprehensive feature screening of the dataset based on P < 0.1 variables in the univariate analysis, and the variables selected by the three feature screening methods at the same time were chosen as predictive variables. Figure 3 and Table 1 illustrate the outcomes of the feature screening. Finally, we merged the 20 variables selected simultaneously by all three methods with the final variable to create the dataset for the AECOPD predictive model configuration.

Table 1

Feature sets selected by three approaches of feature screening
Variables	GA	Boruta	RF-RFE	The selected variables
gender	gender	-	-	-
marital_status	-	-	-	-
insurance	insurance	-	-	-
smoker	smoker	-	-	-
CAD	CAD	CAD	CAD	CAD
CHF	-	CHF	-	-
PVD	PVD	-	-	-
pneumonia	pneumonia	pneumonia	pneumonia	pneumonia
sepsis	sepsis	sepsis	sepsis	sepsis
CKD	CKD	-	-	-
LABA	LABA	LABA	LABA	LABA
LAMA	LAMA	LAMA	LAMA	LAMA
Inhaled_corticosteroid	Inhaled_corticosteroid	Inhaled_corticosteroid	Inhaled_corticosteroid	Inhaled_corticosteroid
Oral_corticosteroid	Oral_corticosteroid	Oral_corticosteroid	Oral_corticosteroid	Oral_corticosteroid
Beta_blockers	Beta_blockers	Beta_blockers	Beta_blockers	Beta_blockers
Calcium_channel_blocker	Calcium_channel_blocker	Calcium_channel_blocker	Calcium_channel_blocker	Calcium_channel_blocker
Diuretics	-	-	-	-
Antiplatelet	Antiplatelet	Antiplatelet	Antiplatelet	Antiplatelet
Nitrates	Nitrates	Nitrates	Nitrates	Nitrates
sofa	sofa	sofa	sofa	sofa
apsiii	-	apsiii	apsiii	-
oasis	oasis	oasis	oasis	oasis
WBC	WBC	WBC	WBC	WBC
Neutrophils	Neutrophils	Neutrophils	Neutrophils	Neutrophils
Lymphocytes	-	Lymphocytes	Lymphocytes	-
Platelet	Platelet	Platelet	-	-
Hemoglobin	Hemoglobin	Hemoglobin	Hemoglobin	Hemoglobin
Sodium	-	Sodium	Sodium
Potassium	Potassium	Potassium	Potassium	Potassium
Bicarbonate	Bicarbonate	Bicarbonate	Bicarbonate	Bicarbonate
RBC	RBC	RBC	RBC	RBC
Creatinine	Creatinine	Creatinine	Creatinine	Creatinine
Glu	-	Glu	-	-

3.4 Model establishment and evaluation

The findings of the internal validation and the external validation of each model included in the study dataset are summarized in Table S6 and Table 2. As can be observed from Table S6, all the models got an extraordinarily high specificity in the class-imbalanced dataset, with a sensitivity of just about 0.3. It has been demonstrated that conventional models and machine learning algorithms did not perform well in identifying AECOPD patients when there was a class imbalance in the data. In the unbalanced dataset, the XGBoost model obtained the greatest sensitivity value (0.321), F1 score (0.404), and G-mean (0.546). Additionally, it obtained high metrics values for AUC, accuracy, and specificity metrics, with relatively best classification performance. The sensitivity of all types of models was improved after data equalization with the help of the resampling technique. The Save_Level_SMOTE balanced LR model had the highest recall (0.721), and the corresponding F1 values and G-mean of the models were also improved; however, no model achieved the highest scores for all the evaluation metrics.

In the test set, which presented relatively consistent predictive performance with internal validation, all models continued to present high specificity and low sensitivity in the class-imbalanced dataset. This suggested that the various types of algorithms still failed to identify AECOPD patients effectively on the new imbalanced data. The LR model obtained the relatively best prediction performance on the unbalanced dataset. After applying the resampling technique to equalize the data, the sensitivity of all types of models were enhanced, with some models increasing from approximately 0.4 to over 0.80, and the corresponding F1 values and G-mean were also enhanced. Similar to the results of the training set, there was no model that received the maximum score for all evaluation metrics in the external validation study.

No model was able to attain the best values for all metrics in either the training set or the test set, which makes it difficult to objectively evaluate the performance of the model. To that purpose, we calculated the model combined score using a ranked assignment technique, and the model with the highest score was judged to have the best overall performance. The scores and rankings of all models are summarized and recorded in Table 2, and the LightGBM ensemble model with NC under-sampling treatment performed the best in the test set with a score of 534.5.

Table 2

Summary of model performance for external validation data
Model	AUC	Accuracy	Sensitivity	F1	Specificity	G-mean	Score	Rank
SVM	0.742	0.807	0.000	0.000	1.000	0.000	192.5	69
LR	0.763	0.795	0.443	0.454	0.879	0.624	461.5	9
MLP	0.762	0.805	0.396	0.439	0.902	0.598	415.0	21
RF	0.767	0.820	0.083	0.151	0.996	0.287	291.0	54
GBDT	0.748	0.812	0.262	0.349	0.943	0.497	224.5	62
XGBoost	0.727	0.812	0.280	0.365	0.939	0.513	181.0	72
LGBM	0.764	0.820	0.314	0.402	0.941	0.543	331.0	39
CatBoost	0.769	0.821	0.254	0.353	0.956	0.493	308.0	49
OS_SVM	0.757	0.821	0.313	0.403	0.942	0.543	306.5	50
OS_LR	0.761	0.765	0.534	0.466	0.820	0.661	485.0	7
OS_MLP	0.752	0.800	0.387	0.427	0.899	0.589	328.0	41
OS_RF	0.762	0.828	0.220	0.331	0.973	0.463	290.5	55
OS_GBDT	0.736	0.805	0.396	0.439	0.902	0.598	313.0	46
OS_XGBoost	0.721	0.809	0.329	0.399	0.924	0.551	203.0	66
OS_LGBM	0.755	0.808	0.407	0.450	0.904	0.606	410.5	23
OS_CatBoost	0.756	0.805	0.345	0.405	0.915	0.561	300.0	51
NC_SVM	0.753	0.766	0.526	0.464	0.823	0.658	444.0	14
NC_LR	0.759	0.743	0.643	0.491	0.767	0.702	525.0	2
NC_MLP	0.752	0.777	0.500	0.463	0.842	0.649	424.5	17
NC_RF	0.766	0.800	0.340	0.396	0.910	0.556	312.5	47
NC_GBDT	0.749	0.776	0.505	0.464	0.840	0.651	420.0	18
NC_XGBoost	0.742	0.787	0.443	0.445	0.869	0.621	348.5	36
NC_LightGBM	0.769	0.770	0.557	0.482	0.820	0.676	543.5	1
NC_CatBoost	0.756	0.793	0.363	0.403	0.896	0.570	293.0	52
S_SVM	0.761	0.616	0.788	0.442	0.575	0.673	426.5	16
S_LR	0.760	0.530	0.863	0.414	0.451	0.624	330.0	40
S_MLP	0.752	0.552	0.837	0.419	0.485	0.637	315.5	43
S_RF	0.750	0.670	0.725	0.459	0.657	0.690	452.0	12
S_GBDT	0.713	0.787	0.368	0.399	0.887	0.571	190.5	70
S_XGBoost	0.719	0.656	0.643	0.419	0.660	0.651	257.0	60
S_LGBM	0.742	0.661	0.702	0.444	0.651	0.676	380.0	29
S_CatBoost	0.733	0.781	0.453	0.444	0.860	0.624	315.0	44
SI_SVM	0.762	0.610	0.793	0.439	0.567	0.670	432.0	15
SI_LR	0.762	0.524	0.863	0.411	0.443	0.618	338.0	38
SI_MLP	0.752	0.564	0.839	0.426	0.499	0.647	339.0	37
SI_RF	0.743	0.661	0.720	0.450	0.647	0.683	406.5	25
SI_GBDT	0.716	0.781	0.396	0.417	0.873	0.588	226.0	61
SI_XGBoost	0.717	0.649	0.663	0.421	0.645	0.654	260.0	59
SI_LGBM	0.741	0.674	0.676	0.444	0.674	0.675	374.0	32
SI_CatBoost	0.731	0.782	0.443	0.439	0.863	0.618	291.5	53

Table 2

(Continued)
Model	AUC	Accuracy	Sensitivity	F1	Specificity	G-mean	Score	Rank
SIC_SVM	0.762	0.643	0.772	0.454	0.612	0.687	496.0	5
SIC_LR	0.762	0.553	0.832	0.417	0.486	0.636	362.0	34
SIC_MLP	0.754	0.582	0.803	0.426	0.530	0.652	351.5	35
SIC_RF	0.742	0.701	0.679	0.466	0.706	0.692	456.5	10
SIC_GBDT	0.720	0.791	0.358	0.397	0.894	0.566	184.0	71
SIC_XGBoost	0.740	0.694	0.658	0.453	0.702	0.680	394.5	28
SIC_LGBM	0.749	0.673	0.718	0.458	0.662	0.689	447.0	13
SIC_CatBoost	0.737	0.775	0.469	0.445	0.848	0.631	325.5	42
SL_SVM	0.761	0.648	0.744	0.448	0.625	0.682	456.0	11
SL_LR	0.762	0.592	0.814	0.434	0.539	0.662	419.0	20
SL_MLP	0.757	0.627	0.769	0.443	0.593	0.675	419.5	19
SL_RF	0.757	0.680	0.728	0.467	0.669	0.698	518.5	3
SL_GBDT	0.742	0.785	0.394	0.414	0.878	0.588	268.5	58
SL_XGBoost	0.707	0.628	0.676	0.412	0.617	0.646	224.0	63
SL_LGBM	0.742	0.668	0.707	0.451	0.659	0.683	410.5	23
SL_CatBoost	0.739	0.751	0.510	0.441	0.808	0.642	313.5	45
KS_SVM	0.736	0.761	0.415	0.401	0.844	0.591	220.0	65
KS_LR	0.744	0.724	0.601	0.456	0.753	0.673	412.0	22
KS_MLP	0.752	0.749	0.534	0.450	0.800	0.654	402.0	26
KS_RF	0.736	0.772	0.376	0.388	0.867	0.571	193.0	68
KS_GBDT	0.736	0.808	0.316	0.389	0.926	0.541	202.5	67
KS_XGBoost	0.731	0.725	0.544	0.432	0.768	0.646	287.5	56
KS_LGBM	0.753	0.779	0.474	0.452	0.852	0.635	397.0	27
KS_CatBoost	0.755	0.809	0.386	0.438	0.910	0.593	366.5	33
MW_SVM	0.761	0.669	0.736	0.461	0.653	0.693	505.0	4
MW_LR	0.762	0.565	0.824	0.422	0.504	0.644	378.5	31
MW_MLP	0.755	0.666	0.744	0.462	0.648	0.694	486.0	6
MW_RF	0.754	0.754	0.565	0.469	0.799	0.672	467.0	8
MW_GBDT	0.741	0.794	0.355	0.399	0.899	0.565	221.5	64
MW_XGBoost	0.730	0.706	0.593	0.437	0.732	0.659	308.5	48
MW_LGBM	0.745	0.741	0.549	0.449	0.786	0.657	379.0	30
MW_CatBoost	0.747	0.800	0.363	0.412	0.905	0.573	280.0	57
Note: OS = One-Sided Selection, NC = Neighborhood Cleaning Rule, S = SMOTE, SI = SMOTE-IPF, SIC = SMOTE_IPF (CatBoost), SL = Safs_Level_SMOTE, KS = kmeans-SMOTE, MW = MWMOTE.

Moreover, comparison diagrams of each evaluation index of the same model under different resampling methods were plotted to investigate the performance improvement of the classification model under various resampling methods. From Figure. 4, the AUC values of the same model under the processing of different resampling methods were relatively concentrated, in addition, considering that the performance of the comprehensive indexes for judging was more reasonable than a single index, therefore, only two comprehensive indexes, namely F1 and G_mean, were used as the main point of view of the performance of the joint model. The results indicated that the SVM model performed best under MWOTE processing, while LR, MLP, RF, GBDT, XGBoost, LGBM, and CatBoost performed best under NC, MWOTE, SL, NC, SIC, NC, and SL balanced processing, respectively, which was essentially the same as the results of the combined scores in Table 2, which suggested that the manner of combining class-balanced processing methods and classification algorithms for the same dataset is also one of the factors affecting the modelling performance, which needs to be further investigated.

In addition, the model's internal validation performance did not differ substantially from its external validation performance in both internal and external validation (see Figure S2 for details), indicating that the model's generalization performance is satisfactory.

To further improve the model performance, we perform Voting integration on the top three models under different class balancing processing methods and assign the ranking in the same way to select the optimal model. As shown in Table 3 and Fig. 5, the joint MWMOTE-Voting model is ranked first among all the models, followed by the NC-Voting model, and the models after the Voting-ensemble models obtained the optimal performance. Most Voting-ensemble models ranked in the top 10, all heterogeneous-ensemble models ranked in the top 30 of all models in terms of performance, and the heterogeneous-ensemble models outperformed their base classifiers in several metrics, indicating that further integration of models is a way to improve the predictive performance of dominant models.

In furtherance of this, the authors of this work plotted their KS curves for each of the nine models following heterogeneous ensemble (Fig. 6). Figure 6 depicts the results, in which the KS values of the model are centered around 0.41 with adequate discrimination.

Table 3

Comparison of model performance after Voting ensemble.
Model	AUC	Accuracy	Sensitivity	F1	Specificity	G-mean	Rank
Voting	0.771	0.814	0.371	0.434	0.920	0.584	29
OS_Voting	0.769	0.808	0.404	0.448	0.904	0.605	18
NC_Voting	0.764	0.758	0.604	0.490	0.795	0.693	2
S_Voting	0.763	0.649	0.762	0.455	0.622	0.688	8
SI_Voting	0.761	0.645	0.762	0.452	0.617	0.685	13
SIC_Voting	0.763	0.664	0.749	0.462	0.643	0.694	4
SL_Voting	0.767	0.646	0.757	0.452	0.620	0.685	9
KS_Voting	0.761	0.760	0.529	0.458	0.815	0.656	17
MW_Voting	0.769	0.696	0.712	0.474	0.692	0.702	1

3.5 Visualization of feature importance

Figure 7(A) and Fig. 7(B) showed the shapley value plots. Figure 7(A) depicts the overall feature shapley value plot, which depicts the absolute importance of each feature on the model prediction results. Figure 7(B) depicts the typical shapley values for each sample. The colors represent the magnitude of the highlighted values, while the horizontal coordinates represent the shapley values. With red dots representing a high-risk value and blue dots representing a low-risk value. The irregularly overlapping points explain the dispersion.

As manifested in Fig. 7(A-B), the most important risk factor for AECOPD in the COPD patients was pneumonia, patients with pneumonia are more likely to develop AECOPD than COPD patients without pneumonia. Oasis came in second, and the other factors in order were Oral corticosteroid, Calcium channel blocker, Bicarbonate, Beta blockers, Inhaled corticosteroid etc.

Figure 7 The interpretation of the LightGBM model. (A): SHAP overall feature importance chart; (B): Distribution of characteristic shapley values.

In this COPD acute exacerbation risk identification task based on the MIMIC-III database, we explored a multilevel model combination strategy combining multiple supervised classifiers, class imbalance processing methods, hybrid feature screening methods, and heterogeneous ensemble models for acute exacerbation risk prediction study on ICU inpatient COPD patient data with class imbalance sample characteristics. The findings of the study indicate that:

In terms of feature engineering, this paper proposes a feature screening strategy that uses three feature screening methods (GA, Boruta, and RF-RFE) for joint selection of the original feature set, and the optimal feature variables are selected by considering the three methods comprehensively, effectively reducing data dimensionality and the complexity of constructing the classification model later. In the end, it was determined that CAD, pneumonia, sepsis, LABA, LAMA, Inhaled corticosteroid, Oral corticosteroid, Beta blockers, Calcium channel blocker, Antiplatelet, Nitrates, sofa, oasis, WBC, Neutrophils, Hemoglobin, Potassium, Bicarbonate, RBC and Creatinine were significant predictors of the risk of developing AECOPD. The screening strategy of this study, which incorporates multiple methods to select a subset of features, is more reliable and representative than the previous screening strategy, which utilized only one screening method[48, 54].

In terms of data equalization processing, this study primarily employs a number of resampling algorithms to address the class imbalance problem, as well as several advanced ensembles learning models. It is worth noting that for resampling methods, a total of nine resampling methods including under-sampling, over-sampling and integrated resampling were simultaneously applied and analyzed in this study. After being subjected to the resampling procedure, the performance of each model was enhanced proportionally, particularly the sensitivity, F1 score, and G-mean index values associated with AECOPD recognition. It indicates that data equalization processing is particularly important for classification models with unbalanced data. In addition, it was discovered that various combinations of resampling methods and classification algorithms produce varying results in terms of prediction accuracy. Among these models, the LightGBM and LR models with NC treatment, the SVM model under SIC treatment, the RF model under SL treatment, and the SVM model under MW treatment obtain the optimal combined prediction performance before heterogeneous ensemble. This further illustrates the rationality of comprehensively comparing multiple joint approaches at the same time. This study examines the impact of the combination of class imbalance processing methods and classification algorithms on the prediction performance of the model, as compared to previous research that utilized a single class balance processing method to address the data class imbalance problem[48, 55]. It provides a new research concept for future model performance enhancement research.

In terms of model validation, the findings of this study indicate that no classification model can attain a perfect score across all evaluation indicators. The Voting ensemble model after MWMOTE resampling fared the best after ranking and aggregating the values of each type of evaluation metric, followed by the Voting ensemble model after NC under sampling. Models that were heterogeneously ensemble by Voting generally performed well (over half of the heterogeneously integrated models were in the top 10 of all model units). And the optimal model obtained in this paper outperforms the prediction performance of some of previous studies[14, 56, 57], indicating that the joint ensemble framework is feasible to some degree. Furthermore, in this study, we innovatively use the comprehensive sorting and summing method to assign scores to all models, and finally evaluate the overall performance of the models by the combined scores, which provides an evaluation mode for selecting the model with the best overall performance according to the focus of the research area. The evaluation procedure is more objective and straightforward to analyze.

At the level of model interpretability, we conducted an interpretability analysis of the model predictions using the SHAP method, and the results indicated that the top five factors affecting the conversion of COPD to AECOPD were Pneumonia, Oasis, Oral corticosteroid, Calcium channel blocker, and Bicarbonate, which have been shown to be associated with AECOPD in many studies[3, 58, 59]. Therefore, the findings are clinically sound and further compensate for the absence of machine learning's "black box" problem.

There are also deficiencies in this study. First, due to the problem of too much missing data, this study excluded some oxygenation analysis data related to the development of chronic obstructive pulmonary disease, which affected the recognition rate of AECOPD to some extent; Second, the model validation lacked external validation of independent data, and the generalization performance of the model needs to be further researched and analyzed. Furthermore, as artificial intelligence has advanced, deep learning has been reported to be used to build medical models. In the future, we hope to develop a deep learning model to predict AECOPD and to combine more extensive data and information for various levels of research.

This study combines feature screening methods, unbalanced data processing methods, and machine learning algorithms to identify high-risk patients with AECOPD based on the clinical information of ICU patients, which can identify AECOPD stratification of COPD patients admitted to the ICU at an early stage and can provide a simple and scientific auxiliary means for the early identification of AECOPD, and the stratified management of patients. Further, it provides a joint strategy framework for disease risk identification, a more reliable feature screening strategy, and an evaluation model for comprehensively evaluating the model's performance in related research, which is of scientific value in the field of disease-assisted diagnostic identification model development.

COPD: Chronic obstructive pulmonary disease; AECOPD: Acute exacerbation of COPD; LR: Logistic Regression; SVM: Support Vector Machine; MLP: Multilayer Perceptron; RF: Random Forest; GBDT: Gradient boost decision tree; XGBoost: Extreme Gradient Boosting; LightGBM: Light Gradient Boosting Machine; CatBoost: Categorical Boosting; OS: One-Sided Selection; NC: Neighborhood Cleaning Rule; AUC: area under the receiver-operating characteristic curve.

Funding

This research is supported by a grant from the National Natural Science Foundation of China (grant no: 81973155).

Contributors

QLX and WXC conceptualized and designed the study; WXC, ZYD, CY, RH, QYC and HCQ processed the data; WXC analyzed and interpreted the data, and was a major contributor to writing the manuscript. ZYD, CY and RH were responsible for preprocessing the data and checking the results. CLM and QLX gave constructive suggestions for the manuscript. All authors revised the manuscript for important intellectual content and approved the final version.

Ethical approval

Not applicable.

Exclusive license

The Corresponding Author has the right to grant on behalf of all authors and does grant on behalf of all authors, an exclusive license (or non-exclusive for government employees) on a worldwide basis to the PrevMed Publishing Group Ltd to permit this article (if accepted) to be published in PrevMed editions and any other PrevMed products.

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Both authors declare no support from any organization for the submitted work; no financial relationships with any organizations that might have an interest in the submitted work in the previous three years, and no other relationships or activities that could appear to have influenced the submitted work.

Acknowledgement

This research is supported by a grant from the National Natural Science Foundation of China (grant no: 81973155). We thank all teachers in the statistical research office of Shanxi medical university. The authors would also like to acknowledge all interviewers for survey data collection work.

Supplementary materials

Additional file 1:

Dataset: Table S1 Disease diagnostic coding based on ICD-9; Table S2 Drug specific information; Table S3 Summary of missing data.
Feature selection: Figure S1 Genetic algorithm flow; Table S4 parameter settings.
Univariate analysis: Table S4 Basic description of data and univariate analysis results.
Prediction models: Table S5 Hyperparameter Setting; Table S6 Average and dispersion of 10 times stratified 10-fold cross-validation results in training set; Figure S2 Performance comparison between the training and test sets of different models. (A). Comparison of AUC values for the training and test sets. (B). Comparison of F1-Score values for the training and test sets. (C). Comparison of G-mean values for the training and test sets.

Zhe W, Lin LI, Cheng LI, University XMJCDM: Stage Prediction of Chronic Obstructive Pneumonia Based on Machine Learning. 2019, 14(03):38-40.
López-Campos JL, Tan W, Soriano JB: Global burden of COPD. Respirology (Carlton, Vic) 2016, 21(1):14-23.
Kor CT, Li YR, Lin PR, Lin SH, Wang BY, Lin CH: Explainable Machine Learning Model for Predicting First-Time Acute Exacerbation in Patients with Chronic Obstructive Pulmonary Disease. Journal of personalized medicine 2022, 12(2).
Lima FV, Yen TY, Patel JK: Trends in In-Hospital Outcomes Among Adults Hospitalized With Exacerbation of Chronic Obstructive Pulmonary Disease. Copd 2015, 12(6):636-642.
Singanayagam A, Schembri S, Chalmers JD: Predictors of mortality in hospitalized adults with acute exacerbation of chronic obstructive pulmonary disease. Annals of the American Thoracic Society 2013, 10(2):81-89.
Ongel EA, Karakurt Z, Salturk C, Takir HB, Burunsuzoglu B, Kargin F, Ekinci GH, Mocin O, Gungor G, Adiguzel N et al: How do COPD comorbidities affect ICU outcomes? International journal of chronic obstructive pulmonary disease 2014, 9:1187-1196.
Shah T, Press VG, Huisingh-Scheetz M, White SR: COPD Readmissions: Addressing COPD in the Era of Value-based Health Care. Chest 2016, 150(4):916-926.
Wang C, Chen X, Du L, Zhan Q, Yang T, Fang Z: Comparison of machine learning algorithms for the identification of acute exacerbations in chronic obstructive pulmonary disease. Computer methods and programs in biomedicine 2020, 188:105267.
Shah P, McWilliams A, Howard D, Roberge J: A comparison of methodologies for the real-time identification of hospitalized patients with acute exacerbations of COPD. International journal of chronic obstructive pulmonary disease 2019, 14:693-698.
Fernandez-Granero MA, Sanchez-Morillo D, Leon-Jimenez A: An artificial intelligence approach to early predict symptom-based exacerbations of COPD %J Biotechnology & Biotechnological Equipment. 2018(No.3):778-784.
Oliveira AS, Munhá J, Bugalho A, Guimarães M, Reis G, Marques A: Identification and assessment of COPD exacerbations. Pulmonology 2017.
Mohktar MS, Redmond SJ, Antoniades NC, Rochford PD, Pretto JJ, Basilakis J, Lovell NH, McDonald CF: Predicting the risk of exacerbation in patients with chronic obstructive pulmonary disease using home telehealth measurement data. Artificial intelligence in medicine 2015, 63(1):51-59.
Sanchez-Morillo D, Fernandez-Granero MA, Leon-Jimenez A: Use of predictive algorithms in-home monitoring of chronic obstructive pulmonary disease and asthma: A systematic review. Chronic respiratory disease 2016, 13(3):264-283.
Shah SA, Velardo C, Farmer A, Tarassenko L: Exacerbations in Chronic Obstructive Pulmonary Disease: Identification and Prediction Using a Digital Health System. Journal of medical Internet research 2017, 19(3):e69.
Pauwels R, Calverley P, Buist AS, Rennard S, Fukuchi Y, Stahl E, Löfdahl CG: COPD exacerbations: the importance of a standard definition. Respiratory medicine 2004, 98(2):99-107.
Vestbo J, Hurd SS, Agustí AG, Jones PW, Vogelmeier C, Anzueto A, Barnes PJ, Fabbri LM, Martinez FJ, Nishimura M et al: Global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease: GOLD executive summary. American journal of respiratory and critical care medicine 2013, 187(4):347-365.
Mackay AJ, Donaldson GC, Patel AR, Singh R, Kowlessar B, Wedzicha JA: Detection and severity grading of COPD exacerbations using the exacerbations of chronic pulmonary disease tool (EXACT). The European respiratory journal 2014, 43(3):735-744.
Effing TW, Kerstjens HAM, Monninkhof EM, van der Valk P, Wouters EFM, Postma DS, Zielhuis GA, van der Palen J: Definitions of exacerbations: does it really matter in clinical trials on COPD? Chest 2009, 136(3):918-923.
Wu CT, Li GH, Huang CT, Cheng YC, Chen CH, Chien JY, Kuo PH, Kuo LC, Lai F: Acute Exacerbation of a Chronic Obstructive Pulmonary Disease Prediction System Using Wearable Device Data, Machine Learning, and Deep Learning: Development and Cohort Study. JMIR mHealth and uHealth 2021, 9(5):e22591.
Lanclus M, Clukers J, Van Holsbeke C, Vos W, Leemans G, Holbrechts B, Barboza K, De Backer W, De Backer J: Machine Learning Algorithms Utilizing Functional Respiratory Imaging May Predict COPD Exacerbations. Academic radiology 2019, 26(9):1191-1199.
Wang Q, Wang H, Wang LT, Yu FP: Diagnosis of Chronic Obstructive Pulmonary Disease Based on Transfer Learning. Ieee Access 2020, 8:47370-47383.
Johnson AEW, Pollard TJ, Shen L, Lehman LWH, Feng ML, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG: MIMIC-III, a freely accessible critical care database. Scientific Data 2016, 3.
Stekhoven DJ, Buhlmann P: MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28(1):112-118.
Okada S, Ohzeki M, Taguchi S: Efficient partition of integer optimization problems with one-hot encoding. Scientific Reports 2019, 9.
Wu Y, Zhang A: Feature selection for classifying high-dimensional numerical data. In: Computer Vision and Pattern Recognition, 2004 CVPR 2004 Proceedings of the 2004 IEEE Computer Society Conference on: 2004; 2004.
Goldberg DEJA-WPC: Genetic Algorithms in Search, Optimization, and Machine Learning. 1989.
Kursa MB, Jankowski A, Rudnicki WRJFI: Boruta - A System for Feature Selection. 2010, 101(4):271-285.
Chen Q, Meng Z, Liu X, Jin Q, Su R: Decision Variants for the Automatic Determination of Optimal Feature Subset in RF-RFE. Genes 2018, 9(6).
Sun YM, Kamel MS, Akc W, Wang YJPRTJotPRS: Cost-sensitive boosting for classification of imbalanced data. 2007(12):40.
Kubat M: Adressing the curse of imbalanced training sets: one-sided selection. In: International Conference on Machine Learning: 1997; 1997.
Wilson DL: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data %J IEEE Transactions on Systems, Man, and Cybernetics. 1972(No.3):408-421.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WPJJoAIR: SMOTE: Synthetic Minority Over-sampling Technique. 2002, 16(1):321-357.
Tao S, Haifeng WU, Zhigang L, Wen HE, Lei Z, Pingxin LV: Application of SMOTE arithmetic for unbalanced data. Beijing Biomedical Engineering, 2021(31):528-530.
Barua S, Islam MM, Yao X, Murase KJIToK, Engineering D: MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning. 2013, 26(2):405-425.
Georgios D, Fernando B, Felix LJIe: Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. 2018, 465:1-20.
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C: Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. In: Pacific-asia Conference on Advances in Knowledge Discovery & Data Mining: 2009; 2009.
Sáez JA, Luengo J, Stefanowski J, Herrera FJIS: SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. 2015.
Cortes C, Vapnik VJML: Support-Vector Networks. 1995, 20(3):273-297.
Basili VR, Briand LCJIToSE: A validation of object-oriented design metrics as quality indicators. 1996, 22(10):P.751-761.
Dreiseitl S, Ohno-Machado LJJoBI: Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform 35(5-6):352-359. 2002, 35(5-6):352-359.
Liu Y, Wang Y, Zhang J: New Machine Learning Algorithm: Random Forest. In: International Conference on Information Computing & Applications: 2012; 2012.
Friedman JJAoS: Greedy function approximation : A gradient boosting machine. 2001, 29.
Chen T, Guestrin CJA: XGBoost: A Scalable Tree Boosting System. 2016.
Meng Q: LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In: Neural Information Processing Systems: 2017; 2017.
Dorogush AV, Ershov V, Gulin A: CatBoost: gradient boosting with categorical features support. 2018.
Ma X, Wu Y, Zhang L, Yuan W, Yan L, Fan S, Lian Y, Zhu X, Gao J, Zhao J et al: Comparison and development of machine learning tools for the prediction of chronic obstructive pulmonary disease in the Chinese population. Journal of translational medicine 2020, 18(1):146.
Yang H, Li X, Cao H, Cui Y, Luo Y, Liu J, Zhang Y: Using machine learning methods to predict hepatic encephalopathy in cirrhotic patients with unbalanced data. Computer methods and programs in biomedicine 2021, 211:106420.
Wang K, Tian J, Zheng C, Yang H, Ren J, Liu Y, Han Q, Zhang Y: Interpretable prediction of 3-year all-cause mortality in patients with heart failure caused by coronary heart disease based on machine learning and SHAP. Computers in biology and medicine 2021, 137:104813.
Liao H, Zhang X, Zhao C, Chen Y, Zeng X, Li H: LightGBM: an efficient and accurate method for predicting pregnancy diseases. Journal of obstetrics and gynaecology : the journal of the Institute of Obstetrics and Gynaecology 2022, 42(4):620-629.
Punmiya R, Choe SJSGITo: Energy Theft Detection Using Gradient Boosting Theft Detector With Feature Engineering-Based Preprocessing. 2019, 10(2):2326-2329.
Lundberg S, Lee SI: A Unified Approach to Interpreting Model Predictions. 2017.
Athanasiou M, Sfrintzeri K, Zarkogianni K, Thanopoulou AC, Nikita KSJI: An explainable XGBoost–based approach towards assessing the risk of cardiovascular disease in patients with Type 2 Diabetes Mellitus. 2020.
Lundberg SM, Erion GG, Lee SI: Consistent Individualized Feature Attribution for Tree Ensembles. 2018.
Xia J, Sun L, Xu S, Xiang Q, Zhao J, Xiong W, Xu Y, Chu S: A Model Using Support Vector Machines Recursive Feature Elimination (SVM-RFE) Algorithm to Classify Whether COPD Patients Have Been Continuously Managed According to GOLD Guidelines. International journal of chronic obstructive pulmonary disease 2020, 15:2779-2786.
Lee YW, Choi JW, Shin EH: Machine learning model for predicting malaria using clinical information. Computers in biology and medicine 2021, 129:104151.
Jensen MH, Cichosz SL, Dinesen B, Hejlesen OK: Moving prediction of exacerbation in chronic obstructive pulmonary disease for patients in telecare. Journal of telemedicine and telecare 2012, 18(2):99-103.
Kronborg T, Mark L, Cichosz SL, Secher PH, Hejlesen O: Population exacerbation incidence contains predictive information of acute exacerbations in patients with chronic obstructive pulmonary disease in telecare. International journal of medical informatics 2018, 111:72-76.
Ruan Z, Li D, Hu Y, Qiu Z, Chen X: The Association of Renin-Angiotensin System Blockades and Mortality in Patients with Acute Exacerbation of Chronic Obstructive Pulmonary Disease and Acute Respiratory Failure: A Retrospective Cohort Study. International journal of chronic obstructive pulmonary disease 2022, 17:2001-2011.
Bonomo M, Hermsen MG, Kaskovich S, Hemmrich MJ, Rojas JC, Carey KA, Venable LR, Churpek MM, Press VG: Using Machine Learning to Predict Likelihood and Cause of Readmission After Hospitalization for Chronic Obstructive Pulmonary Disease Exacerbation. International journal of chronic obstructive pulmonary disease 2022, 17:2701-2709.

No competing interests reported.

Additionalfile.docx
Additional file 1:
1. Dataset: Table S1 Disease diagnostic coding based on ICD-9; Table S2 Drug specific information; Table S3 Summary of missing data.
2. Feature selection: Figure S1 Genetic algorithm flow; Table S4 parameter settings.
3. Univariate analysis: Table S4 Basic description of data and univariate analysis results.
4. Prediction models: Table S5 Hyperparameter Setting; Table S6 Average and dispersion of 10 times stratified 10-fold cross-validation results in training set; Figure S2 Performance comparison between the training and test sets of different models. (A). Comparison of AUC values for the training and test sets. (B). Comparison of F1-Score values for the training and test sets. (C). Comparison of G-mean values for the training and test sets.

Download PDF

Version 1

posted

You are reading this latest preprint version

A Joint Ensemble Framework for the Detection of Acute Exacerbations in Chronic Obstructive Pulmonary Disease

Status:

Version 1

Abstract

Objective

Methods

Results

Conclusion

Figures

1. Introduction

2. Methods

2.1 Data Sources

2.2 Data Pre-Processing

2.3 Feature selection

2.4 Class imbalance

2.5 Prediction models

2.6 Model interpretation and feature importance

2.7 Evaluation parameters

2.7.1 Individual evaluation indicators:

2.7.2 Comprehensive evaluation approach:

2.8 Experimental Setup

2.9 Statistical analysis

3. Results

3.1 Baseline Characteristics

3.2 Univariate Analysis

3.3 Variable Selection

3.4 Model establishment and evaluation

3.5 Visualization of feature importance

4. Discussion

5. Limitations

6. Conclusion

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1