3.1 Baseline Characteristics
In this study, a total of 55 variables from 6691 participants were extracted. After deletion of variables and samples with ≥ 30% missing, a total of 42 variables from 6680 COPD patients were included in the study (See Table S3 for details of missing data). There were 1286 cases of AECOPD and 5394 cases of regular COPD. As demonstrated by Fig. 2, the prevalence of AECOPD was higher in men than in women; the prevalence of AECOPD was higher in divorced/separated/widowed and single COPD patients than in married and unknown marital status patients; the prevalence of AECOPD was lower in the White population than in other populations; and the proportion of smokers among AECOPD patients was greater than that of nonsmokers. The age distributions of whether or not they smoked and of the different gender populations were broadly consistent among patients with AECOPD and regular COPD.
Figure 2 Distribution of basic features. (A) The prevalence of AECOPD in different gender, marital status, ethnicity and smoking; (B) Age distribution of smokers and non-Smokers in COPD/AECOPD; (C) Age distribution of male and female in COPD/AECOPD
3.2 Univariate Analysis
The distribution of AECOPD patients among different factors and the results of the univariate analysis were depicted in Tables S4. The univariate factor analysis was realized using the chi-square test and nonparametric statistics (Mann-Whitney), and the test level \(\alpha\) was set at 0.10. The results showed that for 33 factors including gender, marital status, Insurance, smoking, CAD, CHF, PVD, LAMA, pneumonia, sepsis, CKD, LABA, inhaled corticosteroid, oral corticosteroid, beta blockers, calcium channel blocker, diuretics, antiplatelet, nitrates, sofa, apsiii, oasis, WBC, neutrophils, lymphocytes, platelet, hemoglobin, sodium, potassium, bicarbonate, RBC, creatinine and glu (see Tables S4 for details of other factors), there was a statistically significant difference between groups in the prevalence of AECOPD (P < 0.1).
3.4 Model establishment and evaluation
The findings of the internal validation and the external validation of each model included in the study dataset are summarized in Table S6 and Table 2. As can be observed from Table S6, all the models got an extraordinarily high specificity in the class-imbalanced dataset, with a sensitivity of just about 0.3. It has been demonstrated that conventional models and machine learning algorithms did not perform well in identifying AECOPD patients when there was a class imbalance in the data. In the unbalanced dataset, the XGBoost model obtained the greatest sensitivity value (0.321), F1 score (0.404), and G-mean (0.546). Additionally, it obtained high metrics values for AUC, accuracy, and specificity metrics, with relatively best classification performance. The sensitivity of all types of models was improved after data equalization with the help of the resampling technique. The Save_Level_SMOTE balanced LR model had the highest recall (0.721), and the corresponding F1 values and G-mean of the models were also improved; however, no model achieved the highest scores for all the evaluation metrics.
In the test set, which presented relatively consistent predictive performance with internal validation, all models continued to present high specificity and low sensitivity in the class-imbalanced dataset. This suggested that the various types of algorithms still failed to identify AECOPD patients effectively on the new imbalanced data. The LR model obtained the relatively best prediction performance on the unbalanced dataset. After applying the resampling technique to equalize the data, the sensitivity of all types of models were enhanced, with some models increasing from approximately 0.4 to over 0.80, and the corresponding F1 values and G-mean were also enhanced. Similar to the results of the training set, there was no model that received the maximum score for all evaluation metrics in the external validation study.
No model was able to attain the best values for all metrics in either the training set or the test set, which makes it difficult to objectively evaluate the performance of the model. To that purpose, we calculated the model combined score using a ranked assignment technique, and the model with the highest score was judged to have the best overall performance. The scores and rankings of all models are summarized and recorded in Table 2, and the LightGBM ensemble model with NC under-sampling treatment performed the best in the test set with a score of 534.5.
Table 2
Summary of model performance for external validation data
Model
|
AUC
|
Accuracy
|
Sensitivity
|
F1
|
Specificity
|
G-mean
|
Score
|
Rank
|
SVM
|
0.742
|
0.807
|
0.000
|
0.000
|
1.000
|
0.000
|
192.5
|
69
|
LR
|
0.763
|
0.795
|
0.443
|
0.454
|
0.879
|
0.624
|
461.5
|
9
|
MLP
|
0.762
|
0.805
|
0.396
|
0.439
|
0.902
|
0.598
|
415.0
|
21
|
RF
|
0.767
|
0.820
|
0.083
|
0.151
|
0.996
|
0.287
|
291.0
|
54
|
GBDT
|
0.748
|
0.812
|
0.262
|
0.349
|
0.943
|
0.497
|
224.5
|
62
|
XGBoost
|
0.727
|
0.812
|
0.280
|
0.365
|
0.939
|
0.513
|
181.0
|
72
|
LGBM
|
0.764
|
0.820
|
0.314
|
0.402
|
0.941
|
0.543
|
331.0
|
39
|
CatBoost
|
0.769
|
0.821
|
0.254
|
0.353
|
0.956
|
0.493
|
308.0
|
49
|
OS_SVM
|
0.757
|
0.821
|
0.313
|
0.403
|
0.942
|
0.543
|
306.5
|
50
|
OS_LR
|
0.761
|
0.765
|
0.534
|
0.466
|
0.820
|
0.661
|
485.0
|
7
|
OS_MLP
|
0.752
|
0.800
|
0.387
|
0.427
|
0.899
|
0.589
|
328.0
|
41
|
OS_RF
|
0.762
|
0.828
|
0.220
|
0.331
|
0.973
|
0.463
|
290.5
|
55
|
OS_GBDT
|
0.736
|
0.805
|
0.396
|
0.439
|
0.902
|
0.598
|
313.0
|
46
|
OS_XGBoost
|
0.721
|
0.809
|
0.329
|
0.399
|
0.924
|
0.551
|
203.0
|
66
|
OS_LGBM
|
0.755
|
0.808
|
0.407
|
0.450
|
0.904
|
0.606
|
410.5
|
23
|
OS_CatBoost
|
0.756
|
0.805
|
0.345
|
0.405
|
0.915
|
0.561
|
300.0
|
51
|
NC_SVM
|
0.753
|
0.766
|
0.526
|
0.464
|
0.823
|
0.658
|
444.0
|
14
|
NC_LR
|
0.759
|
0.743
|
0.643
|
0.491
|
0.767
|
0.702
|
525.0
|
2
|
NC_MLP
|
0.752
|
0.777
|
0.500
|
0.463
|
0.842
|
0.649
|
424.5
|
17
|
NC_RF
|
0.766
|
0.800
|
0.340
|
0.396
|
0.910
|
0.556
|
312.5
|
47
|
NC_GBDT
|
0.749
|
0.776
|
0.505
|
0.464
|
0.840
|
0.651
|
420.0
|
18
|
NC_XGBoost
|
0.742
|
0.787
|
0.443
|
0.445
|
0.869
|
0.621
|
348.5
|
36
|
NC_LightGBM
|
0.769
|
0.770
|
0.557
|
0.482
|
0.820
|
0.676
|
543.5
|
1
|
NC_CatBoost
|
0.756
|
0.793
|
0.363
|
0.403
|
0.896
|
0.570
|
293.0
|
52
|
S_SVM
|
0.761
|
0.616
|
0.788
|
0.442
|
0.575
|
0.673
|
426.5
|
16
|
S_LR
|
0.760
|
0.530
|
0.863
|
0.414
|
0.451
|
0.624
|
330.0
|
40
|
S_MLP
|
0.752
|
0.552
|
0.837
|
0.419
|
0.485
|
0.637
|
315.5
|
43
|
S_RF
|
0.750
|
0.670
|
0.725
|
0.459
|
0.657
|
0.690
|
452.0
|
12
|
S_GBDT
|
0.713
|
0.787
|
0.368
|
0.399
|
0.887
|
0.571
|
190.5
|
70
|
S_XGBoost
|
0.719
|
0.656
|
0.643
|
0.419
|
0.660
|
0.651
|
257.0
|
60
|
S_LGBM
|
0.742
|
0.661
|
0.702
|
0.444
|
0.651
|
0.676
|
380.0
|
29
|
S_CatBoost
|
0.733
|
0.781
|
0.453
|
0.444
|
0.860
|
0.624
|
315.0
|
44
|
SI_SVM
|
0.762
|
0.610
|
0.793
|
0.439
|
0.567
|
0.670
|
432.0
|
15
|
SI_LR
|
0.762
|
0.524
|
0.863
|
0.411
|
0.443
|
0.618
|
338.0
|
38
|
SI_MLP
|
0.752
|
0.564
|
0.839
|
0.426
|
0.499
|
0.647
|
339.0
|
37
|
SI_RF
|
0.743
|
0.661
|
0.720
|
0.450
|
0.647
|
0.683
|
406.5
|
25
|
SI_GBDT
|
0.716
|
0.781
|
0.396
|
0.417
|
0.873
|
0.588
|
226.0
|
61
|
SI_XGBoost
|
0.717
|
0.649
|
0.663
|
0.421
|
0.645
|
0.654
|
260.0
|
59
|
SI_LGBM
|
0.741
|
0.674
|
0.676
|
0.444
|
0.674
|
0.675
|
374.0
|
32
|
SI_CatBoost
|
0.731
|
0.782
|
0.443
|
0.439
|
0.863
|
0.618
|
291.5
|
53
|
Table 2
Model
|
AUC
|
Accuracy
|
Sensitivity
|
F1
|
Specificity
|
G-mean
|
Score
|
Rank
|
SIC_SVM
|
0.762
|
0.643
|
0.772
|
0.454
|
0.612
|
0.687
|
496.0
|
5
|
SIC_LR
|
0.762
|
0.553
|
0.832
|
0.417
|
0.486
|
0.636
|
362.0
|
34
|
SIC_MLP
|
0.754
|
0.582
|
0.803
|
0.426
|
0.530
|
0.652
|
351.5
|
35
|
SIC_RF
|
0.742
|
0.701
|
0.679
|
0.466
|
0.706
|
0.692
|
456.5
|
10
|
SIC_GBDT
|
0.720
|
0.791
|
0.358
|
0.397
|
0.894
|
0.566
|
184.0
|
71
|
SIC_XGBoost
|
0.740
|
0.694
|
0.658
|
0.453
|
0.702
|
0.680
|
394.5
|
28
|
SIC_LGBM
|
0.749
|
0.673
|
0.718
|
0.458
|
0.662
|
0.689
|
447.0
|
13
|
SIC_CatBoost
|
0.737
|
0.775
|
0.469
|
0.445
|
0.848
|
0.631
|
325.5
|
42
|
SL_SVM
|
0.761
|
0.648
|
0.744
|
0.448
|
0.625
|
0.682
|
456.0
|
11
|
SL_LR
|
0.762
|
0.592
|
0.814
|
0.434
|
0.539
|
0.662
|
419.0
|
20
|
SL_MLP
|
0.757
|
0.627
|
0.769
|
0.443
|
0.593
|
0.675
|
419.5
|
19
|
SL_RF
|
0.757
|
0.680
|
0.728
|
0.467
|
0.669
|
0.698
|
518.5
|
3
|
SL_GBDT
|
0.742
|
0.785
|
0.394
|
0.414
|
0.878
|
0.588
|
268.5
|
58
|
SL_XGBoost
|
0.707
|
0.628
|
0.676
|
0.412
|
0.617
|
0.646
|
224.0
|
63
|
SL_LGBM
|
0.742
|
0.668
|
0.707
|
0.451
|
0.659
|
0.683
|
410.5
|
23
|
SL_CatBoost
|
0.739
|
0.751
|
0.510
|
0.441
|
0.808
|
0.642
|
313.5
|
45
|
KS_SVM
|
0.736
|
0.761
|
0.415
|
0.401
|
0.844
|
0.591
|
220.0
|
65
|
KS_LR
|
0.744
|
0.724
|
0.601
|
0.456
|
0.753
|
0.673
|
412.0
|
22
|
KS_MLP
|
0.752
|
0.749
|
0.534
|
0.450
|
0.800
|
0.654
|
402.0
|
26
|
KS_RF
|
0.736
|
0.772
|
0.376
|
0.388
|
0.867
|
0.571
|
193.0
|
68
|
KS_GBDT
|
0.736
|
0.808
|
0.316
|
0.389
|
0.926
|
0.541
|
202.5
|
67
|
KS_XGBoost
|
0.731
|
0.725
|
0.544
|
0.432
|
0.768
|
0.646
|
287.5
|
56
|
KS_LGBM
|
0.753
|
0.779
|
0.474
|
0.452
|
0.852
|
0.635
|
397.0
|
27
|
KS_CatBoost
|
0.755
|
0.809
|
0.386
|
0.438
|
0.910
|
0.593
|
366.5
|
33
|
MW_SVM
|
0.761
|
0.669
|
0.736
|
0.461
|
0.653
|
0.693
|
505.0
|
4
|
MW_LR
|
0.762
|
0.565
|
0.824
|
0.422
|
0.504
|
0.644
|
378.5
|
31
|
MW_MLP
|
0.755
|
0.666
|
0.744
|
0.462
|
0.648
|
0.694
|
486.0
|
6
|
MW_RF
|
0.754
|
0.754
|
0.565
|
0.469
|
0.799
|
0.672
|
467.0
|
8
|
MW_GBDT
|
0.741
|
0.794
|
0.355
|
0.399
|
0.899
|
0.565
|
221.5
|
64
|
MW_XGBoost
|
0.730
|
0.706
|
0.593
|
0.437
|
0.732
|
0.659
|
308.5
|
48
|
MW_LGBM
|
0.745
|
0.741
|
0.549
|
0.449
|
0.786
|
0.657
|
379.0
|
30
|
MW_CatBoost
|
0.747
|
0.800
|
0.363
|
0.412
|
0.905
|
0.573
|
280.0
|
57
|
Note: OS = One-Sided Selection, NC = Neighborhood Cleaning Rule, S = SMOTE, SI = SMOTE-IPF, SIC = SMOTE_IPF (CatBoost), SL = Safs_Level_SMOTE, KS = kmeans-SMOTE, MW = MWMOTE. |
Moreover, comparison diagrams of each evaluation index of the same model under different resampling methods were plotted to investigate the performance improvement of the classification model under various resampling methods. From Figure. 4, the AUC values of the same model under the processing of different resampling methods were relatively concentrated, in addition, considering that the performance of the comprehensive indexes for judging was more reasonable than a single index, therefore, only two comprehensive indexes, namely F1 and G_mean, were used as the main point of view of the performance of the joint model. The results indicated that the SVM model performed best under MWOTE processing, while LR, MLP, RF, GBDT, XGBoost, LGBM, and CatBoost performed best under NC, MWOTE, SL, NC, SIC, NC, and SL balanced processing, respectively, which was essentially the same as the results of the combined scores in Table 2, which suggested that the manner of combining class-balanced processing methods and classification algorithms for the same dataset is also one of the factors affecting the modelling performance, which needs to be further investigated.
In addition, the model's internal validation performance did not differ substantially from its external validation performance in both internal and external validation (see Figure S2 for details), indicating that the model's generalization performance is satisfactory.
To further improve the model performance, we perform Voting integration on the top three models under different class balancing processing methods and assign the ranking in the same way to select the optimal model. As shown in Table 3 and Fig. 5, the joint MWMOTE-Voting model is ranked first among all the models, followed by the NC-Voting model, and the models after the Voting-ensemble models obtained the optimal performance. Most Voting-ensemble models ranked in the top 10, all heterogeneous-ensemble models ranked in the top 30 of all models in terms of performance, and the heterogeneous-ensemble models outperformed their base classifiers in several metrics, indicating that further integration of models is a way to improve the predictive performance of dominant models.
In furtherance of this, the authors of this work plotted their KS curves for each of the nine models following heterogeneous ensemble (Fig. 6). Figure 6 depicts the results, in which the KS values of the model are centered around 0.41 with adequate discrimination.
Table 3
Comparison of model performance after Voting ensemble.
Model
|
AUC
|
Accuracy
|
Sensitivity
|
F1
|
Specificity
|
G-mean
|
Rank
|
Voting
|
0.771
|
0.814
|
0.371
|
0.434
|
0.920
|
0.584
|
29
|
OS_Voting
|
0.769
|
0.808
|
0.404
|
0.448
|
0.904
|
0.605
|
18
|
NC_Voting
|
0.764
|
0.758
|
0.604
|
0.490
|
0.795
|
0.693
|
2
|
S_Voting
|
0.763
|
0.649
|
0.762
|
0.455
|
0.622
|
0.688
|
8
|
SI_Voting
|
0.761
|
0.645
|
0.762
|
0.452
|
0.617
|
0.685
|
13
|
SIC_Voting
|
0.763
|
0.664
|
0.749
|
0.462
|
0.643
|
0.694
|
4
|
SL_Voting
|
0.767
|
0.646
|
0.757
|
0.452
|
0.620
|
0.685
|
9
|
KS_Voting
|
0.761
|
0.760
|
0.529
|
0.458
|
0.815
|
0.656
|
17
|
MW_Voting
|
0.769
|
0.696
|
0.712
|
0.474
|
0.692
|
0.702
|
1
|
3.5 Visualization of feature importance
Figure 7(A) and Fig. 7(B) showed the shapley value plots. Figure 7(A) depicts the overall feature shapley value plot, which depicts the absolute importance of each feature on the model prediction results. Figure 7(B) depicts the typical shapley values for each sample. The colors represent the magnitude of the highlighted values, while the horizontal coordinates represent the shapley values. With red dots representing a high-risk value and blue dots representing a low-risk value. The irregularly overlapping points explain the dispersion.
As manifested in Fig. 7(A-B), the most important risk factor for AECOPD in the COPD patients was pneumonia, patients with pneumonia are more likely to develop AECOPD than COPD patients without pneumonia. Oasis came in second, and the other factors in order were Oral corticosteroid, Calcium channel blocker, Bicarbonate, Beta blockers, Inhaled corticosteroid etc.
Figure 7 The interpretation of the LightGBM model. (A): SHAP overall feature importance chart; (B): Distribution of characteristic shapley values.