Baseline Characteristics
For the health checkup cohort of 40899 subjects, the mean (SD) ages for males and females were 47.4 (14.0) and 45.4 (13.6) years, respectively. At the end of the follow-up period, 4055 HUA cases (2770 males and 1285 females) were diagnosed, resulting in an incidence rate of 99.15/1000 person-years. The baseline characteristics of 36844 non-HUA subjects and 4055 HUA subjects were listed below, as shown in Table 1.
Table 1
Baseline characteristics of subjects in different groups.
|
Non-HUA (N = 36844)
|
HUA (N = 4055)
|
P-value
|
Gender
|
|
|
|
Female
|
17088 (46.4%)
|
1285 (31.7%)
|
< 0.001
|
Male
|
19756 (53.6%)
|
2770 (68.3%)
|
|
Age
|
47.4 (14.0)
|
45.4 (13.6)
|
< 0.001
|
BMI
|
24.1 (3.44)
|
26.0 (3.50)
|
< 0.001
|
SBP
|
126 (17.7)
|
129 (16.8)
|
< 0.001
|
DBP
|
76.5 (11.3)
|
79.4 (11.3)
|
< 0.001
|
ALT
|
19.1 (23.9)
|
25.9 (21.2)
|
< 0.001
|
AST
|
18.8 (11.3)
|
21.2 (9.56)
|
< 0.001
|
GGT
|
24.1 (23.2)
|
35.1 (33.9)
|
< 0.001
|
TBil
|
12.2 (5.59)
|
12.9 (5.87)
|
< 0.001
|
TP
|
73.6 (3.97)
|
74.6 (3.91)
|
< 0.001
|
Alb
|
47.1 (2.63)
|
47.8 (2.68)
|
< 0.001
|
BUN
|
4.71 (1.26)
|
5.02 (1.18)
|
< 0.001
|
Cr
|
70.9 (16.0)
|
76.9 (13.4)
|
< 0.001
|
EGFR
|
106 (15.5)
|
103 (15.1)
|
< 0.001
|
TG
|
1.30 (0.802)
|
1.71 (1.06)
|
< 0.001
|
TC
|
4.78 (0.922)
|
4.99 (0.939)
|
< 0.001
|
HDL
|
1.37 (0.303)
|
1.28 (0.270)
|
< 0.001
|
LDL
|
2.68 (0.703)
|
2.86 (0.731)
|
< 0.001
|
FBG
|
5.11 (1.25)
|
5.12 (1.07)
|
0.481
|
WBC
|
6.21 (1.54)
|
6.65 (1.52)
|
< 0.001
|
NEUT
|
3.43 (1.13)
|
3.65 (1.14)
|
< 0.001
|
BUA
|
297 (64.8)
|
362 (52.3)
|
< 0.001
|
Fatty_liver
|
|
|
|
Non-Fatty_liver
|
22102 (60.0%)
|
1435 (35.4%)
|
< 0.001
|
Fatty_liver
|
14742 (40.0%)
|
2620 (64.6%)
|
|
Feature Selection
Predicting features were filtered by LASSO regression, and 15 features were finally screened out of 23 variables, including age, gender, BMI, GGT, TBil, TP, BUN, Cr, EGFR, TG, TC, FBG, WBC, BUA and the fatty liver status, as shown in Fig. 1. The figure on the left was the LASSO coefficient path diagram, where each curve represents the trajectory of the coefficient of each variable, and the variables first reached to point 0 were excluded. The figure on the right is the feature importance diagram, which shows how much every feature is related to the outcome by ranking their coefficients.
Construction of Prediction Models
First of all, 14445 non-HUA subjects and 14185 HUA subjects were generated from the training set using the ROSE sampling method and then three individual machine learning models, SVM, C5.0, and XGBoost were trained using the grid search strategy for hyperparameters tuning. Then, a gradient boosting machine (GBM) model was applied to stack these three individual models together into our final ensemble model. We can see that the XGBoost takes the largest proportion of influence in our ensemble model, as shown in Fig. 2A. The hyperparameter tuning process of the component models, XGBoost, C5.0, and SVM are shown in Figs. 2B, 2C, 2D respectively. The area under the receiver operating characteristic curve (ROC) showed increasing trends with boosting iterations.
The AUC for each of the 10 bootstrapped datasets were obtained, as depicted in Fig.
3, and they varied across different subsets for the three machine learning models. Also, the correlations between each pair of models were examined, and they showed significant statistical differences, which indicated that each model captured distinct aspects of the data. In this case, there is a good chance that our ensemble model can enhance predictive performance even further while stacking these three machine learning models together.
Evaluation of Prediction Models
For ease of comparison, the ROC curves of four models on the validation set were depicted in a single plot, as shown in Fig. 4A. The stacking ensemble model with an AUC of 0.854, outperformed the other three models, SVM, C5.0, and the XGBoost with AUCs of 0.848, 0.851 and 0.849, respectively. Moreover, the ensemble model outperformed the other three models in terms of calibration accuracy with fewer deviations from the diagonal, as shown in Fig. 4B. Other metrics for evaluating our models, including accuracy, sensitivity, specificity, PPV, NPV, and F1 score were also presented, which further proved the ensemble model’s superiority over the other three models, as shown in Table 2.
Table 2
Other performance metrics of different models on the validation set.
Model
|
Accuracy
|
Sensitivity
|
Specificity
|
Positive Predictive Value
|
Negative Predictive Value
|
F1 Score
|
SVM
|
0.9230
|
0.8133
|
0.9340
|
0.5544
|
0.9802
|
0.6593
|
C5.0
|
0.9283
|
0.8297
|
0.9382
|
0.5753
|
0.9820
|
0.6795
|
XGBoost
|
0.9245
|
0.8709
|
0.9299
|
0.5562
|
0.9862
|
0.6788
|
Ensemble
|
0.9307
|
0.8766
|
0.9361
|
0.5806
|
0.9868
|
0.6986
|
Ensemble Model interpretation
To better illustrate our stacking ensemble model, the iBreakdown algorithm was used for detecting interactions for subject-level explanations. The contributing features of developing HUA in the future were estimated using six randomly selected subjects, which showed that BUA, gender, age, GGT, EFGR, BMI, TP, TG, Cr were associated with an increased risk of developing HUA. Being Female and relatively younger, together with having higher BUA, BMI, GGT, TP, TG, Cr, FBG values can increase the risk of developing HUA, as shown in Fig. 5.
Extra Validation of the Ensemble Model
To further validate our model’s applicability in the health checkup population, we used another cohort from a different timespan enrolled from Jan 1, 2022, to May 31, 2023 in the same hospital, whose baseline characteristics were shown in Table 3. At the end of the follow-up period for 8559 subjects, 804 incident HUA cases were diagnosed, resulting in an incidence rate of 93.94/1000 person-years. The stacking ensemble model with an AUC of 0.846, outperformed the other three models, SVM, C5.0, and the XGBoost with AUCs of 0.839, 0.835 and 0.840, respectively, as shown in Fig. 6A. The calibration curves and other metrics were also depicted, which showed our ensemble model had favorable performances in those evaluations, as shown in Fig. 6B and Table 4.
Table 3
Baseline characteristics of the extra-validation set in different groups.
|
Non-HUA (N = 7755)
|
HUA (N = 804)
|
P-value
|
Gender
|
|
|
|
Female
|
3529 (45.5%)
|
288 (35.8%)
|
< 0.001
|
Male
|
4226 (54.5%)
|
516 (64.2%)
|
|
Age
|
50.1 (14.2)
|
48.4 (14.3)
|
0.00127
|
BMI
|
24.4 (3.37)
|
26.3 (3.67)
|
< 0.001
|
SBP
|
129 (18.2)
|
132 (17.3)
|
< 0.001
|
DBP
|
78.0 (11.5)
|
80.8 (11.6)
|
< 0.001
|
ALT
|
18.7 (16.2)
|
25.0 (32.6)
|
< 0.001
|
AST
|
18.7 (8.86)
|
21.5 (21.0)
|
< 0.001
|
GGT
|
24.3 (25.6)
|
34.9 (54.3)
|
< 0.001
|
TBil
|
12.2 (5.60)
|
12.6 (5.52)
|
0.0344
|
TP
|
72.6 (3.99)
|
73.7 (3.83)
|
< 0.001
|
Alb
|
46.6 (2.56)
|
47.3 (2.49)
|
< 0.001
|
BUN
|
4.79 (1.24)
|
5.12 (1.26)
|
< 0.001
|
Cr
|
73.2 (14.9)
|
78.1 (14.2)
|
< 0.001
|
EGFR
|
102 (16.0)
|
99.0 (15.5)
|
< 0.001
|
TG
|
1.30 (0.782)
|
1.71 (0.989)
|
< 0.001
|
TC
|
4.83 (0.927)
|
5.06 (0.984)
|
< 0.001
|
HDL
|
1.35 (0.293)
|
1.25 (0.277)
|
< 0.001
|
LDL
|
2.87 (0.760)
|
3.08 (0.840)
|
< 0.001
|
FBG
|
5.24 (1.34)
|
5.22 (1.13)
|
0.735
|
WBC
|
6.20 (1.54)
|
6.64 (1.54)
|
< 0.001
|
NEUT
|
3.39 (1.16)
|
3.62 (1.15)
|
< 0.001
|
BUA
|
299 (65.3)
|
362 (56.4)
|
< 0.001
|
Fatty_liver
|
|
|
|
Non-Fatty_liver
|
4489 (57.9%)
|
263 (32.7%)
|
< 0.001
|
Fatty_liver
|
3266 (42.1%)
|
541 (67.3%)
|
|
Table 4
Other performance metrics of different models on the extra validation set.
Model
|
Accuracy
|
Sensitivity
|
Specificity
|
Positive Predictive Value
|
Negative Predictive Value
|
F1 Score
|
SVM
|
0.9059
|
0.8756
|
0.9090
|
0.4996
|
0.9860
|
0.6362
|
C5.0
|
0.9095
|
0.8818
|
0.9123
|
0.5104
|
0.9868
|
0.6466
|
XGBoost
|
0.8996
|
0.8706
|
0.9026
|
0.4811
|
0.9854
|
0.6197
|
Ensemble
|
0.9096
|
0.8955
|
0.9110
|
0.5106
|
0.9883
|
0.6504
|
Clinical Use of the Ensemble Model
To facilitate the use of our ensemble model in clinical practice, we built a dynamic risk calculator for HUA, as shown in Fig. 7. To use the dynamic calculator, select or type in the correct values in the corresponding options, and click “Submit” to get the probability of developing HUA in the future. To further support our calculator’s worth, its validity was analyzed using decision curve analysis (DCA). The threshold probability is the risk of HUA predicted by a clinician using the calculator and might benefit from intervention. As we can see from the decision curve that using the calculator based on the ensemble model to predict the risk of HUA can be clinically beneficial if the threshold ranging from around 10–80% and more advantageous than the other three models, as shown in Fig. 8.