Ensemble Machine Learning Prediction of Hyperuricemia Based on a Prospective Health Checkup Population

doi:10.21203/rs.3.rs-3287684/v1

Download PDF

Research Article

Ensemble Machine Learning Prediction of Hyperuricemia Based on a Prospective Health Checkup Population

https://doi.org/10.21203/rs.3.rs-3287684/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Objectives

An accurate prediction model for hyperuricemia (HUA) is urgently needed. This study aimed to develop a stacking ensemble prediction model for the risk of hyperuricemia and to identify the contributing risk factors.

Methods

A prospective health checkup cohort of 40899 subjects was examined and randomly divided into the training and validation sets with the ratio of 7:3, and then the ROSE sampling technique was used to handle the imbalanced classes. LASSO regression was employed to screen out important predicting features. An ensemble model using stacking strategy was constructed based on three individual models, including Support Vector Machine (SVM), Decision Tree C5.0 (C5.0), and eXtreme Gradient Boosting (XGBoost). Model validations were conducted using the area under the receiver operating characteristic curve (AUC) and the calibration curve, as well as metrics including accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score on both the validation set and the extra-validation set. The iBreakdown algorithm was used to illustrate the black-box nature of our ensemble model, and to identify contributing risk factors.

Results

Fifteen important features were screened out of 23 clinical variables. Our stacking ensemble model with an AUC of 0.854, outperformed the other three models, SVM, C5.0, and XGBoost with AUCs of 0.848, 0.851 and 0.849 respectively. Calibration accuracy as well as other metrics including accuracy, specificity, NPV, and F1 score were also proved our ensemble model’s superiority over the other three models. The contributing risk factors were estimated using six randomly selected subjects, which showed that being female and relatively younger, together with having higher BUA, BMI, GGT, TP, TG, Cr, and FBG values can increase the risk of HUA. To further validate our model’s applicability in the health checkup population, we used another cohort of 8559 subjects that also showed our ensemble prediction model had favorable performances with an AUC of 0.846.

Conclusions

In this study, the stacking ensemble prediction model for the risk of HUA was developed, which outperformed the individual machine-learning models that compose it, and the contributing risk factors were identified with insightful ideas.

hyperuricemia

prediction model

machine learning

stacking ensemble

risk factors

Hyperuricemia (HUA) is a disease characterized by elevated blood uric acid due to disorders of purine metabolism and/or impaired uric acid excretion in the body. In recent years, the prevalence of HUA in the population globally has gradually increased, and a meta-analysis shows that the prevalence of HUA in China has reached 13.30%, with a population of approximately 177 million [1]. A large body of studies indicate that HUA is closely related to the development of cardiovascular diseases and is an independent risk factor for myocardial infarction, stroke, and other diseases [2]. HUA usually has an insidious onset, and once symptoms appear, it often develops into gout and is also often complicated by diabetes, hypertension, obesity and other diseases [3], which has gradually become a serious public health problem.

Machine learning is a type of artificial intelligence technology that enables computers to automatically extract useful information and knowledge from large amounts of data, and to make intelligent decisions and predictions. Machine learning has a wide range of applications in various fields, such as natural language processing, image recognition, financial risk prediction, medical diagnosis, etc. Ensemble learning is one of the machine learning strategies that leverage the power of multiple models to enhance prediction accuracy. This strategy combines the predictions of several models to generate a more robust and accurate forecast. There are three main types of ensemble learning algorithms: bagging, boosting, and stacking, each with its unique way of model combination [4, 5]. Stacking is an ensemble technique that involves training multiple models with different algorithms on the same dataset. The predictions generated by these models are then combined using a second-level model, known as the meta-learner, to produce a more accurate and robust prediction. In this approach, the individual models are referred to as first-level learners, while the meta-learner is responsible for learning how to optimally combine the outputs of the first-level learners [6]. We aimed to use the stacking ensemble technique to build a HUA risk prediction model, integrating the results of Support Vector Machine (SVM), Decision Tree C5.0 (C5.0), and eXtreme Gradient Boosting (XGBoost) models to improve the final performance.

Thus far, various studies worldwide have identified different risk factors associated with the occurrence of HUA, such as age, gender, high blood pressure, obesity, hypercholesterolemia, hypertriglyceridemia, impaired fasting glucose, and hypertension [1, 7–12]. Moreover, several prediction models for HUA have been developed. Cao et al. created two Cox regression models for males and females that utilized routine health checkup data in urban Han Chinese adults [13]. Zeng et al. developed an artificial neural network prediction model based on dietary risk factors in Chinese adults [14]. Lee et al. explored multiple machine learning algorithms to predict HUA status in Korean individuals over the age of 40 [15]. Huang et al. developed an HUA risk model for diabetic kidney disease patients based on a retrospective study [16]. Gao et al. developed a random forest prediction model for HUA based on a Chinese basic health checkup population [17]. Zheng et al. developed a HUA prediction model for steelworkers using an occupational health examination dataset [18]. Chen et al. established a simple risk prediction model for HUA in Chinese adults using three individual machine learning techniques [19]. However, these studies didn’t explore enough predictive variables, their predictive accuracy and discrimination performances were not satisfying in their validation sets, and they were lack of practical clinical applications. Therefore, it is very necessary to develop a more accurate prediction model for the risk of HUA using the ensemble strategy and develop an easy-to-use risk calculator, intending to improve model performance and enhance HUA prediction in real clinical settings.

Study Design and Participants

This study was a prospective cohort study based on the database of a large longitudinal health checkup cohort in the First Affiliated Hospital of Shandong First Medical University and was approved by the Ethics Committee of this hospital. Subjects without HUA at their first checkup in the year 2021 and without any missing variables were enrolled. All subjects were followed up for one year, and their HUA status were checked at the end of follow up in the year of 2022.

Data Collection and Preprocessing

By reviewing previous studies, we identified 23 variables from routine health checkup data that are possibly associated with HUA. They were age, gender, body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), alanine aminotransferase (ALT), aspartate aminotransferase (AST), γ-glutamyl transpeptidase (GGT), total bilirubin (TBil), total protein (TP), albumin (Alb), blood urea nitrogen (BUN), creatinine (Cr), estimated glomerular filtration rate (EGFR), triglycerides (TG), total cholesterol (TC), high-density lipoprotein cholesterol (HDL), low-density lipoprotein cholesterol (LDL), fasting blood glucose (FBG), white blood cell count (WBC), neutrophil count (NEUT), baseline uric acid (BUA) and the fatty liver status. BMI was determined as dividing the weight (kg) by the square of the height (m²). SBP and DBP were measured on the right upper arm after the subjects seated for a 5-min rest. After a 12-hour fasting period, peripheral blood samples were collected in the morning to measure the following blood variables: ALT, AST, GGT, TBil, TP, Alb, BUN, Cr, EGFR, TG, TC, HDL, LDL, FBG, WBC, NEUT and BUA. Experimental specialists performed all laboratory tests following standard protocols at the Department of Laboratory. Fatty liver status was diagnosed by certified imaging physicians through abdominal ultrasound examination. The diagnostic threshold for HUA was established as a serum uric acid level of 420 μmol/L for males and 360 μmol/L for females [20].

Statistical Analysis

Descriptive analysis for the baseline characteristics was performed. Statistical significance for quantitative data was evaluated using student's t test or nonparametric Wilcoxon test, and the Chi-square test was employed for the qualitative data.

Prediction model was constructed and evaluated. Firstly, we utilized LASSO regression for feature selection [21,22], and screened 15 important features among the 23 clinical variables by adding a penalty function. Next, the final dataset was randomly divided into the training set, comprising 70% of the subjects, and the validation set, comprising the remaining 30% [23, 24]. To handle the disparity in the frequencies of the observed classes and generate a steady prediction model, the ROSE sampling technique from the R ROSE package was used on the training set [25], which down-sampled the majority class and synthesized new data in the minority class. Then, our models were trained using the platform provided by the R caretEnsemble package. The SVM, C5.0, XGBoost, and the stacking ensemble model assembling these three models were developed based on the training set using 15 selected features. Then, we conducted internal validation of our models using the validation set and obtained estimates of the area under the receiver operating characteristic curve (AUC) as well as multiple metrics for evaluating the performance of our models, including accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and F1 score. At the same time, the calibration curve of each model was depicted. All of the above metrics were employed to assess the discrimination of our models, which refers to their ability to effectively distinguish between individuals who had risks of diseases and those who did not. Furthermore, the iBreakdown algorithm was used to illustrate the black-box nature of our ensemble model [26], and contributing risk factors were identified. Lastly, we developed a dynamic risk calculator based on the R shiny package for ease of clinical use, and further estimated its validity.

All statistical tests were two-sided with a type I error of 0.05, and p-value < 0.05 were considered statistically significant. Statistical analysis was carried out using software R version 4.2.2 and Python version 3.10.8.

Baseline Characteristics

For the health checkup cohort of 40899 subjects, the mean (SD) ages for males and females were 47.4 (14.0) and 45.4 (13.6) years, respectively. At the end of the follow-up period, 4055 HUA cases (2770 males and 1285 females) were diagnosed, resulting in an incidence rate of 99.15/1000 person-years. The baseline characteristics of 36844 non-HUA subjects and 4055 HUA subjects were listed below, as shown in Table 1.

Table 1

Baseline characteristics of subjects in different groups.
	Non-HUA (N = 36844)	HUA (N = 4055)	P-value
Gender
Female	17088 (46.4%)	1285 (31.7%)	< 0.001
Male	19756 (53.6%)	2770 (68.3%)
Age	47.4 (14.0)	45.4 (13.6)	< 0.001
BMI	24.1 (3.44)	26.0 (3.50)	< 0.001
SBP	126 (17.7)	129 (16.8)	< 0.001
DBP	76.5 (11.3)	79.4 (11.3)	< 0.001
ALT	19.1 (23.9)	25.9 (21.2)	< 0.001
AST	18.8 (11.3)	21.2 (9.56)	< 0.001
GGT	24.1 (23.2)	35.1 (33.9)	< 0.001
TBil	12.2 (5.59)	12.9 (5.87)	< 0.001
TP	73.6 (3.97)	74.6 (3.91)	< 0.001
Alb	47.1 (2.63)	47.8 (2.68)	< 0.001
BUN	4.71 (1.26)	5.02 (1.18)	< 0.001
Cr	70.9 (16.0)	76.9 (13.4)	< 0.001
EGFR	106 (15.5)	103 (15.1)	< 0.001
TG	1.30 (0.802)	1.71 (1.06)	< 0.001
TC	4.78 (0.922)	4.99 (0.939)	< 0.001
HDL	1.37 (0.303)	1.28 (0.270)	< 0.001
LDL	2.68 (0.703)	2.86 (0.731)	< 0.001
FBG	5.11 (1.25)	5.12 (1.07)	0.481
WBC	6.21 (1.54)	6.65 (1.52)	< 0.001
NEUT	3.43 (1.13)	3.65 (1.14)	< 0.001
BUA	297 (64.8)	362 (52.3)	< 0.001
Fatty_liver
Non-Fatty_liver	22102 (60.0%)	1435 (35.4%)	< 0.001
Fatty_liver	14742 (40.0%)	2620 (64.6%)

Feature Selection

Predicting features were filtered by LASSO regression, and 15 features were finally screened out of 23 variables, including age, gender, BMI, GGT, TBil, TP, BUN, Cr, EGFR, TG, TC, FBG, WBC, BUA and the fatty liver status, as shown in Fig. 1. The figure on the left was the LASSO coefficient path diagram, where each curve represents the trajectory of the coefficient of each variable, and the variables first reached to point 0 were excluded. The figure on the right is the feature importance diagram, which shows how much every feature is related to the outcome by ranking their coefficients.

Construction of Prediction Models

First of all, 14445 non-HUA subjects and 14185 HUA subjects were generated from the training set using the ROSE sampling method and then three individual machine learning models, SVM, C5.0, and XGBoost were trained using the grid search strategy for hyperparameters tuning. Then, a gradient boosting machine (GBM) model was applied to stack these three individual models together into our final ensemble model. We can see that the XGBoost takes the largest proportion of influence in our ensemble model, as shown in Fig. 2A. The hyperparameter tuning process of the component models, XGBoost, C5.0, and SVM are shown in Figs. 2B, 2C, 2D respectively. The area under the receiver operating characteristic curve (ROC) showed increasing trends with boosting iterations.

The AUC for each of the 10 bootstrapped datasets were obtained, as depicted in Fig. 3, and they varied across different subsets for the three machine learning models. Also, the correlations between each pair of models were examined, and they showed significant statistical differences, which indicated that each model captured distinct aspects of the data. In this case, there is a good chance that our ensemble model can enhance predictive performance even further while stacking these three machine learning models together.
Evaluation of Prediction Models

For ease of comparison, the ROC curves of four models on the validation set were depicted in a single plot, as shown in Fig. 4A. The stacking ensemble model with an AUC of 0.854, outperformed the other three models, SVM, C5.0, and the XGBoost with AUCs of 0.848, 0.851 and 0.849, respectively. Moreover, the ensemble model outperformed the other three models in terms of calibration accuracy with fewer deviations from the diagonal, as shown in Fig. 4B. Other metrics for evaluating our models, including accuracy, sensitivity, specificity, PPV, NPV, and F1 score were also presented, which further proved the ensemble model’s superiority over the other three models, as shown in Table 2.

Table 2

Other performance metrics of different models on the validation set.
Model	Accuracy	Sensitivity	Specificity	Positive Predictive Value	Negative Predictive Value	F1 Score
SVM	0.9230	0.8133	0.9340	0.5544	0.9802	0.6593
C5.0	0.9283	0.8297	0.9382	0.5753	0.9820	0.6795
XGBoost	0.9245	0.8709	0.9299	0.5562	0.9862	0.6788
Ensemble	0.9307	0.8766	0.9361	0.5806	0.9868	0.6986

Ensemble Model interpretation

To better illustrate our stacking ensemble model, the iBreakdown algorithm was used for detecting interactions for subject-level explanations. The contributing features of developing HUA in the future were estimated using six randomly selected subjects, which showed that BUA, gender, age, GGT, EFGR, BMI, TP, TG, Cr were associated with an increased risk of developing HUA. Being Female and relatively younger, together with having higher BUA, BMI, GGT, TP, TG, Cr, FBG values can increase the risk of developing HUA, as shown in Fig. 5.

Extra Validation of the Ensemble Model

To further validate our model’s applicability in the health checkup population, we used another cohort from a different timespan enrolled from Jan 1, 2022, to May 31, 2023 in the same hospital, whose baseline characteristics were shown in Table 3. At the end of the follow-up period for 8559 subjects, 804 incident HUA cases were diagnosed, resulting in an incidence rate of 93.94/1000 person-years. The stacking ensemble model with an AUC of 0.846, outperformed the other three models, SVM, C5.0, and the XGBoost with AUCs of 0.839, 0.835 and 0.840, respectively, as shown in Fig. 6A. The calibration curves and other metrics were also depicted, which showed our ensemble model had favorable performances in those evaluations, as shown in Fig. 6B and Table 4.

Table 3

Baseline characteristics of the extra-validation set in different groups.
	Non-HUA (N = 7755)	HUA (N = 804)	P-value
Gender
Female	3529 (45.5%)	288 (35.8%)	< 0.001
Male	4226 (54.5%)	516 (64.2%)
Age	50.1 (14.2)	48.4 (14.3)	0.00127
BMI	24.4 (3.37)	26.3 (3.67)	< 0.001
SBP	129 (18.2)	132 (17.3)	< 0.001
DBP	78.0 (11.5)	80.8 (11.6)	< 0.001
ALT	18.7 (16.2)	25.0 (32.6)	< 0.001
AST	18.7 (8.86)	21.5 (21.0)	< 0.001
GGT	24.3 (25.6)	34.9 (54.3)	< 0.001
TBil	12.2 (5.60)	12.6 (5.52)	0.0344
TP	72.6 (3.99)	73.7 (3.83)	< 0.001
Alb	46.6 (2.56)	47.3 (2.49)	< 0.001
BUN	4.79 (1.24)	5.12 (1.26)	< 0.001
Cr	73.2 (14.9)	78.1 (14.2)	< 0.001
EGFR	102 (16.0)	99.0 (15.5)	< 0.001
TG	1.30 (0.782)	1.71 (0.989)	< 0.001
TC	4.83 (0.927)	5.06 (0.984)	< 0.001
HDL	1.35 (0.293)	1.25 (0.277)	< 0.001
LDL	2.87 (0.760)	3.08 (0.840)	< 0.001
FBG	5.24 (1.34)	5.22 (1.13)	0.735
WBC	6.20 (1.54)	6.64 (1.54)	< 0.001
NEUT	3.39 (1.16)	3.62 (1.15)	< 0.001
BUA	299 (65.3)	362 (56.4)	< 0.001
Fatty_liver
Non-Fatty_liver	4489 (57.9%)	263 (32.7%)	< 0.001
Fatty_liver	3266 (42.1%)	541 (67.3%)

Table 4

Other performance metrics of different models on the extra validation set.
Model	Accuracy	Sensitivity	Specificity	Positive Predictive Value	Negative Predictive Value	F1 Score
SVM	0.9059	0.8756	0.9090	0.4996	0.9860	0.6362
C5.0	0.9095	0.8818	0.9123	0.5104	0.9868	0.6466
XGBoost	0.8996	0.8706	0.9026	0.4811	0.9854	0.6197
Ensemble	0.9096	0.8955	0.9110	0.5106	0.9883	0.6504

Clinical Use of the Ensemble Model

To facilitate the use of our ensemble model in clinical practice, we built a dynamic risk calculator for HUA, as shown in Fig. 7. To use the dynamic calculator, select or type in the correct values in the corresponding options, and click “Submit” to get the probability of developing HUA in the future. To further support our calculator’s worth, its validity was analyzed using decision curve analysis (DCA). The threshold probability is the risk of HUA predicted by a clinician using the calculator and might benefit from intervention. As we can see from the decision curve that using the calculator based on the ensemble model to predict the risk of HUA can be clinically beneficial if the threshold ranging from around 10–80% and more advantageous than the other three models, as shown in Fig. 8.

In this study, a stacking ensemble prediction model for the risk of HUA was developed using data obtained from a prospective health checkup population. Ensemble learning is a machine learning approach that attempts to improve prediction performance by blending predictions from different models, which aims to reduce prediction generalization errors [27]. The main purpose of the ensemble model is to improve the accuracy of the model by combining several weak learners into one powerful learner [28]. Stacking is an assembly method that composes multiple first-level classifiers trained on the same dataset into a meta-learner which can improve model predictions [29]. Verma et al. proposed a new technique using six different machine learning models and then developed an ensemble model using stacking and improved the performance of skin disease prediction with a final accuracy of 99.67% [30]. Pal and Roy used different first-level learners to build up ensemble models and found the stacking model could be 100% accurate in their prediction [31]. Abdollahi and Nouri-Moghaddam used the stacking ensemble method to predict diabetes and achieved a 98.8% accuracy in disease diagnosis [32]. All the studies above proved the advantages of the stacking ensemble strategy. LASSO regression was used in our study for filtering important predicting features, and eventually, 15 influencing features among 23 variables were screened out. LASSO regression is an advanced variable selection algorithm for high-dimensional data, and the complexity of the model can be simplified by constructing a penalty function for pruning predicting variables. In comparison to the traditional stepwise regression method, LASSO regression has the advantage of simultaneously considering all independent variables and effectively addresses model overfitting while enhancing model stability. Our ensemble model was built up based on 15 selected features and demonstrated favorable performance with AUCs of 0.854 and 0.846 in the validation and extra-validation sets respectively, which outperformed all the three individual SVM, C5.0, XGBoost models and the other abovementioned HUA prediction models [7–10, 13, 17–19, 33]. Other metrics, including accuracy, specificity, NPV, F1 score, and calibration accuracy, likewise indicated the superiority of our model and made it a powerful tool in HUA predicting. Our ensemble model can be utilized by clinical caregivers for identifying the risk of developing HUA in the health checkup population. Thus, individuals in high risk and with risk factors can be driven to pay extra attention to their health status, and to correct their existing risk factors in the right way, which could help avoid the occurrence of HUA, and a step further, improve their quality of life.

Our findings are consistent with the risk factors of HUA found in established studies. Six randomly selected subjects were analyzed using iBreakdown algorithm, which found that BUA, gender, age, GGT, EFGR, BMI, TP, TG, and Cr were associated with an increased risk of HUA. Chang and Cao both confirmed age and gender were very important factors in the development of HUA [13, 34]. Age is a complex influencing factor because the amount of uric acid produced varies with age. In our study, we found being relatively younger can increase the risk of developing HUA. The abovementioned two studies also proved that uric acid levels of males and females reached their apex in their 20s or so, and then declined with aging. Relatively younger people tend to have higher physical activity intensities and higher metabolic levels with different dietary habits from elderly people, which might help them produce more uric acid that increases the risk of developing HUA. We also found being female can increase the risk of developing HUA, which might contradict the common sense. Considering different diagnostic criteria of HUA for different genders, a female with relatively low levels of uric acid may be diagnosed with HUA, while a male must have very high levels of uric acid that could be diagnosed with HUA, that presenting being a female as a risk factor for HUA in our model. To be more sure about the exact influence of different genders on the risk of HUA, the smoking, drinking, physical activities, and dietary habits of different genders need to be included in the analysis. Several other studies conducted in different countries had demonstrated significant associations between HUA and BMI, TP, and TG levels [8–12, 33]. Other studies had proven smoking, alcohol consumption, sedentary lifestyle that our study didn’t involve could contribute to the development of HUA [35–38]. Besides these indicators studied in previous studies, we found that having relatively higher GGT and FBG values can increase the risk of HUA.

Our study has several advantages. Firstly, this cohort study included a large sample size of the cohort, which can minimize the risk of bias. Secondly, the stacking ensemble strategy was employed, which brought high predicting performance with fair robustness. Thirdly, we developed a dynamic risk calculator to predict the risk of HUA. The calculator was clear and intuitive, which could be used to quickly and accurately identify individuals at high risk of HUA. Our study has certain limitations. Firstly, our results were all based on one-time measurement, which may not reflect the status of the subjects accurately and may be overestimating the incidence rate of HUA. Secondly, our HUA risk prediction model was extra-validated using datasets from the same hospital in a different timespan, while the validation data from other places were necessary. Thirdly, more variables like smoking, drinking, and dietary habits etc. need to be included in our analysis. Furthermore, we will cooperate with multiple centers to obtain more external datasets and add more variables in the future to better our HUA risk prediction model.

In conclusion, we have developed and validated a stacking ensemble prediction model for the risk of HUA using a prospective health checkup population and identified the most contributing risk factors associated with HUA. For the ease of clinical practices, we developed a dynamic risk calculator and proved its validity. The performance of the ensemble model was better than the other three machine learning models that compose it and the existing HUA prediction models to the best of our knowledge. The ensemble model could help in identifying high-risk HUA groups and encouraging them to pay attention to their health status, and their unhealthy lifestyles, and finally prevent HUA and the complications after it.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Author’s contributions

GZ conceived and designed the research. YZ performed the statistical analysis and drafted the manuscript. HL and DL revised the manuscript and took part in the discussion. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by grants from the Natural Science Foundation of Shandong Province (ZR2020MF026) and the cultivation Foundation of National Natural Science Foundation of Shandong Provincial Qianfoshan Hospital (QYPY2020NSFC0603).

Ethics approval and consent to participate

This work was approved by the Ethics Committee of the First Affiliated Hospital of Shandong First Medical University (2021S128) and written informed consent was obtained from all participants.

Availability of data materials

The original contributions presented in this study are included in the article, further inquiries can be directed to the corresponding author.

Liu R, Han C, Wu D, Xia X, Gu J, Guan H, Shan Z, Teng W: Prevalence of Hyperuricemia and Gout in Mainland China from 2000 to 2014: A Systematic Review and Meta-Analysis. Biomed Res Int 2015, 2015:762820.
Maloberti A, Giannattasio C, Bombelli M, Desideri G, Cicero AFG, Muiesan ML, Rosei EA, Salvetti M, Ungar A, Rivasi G et al: Hyperuricemia and Risk of Cardiovascular Outcomes: The Experience of the URRAH (Uric Acid Right for Heart Health) Project. High Blood Press Cardiovasc Prev 2020, 27(2):121-128.
Wang LM, Deng Q, Wang LH: The Prevalence and Risk Factors of Acute Cardiovascular Events in China: Findings from China Chronic Disease Risk Factor Surveillance 2010. Heart 2013, 99:E121-E121.
Zhou ZH: Ensemble learning. In: Machine Learning. edn. Singapore: Springer; 2021: 181-210.
Sugiyama M: Ensemble learning. In: Introduction to Statistical Machine Learning. edn.: Elsevier; 2016: 343-354.
Rokach L: Introduction to ensemble learning. In: Ensemble Learning: Pattern Classification Using Ensemble Methods. edn.: World Scientific; 2019: 51-104.
Yu S, Yang H, Guo X, Zhang X, Zhou Y, Ou Q, Zheng L, Sun Y: Prevalence of hyperuricemia and its correlates in rural Northeast Chinese population: from lifestyle risk factors to metabolic comorbidities. Clin Rheumatol 2016, 35(5):1207-1215.
Qiu L, Cheng XQ, Wu J, Liu JT, Xu T, Ding HT, Liu YH, Ge ZM, Wang YJ, Han HJ et al: Prevalence of hyperuricemia and its related risk factors in healthy adults from Northern and Northeastern Chinese provinces. BMC Public Health 2013, 13:664.
McAdams-DeMarco MA, Law A, Maynard JW, Coresh J, Baer AN: Risk factors for incident hyperuricemia during mid-adulthood in African American and white men and women enrolled in the ARIC cohort study. BMC Musculoskelet Disord 2013, 14:347.
Ryu S, Chang Y, Zhang Y, Kim SG, Cho J, Son HJ, Shin H, Guallar E: A cohort study of hyperuricemia in middle-aged South Korean men. Am J Epidemiol 2012, 175(2):133-143.
Lyu X, Du Y, Liu G, Mai T, Li Y, Zhang Z, Bei C: Prevalence and influencing factors of hyperuricemia in middle-aged and older adults in the Yao minority area of China: a cross-sectional study. Sci Rep 2023, 13(1):10185.
Wang J, Chen Y, Chen S, Wang X, Zhai H, Xu C: Prevalence and risk factors of hyperuricaemia in non-obese Chinese: a single-centre cross-sectional study. BMJ Open 2022, 12(6):e048574.
Cao J, Wang C, Zhang G, Ji X, Liu Y, Sun X, Yuan Z, Jiang Z, Xue F: Incidence and Simple Prediction Model of Hyperuricemia for Urban Han Chinese Adults: A Prospective Cohort Study. Int J Environ Res Public Health 2017, 14(1).
Zeng J, Zhang J, Li Z, Li T, Li G: Prediction model of artificial neural network for the risk of hyperuricemia incorporating dietary risk factors in a Chinese adult study. Food Nutr Res 2020, 64.
Lee S, Choe EK, Park B: Exploration of Machine Learning for Hyperuricemia Prediction Models Based on Basic Health Checkup Tests. J Clin Med 2019, 8(2).
Huang G, Li M, Mao Y, Li Y: Development and internal validation of a risk model for hyperuricemia in diabetic kidney disease patients. Front Public Health 2022, 10:863064.
Gao Y, Jia S, Li D, Huang C, Meng Z, Wang Y, Yu M, Xu T, Liu M, Sun J et al: Prediction model of random forest for the risk of hyperuricemia in a Chinese basic health checkup test. Biosci Rep 2021, 41(4).
Zheng Z, Si Z, Wang X, Meng R, Wang H, Zhao Z, Lu H, Wang H, Zheng Y, Hu J et al: Risk Prediction for the Development of Hyperuricemia: Model Development Using an Occupational Health Examination Dataset. Int J Environ Res Public Health 2023, 20(4).
Chen S, Han W, Kong L, Li Q, Yu C, Zhang J, He H: The development and validation of a non-invasive prediction model of hyperuricemia based on modifiable risk factors: baseline findings of a health examination population cohort. Food Funct 2023, 14(13):6073-6082.
Endocrinology C: Guideline for the diagnosis and management of hyperuricemia and gout in China(2019). Chinese Journal of Endocrinology and Metabolism 2020, 36:1-13.
Sauerbrei W, Boulesteix AL, Binder H: Stability investigations of multivariable regression models derived from low- and high-dimensional data. J Biopharm Stat 2011, 21(6):1206-1231.
Friedman JH, Hastie T, Tibshirani R: Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 2010, 33(1):1 - 22.
Lyu J, Li Z, Wei H, Liu D, Chi X, Gong DW, Zhao Q: A potent risk model for predicting new-onset acute coronary syndrome in patients with type 2 diabetes mellitus in Northwest China. Acta Diabetol 2020, 57(6):705-713.
Chen Y, Huang S, Chen T, Liang D, Yang J, Zeng C, Li X, Xie G, Liu Z: Machine Learning for Prediction and Risk Stratification of Lupus Nephritis Renal Flare. Am J Nephrol 2021, 52(2):152-160.
Lunardon N, Menardi, G., Torelli, N: ROSE: a Package for Binary Imbalanced Learning. In., vol. 6: R Journal; 2014: 82-92.
Gosiewska A, Biecek P: Do Not Trust Additive Explanations. In.; 2020.
Harangi B: Skin lesion classification with ensembles of deep convolutional neural networks. J Biomed Inform 2018, 86:25-32.
Zaini NAM, Awang MK: Hybrid Feature Selection Algorithm and Ensemble Stacking for Heart Disease Prediction. International Journal of Advanced Computer Science and Applications 2023, 14(2):158-165.
Hera SY, Amjad M, Saba MK: Improving heart disease prediction using multi-tier ensemble model. Network Modeling and Analysis in Health Informatics and Bioinformatics 2022, 11(1).
Verma AK, Pal S, Tiwari BB: Skin disease prediction using ensemble methods and a new hybrid feature selection technique. Iran Journal of Computer Science 2020, 3(4):207-216.
Pal M, Roy BR: Evaluating and Enhancing the Performance of Skin Disease Classification Based on Ensemble Methods. In: 2020 2nd International Conference on Advanced Information and Communication Technology (ICAICT): 28-29 Nov. 2020 2020; 2020: 439-443.
Abdollahi J, Nouri-Moghaddam B: Hybrid stacked ensemble combined with genetic algorithms for diabetes prediction. Iran Journal of Computer Science 2022, 5(3):205-220.
Nakanishi N, Tatara K, Nakamura K, Suzuki K: Risk factors for the incidence of hyperuricaemia: a 6-year longitudinal study of middle-aged Japanese men. Int J Epidemiol 1999, 28(5):888-893.
Chang HY, Pan WH, Yeh WT, Tsai KS: Hyperuricemia and gout in Taiwan: results from the Nutritional and Health Survey in Taiwan (1993-96). J Rheumatol 2001, 28(7):1640-1646.
Kim JY, Yang Y, Sim YJ: Effects of smoking and aerobic exercise on male college students' metabolic syndrome risk factors. J Phys Ther Sci 2018, 30(4):595-600.
Nakamura K, Sakurai M, Miura K, Morikawa Y, Yoshita K, Ishizaki M, Kido T, Naruse Y, Suwazono Y, Nakagawa H: Alcohol intake and the risk of hyperuricaemia: a 6-year prospective study in Japanese men. Nutr Metab Cardiovasc Dis 2012, 22(11):989-996.
Nishida Y, Iyadomi M, Higaki Y, Tanaka H, Hara M, Tanaka K: Influence of physical activity intensity and aerobic fitness on the anthropometric index and serum uric acid concentration in people with obesity. Intern Med 2011, 50(19):2121-2128.
He H, Guo P, He J, Zhang J, Niu Y, Chen S, Guo F, Liu F, Zhang R, Li Q et al: Prevalence of hyperuricemia and the population attributable fraction of modifiable risk factors: Evidence from a general population cohort in China. Front Public Health 2022, 10:936717.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Ensemble Machine Learning Prediction of Hyperuricemia Based on a Prospective Health Checkup Population

Status:

Version 1

Abstract

Objectives

Methods

Results

Conclusions

Figures

Introduction

Methods

Study Design and Participants

Data Collection and Preprocessing

Statistical Analysis

Results

Feature Selection

Discussion

Conclusions

Declarations

Conflict of interest

Author’s contributions

Funding

Ethics approval and consent to participate

Availability of data materials

References

Additional Declarations

Status:

Version 1