Metabolic syndrome prediction models using only lifestyle information based on nationwide survey data

doi:10.21203/rs.3.rs-4386349/v1

Download PDF

Article

Metabolic syndrome prediction models using only lifestyle information based on nationwide survey data

https://doi.org/10.21203/rs.3.rs-4386349/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Introduction: Metabolic syndrome (MetS) is increasingly prevalent worldwide and under-addressed, with emerging interest in using mobile technology for health management. Effective interventions hinge on reliable prediction models.

Objectives: This study aims to develop an algorithm to estimate MetS risk using only lifestyle factors and assess its impact on patient screening and quality of life enhancement.

Methods: Utilizing data from the Korean National Health and Nutrition Examination Survey (2010–2018), we trained three non-invasive classifier models—artificial neural network (ANN), XGBoost, and LightGBM—for binary classification. We evaluated model performance using sensitivity, specificity, AUROC, and AUPRC metrics and explored quality-adjusted life years (QALYs) improvements.

Results: Machine learning models demonstrated superiority over traditional logistic regression, with LightGBM achieving the highest AUROC and accuracy (AUROC 0.84; accuracy 0.74). Decision curve analysis highlighted significant differences in external datasets. MetS severity was strongly associated with QALYs (p < 0.0001), predicting substantial QALY gains across MetS categories.

Conclusion: The developed model enhances MetS risk assessment accuracy and underscores the importance of incorporating gender-specific factors in predictive models.

Metabolic syndrome (MetS) is a cluster of conditions that has a significant medical ^1,2, economic ³, and public health implications ⁴. It includes risk factors such as high blood pressure, high blood sugar, excess body fat around the waist, and abnormal cholesterol levels ⁵. These factors heighten the risk of heart disease, stroke, and type 2 diabetes, which can severely impact an individual's quality of life ⁶. Furthermore, the economic burden of managing these conditions and the social consequences of living with chronic disease underscore the importance of addressing MetS ⁷.

Nevertheless, the prevalence of MetS is steadily increasing worldwide, particularly in countries such as the United States ⁸, South Korea ⁵, China ⁹, and those in Europe¹⁰. Factors contributing to this growth include increasing sedentary lifestyles, unhealthy diets and aging populations ^11,12. As a result, MetS has become a significant public health concern in these countries, requiring effective strategies for prevention, early detection, and management ⁴.

Despite its significance, MetS remains undertreated ^13,14. A primary issue is the inadequate screening for its components. The awareness rates of obesity ²⁵, hypertension ²⁶, and diabetes ^27,28 remain low, highlighting the need for routine screening. This is particularly concerning because early detection and intervention can help prevent or mitigate the serious consequences associated with MetS ²⁹.

Several factors contribute to the undertreatment of MetS. First, many individuals with MetS are asymptomatic, making them less likely to seek testing. Second, the invasive nature of the blood tests required for diagnosis discourages people from undergoing screening, especially those without symptoms. Finally, even when MetS is identified, it often is not adequately managed because it does not directly interfere with daily life. Consequently, many patients with MetS remain untreated until the condition worsens, leading to diabetes and cardiac complications.

Recently, there has been a growing trend towards using mobile technology for health management ³⁰. The integration of non-invasive tests to assess the risk of MetS could significantly increase awareness and encourage proactive measures, as discussed in a recent study ³¹. By providing individuals with easily accessible information about their risk, mobile health applications can encourage them to take preventive or management actions regarding the condition ³².

However, the efficacy of mobile health management technologies in daily routines depends on an appropriate prediction model. If a model requires information typically obtained through blood tests, it will not effectively elevate awareness of MetS. Conversely, if the model can assess MetS using easily obtainable physical metrics such as height, weight, and lifestyle factors, its utilization and the subsequent improvement in awareness will likely increase. Thus, a parsimonious model is well-suited for health management on mobile platforms.

Parsimonious models have already been applied in a wide range of diseases, including COVID-19^15–19, artery disease²⁰, acute kidney injury²¹, osteoporosis ²², and hepatobiliary diseases^23,24. Various modeling techniques such as logistic regression^15,18, deep learning^{16,19,22–24}, and machine-learning based genetic evolution algorithm^20,21 have been employed in these contexts. By applying a parsimonious model to MetS, appropriate screening can be performed.

Therefore, the purpose of this study is twofold: 1) to develop a parsimonious algorithm that estimates the risk of MetS using only lifestyle variables, and 2) to analyze the potential impact of using this algorithm to screen patients with MetS on improving their quality of life. By providing a non-invasive and easily accessible method for estimating MetS risk, this study aims to contribute to better detection, management, and prevention of this critical public health issue.

Baseline characteristics

Table 1 outlines the demographic and clinical characteristics of the 50,428 participants in the study, categorized by gender and divided into training/validation sets. The average age was 50.75 (± 16.30) for females and 50.43 (± 16.37) for males in the training set, with females making up 58.40% (n = 29,542) of this group. Diagnostic criteria for MetS such as waist circumference, systolic and diastolic blood pressure, and HDL cholesterol levels were generally higher in men. Triglycerides were notably higher in men, averaging 161.23 (± 136.88), compared to 116.15 (± 33.35) in women. There was a significant gender disparity in smoking rates, with 38.59% of men and only 5.15% of women smoking. More men reported higher education levels, and more women reported a family history of hypertension (39.88% vs. 33.77%). The prevalence of MetS was similar between genders in the training set, at 28.08% for women and 27.92% for men. These characteristics remained consistent across both the training and validation sets, validating the random sampling method used.

Table 1

Baseline characteristics of the study population
	MetS (n = 14,585)	Non-MetS (n = 35,840)	P-value
Age, years	59.1(13.6)	47.8(16.2)	0.00
Sex, male (%)	6,948(47.6)	14,251(40.6)	0.00
Height, cm	161.6(9.9)	162.8(9.0)	0.00
Weight, kg	68.7(12.8)	60.7(10.8)	0.00
Body mass index	26.2(3.3)	22.8(3.0)	0.00
Waist circumference, cm	89.6(8.5)	78.5(8.8)	0.00
Systolic blood pressure, mm Hg	129(16.4)	115.1(15.7)	0.00
Diastolic blood pressure, mm Hg	79.2(11.3)	73.9(9.6)	0.00
Heartbeat per 15 seconds	17.8(2.3)	17.6(2.1)	0.00
Triglyceride	201(144.4)	105.8(71.3)	0.00
HDL cholesterol	43.4(10.0)	53.8(12.0)	0.00
LDL cholesterol	115.3(35.4)	114.8(31.7)	0.36
Education level (%)			0.00
Below elementary school	5,432(37.2)	6,181(17.6)
Middle school	2,043(14.0)	3,249(9.3)
High school	3,999(27.4)	12,085(34.5)
College	3,099(21.3)	13,540(38.6)
Marital status (%)			0.00
Married or living as married	13,790(94.6)	28,678(81.8)
Unmarried	794(5.4)	6,385(18.2)
Current smoker (%)	3,010(20.6)	6,496(18.5)	0.00
Family history of hypertension (%)	4,558(31.3)	11,254(32.1)	0.07
Family history of diabetes mellitus (%)	2,371(16.3)	5,585(15.9)	0.36
Data are presented as the mean ± SD or as n(%). BMI: body mass index; SBP: systolic blood pressure; DBP: diastolic blood pressure; HDL: high-density lipoprotein; LDL: low-density lipoprotein

Indicators of MetS risk

As shown in Table 2, after using backward elimination logistic regression, significant variables (p ≤ 0.05) were retained to build multiple logistic regression models for each gender-specific training set. These included four continuous variables—age, body mass index (BMI), waist circumference, and heart rate (heartbeats per 15 seconds)—all significant predictors of metabolic syndrome (MetS).

For categorical variables, significance varied by gender. Alcohol consumption per session did not significantly alter MetS risk in women, except for abstainers. In men, only high consumption levels (5–6 shots or more) or abstention showed significant effects on MetS risk compared to the reference group.

Educational attainment also influenced MetS risk differently between genders. For men, education level had no significant effect on MetS risk. For women, possessing at least a high school diploma significantly reduced MetS risk, highlighting a gender-specific disparity in the impact of education on health outcomes. These findings emphasize the need for gender-specific approaches in predictive modeling and health interventions for MetS.

Table 2

Multivariate logistic regression analysis of MetS
	Coefficient	P-value
Age	0.8	0.00
Body mass index	0.4	0.00
Waist circumference	1.2	0.00
Heart beat per 15 seconds	0.2	0.00
Sex, female	0.3	0.00
Alcohol consumption frequency
Less than once a month	-1.4	0.00
About once a month	-1.5	0.00
Two to four times a month	-1.5	0.00
Two to three times a week	-1.4	0.00
More than four times a week	-1.5	0.00
Non-drinker	0.1	0.00
Amount of consumed per session
3–4 shots	0.1	0.24
5–6 shots	0.2	0.00
7–9 shots	0.4	0.00
More than 10 shots	0.5	0.00
Non-drinker	-1.4	0.00
Smoking status
Smoke everyday	0.1	0.01
Smoke once a while	0.0	0.61
Have smoked in the past	-0.3	0.00
Number of walking days per week
1 day	0.2	0.00
2 days	0.1	0.01
3 days	0.1	0.01
4 days	0.1	0.20
5 days	0.1	0.21
6 days	0.1	0.05
Everyday	0.0	0.43
Number of strength exercise days per week
1 day	0.0	0.59
2 days	0.0	0.80
3 days	0.0	0.78
4 days	0.1	0.40
More than 5 days	-0.2	0.01
Household income
50% percentile	0.0	0.36
75% percentile	0.0	0.55
100% percentile	0.0	0.95
Number of housing owned
More than 1	0.0	0.29
More than 2	0.0	0.97
Martial status, No	-0.3	0.00
Education level
Junior high school	0.0	0.46
High school	-0.2	0.00
College and above	-0.4	0.00
Family history of Hypertension	0.3	0.00
Family history of diabetes mellitus	0.3	0.00
Family history of ischemic heart disease	0.1	0.02
Family history of stroke	0.1	0.01
Region of residence
Busan	-0.2	0.00
Daegu	-0.2	0.00
Incheon	0.0	0.77
Gwangju	0.0	0.67
Daejeon	-0.2	0.04
Ulsan	-0.5	0.00
Sejong	0.0	0.59
Gyeonggi	0.0	0.49
Gangwon	0.0	0.63
Chungbuk	-0.2	0.03
Chungnam	-0.1	0.15
Jeonbuk	-0.3	0.00
Jeonnam	-0.3	0.00
Gyeongbuk	-0.2	0.00
Gyeongnam	-0.2	0.00
Jeju	-0.2	0.11

Classification models

Three binary classification models were trained: an artificial neural network (ANN), XGBoost, and LightGBM. The ANN was a feedforward multilayer perceptron with three hidden layers of 64 nodes each, using ReLU activation functions and dropout regularization at a 20% rate to prevent overfitting. The output layer used a sigmoid function for risk probability output, and learning was optimized with binary cross-entropy loss and the Adam optimizer.

The XGBoost model included up to 1000 boosting trees with early stopping at 50 rounds to prevent overtraining. It used a maximum depth of 5 for interpretability, subsampled half the training data each round, utilized a binary loss function, and had a learning rate of 0.1.

LightGBM utilized the GBDT method for up to 2000 boosting rounds with early stopping after 100 rounds. The learning rate was set at 0.01. Both bagging and feature fraction were set at 0.8, with maximum depth and number of leaves per weak learner capped at 4 and 16, respectively, ensuring a balance between model complexity and interpretability.

Predictive performance

Table 3 presents the performance metrics for each prediction model across genders. Sensitivity, specificity, area under the receiver operating characteristic (AUROC), and mean precision were evaluated among the three machine learning models by sex. The LightGBM model excelled in sensitivity, AUROC, and average precision for both sexes. In contrast, the multiple logistic regression (MLR) model demonstrated superior specificity, effectively distinguishing true negative cases (individuals without metabolic syndrome, MetS) from false positives across both genders. However, the sensitivity of the MLR model was significantly lower than the other two models (0.604 vs. 0.670 and 0.709 for female models; 0.489 vs. 0.599 and 0.628 for male models). This disparity suggests that the high specificity of the MLR model may result from its failure to adequately differentiate between positive and negative cases, primarily predicting negative outcomes. Notably, models for females generally outperformed those for males in all metrics except specificity, which aligns with our findings.

Table 3

Performance comparison of various models
Model	AUROC	AUPRC	Sensitivity	Specificity	Accuracy
Logistic Regression	0.83 (0.83–0.83)	0.62 (0.61–0.62)	0.82 (0.79–0.85)	0.68 (0.66–0.71)	0.72 (0.71–0.73)
Logistic Regression with L1 penalty	0.83 (0.83–0.83)	0.62 (0.61–0.62)	0.82 (0.79–0.85)	0.68 (0.66–0.71)	0.72 (0.711–0.73)
Logistic Regression with L2 penalty	0.83 (0.83–0.83)	0.62 (0.61–0.62)	0.82 (0.79–0.85)	0.68 (0.66–0.71)	0.72 (0.711–0.73)
XGBoost Regression	0.84 (0.84–0.84)	0.65 (0.64–0.65)	0.81 (0.77–0.86)	0.71 (0.66–0.76)	0.74 (0.72–0.76)
XGBoost RandomForest	0.84 (0.84–0.84)	0.65 (0.65–0.65)	0.82 (0.80–0.84)	0.71 (0.68–0.73)	0.74 (0.73–0.75)
LightGBM Regression	0.84 (0.84–0.84)	0.65 (0.65–0.65)	0.81 (0.76–0.86)	0.72 (0.67–0.77)	0.74 (0.72–0.77)
Multiple Layer Perceptron	0.84 (0.84–0.85)	0.65 (0.64–0.66)	0.81 (0.77–0.85)	0.72 (0.68–0.76)	0.74 (0.73–0.76)

External validation

These models were initially trained and validated using the KNHANES dataset; however, not all datasets conform to the KNHANES criteria. To establish the robustness of our trained models, it is imperative to use an external validation dataset. The criteria for smoking, drinking alcohol, and exercising in the external validation dataset differ from those in the KNHANES dataset, and some do not provide explicit thresholds to categorize frequency levels. Consequently, the alcohol consumption frequency categories from the external validation dataset were used without modification. This dataset comprises 877 individuals.

As shown on Supplementary Table 1, the performance outcomes from the external validation dataset were comparable to those from the internal validation dataset. Both with and without regularization, the logistic regression models exhibited marginally superior results on the receiver operating characteristic (ROC) curve and the precision-recall curve (PRC) compared to the machine learning models.

However, the machine learning models surpassed the logistic regression models in performance on the decision curve analysis (DCA). Despite achieving high scores on the ROC curve, the logistic regression models provided almost no net benefit compared to the "treat all" strategy, as indicated by the DCA curve (Fig. 1).

Statistical analysis

Figure 2 displays the SHAP feature importance graph for each model. In these graphs, features represented by red dots to the left of the origin line indicate a negative relationship with metabolic syndrome (MetS), while those to the right suggest a positive relationship. Across all models, body mass index (BMI) and age are ranked as the two most crucial features for determining MetS. Other significant factors include the frequency of smoking, alcohol consumption, gender, and exercise class, though their impacts are less pronounced compared to BMI and age.

Life quality improvement

To demonstrate the improvement in life quality by detecting and treating hidden patients with MetS, it is essential to statistically measure the trend of life quality across MetS levels. Cuzick's nonparametric test, a statistical method for assessing trends in categorical variables, is employed for this purpose. Using MetS groups ranked by Z-score, Cuzick's test was applied to evaluate the trend of quality-adjusted life years (QALY) across these groups. The resulting p-value from Cuzick's test indicates a promising correlation between MetS severity and QALY.

Supplementary Table 2 presents the mean ranks from Cuzick’s test for trend across the EQ5D and its individual categories. Patients without MetS are categorized as class 0, those with a Z-score lower than 0.699395 as class 1, between 0.700 and 1.127 as class 2, between 1.127 and 1.612 as class 3, and those higher than 1.612 as class 4. From class 0 to class 4, the mean ranks for all categories were analyzed using Cuzick’s test for trend. Although there are some minor reverse ranks between class 1 and class 2 in the AD category, the mean ranks generally increase and the p-value for the trend is less than 1e-6, substantiating the strong linkage between MetS and various measures of quality of life.

To calculate the quality-adjusted life years (QALY) gain from preventing metabolic syndrome (MetS), a multivariate linear regression model was employed to determine the beta coefficient of MetS severity in relation to EQ5D scores. The beta coefficients derived from this model are detailed in Supplementary Table 3. After establishing the prevalence of MetS within the entire KNHANES dataset, Supplementary Table 4 presents the calculated QALY gains per 100,000 persons per year, achieved by treating patients across different MetS classes.

Predictive models have garnered significant attention and undergone substantial enhancements over the past decade, thanks to advancements in machine learning and computer technology. Historically, some classic prediction models served primarily as tools for approximating disease risk and aiding medical decision-making^15,16,17. Unlike most diagnostic criteria, which typically consider one or two risk factors with a fixed cutoff for risk categorization, predictive models incorporate a wide array of novel parameters. These models generate an individualized risk score on a continuous scale (ranging from 0 to 1), providing a nuanced assessment rather than a binary outcome. The systematic integration of multiple risk factors into diagnoses may only be feasible through the adoption of mathematical decision models. Vickers et al. discuss the challenges of implementing these models in routine medical practice¹⁸. Furthermore, the cost-effectiveness of predictive models represents a significant advantage for healthcare systems. Unlike vaccines or medications that require manufacturing and distribution, the scalability of predictive models is virtually limitless once they are developed. Employing risk prediction models in diagnostics offers a rapid and non-invasive method to assess the risk of metabolic syndrome (MetS) without the need for biochemical or laboratory testing.

A simple risk prediction model offers numerous advantages. The required input for our MetS risk prediction model can be gathered without the need for health screenings or laboratory tests, which can take days or weeks to yield results. Importantly, our model does not rely on variables that necessitate invasive procedures, such as blood tests, for risk assessment. The scalability of the machine learning model is virtually unlimited, potentially applicable to patients worldwide, provided it can be adapted to accommodate the diverse characteristics of various ethnic groups, though it was initially trained only on Korean adults.

MetS is recognized as a condition stemming from insulin resistance, which often precedes diabetes mellitus and various cardiovascular diseases²⁴. A simple risk prediction model acts as a preliminary preventive measure by aiding decisions on whether to undergo further physical examinations to assess metabolic health more closely. Our MetS risk prediction model has demonstrated superior performance compared to other models for cardiovascular disease risks, such as those predicting diabetes mellitus, hypertension, and dyslipidemia, particularly in distinguishing individuals with and without these conditions. Given that each diagnostic criterion for MetS may be highly responsive to straightforward variables such as body mass index or lifestyle factors, a simple model is optimally suited for predicting MetS risk.

The three machine learning models were developed using data from Korean individuals aged over 20, supported by the well-established medical infrastructure that provides access to a wealth of public data sources. These include the Korean National Health Insurance Service cohorts, the Korean National Health and Nutrition Examination Survey (KNHANES), and electronic health records from major hospitals in Korea. Various prediction models for cancers and cardiovascular diseases, including MetS, have been developed from these comprehensive datasets. Our model is distinct in that it utilizes data up to the latest KNHANES 2019 release and incorporates only simple risk factors for MetS, enhancing its practicality and applicability.

The discrimination performance of our model in identifying individuals with and without MetS was benchmarked against the leading machine learning models dedicated to MetS risk prediction. In a comparable context, Wang and colleagues⁴¹ developed an artificial neural network model to predict the risk of type 2 diabetes mellitus (T2DM) among rural adults in China, achieving sensitivity, specificity, and an area under the receiver operating characteristic curve (AUROC) of 0.869, 0.791, and 0.891, respectively. These metrics are closely matched by the performance of our LightGBM model for women, which posted sensitivity, specificity, and AUROC values of 0.709, 0.876, and 0.897. This comparison underscores the competitive accuracy and reliability of our model in the landscape of predictive health modeling.

Limitations

A significant limitation of this study is the reliance on self-reported lifestyle information, which could lead to misclassification or recall bias. The KNHANES dataset, being a cross-sectional survey that includes medical examinations, does not permit the analysis of causal relationships between risk factors and MetS or the progression of these events due to the absence of a temporal dimension. Consequently, our developed model only provides an immediate probability estimate of an individual having MetS, based on available medical examination results.

This limitation restricts the model's use primarily to pre-examination screening. In contrast, models trained with longitudinal, time-series data could additionally facilitate post-examination follow-up by incorporating changes over time. Including more dynamic features such as blood lipid levels, blood pressure, and other laboratory test results could significantly enhance the model’s performance.

Moreover, the impact of risk factors on individuals with severe MetS conditions might be overstated during model training because MetS status is treated as a binary variable—either present or absent. This binary approach may result in an overly conservative model, especially in cases with extremely high blood pressure, lipid levels, or body mass index, leading to predictions that err on the side of diagnosing MetS.

The absence of data on family history of cardiovascular disease, a known risk factor for metabolic syndrome (MetS), in the earlier years of the KNHANES dataset presents a notable limitation. Until 2010, the survey did not include questions regarding direct family history of various diseases, which could significantly impact the understanding and modeling of MetS risk factors. Additionally, exercise habit, which influences glucose tolerance and is an established lifestyle risk factor⁴², suffers from extensive missing data in the dataset. The exclusion of individuals with missing exercise data resulted in a substantial reduction in sample size, with nearly two-thirds of potential data points lost.

The accuracy of lifestyle-related variables, which are predominantly self-reported in surveys, poses another challenge. If these variables could be captured more accurately through alternative methods, it would greatly enhance the predictive power of models assessing the risk of incident diseases. Improved data collection techniques, such as objective monitoring devices or more detailed and frequent surveys, could provide more reliable and comprehensive data, thereby refining the predictive models and their outcomes.

The KNHANES dataset, being exclusively representative of the Korean population, highlights Korea's limited ethnic diversity. Consequently, the predictive accuracy of the MetS risk model might be compromised when applied to populations with different ethnic backgrounds. This recognition forms the basis for one of our future initiatives: external validation of our risk model using data from individuals whose ethnicity varies from that of native Koreans. This step is crucial for ensuring the model's applicability and reliability across diverse populations.

Despite these limitations, the KNHANES dataset also has significant strengths. As a comprehensive national survey of the entire Korean population, it is particularly well-suited for developing models aimed at a general, "healthy" population rather than a hospital-based cohort, which often consists of individuals already seeking medical care. Such data can skew toward higher disease prevalence and might not accurately represent the general population's health status.

Moreover, it is important to note that in our current analysis, variables or characteristics of the individuals have not been adjusted for age or other demographic factors. This oversight could affect the interpretations and conclusions drawn from the model, as age and other factors often significantly influence disease risk profiles. Future iterations of the model could benefit from incorporating these adjustments to enhance its accuracy and relevance.

Future directions

The advent of smart wearables has significantly advanced the capability for real-time blood pressure monitoring, yet these devices still require calibration with a baseline measurement from a conventional blood pressure monitor, which must be updated periodically. Improvements in the accuracy of wearable blood pressure measurements could facilitate the inclusion of systolic and diastolic blood pressure readings as "simple" variables in MetS risk prediction models.

Blood pressure is a critical risk factor for MetS and can substantially enhance a model's predictive accuracy. For instance, incorporating blood pressure readings into the LightGBM model for women, which is already our top-performing model, could potentially elevate its performance further. With the inclusion of these blood pressure variables, the model's area under the receiver operating characteristic curve (AUROC) could rise to 0.911, and its average precision could increase to 0.754. Such enhancements would not only bolster the model’s reliability but also its utility in clinical and preventative settings, offering more precise assessments for at-risk individuals.

Exploring MetS in the non-obese population presents a valuable extension to our study, particularly given the unique context in Korea. Despite the lower prevalence of obesity in Korea—partly due to the different diagnostic criteria for MetS that include waist circumference—the prevalence of MetS remains comparatively high. This discrepancy suggests a notable incidence of MetS among non-obese individuals, who might experience more severe forms of the syndrome. For non-obese individuals, typical first-line interventions for MetS, such as significant weight loss and strict dietary controls, are not applicable.

Non-obese individuals may face greater challenges with the other four MetS criteria—blood pressure, fasting glucose, triglycerides, and HDL cholesterol—which are often harder to manage than central obesity. In such cases, more invasive treatments may be necessary, and the potential for improvement through lifestyle modification alone may be limited.

Out of 50,428 individuals surveyed, 33,662 were non-obese (body mass index ≤ 25 kg/m^2), with a MetS prevalence of approximately 16% (n = 5,375) within this group. Developing a targeted prediction model for this demographic could significantly enhance awareness and early detection of MetS among non-obese individuals, who may otherwise perceive themselves as at lower risk. Such a model would not only adjust for the absence of obesity but also emphasize the importance of monitoring other metabolic factors that contribute to the syndrome. This focused approach could lead to better tailored interventions and ultimately improve health outcomes for this specific population.

The prevalence of MetS in our training and validation set was notably higher than in other studies, underscoring the sensitivity of the MetS diagnostic criteria chosen for labeling individuals. This decision is crucial as it directly influences the training and performance of the predictive model. For instance, the criteria used by Hirose and colleagues²⁷, which excluded the non-obese, resulted in a reported low prevalence of MetS. Such discrepancies highlight that differences in MetS prevalence may stem from methodological choices in defining MetS, in addition to potential ethnic variations.

Furthermore, expanding the model to include high-risk drinking status, alongside the frequency of alcohol consumption or the amount consumed per occasion, could enhance the prediction of MetS risk. Hong et al.⁴³ utilized the Alcohol Use Disorder Identification Test (AUDIT) to categorize individuals from the KNHANES 2010–2012 data into three groups based on their AUDIT scores. The analysis showed statistically significant differences in most characteristics among the groups defined by high-risk drinking status. Incorporating such nuanced alcohol consumption data could refine the model’s accuracy by providing a more detailed picture of lifestyle factors that contribute to MetS risk. This approach would be particularly beneficial for tailoring interventions and preventative measures to specific risk profiles within the population.

The overall research scheme is shown in Fig. 3.

Dataset

We utilized data from the The Korean National Health and Nutrition Examination Survey (KNHANES) spanning from 2010 to 2018. KNHANES is a cross-sectional sample survey of the entire Korean population. The profile and reliability of the dataset have been mentioned in previous studies ^33,34. Data from 72,751 individuals were initially gathered.

The objective of this study was to develop a model to predict the immediate risk of MetS in Korean adults. Diagnostic criteria specific to adults led to the exclusion of individuals under 20 years, reducing the initial cohort from 56,371 to 49,668 after further excluding those with missing data on key factors like smoking habits or marital status (Fig. 4).

Age was considered as an independent numerical variable. Body Mass Index (BMI) was calculated from height and weight, measured to the nearest 0.1 cm and kg, respectively. Waist circumference was also measured to the same precision. Alcohol consumption frequency and quantity were self-reported, with consumption measured in soju shots, a traditional Korean spirit. Smoking status categories included "smoking every day," "smoking occasionally," "former smoker," and "non-smoker." Physical activity was assessed by counting the number of walking days per week (for walks lasting over 10 minutes) and the number of days engaged in strength exercises like push-ups and weight-lifting.

MetS was defined using the US NCEP ATP III criteria, adjusted for the Asia-Pacific population, including thresholds for hypertriglyceridemia (≥ 150 mg/dL), low HDL cholesterol (≤ 40 mg/dL for men, ≤ 50 mg/dL for women), high blood pressure (≥ 135/85 mmHg), and high fasting glucose (≥ 110 mg/dL). Waist circumference cutoffs were set at ≥ 90 cm for men and ≥ 80 cm for women, with a MetS diagnosis requiring meeting three or more criteria. Lifestyle-related variables were prioritized for model predictions based on their feature importance.

Training and validation datasets

Gender differences in MetS have been observed in previous studies ^36,37. Nonetheless, a single universal model was employed for the parsimonious prediction model, as deep learning can capture non-linear relationships in regression. The data were randomly divided into training and validation sets at a 3:1 ratio. For continuous variables, the mean and standard deviation are presented for each set. For categorical variables, the number and proportion within each set are displayed.

To assess the equivalence of the two sets, an independent two-sample t-test was conducted to compare the means of continuous variables. For categorical variables, a chi-squared test was used to determine whether the distribution of levels was consistent across the sets. This statistical approach ensures that the training and validation sets are comparable, supporting the reliability of the model predictions.

Model selection and structure

Four models were developed to predict MetS, using diagnosis as the binary dependent variable (0 for negative, 1 for positive) and simple patient information as independent variables. The performance of each model was evaluated using sensitivity, specificity, area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPRC). All models were built using Python 3.8.5, providing a robust and current computational environment. This methodology facilitated a thorough assessment of each model's accuracy in predicting MetS, helping identify the most effective model for practical use.

Evaluation of the effectiveness of the developed models in terms of quality of life

The quality-adjusted life-year (QALY) combines life expectancy with quality of life to measure health outcomes. Utilizing the EQ5D-3L tool from the Korean National Health and Nutrition Examination Survey (KNHANES), which assesses five dimensions—mobility, self-care, usual activities, pain/discomfort, and anxiety/depression—QALY gains between a MetS group and a control group can be quantified. The EQ5D-3L scores are derived using an established algorithm that evaluates these dimensions at three levels. To demonstrate potential QALY gains from screening MetS using a prevalence risk prediction model, MetS patients were stratified into four severity categories based on Z-scores calculated from fasting plasma glucose, blood pressure, waist circumference, HDL, and triglyceride levels, resulting in classifications from no MetS to category 4.

The development of a straightforward model proves beneficial for preliminary screening of metabolic syndrome (MetS) risk, facilitating early intervention before more extensive medical evaluation becomes necessary. An intriguing discovery within our study was the prominence of resting heart rate (measured as heartbeats per 15 seconds) as a significant continuous feature in predictive models for both men and women. This underscores its potential as a simple, non-invasive measure in assessing MetS risk.

Given the non-additive nature of MetS risk factors between genders and the varying prevalence of MetS across sex groups, we opted to train separate models for men and women. This approach acknowledges and addresses the biological and possibly lifestyle-driven disparities influencing MetS risk between the sexes. The most effective model emerged as the LightGBM regression model tailored for women, highlighting its superior performance in predicting MetS when compared to other models and approaches used in our study.

This tailored approach not only enhances the precision of risk assessment but also underscores the importance of considering gender-specific factors in the development of predictive health models. As we move forward, these insights can guide more personalized, effective preventive measures and interventions for MetS, potentially reducing the burden of this syndrome on the healthcare system.

Author Contributions

All authors contributed to the conception and design of the study. SY and HL performed the data collection and analysis. SY and HL drafted the manuscript, and IY, SJ, and HH provided critical revisions. All authors reviewed and approved the final version of the manuscript to be published and agree to be accountable for all aspects of the work.

Data Availability Statement

The datasets generated and analyzed during the current study involve confidential patient data and are not publicly available due to privacy and ethical restrictions. Access to the data is restricted and can be provided by the corresponding author upon reasonable request, subject to necessary approvals from the relevant ethics and data protection authorities.

Additional Information

Competing Interests Statement: The authors declare no competing interests.

Galassi, A., Reynolds, K. & He, J. Metabolic syndrome and risk of cardiovascular disease: a meta-analysis. The American journal of medicine 119, 812-819 (2006).
Shi, T.H., Wang, B. & Natarajan, S. The influence of metabolic syndrome in predicting mortality risk among US adults: importance of metabolic syndrome even in adults with normal weight. Preventing chronic disease 17, E36 (2020).
Hammond, R.A. & Levine, R. The economic impact of obesity in the United States. Diabetes, metabolic syndrome and obesity: targets and therapy, 285-295 (2010).
Zimmet, P., Magliano, D., Matsuzawa, Y., Alberti, G. & Shaw, J. The metabolic syndrome: a global public health problem and a new definition. Journal of atherosclerosis and thrombosis 12, 295-300 (2005).
Huh, J.H., Kang, D.R., Kim, J.Y. & Koh, K.K. Metabolic syndrome fact sheet 2021: executive report. CardioMetabolic Syndrome Journal 1, 125-134 (2021).
Ford, E.S. & Li, C. Metabolic syndrome and health-related quality of life among US adults. Annals of epidemiology 18, 165-171 (2008).
Gheorghe, A., et al. The economic burden of cardiovascular disease and hypertension in low-and middle-income countries: a systematic review. BMC public health 18, 1-11 (2018).
Hirode, G. & Wong, R.J. Trends in the prevalence of metabolic syndrome in the United States, 2011-2016. Jama 323, 2526-2528 (2020).
Li, R., et al. Prevalence of metabolic syndrome in Mainland China: a meta-analysis of published studies. BMC public health 16, 1-10 (2016).
van Vliet-Ostaptchouk, J.V., et al. The prevalence of metabolic syndrome and metabolically healthy obesity in Europe: a collaborative analysis of ten large cohort studies. BMC endocrine disorders 14, 1-13 (2014).
Park, Y.-W., et al. The metabolic syndrome: prevalence and associated risk factor findings in the US population from the Third National Health and Nutrition Examination Survey, 1988-1994. Archives of internal medicine 163, 427-436 (2003).
Gennuso, K.P., Gangnon, R.E., Thraen-Borowski, K.M. & Colbert, L.H. Dose–response relationships between sedentary behaviour and the metabolic syndrome and its components. Diabetologia 58, 485-492 (2015).
Nádas, J., Putz, Z., Jermendy, G. & Hidvégi, T. Public awareness of the metabolic syndrome. Diabetes research and clinical practice 76, 155-156 (2007).
Ripsin, C. The metabolic syndrome: underdiagnosed and undertreated. Southern medical journal 102, 1194-1195 (2009).
Famiglini, L., Campagner, A., Carobene, A. & Cabitza, F. A robust and parsimonious machine learning method to predict ICU admission of COVID-19 patients. Med Biol Eng Comput, 1-13 (2022).
Kim, J., et al. Optimal Triage for COVID-19 Patients Under Limited Health Care Resources With a Parsimonious Machine Learning Prediction Model and Threshold Optimization Using Discrete-Event Simulation: Development Study. JMIR Med Inform 9, e32726 (2021).
Mizani, M.A., et al. Using national electronic health records for pandemic preparedness: validation of a parsimonious model for predicting excess deaths among those with COVID-19-a data-driven retrospective cohort study. J R Soc Med 116, 10-20 (2023).
Murri, R., et al. A machine-learning parsimonious multivariable predictive model of mortality risk in patients with Covid-19. Sci Rep 11, 21136 (2021).
Singh, V., et al. A deep learning approach for predicting severity of COVID-19 patients using a parsimonious set of laboratory markers. iScience 24, 103523 (2021).
Jalali Seyed Mohammad Jafar, et al. Parsimonious Evolutionary-based Model Development. 2019 IEEE International Conference on Industrial Technology (ICIT), 800-805 (2019).
Sandokji, I., et al. A Time-Updated, Parsimonious Model to Predict AKI in Hospitalized Children. J Am Soc Nephrol 31, 1348-1357 (2020).
Jang, M., et al. Opportunistic Osteoporosis Screening Using Chest Radiographs With Deep Learning: Development and External Validation With a Cohort Dataset. J Bone Miner Res 37, 369-377 (2022).
Xiao, W., et al. Screening and identifying hepatobiliary diseases through deep learning using ocular images: a prospective, multicentre study. Lancet Digit Health 3, e88-e97 (2021).
Pickhardt, P.J. Value-added Opportunistic CT Screening: State of the Art. Radiology 303, 241-254 (2022).
Eisenberg, D., et al. Rates, Variability, and Predictors of Screening for Obesity: Are Individuals with Spinal Cord Injury Being Overlooked? Obesity Facts 15, 451-457 (2022).
Kim, H.C., et al. Korea hypertension fact sheet 2020: analysis of nationwide population-based data. Clinical hypertension 27, 1-4 (2021).
Shirani, S., et al. Awareness, treatment and control of hypertension, dyslipidaemia and diabetes mellitus in an Iranian population: the IHHP study. EMHJ-Eastern Mediterranean Health Journal, 15 (6), 1455-1463, 2009 (2009).
Bae, J.H., et al. Diabetes fact sheet in Korea 2021. Diabetes & Metabolism Journal 46, 417-426 (2022).
Handelsman, Y., et al. Early intervention and intensive management of patients with diabetes, cardiorenal, and metabolic diseases. Journal of Diabetes and its Complications, 108389 (2023).
Steinhubl, S.R., Muse, E.D. & Topol, E.J. Can mobile health technologies transform health care? Jama 310, 2395-2396 (2013).
Mantena, S., Celi, L.A., Keshavjee, S. & Beratarrechea, A. Improving community health-care screenings with smartphone-based AI technologies. The Lancet Digital Health 3, e280-e282 (2021).
Lee, J.-A., Choi, M., Lee, S.A. & Jiang, N. Effective behavioral intervention strategies using mobile health applications for chronic disease management: a systematic review. BMC medical informatics and decision making 18, 1-18 (2018).
Kweon, S., et al. Data resource profile: the Korea national health and nutrition examination survey (KNHANES). International journal of epidemiology 43, 69-77 (2014).
Kim, Y. The Korea National Health and nutrition examination survey (KNHANES): current status and challenges. Epidemiology and health 36(2014).
Rezaianzadeh, A., Namayandeh, S.-M. & Sadr, S.-M. National cholesterol education program adult treatment panel III versus international diabetic federation definition of metabolic syndrome, which one is associated with diabetes mellitus and coronary artery disease? International journal of preventive medicine 3, 552 (2012).
Iyer, A., Kauter, K. & Brown, L. Gender differences in metabolic syndrome-a Key research issue. Endocrine, Metabolic & Immune Disorders-Drug Targets (Formerly Current Drug Targets-Immune, Endocrine & Metabolic Disorders) 11, 182-188 (2011).
Rochlani, Y., Pothineni, N.V. & Mehta, J.L. Metabolic syndrome: does it differ between women and men? Cardiovascular drugs and therapy 29, 329-338 (2015).

No competing interests reported.

SupplementaryTables.docx

Download PDF

Reviews received at journal
27 Jun, 2024
Reviewers agreed at journal
13 Jun, 2024
Reviewers agreed at journal
11 Jun, 2024
Reviewers invited by journal
29 May, 2024
Editor assigned by journal
29 May, 2024
Editor invited by journal
11 May, 2024
Submission checks completed at journal
10 May, 2024
First submitted to journal
07 May, 2024

You are reading this latest preprint version

Metabolic syndrome prediction models using only lifestyle information based on nationwide survey data

Status:

Version 1

Abstract

Figures

Introduction

Results

Baseline characteristics

Indicators of MetS risk

Classification models

Predictive performance

External validation

Statistical analysis

Life quality improvement

Discussion

Limitations

Future directions

Methods

Dataset

Training and validation datasets

Model selection and structure

Evaluation of the effectiveness of the developed models in terms of quality of life

Conclusion

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1