Predictive models have garnered significant attention and undergone substantial enhancements over the past decade, thanks to advancements in machine learning and computer technology. Historically, some classic prediction models served primarily as tools for approximating disease risk and aiding medical decision-making15,16,17. Unlike most diagnostic criteria, which typically consider one or two risk factors with a fixed cutoff for risk categorization, predictive models incorporate a wide array of novel parameters. These models generate an individualized risk score on a continuous scale (ranging from 0 to 1), providing a nuanced assessment rather than a binary outcome. The systematic integration of multiple risk factors into diagnoses may only be feasible through the adoption of mathematical decision models. Vickers et al. discuss the challenges of implementing these models in routine medical practice18. Furthermore, the cost-effectiveness of predictive models represents a significant advantage for healthcare systems. Unlike vaccines or medications that require manufacturing and distribution, the scalability of predictive models is virtually limitless once they are developed. Employing risk prediction models in diagnostics offers a rapid and non-invasive method to assess the risk of metabolic syndrome (MetS) without the need for biochemical or laboratory testing.
A simple risk prediction model offers numerous advantages. The required input for our MetS risk prediction model can be gathered without the need for health screenings or laboratory tests, which can take days or weeks to yield results. Importantly, our model does not rely on variables that necessitate invasive procedures, such as blood tests, for risk assessment. The scalability of the machine learning model is virtually unlimited, potentially applicable to patients worldwide, provided it can be adapted to accommodate the diverse characteristics of various ethnic groups, though it was initially trained only on Korean adults.
MetS is recognized as a condition stemming from insulin resistance, which often precedes diabetes mellitus and various cardiovascular diseases24. A simple risk prediction model acts as a preliminary preventive measure by aiding decisions on whether to undergo further physical examinations to assess metabolic health more closely. Our MetS risk prediction model has demonstrated superior performance compared to other models for cardiovascular disease risks, such as those predicting diabetes mellitus, hypertension, and dyslipidemia, particularly in distinguishing individuals with and without these conditions. Given that each diagnostic criterion for MetS may be highly responsive to straightforward variables such as body mass index or lifestyle factors, a simple model is optimally suited for predicting MetS risk.
The three machine learning models were developed using data from Korean individuals aged over 20, supported by the well-established medical infrastructure that provides access to a wealth of public data sources. These include the Korean National Health Insurance Service cohorts, the Korean National Health and Nutrition Examination Survey (KNHANES), and electronic health records from major hospitals in Korea. Various prediction models for cancers and cardiovascular diseases, including MetS, have been developed from these comprehensive datasets. Our model is distinct in that it utilizes data up to the latest KNHANES 2019 release and incorporates only simple risk factors for MetS, enhancing its practicality and applicability.
The discrimination performance of our model in identifying individuals with and without MetS was benchmarked against the leading machine learning models dedicated to MetS risk prediction. In a comparable context, Wang and colleagues41 developed an artificial neural network model to predict the risk of type 2 diabetes mellitus (T2DM) among rural adults in China, achieving sensitivity, specificity, and an area under the receiver operating characteristic curve (AUROC) of 0.869, 0.791, and 0.891, respectively. These metrics are closely matched by the performance of our LightGBM model for women, which posted sensitivity, specificity, and AUROC values of 0.709, 0.876, and 0.897. This comparison underscores the competitive accuracy and reliability of our model in the landscape of predictive health modeling.
Limitations
A significant limitation of this study is the reliance on self-reported lifestyle information, which could lead to misclassification or recall bias. The KNHANES dataset, being a cross-sectional survey that includes medical examinations, does not permit the analysis of causal relationships between risk factors and MetS or the progression of these events due to the absence of a temporal dimension. Consequently, our developed model only provides an immediate probability estimate of an individual having MetS, based on available medical examination results.
This limitation restricts the model's use primarily to pre-examination screening. In contrast, models trained with longitudinal, time-series data could additionally facilitate post-examination follow-up by incorporating changes over time. Including more dynamic features such as blood lipid levels, blood pressure, and other laboratory test results could significantly enhance the model’s performance.
Moreover, the impact of risk factors on individuals with severe MetS conditions might be overstated during model training because MetS status is treated as a binary variable—either present or absent. This binary approach may result in an overly conservative model, especially in cases with extremely high blood pressure, lipid levels, or body mass index, leading to predictions that err on the side of diagnosing MetS.
The absence of data on family history of cardiovascular disease, a known risk factor for metabolic syndrome (MetS), in the earlier years of the KNHANES dataset presents a notable limitation. Until 2010, the survey did not include questions regarding direct family history of various diseases, which could significantly impact the understanding and modeling of MetS risk factors. Additionally, exercise habit, which influences glucose tolerance and is an established lifestyle risk factor42, suffers from extensive missing data in the dataset. The exclusion of individuals with missing exercise data resulted in a substantial reduction in sample size, with nearly two-thirds of potential data points lost.
The accuracy of lifestyle-related variables, which are predominantly self-reported in surveys, poses another challenge. If these variables could be captured more accurately through alternative methods, it would greatly enhance the predictive power of models assessing the risk of incident diseases. Improved data collection techniques, such as objective monitoring devices or more detailed and frequent surveys, could provide more reliable and comprehensive data, thereby refining the predictive models and their outcomes.
The KNHANES dataset, being exclusively representative of the Korean population, highlights Korea's limited ethnic diversity. Consequently, the predictive accuracy of the MetS risk model might be compromised when applied to populations with different ethnic backgrounds. This recognition forms the basis for one of our future initiatives: external validation of our risk model using data from individuals whose ethnicity varies from that of native Koreans. This step is crucial for ensuring the model's applicability and reliability across diverse populations.
Despite these limitations, the KNHANES dataset also has significant strengths. As a comprehensive national survey of the entire Korean population, it is particularly well-suited for developing models aimed at a general, "healthy" population rather than a hospital-based cohort, which often consists of individuals already seeking medical care. Such data can skew toward higher disease prevalence and might not accurately represent the general population's health status.
Moreover, it is important to note that in our current analysis, variables or characteristics of the individuals have not been adjusted for age or other demographic factors. This oversight could affect the interpretations and conclusions drawn from the model, as age and other factors often significantly influence disease risk profiles. Future iterations of the model could benefit from incorporating these adjustments to enhance its accuracy and relevance.
Future directions
The advent of smart wearables has significantly advanced the capability for real-time blood pressure monitoring, yet these devices still require calibration with a baseline measurement from a conventional blood pressure monitor, which must be updated periodically. Improvements in the accuracy of wearable blood pressure measurements could facilitate the inclusion of systolic and diastolic blood pressure readings as "simple" variables in MetS risk prediction models.
Blood pressure is a critical risk factor for MetS and can substantially enhance a model's predictive accuracy. For instance, incorporating blood pressure readings into the LightGBM model for women, which is already our top-performing model, could potentially elevate its performance further. With the inclusion of these blood pressure variables, the model's area under the receiver operating characteristic curve (AUROC) could rise to 0.911, and its average precision could increase to 0.754. Such enhancements would not only bolster the model’s reliability but also its utility in clinical and preventative settings, offering more precise assessments for at-risk individuals.
Exploring MetS in the non-obese population presents a valuable extension to our study, particularly given the unique context in Korea. Despite the lower prevalence of obesity in Korea—partly due to the different diagnostic criteria for MetS that include waist circumference—the prevalence of MetS remains comparatively high. This discrepancy suggests a notable incidence of MetS among non-obese individuals, who might experience more severe forms of the syndrome. For non-obese individuals, typical first-line interventions for MetS, such as significant weight loss and strict dietary controls, are not applicable.
Non-obese individuals may face greater challenges with the other four MetS criteria—blood pressure, fasting glucose, triglycerides, and HDL cholesterol—which are often harder to manage than central obesity. In such cases, more invasive treatments may be necessary, and the potential for improvement through lifestyle modification alone may be limited.
Out of 50,428 individuals surveyed, 33,662 were non-obese (body mass index ≤ 25 kg/m^2), with a MetS prevalence of approximately 16% (n = 5,375) within this group. Developing a targeted prediction model for this demographic could significantly enhance awareness and early detection of MetS among non-obese individuals, who may otherwise perceive themselves as at lower risk. Such a model would not only adjust for the absence of obesity but also emphasize the importance of monitoring other metabolic factors that contribute to the syndrome. This focused approach could lead to better tailored interventions and ultimately improve health outcomes for this specific population.
The prevalence of MetS in our training and validation set was notably higher than in other studies, underscoring the sensitivity of the MetS diagnostic criteria chosen for labeling individuals. This decision is crucial as it directly influences the training and performance of the predictive model. For instance, the criteria used by Hirose and colleagues27, which excluded the non-obese, resulted in a reported low prevalence of MetS. Such discrepancies highlight that differences in MetS prevalence may stem from methodological choices in defining MetS, in addition to potential ethnic variations.
Furthermore, expanding the model to include high-risk drinking status, alongside the frequency of alcohol consumption or the amount consumed per occasion, could enhance the prediction of MetS risk. Hong et al.43 utilized the Alcohol Use Disorder Identification Test (AUDIT) to categorize individuals from the KNHANES 2010–2012 data into three groups based on their AUDIT scores. The analysis showed statistically significant differences in most characteristics among the groups defined by high-risk drinking status. Incorporating such nuanced alcohol consumption data could refine the model’s accuracy by providing a more detailed picture of lifestyle factors that contribute to MetS risk. This approach would be particularly beneficial for tailoring interventions and preventative measures to specific risk profiles within the population.