The association of lifestyle with cardiovascular and all-cause mortality based on machine learning: A Prospective Study from the NHANES

doi:10.21203/rs.3.rs-4664906/v1

Download PDF

Research Article

The association of lifestyle with cardiovascular and all-cause mortality based on machine learning: A Prospective Study from the NHANES

https://doi.org/10.21203/rs.3.rs-4664906/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

It is currently unclear whether machine learning based methods using lifestyle factors can effectively predict the probability of all-cause mortality and cardiovascular disease mortality.

Method

A prospective cohort study was conducted using a nationally representative sample of adults aged 40 years or older, drawn from the US National Health and Nutrition Examination Survey from 2007 to 2010. The participants underwent a comprehensive in-person interview and medical laboratory examinations, and subsequently, their records were linked with the National Death Index for further analysis.

Result

Within a cohort comprising 7921 participants, spanning an average follow-up duration of 9.75 years, a total of 1911 deaths, including 585 cardiovascular-related deaths, were recorded. The model predicted mortality with an area under the receiver operating characteristic curve (AUC) of 0.848 and 0.829. Stratifying participants into distinct risk groups based on ML scores proved effective. All lifestyle behaviors exhibited an inverse association with all-cause and cardiovascular mortality. As age increases, the discernible impacts of dietary scores and sedentary time become increasingly apparent, whereas an opposite trend was observed for physical activity.

Conclusion

We develop a ML model based on lifestyle behaviors to predict all-cause and cardiovascular mortality. The developed model offers valuable insights for the assessment of individual lifestyle-related risks. It applies to individuals, healthcare professionals, and policymakers to make informed decisions.

Cardiovascular mortality

All-cause mortality

Lifestyle behavior

Risk stratification

Mortality prediction

Machine learning

Cardiovascular disease (CVD) poses a formidable challenge to global health, contributing significantly to non-communicable diseases (NCDs) and representing a leading cause of mortality worldwide^{1, 2}. According to data from the World Health Organization (WHO), cardiovascular disease contributes to deaths of nearly one million people in the United States, accounting for 30% of the total annual mortality³. The escalating prevalence of CVD over the past few decades underscores the urgency of identifying effective preventive measures. Extensive research has elucidated a link between modifiable lifestyles and cardiovascular mortality.^4-7 The inherent modifiability of lifestyle renders it of considerable practical significance as a predictive model factor. Through model prediction, the population can be informed of the current level of risk in their lifestyle and effectively promote their transition to a healthier lifestyle. However, traditional statistical methods exhibit limitations in establishing predictive models, struggling to effectively handle the intricate interaction between numerous variables.

Machine learning (ML), with its ability to analyze vast and complex datasets, presents a compelling solution to the limitations of traditional methods in unraveling the multifaceted associations between lifestyle choices and mortality outcomes⁸. Unlike conventional statistical models that rely on predefined hypotheses and assumptions, ML algorithms can identify intricate patterns and nonlinear relationships within data, offering a more holistic and data-driven perspective^{9, 10}. In recent times, an increasing number of studies have applied ML in the field of cardiovascular disease.^{11, 12}. This becomes particularly crucial in the realm of cardiovascular health, where the impact of diverse lifestyle factors may manifest in subtle and interconnected manners.

The NHANES dataset holds a distinct advantage due to its comprehensive inclusion of health, lifestyle, and biochemical information, providing a rich data source for analysis^13-15. Implementing of high-quality standardized collection and testing procedures effectively mitigates the potential for measurement bias, ensuring the reliability of the data. This robust data quality, coupled with a wealth of information, facilitates in-depth exploration of the intricate relationship between lifestyle and both cardiovascular and all-cause mortality, offering a reliable and comprehensive foundation for unraveling the complexities inherent in this association.

This study endeavors to establish a predictive model for mortality related to lifestyle factors and aims to delve into the intricate role of these lifestyle factors using ML models.

The prospective cohort were derived from the National Health and Nutrition Examination Survey (NHANES), a nationwide survey conducted biennially since 1999.

Study Population

The sample population was derived from the NHANES cycles of 2007-2008 and 2009-2010. We selected participants aged over 40 who participated in in-person interview, physical examinations and laboratory tests in a mobile examination center. The screening process is shown in Supplementary Figure 1.

Study outcomes

The follow-up data was obtained from the National Health Data Center, which links the NHANES survey population with the death records of the National Death Index (NDI). Cardiovascular mortality was determined using the International Statistical Classification of Diseases, 10th Revision (ICD-10), and the NCHS classified cardiovascular diseases (054-068, 070). We linked participants with the 2019 mortality data records and excluded individuals whose follow-up years and survival status could not be ascertained.

Model features

The model encompassed a set of features including age, gender, race, BMI, education level, income, hypertension, diabetes, family history of diseases, non HDL-cholesterol, C-reactive protein, diet score, physical activity level, Sedentary minutes, sleep quality, alcohol consumption and smoking status. Age, gender, race, education level, income, and history of close family diseases can be directly obtained from interview data. BMI was derived from physical examination data, while non-HDL cholesterol and C-reactive protein values were obtained from laboratory test data. Sedentary minutes were acquired through the physical activity questionnaire.

NHANES contains a wealth of nutrition information gathered through health interviews, health examinations, and laboratory testing. Participants underwent a 24-hour dietary recall (First Day) interview as part of their health examination at the mobile examination center. Subsequently, they were instructed to complete a second 24-hour dietary recall (Second Day) interview within a period of 3 to 10 days following the initial recall. To rate the dietary patterns of participants, the following steps were taken: linking to the Food Patterns Equivalents Database (FPED) of the US Department of Agriculture based on the USDA code of the food, estimating the daily nutritional intake of participants based on the 24-hour dietary recall on the first day and the 24-hour dietary recall on the second day, and referencing the US Dietary Guidelines 2020-2025 and the scoring rules of the Healthy Eating Index (HEI) to assess and rate the dietary patterns of participants.

Physical activity was obtained from NHANES's physical activity questionnaire. The questionnaire contains the information on the weekly exercise intensity and corresponding time reported by the participants. Participants were classified into four (4) groups based on the 2^nd edition of the Physical Activity Guidelines for Americans. The "Inactive" group comprised individuals not involved in any moderate- or vigorous-intensity physical activity beyond basic daily life movements. Those deemed "Insufficiently active" engaged in some moderate- or vigorous-intensity physical activity but did not reach the threshold of 150 minutes of moderate-intensity activity per week, or 75 minutes of vigorous-intensity activity, or the equivalent combination. The "Active" category encompassed participants achieving the equivalent of 150 to 300 minutes of moderate-intensity physical activity weekly, meeting the key guideline target range for adults. Lastly, the "Highly active" group included individuals undertaking more than 300 minutes of moderate-intensity physical activity weekly, surpassing the key guideline target range for adults.

Due to the J-shaped association between sleep duration and all-cause mortality, participants were divided into three groups based on sleep duration: optimal (6-8 hours/day), intermediate (5-5.9 or 8.1-10 hours/day), and poor (<5 or >10 hours/day)^{6, 16}.

Smoking status was categorized into three groups: non-smokers, individuals who smoked previously, and those who reported current smoking based on responses to the cigarette use questionnaire. Data on alcohol consumption was derived from alcohol use questionnaire, wherein participants provided information on the frequency and quantity of drinks consumed. The average daily alcohol consumption was used to measure the level of alcohol consumption among participants.

Variables related to mental health exhibiting missing values exceeding 40% were excluded from the analysis. Subsequently, the random forest (RF) algorithm was employed to impute missing values in the remaining dataset. In order to mitigate the influence of dimensionality and enhance modeling efficiency, continuous variables were rescaled and standardized. The data distribution before imputation is presented in Supplementary Table 1 & 2.

Model development and Risk stratification

A binary classification model was constructed based on follow-up data and participant features to predict mortality. Model development included trials of various ML classifiers, including logistic regression, ridge regression, support vector machines, random forest and Extreme Gradient Boosting (XGBoost). The initial step involved cross validation on the selected models to determine the approximate range of optimal values for each parameter followed by deployment of the grid search method to select the best model through 10-fold cross validation approach. To assess the performance of each model receiver operating curve (ROC) and the corresponding area under the curve (AUC) values were computed. The model output was calibrated using Platt's scaling and the impact of this calibration was visualized by comparing the Brier score between the uncalibrated and the calibrated outputs.

Participants were stratified into three groups based on the tertiles of the ten-year survival probability predicted by the model. The discriminative ability of the model was further validated by employing the log-rank test to compare the survival curves among these groups.

Feature importance based on machine learning models

To estimate feature importance ranking, as well as main effect of features and interaction effect between features, SHAP (Shapley Additive explanations) was employed. The SHAP is a useful and classical method to calculate the marginal contribution of features to the model’s output. This method provides insight from both global and local perspectives, particularly beneficial for interpreting "black box model".

Baseline characteristics

The cohort consisted of 7921 participants, with average age of 60.79±12.18, and 3866(48.81%) males. During an average follow-up period of 9.75 years, there were 1,911 deaths (24.13%), with 585 cases attributed to cardiovascular diseases. The detailed information was shown in the Table 1. In terms of lifestyle, there are differences between the all-cause mortality group and the cardiovascular disease mortality group and the alive group.

Performance of models

Table 2 presents the AUC scores for all models in predicting all-cause mortality and cardiovascular disease mortality. XGBoost demonstrated notable performance, achieving an AUC score of 0.848 for predicting all-cause mortality and 0.829 for predicting cardiovascular disease mortality, establishing it as the top-performing model. The grid search parameters dictionary and the optimal parameter values were displayed in the Supplementary Table 2. Following calibration, there was an improvement in Brier scores, and detailed information was described in the Supplementary Table 3. Figure 1 shows the calibrated and uncalibrated AUC scores of the XGBoost model. The calibrated score was 0.884, indicating that the model fits the data well.

Machine learning-based risk stratification

Depends on the calibrated output, participants were divided into three groups. Each group survival curve was shown in Figure 2. It can be seen from Supplementary Table 4 that there are significant differences in the survival curves for each group. This demonstrates that the model effectively distinguishes individuals with different risks of mortality.

Features importance and Features’ Role in the Model

In the prediction of both all-cause mortality and cardiovascular disease mortality, age, gender, and diabetes status have made significant contributions to the predictive outcomes (Figure 3). In terms of lifestyle, smoking, alcohol consumption, and physical activity emerge as significant features exerting a substantial impact on the prediction of all-cause mortality. On the other hand, the model indicates that, reduced sedentary time, higher dietary scores, and increased physical activity in the model will lower individual risk scores.

Features interaction effect

Given the prominent role of age in the model predictions, it is essential to further explore the interaction between age and various lifestyle factors. As shown in Figure 4, the impact of diet score and sedentary time on outcome prediction becomes more pronounced with advancing age, while the impact of physical activity level exhibits an opposite trend.

In this prospective cohort spanning an average of 9.75 years, a model was developed and validated to predict both the all-cause mortality and cardiovascular mortality based on the comprehensive dataset encompassing lifestyle data and basic characteristic variables. In addition, the effect of lifestyles on all-cause mortality and cardiovascular mortality and their interaction effect with age were estimated using SHAP. These estimates indicate that lifestyle affects outcome predictions to varying degrees and exhibits diverse patterns in interaction with age.

Simultaneously interpreting multiple risk factors for individual outcomes poses a challenge for the general public, as well as for healthcare professionals and policymakers. By employing ML algorithms, we established a predictive model related to lifestyle and further explored the contributions of diverse factors to survival outcomes. The results indicate that our model performs effectively and can unveil the roles of less influential predictive factors within the model. Additionally, the potential impact of complex and subtle interactions among predictive factors is often overlooked. The inherent advantages of tree models, coupled with their integration with SHAP, allowed for exploring interactions among various predictive factors.

According to the report from the Physical Activity Guidelines for Americans 2nd edition, there is a positive correlation between sedentary time and all-cause mortality¹⁷. A prospective survey study from NHANES reveals that, with the prolongation of sedentary time, the risk of all-cause mortality also increases¹⁸. Similarly, a longitudinal survey study conducted in China also identified an association between sedentary behavior and all-cause mortality¹⁹. In our model, sedentary behavior contributes to the model's inclination to predict adverse events, consistent with previous research. Furthermore, we found that sedentary behavior has a stronger impact on cardiovascular mortality, ranking higher in feature importance analysis. The relationships between lifestyle factors such as physical activity²⁰, diet²¹, sleep²², and both all-cause mortality and cardiovascular mortality have been described in detail in previous literature and is consistent with our findings. Overall, machine learning models and traditional models have drawn similar conclusions regarding the relationship between lifestyle factors and mortality.

Beyond lifestyle factors, age and gender, two fundamental demographic characteristics, play a significant role in the model. While a minority of studies may suggest that the role of age in their models is not statistically significant, the prevailing body of research, including our findings, consistently indicates that age plays a non-negligible role in outcome prediction^{23, 24}. Studies in various countries and regions consistently indicate that females tend to have lower mortality rates or death risks compared to males^25-27. This finding is also reflected in our model, where the male gender feature inclines the model toward predicting a higher likelihood of death. This may be attributed to a higher proportion of females adopting healthier lifestyles compared to males²⁸. Additionally, relatively higher estrogen levels in females may contribute to maintaining healthy vascular function²⁹. Moreover, females might be more inclined to proactively address health issues and seek early treatment³⁰.

Leveraging the advantages of tree-based models in exploring interactions in machine learning³¹, we conducted additional analysis to scrutinize the interactions between various lifestyle factors and age. We discovered some phenomena worth discussing by exploring interactions in the model through SHAP (Shapley Additive explanations).. For example, as age increases, the impact of diet and sedentary behavior on adverse outcome events gradually strengthens, while the effect of physical activity diminishes. Specifically, the gap between recommended and not recommended diet and sedentary behaviors widens across different age groups, while the gap between recommended and not recommended physical activities gradually narrows. Given the limited literature on the interaction between lifestyle and age, more research is needed to confirm this finding. The occurrence of this finding may be attributed to the insufficient granularity in the categorization of physical activity. We classified physical activity as a categorical variable with four levels. However, as participants age, although their physical activity levels decrease, they still fall within the same category as relatively younger individuals, resulting in attenuation of its impact. This can be observed in Supplementary Table 7; as age increases, the average exercise time in the same physical activity group gradually decreases. However, this does not imply that the role of physical activity can be disregarded in the elderly population. One reason is that low physical activity levels can exacerbate the adverse effects of sedentary behavior^{32, 33}.

There are countless factors associated with mortality outcomes, and it's not practical to include all relevant variables in a predictive model. While lifestyle may not be the most significant factor in outcome prediction among many related variables, it possesses an excellent feature—modifiability. Policymakers or healthcare professionals can raise public awareness and guide individuals toward healthier lifestyles through various means such as education and outreach. Our model enables users to predict mortality based on their current conditions, serving as a warning and reminder. This functionality assists users in moving towards healthier lifestyle changes. That's why we chose to establish a predictive model for lifestyle-related mortality rates.

Strength and limitations

This research has several advantages and limitations that need to be acknowledged. We utilized sufficient data and implemented measures such as 10-fold cross-validation to ensure and validate the stability of the model. However, it's important to note that our data is derived from a single cohort, and the effectiveness of the model lacks external validation. This study is a prospective cohort study, and the reliability of causal inference is relatively strong. However, during the follow-up process, a small fraction of participants were lost to follow-up (LTFU) or withdrew from the study for various reasons, leading to the possibility of not capturing the occurrence of outcome events. To the best of our knowledge, this study represents the first attempt to apply ML algorithms to explore the relationship between lifestyle and mortality. Additionally, we leveraged the advantages of tree models to investigate interactions in this context. However, inferences about the role of features based on ML only describe the features' impact on outcome prediction within the model and may not necessarily reflect their real-world effects. The actual effects require further assessment in conjunction with domain expertise.

By employing modifiable lifestyle factors and readily available indicators, we effectively predicted overall mortality and cardiovascular disease mortality using the XGBoost model. This model can serve as a valuable predictive tool to encourage individuals to modify unhealthy lifestyles and prevent adverse events.

Ethics approval

All NHANES protocols received approval from the National Center for Health Statistics ethics review board, and written informed consent was obtained from all participants. The modeling survey was deemed exempt from further review.

Data availability

The scripts generated during the current study are available in the National Health and Nutrition Examination Survey [https://www.cdc.gov/nchs/nhanes/index.htm]. We conducted our analyses using the open-source statistical software Python (version python-3.9.0).

Supplementary data

Supplementary data are available at IJE online.

Funding

This study is supported by Collaborative Innovation System Research on Drug Intervention&Non-drug Intervention in Proactive Health Context (20220518A), Zhengzhou University Education Reform Research and Practice Project (2023ZZUJGXM222), Henan Zhongyuan Medical Science and Technology Innovation and Development Foundation (23YCG1006) and the 2024 Graduate Independent Innovation Project of Zhengzhou University (20240332).

Conflict of interest

None declared.

Authors' contributions

Xinghong Guo, Jian Wu and Beizhu Ye conceptualized and designed research ideas. Clifford Silver Tarimo and Mingze Ma led the data collection. Xinghong Guo, Fengyi Fei and Lipei Zhao performed the statistical analyses, drafted the first version of the manuscript. All authors contributed to draft the manuscript, approved of the final version and agree with the order of presentation of the authors.

Acknowledge

We would like to thank NHANES for providing publicly available data, and most importantly, we thank all participants for their participation.

Organization WH. Call for public comments on the draft WHO Guidelines: Saturated fatty acid and trans-fatty intake for adults and children. [cited 2018-02--06]; Available from: https://www.who.int/news-room/articles-detail/call-for-public-comments-on-the-draft-who-guidelines--saturated-fatty-acid-and-trans-fatty-intake-for-adults-and-children
Roth GA, Mensah GA, Johnson CO, et al. Global Burden of Cardiovascular Diseases and Risk Factors, 1990-2019: Update From the GBD 2019 Study. J Am Coll Cardiol 2020; 76: 2982-3021.
Organization WH. WHO MORTALITY DATABASE Interactive platform visualizing mortality data. [cited 2023-12-25]; Available from: https://platform.who.int/mortality/themes/theme-details/mdb/noncommunicable-diseases
Khraishah H, Alahmad B, Ostergard RL, Jr., et al. Climate change and cardiovascular disease: implications for global health. Nat Rev Cardiol 2022; 19: 798-812.
Li Y, Pan A, Wang DD, et al. Impact of Healthy Lifestyle Factors on Life Expectancies in the US Population. Circulation 2018; 138: 345-55.
Lu Q, Zhang Y, Geng T, et al. Association of Lifestyle Factors and Antihypertensive Medication Use With Risk of All-Cause and Cause-Specific Mortality Among Adults With Hypertension in China. JAMA Netw Open 2022; 5: e2146118.
Sotos-Prieto M, Bhupathiraju SN, Mattei J, et al. Association of Changes in Diet Quality with Total and Cause-Specific Mortality. N Engl J Med 2017; 377: 143-53.
Kline A, Wang H, Li Y, et al. Multimodal machine learning in precision health: A scoping review. NPJ Digit Med 2022; 5: 171.
Ngiam KY, Khor IW. Big data and machine learning algorithms for health-care delivery. Lancet Oncol 2019; 20: e262-e73.
Cui F, Yue Y, Zhang Y, Zhang Z, Zhou HS. Advancing Biosensors with Machine Learning. ACS Sens 2020; 5: 3346-64.
Al'Aref SJ, Anchouche K, Singh G, et al. Clinical applications of machine learning in cardiovascular disease and its relevance to cardiac imaging. Eur Heart J 2019; 40: 1975-86.
Deo RC. Machine Learning in Medicine. Circulation 2015; 132: 1920-30.
Prevention CfDCa. About the National Health and Nutrition Examination Survey. [cited 2023-12-26]; Available from: https://www.cdc.gov/nchs/nhanes/about_nhanes.htm
Zhang YB, Chen C, Pan XF, et al. Associations of healthy lifestyle and socioeconomic status with mortality and incident cardiovascular disease: two prospective cohort studies. Bmj 2021; 373: n604.
Liu J, Micha R, Li Y, Mozaffarian D. Trends in Food Sources and Diet Quality Among US Children and Adults, 2003-2018. JAMA Netw Open 2021; 4: e215262.
Wang C, Bangdiwala SI, Rangarajan S, et al. Association of estimated sleep duration and naps with mortality and cardiovascular events: a study of 116 632 people from 21 countries. Eur Heart J 2019; 40: 1620-9.
Promotion OoDPaH. Physical Activity Guidelines for Americans 2^ndedition; 2018.
Cao C, Friedenreich CM, Yang L. Association of Daily Sitting Time and Leisure-Time Physical Activity With Survival Among US Cancer Survivors. JAMA Oncol 2022; 8: 395-403.
Lin Y, Liu Q, Liu F, et al. Adverse associations of sedentary behavior with cancer incidence and all-cause mortality: A prospective cohort study. J Sport Health Sci 2021; 10: 560-9.
Mok A, Khaw KT, Luben R, Wareham N, Brage S. Physical activity trajectories and mortality: population based cohort study. Bmj 2019; 365: l2323.
Naghshi S, Sadeghi O, Willett WC, Esmaillzadeh A. Dietary intake of total, animal, and plant proteins and risk of all cause, cardiovascular, and cancer mortality: systematic review and dose-response meta-analysis of prospective cohort studies. Bmj 2020; 370: m2412.
Svensson T, Saito E, Svensson AK, et al. Association of Sleep Duration With All- and Major-Cause Mortality Among Adults in Japan, China, Singapore, and Korea. JAMA Netw Open 2021; 4: e2122837.
Huang J, Liao LM, Weinstein SJ, Sinha R, Graubard BI, Albanes D. Association Between Plant and Animal Protein Intake and Overall and Cause-Specific Mortality. JAMA Intern Med 2020; 180: 1173-84.
Davis JS, Banfield E, Lee HY, Peng HL, Chang S, Wood AC. Lifestyle behavior patterns and mortality among adults in the NHANES 1988-1994 population: A latent profile analysis. Prev Med 2019; 120: 131-9.
Zhou T, Yuan Y, Xue Q, et al. Adherence to a healthy sleep pattern is associated with lower risks of all-cause, cardiovascular and cancer-specific mortality. J Intern Med 2022; 291: 64-71.
Yun JE, Won S, Kimm H, Jee SH. Effects of a combined lifestyle score on 10-year mortality in Korean men and women: a prospective cohort study. BMC Public Health 2012; 12: 673.
Tamakoshi A, Tamakoshi K, Lin Y, Yagyu K, Kikuchi S. Healthy lifestyle and preventable death: findings from the Japan Collaborative Cohort (JACC) Study. Prev Med 2009; 48: 486-92.
Zhu N, Yu C, Guo Y, et al. Adherence to a healthy lifestyle and all-cause and cause-specific mortality in Chinese adults: a 10-year prospective study of 0.5 million people. Int J Behav Nutr Phys Act 2019; 16: 98.
Morselli E, Santos RS, Criollo A, Nelson MD, Palmer BF, Clegg DJ. The effects of oestrogens and their receptors on cardiometabolic health. Nat Rev Endocrinol 2017; 13: 352-64.
Courtenay WH. Constructions of masculinity and their influence on men's well-being: a theory of gender and health. Soc Sci Med 2000; 50: 1385-401.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016; 2016. p. 785-94.
Bull FC, Al-Ansari SS, Biddle S, et al. World Health Organization 2020 guidelines on physical activity and sedentary behaviour. Br J Sports Med 2020; 54: 1451-62.
Li S, Lear SA, Rangarajan S, et al. Association of Sitting Time With Mortality and Cardiovascular Events in High-Income, Middle-Income, and Low-Income Countries. JAMA Cardiol 2022; 7: 796-807.

Table 1 Baseline characteristics of participants

	All (n=7921)	Live (n=6010)	Dead (n=1911)	Dead cause of cardiovascular (n=585)
Age(years)	60.79±12.18	57.55±10.95	70.97±10.05^‡	72.75±9.52^‡
Male	3866(48.81%)	2800(46.59%)	1066(55.78%)	322(55.04%)
BMI(Kg/m²)	28.57±7.13	29.01±6.83	27.2±7.86^‡	27.28±8.26^‡
Education level			^†
Less Than 9th Grade	1287(16.25%)	929(15.46%)	358(18.73%)	106(18.12%)
9-11th Grade	1318(16.64%)	931(15.49%)	387(20.25%)	127(21.71%)
High School Grad/GED or Equivalent	1853(23.39%)	1381(22.98%)	472(24.70%)	153(26.15%)
Some College or AA degree	1937(24.45%)	1513(25.17%)	424(22.19%)	122(20.85%)
College Graduate or above	1526(19.27%)	1256(20.90%)	270(14.13%)	77(13.16%)
Ethnicity			^‡	^‡
Mexican American	1256(15.86%)	1095(18.22%)	161(8.42%)	34(5.81%)
Other Hispanic	825(10.42%)	689(11.46%)	136(7.12%)	46(7.86%)
Non-Hispanic White	3961(50.01%)	2786(46.36%)	1175(61.49%)	372(63.59%)
Non-Hispanic Black	1537(19.40%)	1165(19.38%)	372(19.47%)	110(18.80%)
Other Race - Including Multi-Racial	342(4.32%)	275(4.58%)	67(3.51%)	23(3.93%)
The ratio of household income to poverty line	2.33±1.72	2.45±1.78	1.94±1.47^‡	1.93±1.43^‡
Hypertension	4453(56.22%)	3048(50.72%)	1405(73.52%)^‡	456(77.95%)^‡
Diabetes			^‡	^‡
no diabetes	5856(73.93%)	4678(77.84%)	1178(61.64%)	346(59.15%)
diabetes	1557(19.66%)	977(16.26%)	580(30.35%)	190(32.48%)
Prediabetes	508(6.41%)	355(5.91%)	153(8.01%)	49(8.38%)
Close relative heart attack	1322(16.69%)	946(15.74%)	376(19.68%)^‡	125(21.37%)^‡
Close relative diabetes	3455(43.62%)	2688(44.73%)	767(40.14%)^‡	227(38.80%)^‡
non_HDL (mmol/L)	3.57±1.31	3.69±1.27	3.2±1.36^‡	3.17±1.35^‡
CRP	0.42±0.86	0.38±0.73	0.53±1.17	0.53±1.09
diet score	47.55±16.59	48.37±16.07	44.99±17.88^‡	44.08±18.51^‡
physical level			^‡	^‡
inactive	2825(35.66%)	1820(30.28%)	1005(52.59%)	316(54.02%)
insufficient active	1129(14.25%)	863(14.36%)	266(13.92%)	80(13.68%)
active	949(11.98%)	765(12.73%)	184(9.63%)	66(11.28%)
highly active	3018(38.10%)	2562(42.63%)	456(23.86%)	123(21.03%)
Sedentary minutes	331.98±411.55	317.34±403.55	378.01±432.64^‡	374.89±194.81^‡
sleep level			^‡	^‡
Poor	1586(20.02%)	1123(18.69%)	463(24.23%)	134(22.91%)
intermediate	396(5.00%)	265(4.41%)	131(6.86%)	48(8.21%)
optimal	5939(74.98%)	4622(76.91%)	1317(68.92%)	403(68.89%)
Smoking			^‡	^‡
No smoking	3946(49.82%)	3175(52.83%)	771(40.35%)	269(45.98%)
Smoking before	2495(31.50%)	1718(28.59%)	777(40.66%)	230(39.32%)
Smoking	1480(18.68%)	1117(18.59%)	363(19.00%)	86(14.70%)
Alcohol consumption per day	3.99±14.33	4.31±12.83	2.96±18.24^‡	3.36±23.74^‡

Continuous variables as mean ± SD, categorical variables as n (%).

†:p<0.01; ‡:p<0.001 vs. participants alive in follow-up,

unpaired Student’s t-test or Mann-Whitney U test for continuous variables, Chi-squared or Fisher’s exact test for categorical variables

Table 2 AUC score of all models

	The AUC of the optimal parameter combination in the training cohort		The AUC of the optimal parameter combination in the testing cohort
	All-cause mortality	Cardiovascular mortality	All-cause mortality	Cardiovascular mortality
Logistic Regression	0.851	0.829	0.841	0.822
Ridge Regression	0.85	0.826	0.841	0.829
SVM	0.854	0.8	0.845	0.792
Random Forest	0.854	0.853	0.842	0.827
XGBoost	0.875	0.848	0.856	0.829

No competing interests reported.

Supplementaryfile.docx

Download PDF

Editorial decision: Revision requested
02 Jul, 2024
Editor assigned by journal
01 Jul, 2024
Submission checks completed at journal
01 Jul, 2024
First submitted to journal
30 Jun, 2024

You are reading this latest preprint version

The association of lifestyle with cardiovascular and all-cause mortality based on machine learning: A Prospective Study from the NHANES

Status:

Version 1

Abstract

Background

Method

Result

Conclusion

Figures

Introduction

Methods

Study Population

Study outcomes

Model features

Model development and Risk stratification

Feature importance based on machine learning models

Results

Baseline characteristics

Performance of models

Machine learning-based risk stratification

Features importance and Features’ Role in the Model

Features interaction effect

Discussion

Strength and limitations

Conclusion

Declarations

Ethics approval

Data availability

Supplementary data

Funding

Conflict of interest

Authors' contributions

Acknowledge

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1