The prospective cohort were derived from the National Health and Nutrition Examination Survey (NHANES), a nationwide survey conducted biennially since 1999.
Study Population
The sample population was derived from the NHANES cycles of 2007-2008 and 2009-2010. We selected participants aged over 40 who participated in in-person interview, physical examinations and laboratory tests in a mobile examination center. The screening process is shown in Supplementary Figure 1.
Study outcomes
The follow-up data was obtained from the National Health Data Center, which links the NHANES survey population with the death records of the National Death Index (NDI). Cardiovascular mortality was determined using the International Statistical Classification of Diseases, 10th Revision (ICD-10), and the NCHS classified cardiovascular diseases (054-068, 070). We linked participants with the 2019 mortality data records and excluded individuals whose follow-up years and survival status could not be ascertained.
Model features
The model encompassed a set of features including age, gender, race, BMI, education level, income, hypertension, diabetes, family history of diseases, non HDL-cholesterol, C-reactive protein, diet score, physical activity level, Sedentary minutes, sleep quality, alcohol consumption and smoking status. Age, gender, race, education level, income, and history of close family diseases can be directly obtained from interview data. BMI was derived from physical examination data, while non-HDL cholesterol and C-reactive protein values were obtained from laboratory test data. Sedentary minutes were acquired through the physical activity questionnaire.
NHANES contains a wealth of nutrition information gathered through health interviews, health examinations, and laboratory testing. Participants underwent a 24-hour dietary recall (First Day) interview as part of their health examination at the mobile examination center. Subsequently, they were instructed to complete a second 24-hour dietary recall (Second Day) interview within a period of 3 to 10 days following the initial recall. To rate the dietary patterns of participants, the following steps were taken: linking to the Food Patterns Equivalents Database (FPED) of the US Department of Agriculture based on the USDA code of the food, estimating the daily nutritional intake of participants based on the 24-hour dietary recall on the first day and the 24-hour dietary recall on the second day, and referencing the US Dietary Guidelines 2020-2025 and the scoring rules of the Healthy Eating Index (HEI) to assess and rate the dietary patterns of participants.
Physical activity was obtained from NHANES's physical activity questionnaire. The questionnaire contains the information on the weekly exercise intensity and corresponding time reported by the participants. Participants were classified into four (4) groups based on the 2nd edition of the Physical Activity Guidelines for Americans. The "Inactive" group comprised individuals not involved in any moderate- or vigorous-intensity physical activity beyond basic daily life movements. Those deemed "Insufficiently active" engaged in some moderate- or vigorous-intensity physical activity but did not reach the threshold of 150 minutes of moderate-intensity activity per week, or 75 minutes of vigorous-intensity activity, or the equivalent combination. The "Active" category encompassed participants achieving the equivalent of 150 to 300 minutes of moderate-intensity physical activity weekly, meeting the key guideline target range for adults. Lastly, the "Highly active" group included individuals undertaking more than 300 minutes of moderate-intensity physical activity weekly, surpassing the key guideline target range for adults.
Due to the J-shaped association between sleep duration and all-cause mortality, participants were divided into three groups based on sleep duration: optimal (6-8 hours/day), intermediate (5-5.9 or 8.1-10 hours/day), and poor (<5 or >10 hours/day) 6, 16.
Smoking status was categorized into three groups: non-smokers, individuals who smoked previously, and those who reported current smoking based on responses to the cigarette use questionnaire. Data on alcohol consumption was derived from alcohol use questionnaire, wherein participants provided information on the frequency and quantity of drinks consumed. The average daily alcohol consumption was used to measure the level of alcohol consumption among participants.
Variables related to mental health exhibiting missing values exceeding 40% were excluded from the analysis. Subsequently, the random forest (RF) algorithm was employed to impute missing values in the remaining dataset. In order to mitigate the influence of dimensionality and enhance modeling efficiency, continuous variables were rescaled and standardized. The data distribution before imputation is presented in Supplementary Table 1 & 2.
Model development and Risk stratification
A binary classification model was constructed based on follow-up data and participant features to predict mortality. Model development included trials of various ML classifiers, including logistic regression, ridge regression, support vector machines, random forest and Extreme Gradient Boosting (XGBoost). The initial step involved cross validation on the selected models to determine the approximate range of optimal values for each parameter followed by deployment of the grid search method to select the best model through 10-fold cross validation approach. To assess the performance of each model receiver operating curve (ROC) and the corresponding area under the curve (AUC) values were computed. The model output was calibrated using Platt's scaling and the impact of this calibration was visualized by comparing the Brier score between the uncalibrated and the calibrated outputs.
Participants were stratified into three groups based on the tertiles of the ten-year survival probability predicted by the model. The discriminative ability of the model was further validated by employing the log-rank test to compare the survival curves among these groups.
Feature importance based on machine learning models
To estimate feature importance ranking, as well as main effect of features and interaction effect between features, SHAP (Shapley Additive explanations) was employed. The SHAP is a useful and classical method to calculate the marginal contribution of features to the model’s output. This method provides insight from both global and local perspectives, particularly beneficial for interpreting "black box model".