A machine learning-based data mining in medical examination data: a biological features-based biological age prediction model

doi:10.21203/rs.3.rs-1722550/v1

Download PDF

Research Article

A machine learning-based data mining in medical examination data: a biological features-based biological age prediction model

https://doi.org/10.21203/rs.3.rs-1722550/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background: Biological age (BA) has been recognized as a more accurate indicator of aging than chronological age (CA). However, current limitations include insufficient attention to the incompleteness of medical data used to construct BA; lack of machine learning-based BA (ML-BA) on the Chinese population; neglect of the influence of the degree of model overfitting on the stability of the correlation results.

Methods and Results: Based on the medical examination data of the Chinese population (45-90 years), we first evaluated the most suitable missing interpolation method, then constructed 14 ML-BAs based on biomarkers, and finally explored the associations between ML-BAs and health statuses (healthy risk indicators and disease). We found that round-robin linear regression interpolation performed best, while AE showed the highest stability at high missing rates. We further illustrated the potential overfitting problem in ML-BAs, which affected the stability of ML-BAs' associations with health statuses. We then proposed a composite ML-BA based on the Stacking method with a simple meta-model (STK-BA), which overcame the overfitting problem, and associated more strongly with CA (r = 0.66, P < 0.001), healthy risk indicators, disease counts, and six types of disease.

Conclusion: We provided an improved aging measurement method for middle-aged and elderly groups in China, which can more stably capture aging characteristics other than CA, supporting the emerging application potential of machine learning in aging research.

biological age

biological features

machine learning

interpolation

Stacking

health status

In the context of global aging, exploring the representation methods, evaluation indicators, and influencing factors of aging based on big medical data has become an important social issue and a new research hotspot [1]. Aging is an organismal phenomenon manifested by an increased chance of healthy risk (e.g. the likelihood of disease, death) or decreased function over time [2]. The introduction of biological age (BA) is a critical step in aging research. BA is an ideal indicator to provide evidence on aging independent of chronological age (CA) and measures the rate of human aging associated with the functional decline more accurately than CA [3, 4]. Besides, BA is closely related to health characteristics such as physical function, cognition, morbidity, and mortality by measuring the cumulative level of impairment [5]. Effective BA construction methods and quantitative assessments of the associations between BA with health status will contribute to further understanding of aging and provide effective risk stratification [6, 7].

Current BA is mainly based on statistical models of a series of biological features [8]. These features include clinical indicators [9–11], instrumental parameters [12, 13], and molecular genetic measures [14, 15]. The methods commonly used in BA models are mostly based on univariate or multivariate regression methods [7], such as PCA [16], MLP [17], and KDM [18]. Although these classical methods perform well in predicting adverse aging outcomes, they have limitations in multidimensional data processing and biomarker interactions [19–21]. While recently, new approaches applying machine learning (ML) algorithms have shown considerable accuracy and efficiency in BA prediction [22, 23], and have caused wide attention [24]. In addition, the stacking and bagging algorithm displays better performance in distinguishing significant features [25], revealing the complicated non-linear relationships between biomarkers and the target condition [26], but few applications in ML-BA construction.

The Pearson correlations, MAE, and RMSE between BA and CA are the preferred and most commonly used indicators to compare different BA estimation algorithms, which are usually done in the test set [27, 28]. In addition, exploring the associations of ML-BA with epidemiological variables (e.g. health risk indicators, mortality), genetic and environmental factors, and common age-related chronic diseases (e.g. heart disease, kidney disease) can further examine its potential as a biomarker of aging in the general population [6, 29, 30]. We found in the previous ML-BA that the correlation between BA and CA in the full data and the test data showed significant differences [19, 31]. This might be because the model trained on the training set predicted the full dataset’s BA, introducing the interference of parameter tuning and training overfitting. Whether the degree of overfitting affects the obtained association results has not been specifically explored.

Valid BA and reliable conclusions are usually based on large population data. But in public health research, incomplete dataset is a major problem in data mining [32], and related medical databases are often lost for various reasons [33, 34]. Ignoring samples with missing values will greatly limit the downstream analysis performance [19]. Using interpolation methods to estimate incomplete datasets contribute to improving the performance of subsequent analysis [35, 36]. The most universal interpolation was filled with mean, median, mode, zero or random values [19]. Some machine learning-based (ML-based) methods have exhibited great application potential in recent years [37–40]. However, most of the current studies on BA used relatively complete datasets, or deal with missing values only with the most common methods. Insufficient attention has been paid to the complexity and incompleteness of medical data. Therefore, combined with the actual physical examination data, starting from the data preprocessing to construct the BA model can more effectively simulate the actual application scenarios of the BA.

Additionally, most of the current ML-BA studies were from European and American populations [41, 42], and ML-BA based on large Chinese population data (more than 30,000 people) was still very limited[19]. The correlation of ML-BA with CA will vary due to differences in populations and biomarkers [43]. Constructing ML-BA with a large Chinese population from different sources and linking ML-BA with important health statuses will help to further explore the validity and application potential of ML-BA in the Chinese population.

In the research, we focused on four aspects: comparing the applicability of different interpolation methods in medical examination data (e.g. round-robin linear regression, AutoEncoder), constructing ML-BA based on Chinese large population samples with several machine-learning algorithms, examining correlations of ML-BA with health statuses (e.g. health risk indicators, disease status) and exploring the influence of overfitting on the stability of the associated results. We used medical examination data (45–90 years) in Zhejiang Province, China, and Fig. 1 illustrated our analysis flow.

Comparison of missing value interpolation methods

As shown in Fig. 2A-D, the interpolation results of mean, KNN, AE, RRLR, and MICE for continuous variables on MNAR and MCAR simulation data sets were presented. MSE and R² compared the accuracy and validity of interpolation respectively. The parameter selection process of KNN and MICE was presented in Table S1 (MCAR) and Table S2 (MNAR). Specifically, the nearest neighbor value for KNN was chosen from 5, 10, 20, 30, and 50, and the number of mean match candidates for MICE was chosen from 1, 3, 5, and 10. The optimal parameter of both models varied with missing proportions. AE hyper-parameters considered encoder layers, epochs, activation function, batch size, and learning rate. The optimized parameters of AE and RRLR were presented in Table S3.

The results showed that RRLR outperformed other methods under MCAR and MNAR (Fig. 2A-D). The MSE and R² of MICE and RRLR increased significantly with the increase in missing ratio (Fig. 2A-B), but AE showed more excellent stability, with the missing ratio growing from 5–30%, and the MSE only increased by 1.76%. The lower the missing rate, the greater the advantage of RRLR, while AE was more suitable for cases with a high missing rate. The results in the MNAR simulation dataset (Fig. 2C-D) were similar to those in MCAR (Fig. 2A-B). RRLR reduced MSE by 45.06% compared to MICE, the second-most accurate interpolation method in MNAR. R² possessed the same trend as MSE, and RRLR interpolation results displayed the best correlation. In addition to interpolation performance, the time spent in interpolation should also be considered. RRLR exhibited a similar time cost to AE and mean, while the time consumed by KNN mainly depended on the missing ratio. MICE needed the most time to complete interpolation. In general, RRLR was used to fill missing values, and the predicted value of the binary variable greater than 0.5 was marked as 1, otherwise, it was 0. BA will be predicted on the new dataset.

Features selection and BA predictor construction

A total of 22 potential biological features were considered for this study. Fig. S1 showed the optimized lambda and feature selection process in Lasso regression. Urine sugar, urine occult blood, and urine acetone bodies were excluded (Table S4 and Fig. S1). Figure 2E presented the correlation between variables, with an 'X' mark indicating no significant correlation (P > 0.05). Notably, all features were significant (P < 0.05). These two steps yielded 19 features for estimating BA.

Among the machine learning and neural network models explored, ML-BA predicted by Xgboost showed the highest correlation with CA (Pearson’s r = 0.64 in the test set), while Catboost, LGBM, GBDT, and Extra Tress showed similar results (Table 1). Among the five models, R² ranged from 0.32 ~ 0.41, and RMSE ranged from 4.49 ~ 4.89. The parameters of all the above models were detailed in Table S5. However, the evaluation metrics of these five models were significantly different in training and test set (Table 1), which was attributed to the choice of parameters in the model that greatly affected the model's fit during training. If over-fitting on the training set was ignored and the model obtained from the training set was used to predict BA of the entire dataset, overfitting will be introduced into the final result, resulting in higher instability of BA. Thus, in addition to determining the optimal model by test set results, the introduction of the prediction results of the overfitting should be avoided in the final prediction.

To this end, we applied the Stacking approach to fusing the model, where the parameters were inherited from a single model. This method could further improve the prediction accuracy besides effectively lowering the interference of the overfitting. Considering the training time, complexity, and fitting effects of the meta-model, the GAM (spline regression) was finally selected to fuse the above five models. The RMSE in the training and test sets were 5.78 and 5.77, respectively, and the R² was both 0.43. Therefore, we used the fusion model with 19 biological characteristics to get STK-BA. The STK-BA of the entire study population ranged from 44 to 89 years (Table 2), with a mean of 67.8 (SD = 5.0). For females, BA ranged from 43 to 88 years, with a mean of 67.2 (SD = 5.6). For males, BA ranged from 47 to 89 years, with a mean of 68.5 (SD = 4.2). Compared with males, BA in the female population was significantly younger (P < 0.001) and tended to be more normally distributed (Fig. 3A). Table 2 presented that STK-BA was significantly correlated with CA (R = 0.660 ~ 0.668, P < 0.001).

To further highlight the advantages of STK-BA and the influences of over-fitting, we constructed two XGB-BAs with similar performance in the test set (the results and parameters were shown in Table S6). Although XGB-BA2 and XGB-BA1 had similar results on the test set (0.4% MAE difference), XGB-BA2 further improved the fit of the training set, showing a higher correlation with CA (13.1%-increase). Therefore, as shown in Table 2 and Fig. 3, compared with STK-BA, XGB-BAs showed poorer results in the test set, but both improved the correlation with CA in the whole sample (XGB-BA1: 0.738 ~ 0.745; XGB-BA2: 0.822 ~ 0.827)., the effect of gender on XGB-BAs was similar to that of STK-BAs, but XGB-BAs exhibited a wider BA range (Table 2 and Fig. 3). Taking XGB-BA2 as an example, compared with STK-BA, the BA range was expanded by 42.9%.

Table 1

RSME, R², MAE, and Pearson’s correlation of ML-BA models.
Model	Training set (80%)				Test set (20%)
Model	RMSE	R²	MAE	Pearson’s correlation	RMSE		R²	MAE	Pearson’s correlation
Stacking (SVM)	5.765	0.438	4.349	0.661	5.776	0.435		4.352	0.659
Stacking (GAM)	5.777	0.434	4.409	0.658	5.774	0.433		4.403	0.658
Stacking (MLR)	5.788	0.431	4.418	0.657	5.786	0.431		4.414	0.656
Stacking (RF)	2.786	0.900	2.094	0.949	5.828	0.422		4.444	0.650
XGBoost	4.988	0.578	3.780	0.760	5.869	0.414		4.489	0.643
CatBoost	3.674	0.771	2.739	0.878	5.893	0.409		4.494	0.640
LGBM	4.128	0.711	3.097	0.843	5.926	0.403		4.538	0.634
GBDT	5.513	0.484	4.239	0.696	5.951	0.397		4.579	0.630
Extra Tress	0.000	1.000	0.000	1.000	6.319	0.321		4.889	0.566
DNN	6.251	0.341	4.869	0.584	6.419	0.299		5.014	0.547
CNN	5.918	0.409	4.583	0.640	6.467	0.289		5.016	0.537
GAM	6.516	0.279	5.094	0.529	6.509	0.280		5.072	0.529
MLR	6.692	0.240	5.238	0.490	6.691	0.239		5.224	0.489
AdaBoost	6.986	0.172	5.499	0.414	6.994	0.168		5.501	0.409

Table 2

Distribution of BA in male and female study populations.
BA		Min	Max	Median	Mean (SD)	Correlation with CA (P value)
STK-BA	Male	47.23	88.57	68.17	68.51 (4.16)	0.604 ~ 0.617 (< 0.001)
	Female	43.59	88.39	67.02	67.16 (5.58)	0.682 ~ 0.692 (< 0.001)
	Total	43.59	88.57	67.61	67.77 (5.03)	0.660 ~ 0.668 (< 0.001)
XGB-BA1	Male	43.48	90.94	68.17	68.47 (4.39)	0.695 ~ 0.706 (< 0.001)
	Female	36.45	99.75	66.99	67.18 (5.68)	0.756 ~ 0.764 (< 0.001)
	Total	36.45	99.75	67.60	67.76 (5.16)	0.738 ~ 0.745 (< 0.001)
XGB-BA2	Male	44.39	92.43	68.08	68.48 (4.82)	0.791 ~ 0.799 (< 0.001)
	Female	35.37	99.66	66.96	67.17 (6.23)	0.836 ~ 0.842 (< 0.001)
	Total	35.37	99.66	67.54	67.76 (5.67)	0.822 ~ 0.827 (< 0.001)

The importance of Features for the Stacking model

Table S7 recorded the feature importance values of the sub-models in the Stacking model, and Fig. S2 showed the average feature importance value for the Stacking model. DBP, height, SBP, gender, and platelet content were the top 5 biometric characteristics in the Stacking model. Furthermore, weight, SGPT, waist, and SGOT also showed above-average importance. Conversely, the presence or absence of urinary protein was the least essential marker.

The associations between health risk indicators and STK-BA, XGB-BAs

In this evaluation, we chose ABSI and WHtR as health risk status indicators. Previous studies have pointed out that WHtR was a better measure of an individual's health than BMI [6, 44]. ABSI based on physical characteristics appeared to be an indicator of premature death in the general population, predicting mortality risk across age, gender, and weight [45]. The three BAs were of the same type and therefore numerically comparable. As shown in Table S8-10 and Fig. 4, we observed all three ML-BAs exhibited significant positive correlations between ABSI and WHtR (P < 0.001). Results did not change after adjusting for covariates of CA, BMI, and family disease (P < 0.001). And, the correlation strength increased from the first quantile to the fifth quantile, showing a consistent trend. This suggested that the association between ML-BAs and health risk was stable. However, not all ML-BAs showed consistent trends. In an anthropometrically constructed DNN model, the log-rank test for SBSI and WHtR quartiles found that the X² statistic increased from Q1 to Q2, then decreased from Q2 to Q3, but the overall (Q1 ~ Q4) showed an increasing trend [6]. It was worth noting that from STK-BA to XGB-BA1 and XGB-BA2, the strength and significance of the association of BAs with two health risk indicators continued to decline according to the model coefficients and t-statistics (Fig. 4 and Table S8-12). Compared with the Q1 group (Model 2) with the lowest ABSI (WHtR) value (Table S9-12), STK-BA, XGB-BA1, and XGB-BA2 in the Q5 group increased by 2.67 (4.04), 2.31 (3.47), 1.81 (2.73), respectively. Therefore, the increased degree of overfitting of the model reduced the association between BAs and health risk indicators. It could be inferred that when the association strength was small or the degree of overfitting was too high, ML-BA may no longer be correlated with health risk indicators.

The associations between disease statuses and STK-BA, XGB-BAs

The increase in STK-BA and XGB-BAs counted for each disease compared to disease-free participants was shown in Fig. 5A-B, Table 3 and Table S11. Overall, participants with the disease had higher STK-BA and XGB-BAs, and the results remained significant after adjusting for CA and family disease (P < 0.01). In Model 1, XGB-BA2 had the largest BA response to disease count change, while STK-BA had the smallest. Compared with those without the disease, for STK-BA, XGB-BA, and XGB-BA2, those with 1 disease were 0.998, 1.053, and 1.240 years older, and those with 2 or more diseases were 2.422, 2.623, and 3.047 years older. Interestingly, after adjusting for covariates (CA and family disease status), the results were just the opposite. Those with 1 disease were 0.170, 0.100, and 0.069 years older than those without the disease, and were 0.461, 0.372, and 0.284 years older than those with 2 diseases. Also changing was the significance between disease counts and BA (Model 2), the least significant for XGB-BA2 (1:0.024, 2+: 0.001) and the strongest for STK-BA (1:<0.001, 2+:<0.001).

Poisson regression models were used to examine the associations between BAs and disease counts in the full sample (Table 3). Both STK-BA and XGB-BAs were significantly associated with disease counts (P < 0.001). Consistent results (P < 0.01) were observed after adjusting for CA and family diseases, although the absolute values decreased. Consistent with the trend in the linear regression model, STK-BA showed the strongest association with disease counts (Model 2: Coef = 0.008, SE = 0.001), while XGB-BA2 was the weakest (Model 2: Coef = 0.005, SE = 0.002).

Table 3

Associations of STK-BA and XGB-BAs with disease counts.
	Model 1 *			Model 2 **
	Coef (SE)	t-value	P	Coef (SE)	t-value	P
STK-BA	0.025 (0.001)	24.20	< 0.001	0.008 (0.001)	5.981	< 0.001
XGB-BA1	0.025 (0.001)	25.08	< 0.001	0.006 (0.002)	4.130	< 0.001
XGB-BA2	0.023 (0.001)	26.76	< 0.001	0.005 (0.002)	3.205	0.001
* Model 1 was a crude model. ** Model 2 was adjusted for CA, and family disease status.

To gain further insights into the relations between the BAs and disease counts, the associations between each disease and STK-BA, XGB-BAs were explored (Fig. 5C and Table S12). As expected, STK-BA showed a significant positive correlation (P < 0.05) with almost all diseases (except for vascular disease, P = 0.190). XGB-BA1 showed no significant association with vascular disease, kidney disease, and nervous system disease. Notably, in addition to being unrelated to kidney, eye, and nervous system disease, XGB-BA2 was significantly negatively correlated with vascular disease (OR: 0.96, 95%CI: 0.93 ~ 0.99) with vascular disease. Furthermore, it was found from the z-score and P values in Table S12 that compared with XGB-BA1, the associations between XGB-BA2 and diseases (except vascular diseases) was further weakened. This illustrated that overfitting would lead to obvious instability in the results. This also explained why, after adjusting for CA and family disease, XGB-BAs showed weaker associations with disease counts as overfitting degree increased. However, our proposed STK-BA showed fascinating results. After adjusting for covariates, each 1-year increase in STK-BA was associated with a 7% increase in the risk of developing kidney disease (OR: 1.07, 95%CI: 1.00 ~ 1.13), 2% for heart disease (OR: 1.02, 95%CI: 1.00 ~ 1.05), 5% for cerebrovascular disease (OR: 1.05, 95%CI: 1.03 ~ 1.06), 9% for eye disease (OR: 1.09, 95%CI: 1.04 ~ 1.17), 8% for nervous system disease (OR: 1.08, 95%CI: 1.02 ~ 1.15) and 1% for other system diseases (OR: 1.01, 95%CI: 1.01 ~ 1.01). The results were similar to previous studies. BA has been attested to be a strong indicator and predictor of multiple morbidities, especially chronic diseases [46, 47]. This might be attributed that diseases are closely related to aging. One study showed a stronger association between BA and all-cause morbidity than CA or the traditional biomarkers of age-related diseases (hazard ratio 1.06 vs. 1.05 and 1.03), including stroke, dementia, Alzheimer’s disease, cancer, coronary heart disease, and diabetes mellitus [48].

There is no general missing value interpolation method, but only the most appropriate. We compared five classical but effective methods for the Chinese physical examination data and found that RRLR performed best under the same missing ratio of the original data. However, the superior performance of the RRLR method was not universal, and it was more suitable for low missing ratios (e.g. less than 30%). This is because the strategy of RRLR is to build regression models to predict and impute the missing features according to other complete samples in an iterative loop [49]. Although this strategy allows RRLR to utilize as many observations as possible during interpolation, regression typically requires many samples with non-missing data to produce stable results [50]. AE interpolation showed the best stability. As a common artificial neural network in deep learning, deep AE can perform representation learning on the input information, form a higher-level feature map, and then reconstruct the data at the output, reducing sensitivity to higher missing rates [51, 52]. Likewise, Yu L et al. pointed out that multiple regression imputation was suitable for filling in the missing in the WHO ARI Multicentre Study of clinical signs and etiologic agent dataset [53].

More importantly, we found an interesting phenomenon in the previous Chinese population-based ML-BA, which had not been discussed before. When the correlation or R² between BA and CA was taken as the criterion, the results on the test set were quite different from the final prediction of BA on the full dataset. Taking the previous XGB-BA as an example, the R² of the model in the test set was 0.27, while the correlation between BA and CA was 0.75 in the final results (BA to CA regression belonged to simple linear regression, so R = cor = 0.75, R² = 0.56) [19]. The same was also found in the XGB-BA based on the Dongfeng-Tongji cohort [31]. This might be because the model trained on the training set predicted BA on the full dataset, which introduced interference from parameter tuning and training overfitting. However, this still requires further confirmation, as previous studies did not explicitly state how the model was obtained when it finally predicted BA. In any case, the consistency of the test set with the final results is what we would expect.

The correlation between BA and CA was usually regarded as an indispensable index to evaluate BA prediction models. However, after selecting the best model, how to obtain stable correlation analysis results with BA in the whole sample is also of high value. Two generally used health statuses (health risk indicators and disease status) were used as different evaluation aspects to illustrate the influence of different overfitting degrees on correlation strength and significance in ML-BAs. We found that even with similar test set results, as the overfitting degree increased, XGB-BA2 exhibited less obvious associations with health risk indicators (ABSI, WHtR), disease counts, nervous system disease, and eye disease. This finding suggested that the results of association analysis would vary due to parameter selection and other reasons. This can be attributed to the fact that the core purpose of BA is to capture aging features beyond CA, while overfitting causes the model to over-learn the CA feature of the training set. Xingqi Cao et al. adopted default parameters in the model to overcome this problem [39], but it did not work fundamentally.

To avoid overfitting affecting the stability of the association results between BA and health outcomes in the entire dataset, we propose three possible solutions. The first is to let the model show basically the same fitting results on the training set and test set, which is the most convenient and least expensive. Secondly, the method of using cross-modeling to predict, such as LOOCV or K-fold, always keeps the final predicted samples from participating in the construction of the model, but this will produce multiple models that are not exactly the same. The prediction accuracy of each model also usually varies due to parameters and different training samples. Therefore, this method presents new challenges for practical application and less time cost. The third is to use only the sample results on the test set for further analysis, but this does not meet the principle of maximizing the use of data and reduces the reliability of the results.

For this case, our proposed STK-BA could improve the correlation between BA and CA while maintaining the consistency of the model results (the correlation of the training set and the test set are the same in three decimal places). What’s more, the positive association of STK-BA with health risk indicators, disease counts, and specific diseases was also more pronounced, suggesting that it better captures the aging-related features behind diseases. This may be attributed to the biological features we considered to represent different physiological functions or dimensions: immune system (e.g. platelet count, white blood cell), cardio-metabolic system (e.g. HDL, DBP), liver function (e.g. SGPT, SGOT), phenotypic dimension (e.g. height, waist), kidney injury (e.g. urine protein). Additionally, the associations we considered included eye disease and kidney disease, which were also not covered in previous Chinese population studies [28].

The Stacking method we adopted is a mechanism to combine the learned types of models into one, consisting of base models and a meta-model [54]. Instead of selecting a model from multiple models for generalization or simple averaging, Stacking uses a meta-model to balance the features (the output of the base model) and predict [50], which is somewhat like a two-layer neural network. Cross-validation in Stacking and simple meta-model are the keys to overcoming the overfitting of the complex base model, which makes the correlation between model-predicted value and target value basically consistent on the final training set and test set. However, an overly complex meta-model will also lead to overfitting. We observed this when utilizing RF as a meta-model (Table 1). More importantly, the Stacking method is equally applicable to the BA based on a single model. A simple meta-model could enable various BA models to obtain stable prediction BA on the whole sample for correlation analysis, which could be further extended.

The correlation between our STK-BA and CA (r = 0.66) on the test set was better than previously published BA (r = 0.52) based on 19 blood biomarkers [19] but weaker than BA (r = 0.74) which considered 44 biomarkers including lung function. This phenomenon is plausible, depending on the population-specific and age-related biosignatures in different datasets [31]. However, it is worth noting that we showed better CA correlations with the same number of biomarkers in the Chinese population. Additionally, Mamoshina et al. found that models trained in a given population declined in correlation when tested across ethnicities (given population: R² ranged from 0.49 to 0.69; different populations: R² ranged from 0.24 to 0.34) [43]. ML-BA would exhibit different correlations with CA due to differences in population and biometrics [43]. Therefore, we constructed ML-BA using Chinese populations from different sources, and this helped to further confirm the applicability of ML-BA in the Chinese population by associating aging measures with important health conditions and outcomes.

DBP, height, SBP, gender, and platelet content were the five most essential variables screened out in the Stacking model, which may play a vital role in assessing BA differences in different populations. In fact, DBP, SBP, and PC have been widely found to be biomarkers closely related to biological aging. Pinto E [55] noted that elevated pulse pressure due to decreased DBP and increased SBP was the most potent risk predictor in older adults and was associated with older age. In epidemiological studies, aging populations were more likely to exhibit features of lower PC and higher platelet activity, which are associated with higher rates of cardiovascular disease [56–58]. The link between gender or height and aging was also frequently mentioned [59, 60]. In a study of conscripts from Italian inland villages, short people (height less than 161.1 cm) generally had higher survival rates than tall peers [61]. This may be related to caloric restriction, cell replication potential, telomere shortening, and cardiac pumping efficiency [61, 62]. What’s more, the gender-driven characteristics of aging have become the focus of current attention, with gender differences in life expectancy, biological aging, and frailty indices [63]. Of these, women are generally biologically younger than men, consistent with a lower BA assessed by molecular biomarkers [9].

Overall, the BA measurement model we developed integrated multidimensional biosignatures that more systematically reflected human aging. This line of evidence reinforces our findings and suggests that the variable screening results of the Stacking model are biologically interpretable. Besides, although fewer biological features are considered in the model, this facilitates the generalization and practical application of the model and its workflow.

The large sample data of Chinese medical examination data enables us to explore the influence of fitting on the stability of correlation results and develop a new composite BA prediction model after comparing the most suitable interpolation methods. Nonetheless, several limitations need to be discussed. First, although the interpolation methods explored in this study are convenient and practical, more novel missing value imputation methods can be further attempted to be transferred to the medical examination dataset [40], such as the variational AE applied to Genomic data imputation [39]. Second, our data lacked information on outcome variables (e.g., death) to establish a link between BA and survival analysis. We, therefore, associated BA with a health risk indicator that predicted mortality risk instead. Third, the training and test sets of the BA prediction model are both from the same dataset. Testing with external datasets will further evaluate the generalization ability of the ML-BAs [64]. Finally, the biological features used in the study were mostly limited to biochemical indicators, and aging-related indicators that have been discovered, such as mean corpuscular volume, are not included in our data. These may weaken the interpretability of predicted BA and fail to supplement the validation of more existing results [19, 65]. However, with the popularization of big medical data, phenotype information (e.g. cognitive level, gait speed [66, 67]), methylation data (e.g. CpG sites [68, 69]), metabolomic features and pathways (e.g. C-glycosyl tryptophan, α-ketoglutarate and TCA cycle [70–72]) will be more convenient, which assists in predicting and explaining the aging process more systematically. Therefore, as more dimensions of individual indicators are taken into account, our composite BA and its construction process will have a broader reference value.

Overall, we found RRLR best suited for interpolation on our medical examination dataset, while AE exhibited the highest stability at high missing rates. We pointed out a potential problem of over-fitting affecting the association results in recently proposed ML-BAs. After comparing machine learning methods, we constructed two XGB-BAs with different fitting degrees on the training set (similar performance on the test set) to illustrate the degree of fit by the association between ML-BAs and health statuses that will affect the stability of BA application. For this case, we proposed a composite ML-BA based on the Stacking method with a simple meta-model (STK-BA), which overcame the overfitting problem, and associated more strongly with CA (r = 0.66, P < 0.001), healthy risk indicators, disease counts and six types of disease. Furthermore, we found that DBP, height, SBP, gender, and platelet content were the top five important biological features in STK-BA. These findings supported the application of machine learning in geriatric research and proposed improvements to existing ML-based BA methods. This new aging measurement method captures aging characteristics beyond CA more stably and has great potential for application in risk stratification and aging-protective intervention research.

Study Population and assessment of physical examination measurements

Based on the electronic health records of residents in Zhejiang Province, China, this study conducted a representative physical examination survey among different age groups. According to the national code for basic public health services, the records were established by substrate medical and health institutions, including township health centers and community health service centers, in 23 cities, and districts of Zhejiang Province. 487,003 participants were enrolled in the first follow-up and completed the other two follow-up visits from 2015 to 2020.

This study selected potential age-related features missing under 80% and observations with features missing under 20%. Out of the 418161 participants aged 30–100 years old, we excluded observations those included outliers in comparison with data of the same age and sex (N = 30,935) and those with more than 20% missing data on variates (N = 309,416), leaving the analytic sample of 77810 adults. Middle age starts around age 45, while the very old are vulnerable to NCDs and socially disadvantaged [19, 73]. Additionally, due to the relatively small size of the oldest-old group and the differences between participants aged 45–90 and others, we excluded participants aged under 45 and over 90 (N = 666). A total of 77144 participants with 17 biochemical indicators (i.e. systolic blood pressure (SBP), diastolic blood pressure (DBP), hemoglobin, white blood cell, platelets, fasting serum glucose (FSG), serum glutamic pyruvate transaminase (SGPT), serum glutamic oxaloacetic transaminase (SGOT), serum bilirubin, total cholesterol (TC), triglycerides (TG), total bilirubin, low-density lipoprotein (LDL), high-density lipoprotein (HDL), urine protein, urine sugar, urine ketone body, urine occult blood) and 5 physical indicators (i.e. gender, height, weight, waist, body mass index (BMI)) were included in the study. The above indicators were obtained from regular physical examinations. The biological features’ attributions of study populations were shown in Table S13. The BMI was calculated as weight in kilograms divided by height in meters squared. The data of urine protein, urine sugar, urine ketone body, and urine occult blood were defined as positive and negative levels.

Comparison of interpolation methods for missing values

Interpolating the missing values helps improve the model's predictive power. Nevertheless, no specific interpolating method is universal. We compared the mean value, k-Nearest Neighbor (KNN), multiple imputations by chained equations (MICE), AutoEncoder (AE), and round-robin linear regression (RRLR) interpolation under the condition of missing completely at random (MCAR) and missing not at random (MNAR) to choose the method that best fitted our data.

Since the real value corresponding to the missing position could not be obtained, nor could the accuracy of the filled value be intuitively evaluated, it was necessary to conduct simulation experiments on the data without missing. The primary process is as follows: (1) A total of 37320 samples without missing variables were selected to form a simulation dataset, of which 80% were used for training and adjusting core parameters of models (such as K in the KNN method), and 20% were used for testing and comparing results. (2) Calculate the mean and variance of each variable in the training set. (3) Based on the missing ratio of different variables in the dataset (MNAR, variables with a missing ratio > 2% were considered) or random missing ratio (MCAR, 5%, 10%, 20%, 30%), the missing values were introduced into the simulation dataset, and missing location information was recorded at the same time. (4) After interpolation, the imputed value of the test set of each method was compared with the real value by mean square error (MSE) and R square (with a view to the dimensional difference between different variables, the results in (2) were used to standardize the variables).

Feature selection and BA calculation

To avoid the redundancy of latent features, lasso regression was used for feature selection first (the data was standardized to avoid dimensional effects). In the second step, Pearson's correlation was applied to evaluate the correlation of each feature with CA, and features that did not show significant correlations with age (P > 0.05) were excluded.

Similar to that described in previous publications [19], a total of 19 selected biological features were used as independent variables to construct ML-BA. Our work considered machine learning methods (Multiple Linear Regression (MLR), Generalized Additive Models (GAM), Support Vector Machine (SVM), Adaboost, Gradient Boosting Decision Tree (GBDT), Light Gradient Boosting Machine (LGBM), Catboost, Xgboost, Extra Trees) and neural network methods (Deep Neural Networks (DNN), Convolutional Neural Network (CNN)) that can be used for regression analysis.

Finally, stacking model fusion was performed using the top five performing models to calculate the final BA in years (base models). The meta-model considered MLR, GAM (spline regression), SVM, and random forest (RF). Meanwhile, the two xgboost-based BA was calculated, one took the parameters from the Stacking model (XGB-BA1); one amplifies the fit of the training set while keeping the test set results approximately unchanged (XGB-BA2). Both models were trained on the training set to predict the full data set and used to compare the effect of training set overfitting on BA.

A schematic diagram of the Stacking method was presented in Fig. 6. Specifically, the data was divided into the training and test sets with an 8:2 ratio, using CA as the response variable. Each base model was subjected to 10-fold cross-validation in the training set, a total of 10 times (9 folds as training data and a fold as test data per time). The merged result of the predicted values on each test data was the training set provided by the base model for the meta-model. The model obtained by each training also predicted the test set, and the mean of the 10 results on the test set was the test set provided for the meta-model. The above was the complete process of a model in stacking. After repeating these steps for the selected five models, the combined matrix of predicted test data provided by different models was the final training set of the meta-model and the same for the test set (The response variable was inherited from the initial training and test set). The training process of the meta-model was the same as that of the single model.

The associationss with general health statuses.

The way to investigate the performance of estimated BAs in capturing health risk was to consider their possible relationship to known health risk indicators, or how estimated BAs differentiate between subjects with known disease and those without the disease.

Health risk indicators describe the general health state of an individual, such as the A Body Shape Index (ABSI) [45], Surface-based body shape index (SBSI) [74], waist-to-height ratio (WHtR), waist-to-hip ratio (WHR), etc. These indicators are associated with various mortality risks. Considering the biological features covered in the dataset, we used ABSI and WHtR as health risk indicators and further adjusted for BMI, CA, and family disease. WHtR was obtained from the ratio of waist to height. ABSI was obtained by adjusting waist circumference for height and weight:

$$\text{A}\text{B}\text{S}\text{I}= \frac{\text{W}\text{C}}{{\text{B}\text{M}\text{I}}^{2/3}{\text{H}\text{e}\text{i}\text{g}\text{h}\text{t}}^{1/2}}$$

For an effective BA model, when BA increases, the health risk indicator should show a corresponding upward trend. Rahman et al. found a clear separation of BA acceleration by WHtR and SBSI categories (quartiles) in different BA predictive models [6].

Analyze whether BA will characterize any differences between healthy subjects and subjects with certain known chronic diseases [6, 30]. Individuals with more chronic diseases should have higher mean BA levels than people without any chronic diseases. There are 7 types of diseases diagnosed after physical examination, including cerebrovascular disease, kidney disease, heart disease, vascular disease, eye disease, nervous system disease, and other system diseases. We created a binary variable for each type of disease, with the disease marked as 1 and 0 otherwise. As described above, we added up the disease types of each individual to obtain a disease count variable (ranging from 0 to 7). After accounting for the population distribution, a three-category variable for disease counts was created, no disease, 1 disease, and 2 or more diseases.

Statistical Analysis

We trained and optimized BA using training data (80%) and compared the different model results with RMSE, R², and MAE on test data (20%). The Stacking method with the simple meta-model covering GBDT, LGBM, Catboost, Xgboost, and Extra Trees was selected to calculate the optimal BA (STK-BA) in years. To emphasize the advantages of the Stacking fusion model, the two Xgboost-based BAs (XGB-BAs) with different over-fitting in the training set were also introduced. Furthermore, to assess the importance of features to BA, the feature importance value (FIV) of the five models in the Stacking model was converted to weights and added together [19].

As shown in Fig. 1, we performed two primary analyses, one for health risk indicators, and one for disease counts and specific diseases. To account for confounding effects and to perform further subgroup analyses, we considered the following covariates: chronological age, family disease status, BMI, and details were provided in the supplementary material (Table S13-14).

The associations between ML-BAs and health risk indicators were analyzed by MLR. And the health risk indicators were further classified according to quintiles (Q1 ~ Q5) to compare whether the changes in BA are consistent with the increase of quantiles (Model 1 was a crude model, Model 2 was adjusted for CA, BMI, and family disease status).

To assess the associations between ML-BAs with full-sample disease counts, we first built the MLRs with ML-BAs as the dependent variable. Based on the results of the regression, we estimated BA increments for each disease count category compared with disease-free participants. Subsequently, we used Poisson regression models to examine the associations between ML-BAs and disease counts (the dependent variable). Moreover, the logistic regression model (with or without disease as the dependent variable) was used to assess the association of specific diseases with BAs. We considered two models: Model 1 was a crude model, Model 2 was adjusted for CA and family disease status.

For linear and Poisson regression models, we recorded coefficients, standard errors (SE), z-score, and P-values; for logistic regression models, we recorded odds ratios (ORs), corresponding 95% confidence intervals (95% CI), z-score, and P-values. Statistical analysis and visualization of all data were performed using R Version 4.1.2, Python Version 3.8.8, and Prism 8. Continuous variables were presented as mean ± SD, while categorical variables were presented as numbers (proportions). P < 0.05 (two-tailed) was considered statistically significant.

BA, biological age; CA, chronological age; ML-BA, machine learning-based BA; STK-BA, BA based on Stacking method with a simple meta-model; XGB-BA, xgboost-based BA; PCA, principle component analysis; MLP, multilayer perceptron; KDM, the Klemera and Doubal method; ML, machine learning; SD, standard deviation; XGBoost, extreme gradient boosting; BMI, body mass index; KNN, k-nearest neighbor; MICE, multiple imputations by chained equations; AE, autoencoder; RRLR, round-robin linear regression; MCAR, missing completely at random; MNAR, missing not at random; MSE, mean squared error; MLR, multiple linear regression; GAM, generalized additive models; SVM, support vector machine; GBDT, gradient boosting decision tree; LGBM, light gradient boosting machine; DNN, deep neural networks; CNN, convolutional neural network; RMSE, root mean squared error; MAE, mean absolute error; DBP, diastolic blood pressure; SBP, systolic blood pressure; SGPT, serum glutamic-pyruvic transaminase; SGOT, serum glutamic-oxaloacetic transaminase; CI, confidence interval; TC, total cholesterol; TG, triglycerides; LDL, low-density lipoprotein; HDL, high density lipoprotein; PC, platelets count; TCA cycle, tricarboxylic acid cycle; ABSI, A Body Shape Index; SBSI, Surface-based body shape index; WHtR, waist-to-height ratio; WHR, waist-to-hip ratio; FIV, feature importance value; ORs, odds ratios; NCDs, Non-communicableChronic Disease; LOOCV, Leave-One-Out Cross-Validation.

Ethics approval and consent to participate

This paper has passed the ethics review of the Ethics Committee of Zhejiang Provincial Center for Disease Control and Prevention (Approval No.: 2022-033-01). All the data and methods involved in this paper are in line with the relevant regulations of ethical review.

The Ethics Committee of Zhejiang Provincial Center for Disease Control and Prevention makes the decision that the project and the papers produced by the project can be exempted from signing the informed consent. The ethics committee conducted an ethical review of the project and held that: 1. The project data is derived from existing medical data, which will not cause any damage to any individual, and will not bring any harm and inconvenience to any individual's daily life and work. 2. The data does not contain personal information such as the residents' names, telephone numbers, addresses, etc., and the project researchers have been unable to get in touch with the residents, and objectively cannot give informed consent to the relevant individuals. 3. After the working system developed in this project is technically mature, it will be provided to grass-roots community hospitals throughout the province for free, and will not involve commercial interests. According to the relevant provisions of the "Measures for Ethical Review of Biomedical Research Involving Humans", the ethics committee makes the decision that the project and the papers produced by the project can be exempted from signing the informed consent.

Consent for publication

Not applicable.

Availability of Data and Materials

The data that support the findings of this study are available from the Center for Disease Control of Zhejiang Province, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. However, data are available from the authors upon reasonable request and with permission of the Center for Disease Control of Zhejiang Province.

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported by the 151 Talent Project of Zhejiang Province (first level); the National Natural Sciences Foundation of China [32070677]; Health technology Plan of Zhejiang Province [2021PY004]; Jiangsu Collaborative Innovation Center for Modern Crop Production and Collaborative Innovation Center for Modern Crop Production cosponsored by the province and ministry.

Authors' Contributions

Resources, Project administration, Funding acquisition: QY, JL, YZ, WW, TL, MC; Investigation, Conceptualization, Methodology, Data analysis, Formal analysis: SG, KL; Writing and visulization: SG, KL; Writing - review & editing, Supervision: SG, KL, QY, ZW, MC. All authors read and approved the final manuscript.

Acknowledgment

We gratefully acknowledge all the people who helped in the establishment of the medical examination data set.

Zhang B, Trapp A, Kerepesi C, Gladyshev VN: Emerging rejuvenation strategies—Reducing the biological age. Aging Cell 2022, 21(1):e13538. http://dx.doi.org/https://doi.org/10.1111/acel.13538.
Galkin F, Zhang B, Dmitriev SE, Gladyshev VN: Reversibility of irreversible aging. Ageing Research Reviews 2019, 49:104–114. https://www.sciencedirect.com/science/article/pii/S156816371830254X
Hannum G, Guinney J, Zhao L, Zhang L, Hughes G, Sadda S, Klotzle B, Bibikova M, Fan J-B, Gao Y et al: Genome-wide Methylation Profiles Reveal Quantitative Views of Human Aging Rates. Molecular cell 2013, 49(2):359–367. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3780611/
Jylhävä J, Pedersen NL, Hägg S: Biological Age Predictors. EBioMedicine 2017, 21:29–36. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5514388/
Levine ME: Modeling the rate of senescence: can estimated biological age predict mortality more accurately than chronological age? J Gerontol A Biol Sci Med Sci 2013, 68(6):667–674. http://dx.doi.org/10.1093/gerona/gls233.
Ashiqur Rahman S, Giacobbi P, Pyles L, Mullett C, Doretto G, Adjeroh DA: Deep learning for biological age estimation. Brief Bioinform 2021, 22(2):1767–1781. http://dx.doi.org/10.1093/bib/bbaa021.
Gialluisi A, Di Castelnuovo A, Donati MB, de Gaetano G, Iacoviello L, Moli-sani Study I: Machine Learning Approaches for the Estimation of Biological Aging: The Road Ahead for Population Studies. Front Med (Lausanne) 2019, 6:146–146. http://dx.doi.org/10.3389/fmed.2019.00146.
Jia L, Zhang W, Jia R, Zhang H, Chen X: Construction Formula of Biological Age Using the Principal Component Analysis. BioMed Research International 2016, 2016:e4697017. https://www.hindawi.com/journals/bmri/4692016/4697017/
Jylhävä J, Pedersen NL, Hägg S: Biological Age Predictors. EBioMedicine 2017, 21:29–36. http://dx.doi.org/https://doi.org/10.1016/j.ebiom.2017.03.046.
Park J, Cho B, Kwon H, Lee C: Developing a biological age assessment equation using principal component analysis and clinical biomarkers of aging in Korean men. Archives of Gerontology and Geriatrics 2009, 49(1):7–12. http://dx.doi.org/https://doi.org/10.1016/j.archger.2008.04.003.
Tzemah-Shahar R, Hochner H, Iktilat K, Agmon M: What can we learn from physical capacity about biological age? A systematic review. Ageing Research Reviews 2022, 77:101609. http://dx.doi.org/https://doi.org/10.1016/j.arr.2022.101609.
di Giuseppe R, Arcari A, Serafini M, Di Castelnuovo A, Zito F, De Curtis A, Sieri S, Krogh V, Pellegrini N, Schünemann HJ et al: Total dietary antioxidant capacity and lung function in an Italian population: a favorable role in premenopausal/never smoker women. European Journal of Clinical Nutrition 2012, 66(1):61–68. http://dx.doi.org/10.1038/ejcn.2011.148.
Russoniello CV, Zhirnov YN, Pougatchev VI, Gribkov EN: Heart Rate Variability and Biological Age: Implications for Health and Gaming. Cyberpsychology, Behavior, and Social Networking 2013, 16(4):302–308. http://dx.doi.org/10.1089/cyber.2013.1505.
Hannum G, Guinney J, Zhao L, Zhang L, Hughes G, Sadda S, Klotzle B, Bibikova M, Fan J-B, Gao Y et al: Genome-wide Methylation Profiles Reveal Quantitative Views of Human Aging Rates. Molecular Cell 2013, 49(2):359–367. http://dx.doi.org/https://doi.org/10.1016/j.molcel.2012.10.016.
Zhang W-G, Zhu S-Y, Bai X-J, Zhao D-L, Jiang S-M, Li J, Li Z-X, Fu B, Cai G-Y, Sun X-F et al: Select aging biomarkers based on telomere length and chronological age to build a biological age equation. AGE 2014, 36(3):9639. http://dx.doi.org/10.1007/s11357-014-9639-y.
Nakamura E, Miyao K: A Method for Identifying Biomarkers of Aging and Constructing an Index of Biological Age in Humans. The Journals of Gerontology: Series A 2007, 62(10):1096–1105. https://doi.org/1010.1093/gerona/1062.1010.1096.
Bae C-Y, Kang YG, Kim S, Cho C, Kang HC, Yu BY, Lee S-W, Cho KH, Lee DC, Lee K et al: Development of models for predicting biological age (BA) with physical, biochemical, and hormonal parameters. Archives of Gerontology and Geriatrics 2008, 47(2):253–265. https://www.sciencedirect.com/science/article/pii/S016749430700180X.
Klemera P, Doubal S: A new approach to the concept and computation of biological age. Mechanisms of Ageing and Development 2006, 127(3):240–248. https://www.sciencedirect.com/science/article/pii/S0047637405002654.
Cao X, Yang G, Jin X, He L, Li X, Zheng Z, Liu Z, Wu C: A Machine Learning-Based Aging Measure Among Middle-Aged and Older Chinese Adults: The China Health and Retirement Longitudinal Study. Front Med (Lausanne) 2021, 8:698851–698851. http://dx.doi.org/10.3389/fmed.2021.698851.
Jin X, Xiong S, Ju S-Y, Zeng Y, Yan LL, Yao Y: Serum 25-Hydroxyvitamin D, Albumin, and Mortality Among Chinese Older Adults: A Population-based Longitudinal Study. The Journal of Clinical Endocrinology & Metabolism 2020, 105(8):2762–2770. http://dx.doi.org/10.1210/clinem/dgaa349.
Klemera P, Doubal S: A new approach to the concept and computation of biological age. Mechanisms of Ageing and Development 2006, 127(3):240–248. http://dx.doi.org/https://doi.org/10.1016/j.mad.2005.10.004.
Putin E, Mamoshina P, Aliper A, Korzinkin M, Moskalev A, Kolosov A, Ostrovskiy A, Cantor C, Vijg J, Zhavoronkov A: Deep biomarkers of human aging: Application of deep neural networks to biomarker development. Aging (Albany NY) 2016, 8(5):1021–1033. http://dx.doi.org/10.18632/aging.100968.
Bobrov E, Georgievskaya A, Kiselev K, Sevastopolsky A, Zhavoronkov A, Gurov S, Rudakov K, Del Pilar Bonilla Tobar M, Jaspers S, Clemann S: PhotoAgeClock: deep learning algorithms for development of non-invasive visual biomarkers of aging. Aging (Albany NY) 2018, 10(11):3249–3259. http://dx.doi.org/10.18632/aging.101629.
Zhavoronkov A, Mamoshina P, Vanhaelen Q, Scheibye-Knudsen M, Moskalev A, Aliper A: Artificial intelligence for aging and longevity research: Recent advances and perspectives. Ageing Research Reviews 2019, 49:49–66. http://dx.doi.org/https://doi.org/10.1016/j.arr.2018.11.003.
Chen L-K: Machine Learning Improves Analysis of Multi-Omics Data in Aging Research and Geroscience. Archives of Gerontology and Geriatrics 2021, 93:104360. https://www.sciencedirect.com/science/article/pii/S0167494321000236
Pyrkov TV, Slipensky K, Barg M, Kondrashin A, Zhurov B, Zenin A, Pyatnitskiy M, Menshikov L, Markov S, Fedichev PO: Extracting biological age from biomedical data via deep learning: too much of a good thing? Scientific Reports 2018, 8(1):5210%* 2018 The Author(s). https://www.nature.com/articles/s41598-41018-23534-41599
Bae C-Y, Im Y, Lee J, Park C-S, Kim M, Kwon H, Kim B, Park Hr, Lee C-K, Kim I et al: Comparison of Biological Age Prediction Models Using Clinical Biomarkers Commonly Measured in Clinical Practice Settings: AI Techniques Vs. Traditional Statistical Methods. Frontiers in Analytical Science 2021, 1. http://dx.doi.org/10.3389/frans.2021.709589.
Liu Z: Development and Validation of 2 Composite Aging Measures Using Routine Clinical Biomarkers in the Chinese Population: Analyses From 2 Prospective Cohort Studies. J Gerontol A Biol Sci Med Sci 2021, 76(9):1627–1632. http://dx.doi.org/10.1093/gerona/glaa238.
Finkel D, Sternäng O, Wahlin Å: Genetic and Environmental Influences on Longitudinal Trajectories of Functional Biological Age: Comparisons Across Gender. Behav Genet 2017, 47(4):375–382. http://dx.doi.org/10.1007/s10519-017-9851-5.
Rahman SA, Adjeroh DA: Deep Learning using Convolutional LSTM estimates Biological Age from Physical Activity. Sci Rep 2019, 9(1):11425–11425. http://dx.doi.org/10.1038/s41598-019-46850-0.
Wang C, Guan X, Bai Y, Feng Y, Wei W, Li H, Li G, Meng H, Li M, Jie J et al: A machine learning–based biological aging prediction and its associations with healthy lifestyles: the Dongfeng–Tongji cohort. Annals of the New York Academy of Sciences 2022, 1507(1):108–120. http://dx.doi.org/https://doi.org/10.1111/nyas.14685.
Austin PC, White IR, Lee DS, van Buuren S: Missing Data in Clinical Research: A Tutorial on Multiple Imputation. The Canadian journal of cardiology 2021, 37(9):1322–1331. http://dx.doi.org/10.1016/j.cjca.2020.11.010.
Srivastava S, Soman S, Rai A, Srivastava PK: Deep learning for health informatics: Recent trends and future directions. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI): 13–16 Sept. 2017. 1665–1670.
Beaulieu-Jones BK, Lavage DR, Snyder JW, Moore JH, Pendergrass SA, Bauer CR: Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis. JMIR medical informatics 2018, 6(1):e11-e11. http://dx.doi.org/10.2196/medinform.8960.
Peralta M, Jannin P, Haegelen C, Baxter JSH: Data imputation and compression for Parkinson's disease clinical questionnaires. Artificial Intelligence in Medicine 2021, 114:102051. http://dx.doi.org/https://doi.org/10.1016/j.artmed.2021.102051.
Das S, Datta S, Chaudhuri BB: Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recognition 2018, 81:674–693. http://dx.doi.org/https://doi.org/10.1016/j.patcog.2018.03.008.
Zahid FM, Heumann C: Multiple imputation with sequential penalized regression. Statistical Methods in Medical Research 2018, 28(5):1311–1327. http://dx.doi.org/10.1177/0962280218755574.
Lee JY, Styczynski MP: NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics: Official journal of the Metabolomic Society 2018, 14(12):153–153. http://dx.doi.org/10.1007/s11306-018-1451-8.
Qiu YL, Zheng H, Gevaert O: Genomic data imputation with variational auto-encoders. GigaScience 2020, 9(8). http://dx.doi.org/10.1093/gigascience/giaa082.
Silva Hd, Perera AS: Missing data imputation using Evolutionary k- Nearest neighbor algorithm for gene expression data. In: 2016 Sixteenth International Conference on Advances in ICT for Emerging Regions (ICTer): 1–3 Sept. 2016 2016. 141–146.
Lin H, Lunetta KL, Zhao Q, Mandaviya PR, Rong J, Benjamin EJ, Joehanes R, Levy D, van Meurs JBJ, Larson MG et al: Whole Blood Gene Expression Associated With Clinical Biological Age. The Journals of Gerontology: Series A 2019, 74(1):81–88. http://dx.doi.org/10.1093/gerona/gly164.
Pyrkov TV, Slipensky K, Barg M, Kondrashin A, Zhurov B, Zenin A, Pyatnitskiy M, Menshikov L, Markov S, Fedichev PO: Extracting biological age from biomedical data via deep learning: too much of a good thing? Sci Rep 2018, 8(1):5210–5210. http://dx.doi.org/10.1038/s41598-018-23534-9.
Mamoshina P, Kochetov K, Putin E, Cortese F, Aliper A, Lee W-S, Ahn S-M, Uhn L, Skjodt N, Kovalchuk O et al: Population Specific Biomarkers of Human Aging: A Big Data Study Using South Korean, Canadian, and Eastern European Patient Populations. J Gerontol A Biol Sci Med Sci 2018, 73(11):1482–1490. http://dx.doi.org/10.1093/gerona/gly005.
Mørkedal B, Romundstad PR, Vatten LJ: Informativeness of indices of blood pressure, obesity and serum lipids in relation to ischaemic heart disease mortality: the HUNT-II study. Eur J Epidemiol 2011, 26(6):457–461. http://dx.doi.org/10.1007/s10654-011-9572-7.
Krakauer NY, Krakauer JC: A new body shape index predicts mortality hazard independently of body mass index. PLoS One 2012, 7(7):e39504-e39504. http://dx.doi.org/10.1371/journal.pone.0039504.
Rivero-Segura NA, Bello-Chavolla OY, Barrera-Vázquez OS, Gutierrez-Robledo LM, Gomez-Verjan JC: Promising biomarkers of human aging: In search of a multi-omics panel to understand the aging process from a multidimensional perspective. Ageing Research Reviews 2020, 64:101164. http://dx.doi.org/https://doi.org/10.1016/j.arr.2020.101164.
Yoo J, Kim Y, Cho ER, Jee SH: Biological age as a useful index to predict seventeen-year survival and mortality in Koreans. BMC Geriatr 2017, 17(1):7–7. http://dx.doi.org/10.1186/s12877-016-0407-y.
Waziry R, Gras L, Sedaghat S, Tiemeier H, Weverling GJ, Ghanbari M, Klap J, de Wolf F, Hofman A, Ikram MA et al: Quantification of biological age as a determinant of age-related diseases in the Rotterdam Study: a structural equation modeling approach. European Journal of Epidemiology 2019, 34(8):793–799. http://dx.doi.org/10.1007/s10654-019-00497-3.
Song Q, Shepperd M: Missing Data Imputation Techniques. International Journal of Business Intelligence and Data Mining 2007, 2(3):261–291. http://dx.doi.org/10.1504/IJBIDM.2007.015485.
Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O: A survey on missing data in machine learning. Journal of Big Data 2021, 8(1):140. http://dx.doi.org/10.1186/s40537-021-00516-9.
LeCun Y, Bengio Y, Hinton G: Deep learning. Nature 2015, 521(7553):436–444. http://dx.doi.org/10.1038/nature14539.
Pereira RC, Santos M, Rodrigues P, Henriques Abreu P: Reviewing Autoencoders for Missing Data Imputation: Technical Trends, Applications and Outcomes. Journal of Artificial Intelligence Research 2020, 69:1255–1285. http://dx.doi.org/10.1613/jair.1.12312.
Yu L, Liu L, Peace KE: Regression multiple imputation for missing data analysis. Statistical Methods in Medical Research 2020, 29(9):2647–2664. http://dx.doi.org/10.1177/0962280220908613.
Chen Y, Wong M-L, Li H: Applying Ant Colony Optimization to conFiguring stacking ensembles for data mining. Expert Systems with Applications 2014, 41(6):2688–2702. http://dx.doi.org/https://doi.org/10.1016/j.eswa.2013.10.063.
Pinto E: Blood pressure and ageing. Postgraduate medical journal 2007, 83(976):109–114. http://dx.doi.org/10.1136/pgmj.2006.048371.
Le Blanc J, Lordkipanidzé M: Platelet Function in Aging. Frontiers in Cardiovascular Medicine 2019, 6
Segal JB, Moliterno AR: Platelet Counts Differ by Sex, Ethnicity, and Age in the United States. Annals of Epidemiology 2006, 16(2):123–130. http://dx.doi.org/https://doi.org/10.1016/j.annepidem.2005.06.052.
Troussard X, Vol S, Cornet E, Bardet V, Couaillac J-P, Fossat C, Luce J-C, Maldonado E, Siguret V, Tichet J et al: Full blood count normal reference values for adults in France. Journal of Clinical Pathology 2014, 67(4):341. http://dx.doi.org/10.1136/jclinpath-2013-201687.
Krakauer JC, Franklin B, Kleerekoper M, Karlsson M, Levine JA: Body Composition Profiles Derived From Dual-Energy X-Ray Absorptiometry, Total Body Scan, and Mortality. Preventive Cardiology 2004, 7(3):109–115. https://doi.org/10.1111/j.1520-037X.2004.3326.x.
Samaras TT: Should we be concerned over increasing body height and weight? Experimental Gerontology 2009, 44(1):83–92. /https://doi.org/10.1016/j.exger.2008.02.002.
Salaris L, Poulain M, Samaras TT: Height and Survival at Older Ages among Men Born in an Inland Village in Sardinia (Italy), 1866–2006. Biodemography and Social Biology 2012, 58(1):1–13. http://dx.doi.org/10.1080/19485565.2012.666118.
Maier AB, van Heemst D, Westendorp RGJ: RELATION BETWEEN BODY HEIGHT AND REPLICATIVE CAPACITY OF HUMAN FIBROBLASTS IN NONAGENARIANS. The Journals of Gerontology: Series A 2008, 63(1):43–45. http://dx.doi.org/10.1093/gerona/63.1.43.
Hägg S, Jylhävä J: Sex differences in biological aging with a focus on human studies. eLife 2021, 10:e63425. http://dx.doi.org/10.7554/eLife.63425.
Li J, Guasch-Ferré M, Chung W, Ruiz-Canela M, Toledo E, Corella D, Bhupathiraju SN, Tobias DK, Tabung FK, Hu J et al: The Mediterranean diet, plasma metabolome, and cardiovascular disease risk. European Heart Journal 2020, 41(28):2645–2656. http://dx.doi.org/10.1093/eurheartj/ehaa209.
Lam AP, Gundabolu K, Sridharan A, Jain R, Msaouel P, Chrysofakis G, Yu Y, Friedman E, Price E, Schrier S et al: Multiplicative interaction between mean corpuscular volume and red cell distribution width in predicting mortality of elderly patients with and without anemia. American journal of hematology 2013, 88(11):E245-E249. http://dx.doi.org/10.1002/ajh.23529.
Passarino G, Montesanto A, De Rango F, Garasto S, Berardelli M, Domma F, Mari V, Feraco E, Franceschi C, De Benedictis G: A cluster analysis to define human aging phenotypes. Biogerontology 2007, 8(3):283–290. http://dx.doi.org/10.1007/s10522-006-9071-5.
Guida JL, Ahles TA, Belsky D, Campisi J, Cohen HJ, DeGregori J, Fuldner R, Ferrucci L, Gallicchio L, Gavrilov L et al: Measuring Aging and Identifying Aging Phenotypes in Cancer Survivors. JNCI: Journal of the National Cancer Institute 2019, 111(12):1245–1254. http://dx.doi.org/10.1093/jnci/djz136.
Bell CG, Lowe R, Adams PD, Baccarelli AA, Beck S, Bell JT, Christensen BC, Gladyshev VN, Heijmans BT, Horvath S et al: DNA methylation aging clocks: challenges and recommendations. Genome Biology 2019, 20(1):249. http://dx.doi.org/10.1186/s13059-019-1824-y.
Salameh Y, Bejaoui Y, El Hajj N: DNA Methylation Biomarkers in Aging and Age-Related Diseases. Frontiers in Genetics 2020, 11. http://dx.doi.org/10.3389/fgene.2020.00171.
Menni C, Kastenmüller G, Petersen AK, Bell JT, Psatha M, Tsai P-C, Gieger C, Schulz H, Erte I, John S et al: Metabolomic markers reveal novel pathways of ageing and early development in human populations. International journal of epidemiology 2013, 42(4):1111–1119. http://dx.doi.org/10.1093/ije/dyt094.
Srivastava S: Emerging Insights into the Metabolic Alterations in Aging Using Metabolomics. Metabolites 2019, 9(12):301. http://dx.doi.org/10.3390/metabo9120301.
Shahmirzadi AA, Edgar D, Liao C-Y, Hsu Y-M, Lucanic M, Shahmirzadi AA, Wiley C, Riley R, Kaplowitz B, Gan G et al: Alpha-ketoglutarate, an endogenous metabolite, extends lifespan and compresses morbidity in aging mice. bioRxiv 2019:779157. http://dx.doi.org/10.1101/779157.
Prineas RJ, Le A, Soliman EZ, Zhang Z-M, Howard VJ, Ostchega Y, Howard G: United States National Prevalence of Electrocardiographic Abnormalities in Black and White Middle-Age (45- to 64-Year) and Older (≥ 65-Year) Adults (from the Reasons for Geographic and Racial Differences in Stroke Study). The American Journal of Cardiology 2012, 109(8):1223–1228. http://dx.doi.org/https://doi.org/10.1016/j.amjcard.2011.11.061.
Rahman SA, Adjeroh D: Surface-Based Body Shape Index and Its Relationship with All-Cause Mortality. PLoS One 2015, 10(12):e0144639-e0144639. http://dx.doi.org/10.1371/journal.pone.0144639.

No competing interests reported.

SupplementaryMaterials0602.docx

Download PDF

Editorial decision: Major revision
24 Aug, 2022
Reviews received at journal
02 Aug, 2022
Reviews received at journal
14 Jul, 2022
Reviews received at journal
27 Jun, 2022
Reviewers agreed at journal
26 Jun, 2022
Reviewers agreed at journal
26 Jun, 2022
Reviewers invited by journal
25 Jun, 2022
Editor assigned by journal
20 Jun, 2022
Editor invited by journal
20 Jun, 2022
Submission checks completed at journal
20 Jun, 2022
First submitted to journal
03 Jun, 2022

You are reading this latest preprint version

A machine learning-based data mining in medical examination data: a biological features-based biological age prediction model

Status:

Version 1

Abstract

Figures

Introduction

Results

Comparison of missing value interpolation methods

Features selection and BA predictor construction

The importance of Features for the Stacking model

The associations between health risk indicators and STK-BA, XGB-BAs

The associations between disease statuses and STK-BA, XGB-BAs

Discussion

Conclusion

Methods

Study Population and assessment of physical examination measurements

Comparison of interpolation methods for missing values

Feature selection and BA calculation

Statistical Analysis

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1