There is no general missing value interpolation method, but only the most appropriate. We compared five classical but effective methods for the Chinese physical examination data and found that RRLR performed best under the same missing ratio of the original data. However, the superior performance of the RRLR method was not universal, and it was more suitable for low missing ratios (e.g. less than 30%). This is because the strategy of RRLR is to build regression models to predict and impute the missing features according to other complete samples in an iterative loop [49]. Although this strategy allows RRLR to utilize as many observations as possible during interpolation, regression typically requires many samples with non-missing data to produce stable results [50]. AE interpolation showed the best stability. As a common artificial neural network in deep learning, deep AE can perform representation learning on the input information, form a higher-level feature map, and then reconstruct the data at the output, reducing sensitivity to higher missing rates [51, 52]. Likewise, Yu L et al. pointed out that multiple regression imputation was suitable for filling in the missing in the WHO ARI Multicentre Study of clinical signs and etiologic agent dataset [53].
More importantly, we found an interesting phenomenon in the previous Chinese population-based ML-BA, which had not been discussed before. When the correlation or R2 between BA and CA was taken as the criterion, the results on the test set were quite different from the final prediction of BA on the full dataset. Taking the previous XGB-BA as an example, the R2 of the model in the test set was 0.27, while the correlation between BA and CA was 0.75 in the final results (BA to CA regression belonged to simple linear regression, so R = cor = 0.75, R2 = 0.56) [19]. The same was also found in the XGB-BA based on the Dongfeng-Tongji cohort [31]. This might be because the model trained on the training set predicted BA on the full dataset, which introduced interference from parameter tuning and training overfitting. However, this still requires further confirmation, as previous studies did not explicitly state how the model was obtained when it finally predicted BA. In any case, the consistency of the test set with the final results is what we would expect.
The correlation between BA and CA was usually regarded as an indispensable index to evaluate BA prediction models. However, after selecting the best model, how to obtain stable correlation analysis results with BA in the whole sample is also of high value. Two generally used health statuses (health risk indicators and disease status) were used as different evaluation aspects to illustrate the influence of different overfitting degrees on correlation strength and significance in ML-BAs. We found that even with similar test set results, as the overfitting degree increased, XGB-BA2 exhibited less obvious associations with health risk indicators (ABSI, WHtR), disease counts, nervous system disease, and eye disease. This finding suggested that the results of association analysis would vary due to parameter selection and other reasons. This can be attributed to the fact that the core purpose of BA is to capture aging features beyond CA, while overfitting causes the model to over-learn the CA feature of the training set. Xingqi Cao et al. adopted default parameters in the model to overcome this problem [39], but it did not work fundamentally.
To avoid overfitting affecting the stability of the association results between BA and health outcomes in the entire dataset, we propose three possible solutions. The first is to let the model show basically the same fitting results on the training set and test set, which is the most convenient and least expensive. Secondly, the method of using cross-modeling to predict, such as LOOCV or K-fold, always keeps the final predicted samples from participating in the construction of the model, but this will produce multiple models that are not exactly the same. The prediction accuracy of each model also usually varies due to parameters and different training samples. Therefore, this method presents new challenges for practical application and less time cost. The third is to use only the sample results on the test set for further analysis, but this does not meet the principle of maximizing the use of data and reduces the reliability of the results.
For this case, our proposed STK-BA could improve the correlation between BA and CA while maintaining the consistency of the model results (the correlation of the training set and the test set are the same in three decimal places). What’s more, the positive association of STK-BA with health risk indicators, disease counts, and specific diseases was also more pronounced, suggesting that it better captures the aging-related features behind diseases. This may be attributed to the biological features we considered to represent different physiological functions or dimensions: immune system (e.g. platelet count, white blood cell), cardio-metabolic system (e.g. HDL, DBP), liver function (e.g. SGPT, SGOT), phenotypic dimension (e.g. height, waist), kidney injury (e.g. urine protein). Additionally, the associations we considered included eye disease and kidney disease, which were also not covered in previous Chinese population studies [28].
The Stacking method we adopted is a mechanism to combine the learned types of models into one, consisting of base models and a meta-model [54]. Instead of selecting a model from multiple models for generalization or simple averaging, Stacking uses a meta-model to balance the features (the output of the base model) and predict [50], which is somewhat like a two-layer neural network. Cross-validation in Stacking and simple meta-model are the keys to overcoming the overfitting of the complex base model, which makes the correlation between model-predicted value and target value basically consistent on the final training set and test set. However, an overly complex meta-model will also lead to overfitting. We observed this when utilizing RF as a meta-model (Table 1). More importantly, the Stacking method is equally applicable to the BA based on a single model. A simple meta-model could enable various BA models to obtain stable prediction BA on the whole sample for correlation analysis, which could be further extended.
The correlation between our STK-BA and CA (r = 0.66) on the test set was better than previously published BA (r = 0.52) based on 19 blood biomarkers [19] but weaker than BA (r = 0.74) which considered 44 biomarkers including lung function. This phenomenon is plausible, depending on the population-specific and age-related biosignatures in different datasets [31]. However, it is worth noting that we showed better CA correlations with the same number of biomarkers in the Chinese population. Additionally, Mamoshina et al. found that models trained in a given population declined in correlation when tested across ethnicities (given population: R2 ranged from 0.49 to 0.69; different populations: R2 ranged from 0.24 to 0.34) [43]. ML-BA would exhibit different correlations with CA due to differences in population and biometrics [43]. Therefore, we constructed ML-BA using Chinese populations from different sources, and this helped to further confirm the applicability of ML-BA in the Chinese population by associating aging measures with important health conditions and outcomes.
DBP, height, SBP, gender, and platelet content were the five most essential variables screened out in the Stacking model, which may play a vital role in assessing BA differences in different populations. In fact, DBP, SBP, and PC have been widely found to be biomarkers closely related to biological aging. Pinto E [55] noted that elevated pulse pressure due to decreased DBP and increased SBP was the most potent risk predictor in older adults and was associated with older age. In epidemiological studies, aging populations were more likely to exhibit features of lower PC and higher platelet activity, which are associated with higher rates of cardiovascular disease [56–58]. The link between gender or height and aging was also frequently mentioned [59, 60]. In a study of conscripts from Italian inland villages, short people (height less than 161.1 cm) generally had higher survival rates than tall peers [61]. This may be related to caloric restriction, cell replication potential, telomere shortening, and cardiac pumping efficiency [61, 62]. What’s more, the gender-driven characteristics of aging have become the focus of current attention, with gender differences in life expectancy, biological aging, and frailty indices [63]. Of these, women are generally biologically younger than men, consistent with a lower BA assessed by molecular biomarkers [9].
Overall, the BA measurement model we developed integrated multidimensional biosignatures that more systematically reflected human aging. This line of evidence reinforces our findings and suggests that the variable screening results of the Stacking model are biologically interpretable. Besides, although fewer biological features are considered in the model, this facilitates the generalization and practical application of the model and its workflow.
The large sample data of Chinese medical examination data enables us to explore the influence of fitting on the stability of correlation results and develop a new composite BA prediction model after comparing the most suitable interpolation methods. Nonetheless, several limitations need to be discussed. First, although the interpolation methods explored in this study are convenient and practical, more novel missing value imputation methods can be further attempted to be transferred to the medical examination dataset [40], such as the variational AE applied to Genomic data imputation [39]. Second, our data lacked information on outcome variables (e.g., death) to establish a link between BA and survival analysis. We, therefore, associated BA with a health risk indicator that predicted mortality risk instead. Third, the training and test sets of the BA prediction model are both from the same dataset. Testing with external datasets will further evaluate the generalization ability of the ML-BAs [64]. Finally, the biological features used in the study were mostly limited to biochemical indicators, and aging-related indicators that have been discovered, such as mean corpuscular volume, are not included in our data. These may weaken the interpretability of predicted BA and fail to supplement the validation of more existing results [19, 65]. However, with the popularization of big medical data, phenotype information (e.g. cognitive level, gait speed [66, 67]), methylation data (e.g. CpG sites [68, 69]), metabolomic features and pathways (e.g. C-glycosyl tryptophan, α-ketoglutarate and TCA cycle [70–72]) will be more convenient, which assists in predicting and explaining the aging process more systematically. Therefore, as more dimensions of individual indicators are taken into account, our composite BA and its construction process will have a broader reference value.