Data collection and study design. A workflow of the overall approach is described in Figure 1. After selection of a set of blood variables15 and two metabolic indices (i.e., body weight [BW] and FBG), a deep neural network (DNN) was trained (80 % data) and tested (20 % data) to predict blood age, and then to calculate aging acceleration and its impact on all-cause mortality. Initial work used the SLAM dataset alone, which consisted of 10,463 measurements from 1,997 mice of both sexes and two different strains (i.e., C57BL/6J and HET3), and housed at the National Institute on Aging vivarium (Table T1, cohorts C1–C10).
Good results in terms of low mean error and high correlation between blood-predicted age and chronological age were obtained when training-testing the clock with the SLAM dataset (MAE = 14.12, RMSE = 18.52, r = 0.82 and p-value < 0.001, all results in the testing dataset).
Longitudinal changes in blood variables. Given these promising results, we moved to assess whether the longitudinal structure of the data could provide further biologically relevant information and result in better prediction of blood age. To do so, the neural network was trained and tested by adding the longitudinal changes between timepoints for each blood variable (up to ten samples collected over the lifespan of each animal with an average of 4.8 blood tests per animal). The inclusion of this information consistently improved model performance, reducing the mean error more than 16% and increasing the correlation by almost 7.5%, from 0.82 to 0.88 (MAE = 12.45, RMSE = 15.95, r = 0.88, p-value < 0.001). This result might be relevant when translating this blood-based clock into humans, as more than one blood draw would be required for the clock to perform optimally.
The hematological clock shows consistent age prediction in an independent sample of genetically diverse mice. One of the main issues when constructing prediction tools is overfitting. Several approaches can be implemented to protect the models from such limitation and improve their generalizability. Here, the information was pre-processed separately after an 80:20 split (training:testing); the training process included regularization techniques such as elastic net penalization, dropout, and early stopping and all samples from the same animal were clustered together, either in the training or the testing dataset, to avoid knowledge of the hold-out set used for validation leaking from the one used to train the DNN.
To directly assess the statistical generalizability of our approach, we trained and validated this same set of features on an independent longitudinal dataset encompassing 1,535 samples from 563 mice. This second dataset comprised Diversity Outbred (DO) mice of both sexes, with up to three blood samples per animal (Table T1, cohorts G7–G11). In this case, the DO mice were maintained on a different diet and housed at The Jackson Laboratory’s animal facility.
Although the initial version of the clock included all blood variables derived from the ADVIA® 2120i Hematology System15, only those that overlapped from the SLAM project and JAX study were retained as features in the clock. Additionally, highly correlated variables (pairwise correlation above 0.80) were removed from the set of features to avoid redundancy. After all this, a group of 14 blood variables and two metabolic indices comprised the core set of features used to build the final version of the hematological clock (Supplementary Table S1). The training, testing, and assessment of the clock performance was carried out in two ways: i) merging of the SLAM and DO datasets, and ii) using only blood samples from the DO mice to train and test the model. In both cases, the pipeline had the same structure, so the results were comparable. A low error and high correlation between blood age and chronological age was achieved when merging the two datasets (MAE = 11.95, RMSE = 15.41, r = 0.87, p-value < 0.001). Using the DO dataset alone resulted in an additional improvement in the model (MAE = 4.28, RMSE = 6.98, r = 0.95, p-value < 0.001). We hypothesize that part of that improvement could be explained by the simplified longitudinal structure of the samples in the JAX dataset as this study has observations only at three specific timepoints (i.e., 12, 18, and 24 months) compared with the 10 timepoints in the SLAM study. Nevertheless, the good results achieved by the clock in the DO strain when considering the analysis cross-sectionally (MAE = 10.93, RMSE = 14.45, r = 0.78, p-value < 0.001) suggest that some other factors beyond the specific longitudinal structure of the sample contribute to the improved prediction in the JAX dataset compared with SLAM. Given that the DO mice are genetically more heterogeneous than the B6/HET3 strains, and have higher CBC variability, as can be seen in their phenome database16, we performed an intraclass correlation coefficient (ICC) analysis to assess whether blood variability was associated with clock performance. To determine blood variability among HET3/B6 mice and maintain the size and longitudinal structure comparable to that of the DO dataset, we filtered the SLAM dataset to include only the three overlapping timepoints present in the JAX study (blood samples collected at 12, 18 and 24 months of age). To avoid sample bias, the ICC was calculated on multiple iterations composed of 70% of SLAM samples (approximately the same size as the JAX dataset), and averaged the results. The calculations were done in harmonized datasets to remove batch effect. As anticipated, higher ICC was observed in most of the 16 features used in the clock when comparing the heterogeneous DO strain with HET3 or B6, and between HET3 vs. B6 (Supplementary Table S2). In this study, the genetic heterogeneity of the DO strain was associated with higher CBC variability and could be playing a role in the superior performance of the hematologic clock. However, the results from the HET3/B6 strains indicated higher blood variability but less accurate predictions in HET3 vs. B6 mice, making it very difficult to ascertain any correlation between blood variability and clock performance. We also found that the hematological clock had a slightly better predictive performance when processing the B6 strain alone (MAE = 11.17, RMSE = 14.82, r = 0.89, p-value < 0.001) compared to the clock based on HET3 mice (MAE = 13.21, RMSE = 17.54, r = 0.85, p-value < 0.001). The clock also obtained slightly better predictions in males vs. females (MAE = 11.2, RMSE = 14.7, r = 0.89, p-value < 0.001) vs (MAE = 12.6, RMSE = 16.5, r = 0.86, p-value < 0.001). (Figure 2).
Computing the SLAM and JAX studies at the same time led to a ~2% reduction in MAE after the validation dataset performance was compared with the average performance of the clock in the 10-Fold Cross Validation (10FCV) process. Despite the difference in the longitudinal structure of SLAM and JAX datasets, clock implementation in the JAX dataset was adequate, with a consistent error of less than one-half, and a clearly improved correlation coefficient (above 7% increment) compared with predictions in the SLAM dataset.
The fact that similar results were obtained in two sites, three strains, and two sexes, when processing all the information in the training process can be taken as a promising indicator of proper performance for other datasets. We surmise that this hematologic clock may be applicable in other strains and different settings, like drug testing or aging interventions, and pave the way for further translation into age-related human studies.
Clock performance by age range. To compare clock performance at different ages and measure which blood samples showed higher marginal importance when predicting age, we performed further sensitivity analyses, training, and testing the neural network with blood samples from animals of three different age ranges, i.e., under 52 weeks (Young), between 52 and 82 weeks (Adult) and above 82 weeks (Old). We changed the structure of the 10FCV (10 iterations per analysis to avoid sample bias) and averaged the results. Importantly, while the predictive power of the clock was lower when data from only a specific age range was provided, the overall performance was still very strong. The best performance in terms of correlation coefficient was obtained when the network processed samples from old animals (R=0.83 [0.81-0.86] RMSE=8.17 MAE=6.00), then young (R=0.81 [0.78-0.83] RMSE=6.90 MAE=4.81) and finally adults (R=0.76 [0.73-0.79] RMSE=6.96 MAE=5.70), while the best behavior in terms of error was achieved when computing blood samples from young individuals.
Longitudinal Feature Importance. To investigate the relevance of each feature within this core set of blood variables and assess whether the ranking of feature importance remained stable during the life of each mouse, we performed a feature importance analysis by age range, stratifying the dataset in three groups segregated by percentile P33 and P66 (young [tercile 1], adult [tercile 2] and old [tercile 3]). A method based on the random forest algorithm17 was applied for determining the feature importance. This method not only ranks the variables, using the depth in the tree as a metric, but also discriminates those features whose contribution to the model are unambiguously achieved by chance. The results revealed that feature importance was not stable throughout life. Platelet count (adjusted by clumps), one of the most important features according to the above-mentioned method, appeared to be a strong predictor of age during the early stage of life, but its relevance clearly dropped during midlife and remained low within the last stage of life. Mean corpuscular hemoglobin and red blood cells displayed similar trajectories of importance, with a slight increment late in life. FBG levels followed an opposite course, being a weak predictor of age in young animals and progressively increased in importance with age until the last third of life, when the maximum was reached. Lymphocytes (%), eosinophils (adjusted by clumps), white blood cells, and neutrophils steadily lost importance as predictors of age throughout life (Figure 3 and Supplementary Figure S3).
One key aspect of the experimental design was the fact that mice were bred, delivered, and tested in 15 batches or cohorts (i.e., 10 cohorts from SLAM and 5 cohorts from the JAX study), due to the inherent difficulty in handling such large number of animals. We tested the existence of batch effect (BE) by fitting a linear mixed effect model (LMM) to each blood variable, adjusted by available covariates. We observed a variation up to a 14% of the total variation to be attributable to the batch, implying the presence of a slight BE that was more accentuated in some features than others (Supplementary Table S4). To harmonize the dataset and remove BE, each blood variable was adjusted by subtracting the random estimates assigned by the LMM to each cohort and those residuals were then used in all downstream analyses (Supplementary Table S4, right most column)
This specific structure of data in subpopulations or batches may contribute to artificially increase model performance, even after batch harmonization, if the batches are not well balanced in terms of the outcome. An essential premise in the development of tools for biological age prediction is the removal of any chronological feature that alters the true ability of said tool to predict age. The inclusion of the categorical variable “batch” as a feature in the clock improved the neural network prediction by more than 5%. Initially represented as an ordinal label with no other purpose than indicating the sequence each order of animals was delivered by the supplier, the variable “batch” has, in fact, pertinent “age-related” information for the clock. SLAM animals arrived at the National Institute on Aging (NIA) vivarium in cohorts of approximately 200 animals every 3 months, balanced in number between strains and sexes (cohorts C1 to C10 in Table T1). Blood samples were typically collected every three months from baseline, but COVID-19 pandemic restrictions caused some batches of SLAM mice to have missing observations at specific timepoints, leading to unbalanced representation of the outcome “age” across batches. On the other hand, the JAX study (cohorts G07 to G11 in Table T1) had significantly fewer timepoints, with blood collected at 12, 18, and 24 months of age. Hence, depending on the specific batch number and its associated age frequency distribution, the supposedly inert categorical “batch” label was now providing an important chronological hint for the model to predict age and improve performance. Therefore, variables such as the blood test number (i.e., the sequential visit number in epidemiological studies), the study (i.e., SLAM vs. JAX), or the batch label were excluded from the set of features used to predict age. This concept is applicable to any other categorical variable that is not well balanced between levels or classes in terms of the outcome representation.
We also tested the null hypothesis by permuting age as the dependent variable (50 iterations) while maintaining the values of all explanatory variables unaltered. Low correlations between predicted and chronological age were found from this analysis, indicating that the highly correlated predictions made above were unlikely to have been achieved by chance (Supplementary Table S5).
Age acceleration. Age acceleration was calculated as the residuals after modeling chronological age and blood age (predicted by the clock) using a local polynomial regression to account for the nonlinear relationship between chronological age and blood age (Figure 2B). To determine sample bias, we trained and tested the clock to calculate the age acceleration associated with each blood sample computing iteratively 50 random splits, altering the 10FCV each time. The results of such calculations are visualized as a heatmap with each row representing age acceleration per sample and each column depicting an individual blood sample out of 2,381 sorted by age and labeled as young, adult, and old (Figure 4A, left panel). Age acceleration appeared to be a relatively stable variable across the 50 re-samplings, providing similar predictions in all of them, as evidenced by the presence of clear vertical columns in the absence of horizontal bands. To test for a possible association between any available phenotypical trait or batch and age acceleration, we then performed unsupervised hierarchical clustering of the same matrix generated in the previous 50-split process and labeled the samples according to the animal’s age group (e.g., young, adult, or old). The algorithms identified a series of clusters, none of which associated with a particular sex, strain, cohort, or site (NIA vs. Jackson’s lab). Furthermore, age-range labels of each mouse were strikingly unsorted across the band (Figure 4A, right panel, upper band), indicating adequate model fitting of blood predicted age vs. chronological age through the nonlinear regression. Taken together, our mathematical model was able to determine that age acceleration, calculated from the plot of residuals, was a relatively stable variable by providing similar predictions regardless of the sampling process, even though it could not detect a significant association between age acceleration and any specific phenotypical or categorical group in the two studies.
Fast agers show reduced lifespan. Probably one of the most complicated aspects of developing an aging clock is detecting the eventual association between aging acceleration and the clinical implications associated with that acceleration. Even the selection of the model is challenging, as the assumptions required by the model initially selected are not always met (proportional hazards, linearity…), or the sample size is not large enough to detect significant differences. The process does not become simpler even when longitudinal data are examined. To determine whether the discrepancies between predicted age and chronological age were biologically relevant and did not stem from a lack of fit of the model, the association between age acceleration and lifespan was examined using four different approaches.
We initially performed a cross-sectional examination of the relationship between age acceleration and lifespan and found very low correlations between the two variables (Figure 4B). Comparable lifespan was observed whether mice aged faster (positive acceleration) or slower (negative acceleration) (fast agers and lifespan: r = -0.01, p = 0.9035; slow agers and lifespan: r = -0.12, p < 0.001). Collection of longitudinal data allows age prediction at multiple timepoints. Age acceleration can vary within the same animal as rapid age acceleration with a particular mortality risk may be observed at a given timepoint and become slower thereafter (Figure 4C). The converse may be happening as well. Thus, it is paramount to explore the association between trajectories of age acceleration and mortality from a longitudinal perspective. Initially, we analyzed mortality using Cox regressions by modeling time to death as the outcome and age acceleration as a time-dependent risk factor (e.g., a covariate that changes at each timepoint). The results showed that mice with a higher age acceleration had higher mortality risk compared to those with lower acceleration. Indeed, mice within the highest tercile for age acceleration had nearly 1.3 times the mortality rate (Entire dataset: HR = 1.28, 95% confidence interval [CI] = 1.14–1.44; p ≤ 0.001, CI = 0.57, standard error [SE] = 0.008, number of events (n) = 1,780; Test dataset only: HR = 1.27, 95% confidence interval [CI] = 0.98–1.63; p = 0.06, CI = 0.56, standard error [SE] = 0.14, n = 351) of those within the lowest tercile (Percentile P33 and P66 used as cutting points), after adjustment for cohort, strain, sex, and age (Figure 5A [Entire dataset]).
Since the selection of the distribution could be particularly determinant when fitting survival models in long follow-up studies like SLAM, we verified the results obtained in Cox survival analysis by fitting Gompertz regressions to the data and compared subpopulation survivorship. This second approach produced similar results to the ones generated by the Cox regression (Entire dataset: Gompertz coef. = 0.34, Relative Risk = 1.40, [CI] = 1.28–1.51, n = 1,780; Test dataset only: Gompertz coef. = 0.25, Relative Risk = 1.28, [CI] = 1.02–1.53, n = 351). Next, the profiles of the acceleration curves were examined to evaluate whether the blood clock could predict mortality. Linear mixed regression was used to rank slopes of the age acceleration trajectories and classify individuals as slow or fast agers (Figure 5B). The maximum lifespan in each group was then calculated and compared to detect whether higher slopes were associated with shorter lifespan. The method QT3 of Wang and Allison18 clearly showed that animals with higher acceleration slopes (above 90th percentile, i.e., fast agers or highly accelerated) had a significantly lower proportion of long-lived individuals, defined as living above the 90th percentile of maximum lifespan, compared to the group with lower slope (below 10th percentile, i.e., slow agers or highly decelerated) (Table T2) (Figure 5B).
Different age acceleration trajectories between long- versus short-living mice. The final analysis was meant to analyze the age acceleration trajectories of two subpopulations of animals, those with long lifespan and those that had shorter lives. Instead of classifying acceleration and then testing its association with lifespan, we first classified animals by lifespan and then analyzed whether they were predicted to be fast or slow agers using linear mixed regressions. In order to do so, the lifespan of the entire population was segregated into terciles (Percentile P33 and P66 as cutting points) and the clock was run using the blood samples of those animals in the two tails of the distribution (Figure 6A). As anticipated, the age acceleration trajectories of long-lived mice were below the ones from short-living animals throughout almost the entire lifespan (Figures 6B and 6C), indicating the clock’s ability to link longitudinal trajectories of blood variables with aging acceleration. All these approaches seem to provide a biological foundation to this computational tool and may lead toward helping to refine a better understanding of aging by being able to identify biological age and, somehow, quantify a very complex process.
Almost 12,000 observations on approximately 2,500 mice were used to build the hematological clock. It is a highly controlled study with good representation of mice at different ages, based on three different strains and both sexes. By considering not only blood cell counts at specific timepoints but also the longitudinal changes across time, this method appears to measure the aging process and be able to transform the set of blood variables into a metric of aging. Used as the clock’s output, age accelerations were computed for each individual blood collection and a single longitudinal trajectory per animal was generated (Example of trajectories in Figure 4C). Part of its ability to predict age is the fact that the hematological clock goes beyond measuring the abundance of specific biomolecules; it directly quantifies blood cell populations, and this might help to better gauge fluctuations in fundamental biological mechanisms associated with the rate of aging.
As a final consideration, several interesting hypotheses about this blood-based method to measure aging come to mind when planning future analyses beyond the need of translating this method in other species like primates. These include changes in acceleration trajectories in response to interventions that either shorten lifespan (e.g., radiation, high-fat diets, etc.) or extend longevity, such as caloric restriction, rapamycin, or senolytic drugs. Examination of the blood clock's behavior in heterochronic and isochoric parabiotic pairings, and in mice with specific mutations that promote blood-related diseases such as leukemia or anemia would also be of great interest. Another avenue for this work would encompass further data mining, for example using the whole raw ADVIA output with more than 500 variables (not all of them biologically relevant) to determine which features are better predictors of age. Moreover, machine learning tools could be helpful in identifying putative interactions among these and other less studied variables as well as predicting different time-to-event outcomes other than mortality such as tumor onset or the inception of cognitive-associated morbidities. It would also be very interesting to assess whether other sets of variables outperform the one evaluated in this work.
This study has several strengths and limitations worth noting. As the first large longitudinal study of normative aging in mice, SLAM assesses many phenotypes, biological metrics, physical variables, and pathologies, which results in a very rich dataset with a large sample size. Unlike the data used to generate other aging clocks, the blood features used in this analysis are routinely collected in a clinical setting and could provide a unique opportunity to assess aging progression in human populations. Our initial analysis utilized data from two different strains of mice of both sexes, including both inbred and genetically heterogeneous populations of mice. We then expanded our analysis by including data from a study conducted at a different facility in an even more genetically diverse mouse population, further emphasizing the robustness of our findings. Besides the obvious highly controlled environment of the preclinical model-based studies, one particular limitation arises from the fact that the animals in the study were virgin mice and housed at sub-thermoneutral conditions which is different from what would be observed in human populations.
In conclusion, this model can serve as a viable and powerful alternative to epigenetic clocks and a valuable tool to help drive forward the aging field. We have demonstrated that a biological clock based on routinely collected blood variables can provide reliable predictions of biological age. The difference between blood age and chronological age appears to be associated with mortality risk, as mice with an older biological versus chronological age have nearly a 30% higher death rate. Translation of these findings to clinical application in humans will require further study to account for the myriad of uncontrolled variables and challenges that study participants face daily.