In this study, we developed six machine learning models to predict femur fractures and major osteoporotic fractures at 3, 5, and 10 year prediction intervals in women aged 45-79. All algorithms obtained a high AUROC, consistently high sensitivity (>0.80), and moderate to high specificity (0.522 to 0.726) for fracture predictions with the strongest performance yielded in the three-year prediction window for both the major osteoporotic and femur fracture prediction models. This suggests that accurate and interpretable predictions can be made in populations where a 10-year window may not be meaningful or clinically useful. These results support that the algorithms developed in this study may be able to help guide clinical decision-making surrounding individual-level fracture risk in women. Predictions were made using commonly available EHR information and six months of patient data. History of bisphosphonates use was also determined to be an important indicator of fracture risk in all models. These results support that the algorithms developed in this study may be able to help guide clinical decision-making surrounding individual-level fracture risk in women, even in those who have undergone treatment for fracture prevention. The inclusion of females only in our study was due to the fact that osteoporotic risk factors are more complex in women, particularly at postmenopausal state, than they are in men.39,40 Therefore, women may gain the most benefit from these prediction models as fractures in older women constitute a serious health risk.
Though individuals who are being treated with therapeutics for osteoporosis are known to be at higher risk for fractures than the general population, this tool maintains its clinical utility from a cost/benefit perspective. These assessments would allow clinicians to understand a person's fracture risk while on osteoporosis medication to determine future medication use. It may also help a clinician to determine which patients may be candidates for a drug holiday while ensuring that those that are at higher risk for fractures continue treatment.41
Explicitly including osteoporotic medications ensures that the algorithm is valid for individuals with a history of pharmacological fracture prevention treatment, which is a population for whom risk stratification is imperative but for whom FRAX is not validated. Though the clinical efficacy of these therapies in modifying fracture rates has been demonstrated in previous clinical trials, fewer studies have examined long-term fracture risk effects of these drugs, particularly in combination with other clinical risk factors.41 Therefore, long-term osteoporotic fracture risk (even after the osteoporosis diagnosis has been established and treatment has begun) has clinical utility for identifying individuals who continue to stay at higher fracture risk due to additional factors.
The algorithms demonstrated better performance for prediction of major osteoporotic fractures in comparison to the AUCs that are reported for FRAX by Crandall et al.21 and met or exceeded previously reported performance of the FRAX tool in other studies, including those conducted on comparable populations.21–23,28,31
In addition to traditional risk assessment tools, machine learning has been explored for its utility as a risk stratification tool for osteoporotic fractures.26,28,29 ML-based tools for this purpose hold the potential for individualized fracture prediction. ML-based tools that use EHR data may draw upon routinely assessed high-dimensional variables (vital signs, demographic measurements, and comorbidities) that influence the risk of osteoporotic fracture to make personalized assessments. In other health applications, it has been demonstrated that MLAs are particularly suitable for identifying interacting risk factors in high-dimensional EHR data that are otherwise difficult to capture with conventional statistical models.42–44
Several studies have been conducted to evaluate the ability of ML to stratify osteoporotic fracture risk using various data sources and ML-methods. Almog et al. employed natural language processing methods (NLP) to examine the capability of NLP for the assessment of fracture risk within a one to two year lookahead period 26. Sequential, longitudinal ICD data drawn from an 11 year window was used for analysis and eligible patients were > 50 years of age with two years of data available.26 ICD code vectorization with long short-term memory (LSTM) achieved an AUROC of 0.812.26 Kong et al. developed and compared a novel gradient-boosted machine learning model, CatBoost, with two additional common ML methods for fracture predictions. Non-traditional risk factors, such as lifestyle or economic status, were incorporated into the analysis.28 CatBoost was the best performing of the three models, however, its highest area under the curve (AUC) value was only 0.688 for total fracture prediction when all available data were incorporated.28 Performance decreased slightly for prediction of fractures in the hip and vertebrate.28 For this study, we selected the XGBoost algorithm due to its ability to handle missing data and imbalanced classes in addition to being explainable with SHAP plots. We additionally explored more sophisticated models that had greater flexibility in regards to inputs from the time series data of patients.
Limitations of this study are as follows. The performances of the fracture prediction algorithms were not assessed in prospective settings due to the retrospective nature of the dataset. Osteoporotic fracture risk factors were identified solely via EHR data.Some of relevant information such as menopause status and the ability to distinguish between type 1 and type 2 Diabetes was not feasible with our dataset. Other relevant variables, such as fall history, are not well documented in EHR. The exclusion of disorders that impair neurological function and increase fracture risk (eg, dementia, stroke, etc.) is also a limitation of the present work.45 Furthermore, we cannot guarantee that the database includes all patient-related events occurring during the follow-up (10.5 years) period. Due to dataset limitations, we were not able to consider the therapy dose or duration, recency of fractures, or androgen depletion therapy or hormone antagonist therapy. Our dataset does not allow us to determine exact age as it includes birth years only. We cannot predict how our algorithm may perform in other patient populations or in populations with different data availability. Because of this, we used +/- 1 year from the actual age. Although we did not pose an upper age limit, all participants who were 80 years or older, who are known to be at high fracture risk, were filtered out because they did not have the required data for the 10-year follow-up period. It is possible that this introduced bias into the sample, which should be addressed in future work. While the exclusion of subjects without at least 10 years of follow-up was required for the performance evaluation of the long-term prediction model, it also created a biased selection of healthier younger subjects by excluding individuals who died after the start of the follow-up.46 By not adjusting for competing mortality, we may have overestimated the 10-year fracture probability. In future studies, non-parametric statistical methods can be used to adjust for the higher risk or mortality within the observation window that exists for older individuals.46 Further bias may have been introduced through the use of a six month window of data collection, as this may have led to the exclusion of healthier individuals who did not require any medical interventions in that six month window. There was a slight discrepancy between the predicted and observed probabilities (see calibration curves), indicating that our model may overestimate the predicted fracture risk in some individuals or subgroups due to noise in the training data and differences in patient characteristics between training and test datasets. Finally, our study included a long prediction window, during which external factors, such as regulatory clearance for new pharmaceutical treatments for osteoporosis, may have had an impact on patient outcomes in ways not fully captured by algorithm performance. Future directions include using a longer window of patient data to generate predictions, validating the MLA in populations of men and individuals aged >80 to determine how performance is impacted, validating the algorithm in multiple geographic locations to account for localized risk factors, conducting a prospective validation, and examining possible confounding factors that may influence the performance and accuracy of the algorithm.