ACP informs end-of-life care to respect patient preferences, ensure quality of life, and avoid costly, unnecessary, and unwanted interventions.[2], [27] Mortality prediction models may help spur ACP conversations. Timely predictions may strike the right balance between sufficient clinical urgency and an adequately long lead time to allow for these often time-consuming discussions.[4], [28] These predictions may be especially useful in mixed-rurality populations due to relatively reduced access to healthcare when compared to urban populations.
This work was inspired by the studies out of NYU Langone demonstrating the performance and impact of their 60-day mortality prediction model which was intended to encourage ACP discussions [14] as well as to encourage appropriate patient referrals to supportive and palliative care.[13] NYU Langone’s model performance, with an AUC-PR of 28%, was enough to achieve good rates of physician agreement with the alerts and greater use of ACPs.[14] Therefore, we hoped to achieve a similar level of performance in with our model in our mixed-rurality population and maintain that performance over time despite changing conditions. COVID-19 created significant systemic change in the healthcare, and systemic change often leads to performance degradation in machine learned models.[16] Our predictor demonstrated consistent performance and resistance to concept drift, achieving an AUC-PR of 29% on both the pre-COVID and during-COVID datasets.
NYU Langone selected a cutoff designed to achieve a precision of 75% to identify likely appropriate referrals to supportive and palliative care. The tradeoff for high precision was a recall of just 4.6%.[13] Since our intended use was solely to encourage ACP discussions, we evaluated two cutoffs designed to provide higher recall at the cost of reduced precision. On the full pre-COVID dataset at a 12.5% cutoff 12.5%, our model achieved 58% recall and 25% precision; at a 37.5% cutoff the model achieved 12% recall and 44% precision. Model performance on the full during-COVID dataset did not significantly differ from that of the full pre-COVID dataset for any of those measures, demonstrating resistance to concept drift and performance degradation.
Previous work suggests that racial differences exist in the relationship between physiologic and socioeconomic parameters and mortality prediction.[29] Many recommend accounting for potentially differing machine learning model performance among demographic groups.[30]–[32] The COVID-19 pandemic has disrupted healthcare, particularly affecting patients with low socioeconomic status.[33], [34] The timing and effectiveness of ACPs can be affected by socioeconomic circumstances, race, and geographic location.[35], [36] Given these considerations, we assessed model performance in different subgroups including rurality, level of socioeconomic disadvantage, gender, ethnicity, and race. We also assessed performance during a lull and a peak in COVID case rates. Finally, we assessed the importance of fresh data to the model’s performance.
Significant performance differences were not seen for most comparisons, with some notable exceptions and caveats. Fresh data seems important for model performance, at least at the higher cutoff, likely because a recent physiologic change cannot be recognized if that data is not available to the model. Recall was significantly lower than that of the overall pre-COVID population for White non-Hispanic patients and patients from rural areas. During COVID, the Other Race/Ethnicity subgroup and the female-only subset of that subgroup had lower precision than the overall population. Conclusions cannot be drawn and further research is warranted for a substantial minority of comparisons that were neither significantly different nor adequately powered. However, for the majority of comparisons, model performance was comparable to that of the overall population.
As expected, precision tended to be lower in subgroups having a lower prevalence of 5–90-day mortality (Fig. 2). In the two instances for which precision was statistically significantly lower than the overall group, prevalence of 5–90-day mortality was among the lowest of any subgroup. Since most precision comparisons were underpowered at the 37.5% cutoff, the 0.64 correlation at that cutoff may be underestimated. This analysis shows that differences among subgroups in predicted risk at a particular cutoff are associated with actual differences in risk.
For subgroups having significant differences in predictor performance, the cutoffs for those subgroups could be adjusted to equalize performance. However, changing the cutoff typically improves either precision or recall at the cost of worsening the other metric, so predictor performance cannot be simultaneously equalized for both metrics across subgroups. One must select a metric to equalize. In our scenario, selecting cutoffs that equalize precision across subgroups would increase the likelihood that all who receive an alert will have a similar risk of near-term death. However, this means that subgroups with a lower prevalence of near-term death (e.g., females in our study populations) will be less likely to receive an alert and therefore may less likely have an ACP. Instead, cutoffs could be selected to equalize sensitivity across subgroups so that an equal fraction of patients who actually suffer a near-term death receive an alert. However, subgroups with a lower prevalence of near-term death will be more likely to get an alert when they have a lower risk of death. This may lead to alert fatigue and/or mistrust of the predictor,[18] and the magnitude of variation in cutoffs among demographic groups that would lead to predictor distrust in this context is not known. In addition, if clinician capacity for ACPs is limited, patients with a lower risk of death may get an ACP at the expense of those with greater urgency and need. Cutoffs could be selected to equalize the frequency of positive alerts across subgroups to equalize the predictor’s impact on ACPs across subgroups. As with equalizing on sensitivity, however, this outcome may be lost if the resulting alerts on lower risk patients lead to alert fatigue and/or mistrust of the predictor. Also, those in greatest need of an ACP may be less likely to get one if clinician bandwidth to have ACPs is constrained. Other approaches may be taken, but all involve tradeoffs.
Existing literature suggests that equalizing the performance of a Boolean predictor among different subgroups is use-case dependent.[18], [17] For our use case, we suspect that equalizing precision across subgroups may best serve the clinical need by reducing the risk of alert fatigue and mistrust and prioritizing alerts to those with the greatest predicted need. However, since only a few statistically significant performance differences were seen among subgroups, and the statistical significance of those differences were inconsistent across the studied time periods, it may be wisest not to draw firm conclusions about whether or how to adjust cutoffs until the pandemic further stabilizes and the study can be repeated.
Our use of ADI to assess predictive model equity across levels of economic disadvantage along with the assessment of equity across different levels of rurality may be unique. A PubMed search on “ADI prediction equity” or “area deprivation index prediction equity”[37], [38] returned only one relevant result looking at the equity of a prediction model for various levels of ADI, and that study did not assess equity across levels of rurality.[39]
Limitations
Although assessments were designed to avoid “future leakage” (use of data that will not be available at the time of prediction), complete avoidance cannot be guaranteed in this retrospective study. Other confounders related to the retrospective nature of this study may have affected results. This study was performed at one multi-hospital health system serving a predominantly White and Midwestern population, potentially limiting generalizability. Some demographic data may be inaccurate, affecting results. We grouped RUCA codes based on published approaches,[22], [23] but different published groupings might have led to different rurality results.[23], [40] The ADI may not accurately represent the patient’s socioeconomic status, and our use of an average ADI for the five-digit zip code may not represent the actual ADI for the patient’s census tract. Some demographies were aggregated to avoid small group sizes, and the predictor may perform differently across the aggregated demographies. Use of current code status as a proxy for status on admission may have affected results, but we believe patients are more likely to change from null or full code status to something else than the reverse. Finally, our study was limited to model performance analysis, not the resulting impact on clinical care. These limitations represent fruitful areas of future research.