In this study we demonstrate a novel CoDoC system that can learn to decide when to rely on a diagnostic AI system and when to defer to clinical experts or workflows. We evaluated CoDoC in multiple simulated clinical workflows screening for breast cancer or TB and showed that combined AI-clinician performance using CoDoC exceeds that currently possible through either AI or clinicians alone. CoDoC is highly configurable to meet the requirements of specific clinical deployments, and does not require access to the inner workings of the target standalone AI diagnostic model. We believe CoDoC represents a step towards harnessing the complementarity possible between AI and clinical experts, to improve accuracy, trust, and safety in real-world clinical deployments.
It is increasingly becoming apparent that clinicians and AI systems fundamentally assess images differently10, and that both have different strengths and weaknesses 29,30. It is therefore intuitive that systems designed to combine aspects of both should lead to improvements in both performance and safety. However, in practice, there is an unmet need to enable users of medical AI systems to know which opinion should prevail when their opinion differs from an AI tool and they are uncertain which should prevail. Furthermore, the ability for an AI system to say "I'm not sure" or "I do not know" is an important capability to ensure safe clinical deployment of this technology31.
A recent study demonstrated that the paradigm of deferral using threshold-search is a promising approach for managing this unmet need16. However, this prior work only explored the solution in one medical condition (breast cancer), one diagnostic AI tool and one clinical workflow (for breast cancer screening in a German dataset). It has hitherto remained unclear whether the promise of deferral might be applicable to multiple medical AI applications, how a deferral algorithm might generalise to diagnostic AI tools from multiple different manufacturers, whether performance would be robust given multiple different clinical workflows, and whether a deferral algorithm could be adaptable to new AI tools or clinical settings with very limited data for site-specific training (as is common in medicine). CoDoC validates the hypothesis that algorithmically-driven deferral between AI and clinical experts might improve composite performance in a wide variety of medical AI applications screening for cancers and TB alike, with rigorous evaluation in multiple countries for multiple different AI systems from different manufacturers. Our method enables generalisation with limited retraining data and our code is openly shared to enable further reproducibility and advancement of this field (as demonstrated in Section 4). A key contribution of our work is that human-AI complementarity is not always present (as was seen in 2 of 5 commercially-available TB systems) and in that setting our work shows that confidence-based deferral methods will not improve composite performance. In particular, the results from section 5 demonstrate the limitations of confidence based deferral strategies and are a useful tool to determine, given a particular dataset for training the deferral AI, whether one could expect to see any improvements from any confidence based deferral strategy. In real-world scenarios such an analysis could provide clear guidance on whether to use CoDoC.
For breast cancer screening in a large representative UK mammography dataset, CoDoC was superior in sensitivity to double-reading at the same specificity, and superior in sensitivity while maintaining specificity. “Double-reading” is regarded as the “gold standard” for performance in the UK and much of Europe 32 33 34 35) never previously exceeded using AI36 3738. The same system maintained superior accuracy to both clinicians and the same diagnostic AI model even when the diagnostic AI was deployed out-of-distribution in a large US mammography dataset, only tuning the deferral AI on a small amount of out of distribution (OOD) data. Improvements in sensitivity and specificity were sustained for a wholly-separate diagnostic AI tool for US Mammography screening (from a different manufacturer) despite access to only 26 positive cancer cases for tuning.
CoDoC also conferred significant improvements in the resource-limited setting of TB screening in Bangladesh. CoDoC reduced the utilisation of Xpert tests for 3 of 5 commercially-available AI systems, by deciding when Xpert test utilisation should be decided by AI and when the decision should be deferred to a radiologist. This workflow has high real-world applicability as many TB screening centres using AI software already have the ability to route a subset of cases for radiology interpretation, while some countries specify that radiologists must be present at the time of CXR acquisition39. For 2 commercial AI systems, our CoDoC analysis demonstrated that confidence-based deferral would not improve performance over AI systems alone. In settings where radiologist interpretation is nevertheless considered mandatory for AI quality assurance, such CoDoC analysis might enable more cost-effective monitoring by highlighting situations in which radiologists performing quality-assurance of AI systems would be least likely to identify AI errors.
The breadth of clinical modalities demonstrates that CoDoC is highly clinically applicable because the deferral component is easily adaptable to multiple clinical workflows39. Even in one medical modality such as mammography, our results were robust in deferring to either single-reading or double-reading practice. We demonstrated that a variety of operating points could be chosen depending on the goals of the healthcare system, with statistically superior performance in clinically-applicable operating point regions. For example, a mammography centre might wish to optimise for either cancer detection rate or recall to assessment rate, and various CoDoC system configurations can be invoked depending on the balance between those goals with desired efficiencies for clinicians’ time. Indeed our results suggest that deferral to a single reader might enable a screening programme to attain performance exceeding the gold standard of double-reading while only requiring a fraction of a single reader’s time. Prospective and health economic outcomes studies will be required to confirm and quantify this potential benefit. The downstream effects of replacing a first reader (within an AI-enabled double-reader workflow) with CoDoC superior to the whole traditional double-reading workflow could also have a profound effect on the overall performance of AI-enabled double-reading. Future reader studies will be required to quantify this effect.
CoDoC performed well despite stress testing under multiple types of distribution shifts that commonly cause failures of medical AI in real-world settings. Particularly notable were results under two forms of shift that are common in the real world: shift in clinician performance and shift in population or site. It has been shown that clinicians’ accuracy can vary significantly, both in terms of accuracy as well as in terms of the trade-off between sensitivity and specificity 32,40. Reassuringly, CoDoC was able to generalise to multiple previously unseen readers in the UK mammography screening programme without any requirement for per-reader personalisation.
The variation in screening programmes between different hospitals or health systems is often sizeable41 (with our experiments therefore exposing CoDoC to multiple shifts between health systems including changes in demographics, acquisition equipment, disease presentation, local clinical pathways). Despite significant differences from the diagnostic AI system’s training data and an associated performance drop, the deferral AI was able to generalise to a previously unseen US hospital with minimal and realistic local training data needs. In particular, when we tuned the deferral AI using only 40 cases from a new population/site, CoDoC was able to improve upon the diagnostic accuracy of both the standalone AI and the clinician. In this setting, the deferral AI deferred a greater proportion of cases where diagnostic AI was less reliable than clinicians, suggesting that this paradigm could provide a valuable “safety net” for AI-enabled healthcare. This may enable local expert clinicians to mitigate concerns about failures of standalone diagnostic AI during deployment in new environments.
Comparison to relevant literature in AI
There is a long history of literature in machine learning that considers selective prediction systems that can refrain from making predictions on certain instances. This line of work traces back to the work of Chow et al42, where the authors derived theoretically optimal algorithms in this setting. More recent reviews of this area can be found in Wiener et al43. Connections between selective prediction and active learning44 have also been studied. These works differ from the deferral setting considered in this paper, since selective prediction ignores the accuracy of the human expert when the AI system abstains. The deferral setting was studied in Sontag et al14 where the authors proposed a novel statistically consistent estimator for simultaneously learning a deferral model and the underlying prediction model. This was further extended to settings with multiple experts in subsequent work45. Optimising the performance of a human-AI team without restrictions on the deferral rate have been studied46. Other works have also proposed frameworks for AI models to defer to a domain expert in cases where the AI has low confidence in its inference 47 but require the ability to simultaneously learn both the classifier and deferral system. Others have proposed a model48 to characterise human-AI (or human-human, AI-AI) complementarity, and demonstrated that complementarity may or may not exist in human-AI settings with the existence or degree of complementarity depending on a number of factors: the independence of human and AI decisions, existence of confidence scores for the predictions provided, and baseline individual performance of the human and AI. CoDoC extends and grounds these previous observations in the safety-critical medical AI domain, showing varying degrees of extractable complementarity between AI models and human experts, and proposing a reliable method for extracting it when available.
Many of the approaches above require co-training the deferral and diagnostic AI, which is not possible in medical AI settings where diagnostic classification tools are deployed in a “frozen” configuration by regulatory requirement and where access to the training pipeline for the diagnostic tools is not usual. Our work was inspired by this research, but we limited ourselves to deferral based on the confidence estimates of pre trained diagnostic AI models. This constraint for deferral systems to work with “black box” fixed diagnostic AI models also enables deferral to be studied in a wider variety of settings, since it absolves the requirement for access to the training pipeline and data for the diagnostic AI systems in each setting (which present significant practical hurdles to deferral paradigms that require co-training the deferral and diagnostic AI together). We found that our approach sufficed to obtain statistically significant improvements in performance with the CoDoC system, and that doing so decoupled the training of the deferral AI from the diagnostic AI which is highly advantageous in situations where the diagnostic AI is only available as a black box that cannot be modified (for example, due to IP or regulatory constraints). In future work, it would additionally be valuable to explore additional gains in diagnostic accuracy that could be obtained by co-training the diagnostic AI and deferral model, which might be possible for individual manufacturers in medical settings, even if not practicable in our setting of developing a single deferral wrapper for multiple different medical AI systems.
Limitations
In this study we evaluated performance under the assumption that clinicians and the diagnostic AI model perform independent case interpretation, as is approved in some clinical settings such as TB screening. However, in many settings clinicians use diagnostic AI models as an assistive tool, where prospective research will be required to establish the impact of CoDoC and where orthogonal work to CoDoC will be required to maximise its benefits. For example, it has been shown that the complementarity of AI tools for human experts is also dependent upon factors such as the operators’ mental model, cognitive load, and trust4, which could be optimised independently of the application of the CoDoC paradigm in a manner specific to each diagnostic AI tool. In particular, there is also evidence that providing AI decision-support can lead to systematic but unconscious biases on a clinician’s decision-making process4.
Our research demonstrated that improvements in accuracy were obtained using the CoDoC system while saving clinician’s time compared to a standard AI-enabled workflow. The CoDoC framework already supports the introduction of tunable constraints/penalties on the deferral rate, and this could be adjusted based on desired savings for clinician time as a trade-off with composite accuracy. However, further health economic research and detailed per-hospital considerations would be needed to determine the right trade-offs, beyond the scope of the present work.
While our mammography test set was representative for UK practice, the US mammography dataset 2 was enriched for cancer prevalence compared to national practice. We simulated deployment scenarios for CoDoC with retrospective datasets, but quantifying the performance gains that result from clinician-AI interaction would require prospective reader studies and exploration of other aspects of human-AI complementarity orthogonal to the deferral decision - for example AI onboarding, trust and mental models.
The same limitation was also true of the CXR examinations used to triage Xpert tests for TB screening, in which multiple non-TB pathologies may be noticed by a radiologist but not classified by AI tools used to screen TB. Incorporating these tasks would require further research. Furthermore, although Xpert is regarded by WHO guidelines as an acceptable reference standard for evaluating AI systems, the gold standard would be full sputum culture for all participants. However, this was true for both AI and human radiologists in the dataset presented, so no selection or measurement bias was introduced and our approach was consistent with prior published work. CoDoC did not achieve uniform performance gains across the whole ROC curve. In the datasets we considered, the ROC range in which superiority was demonstrated coincided with regions of clinically-relevance (as illustrated by benchmarks of clinician sensitivity or specificity for screening decisions), but this may not be guaranteed for other applications of the CoDoC paradigm.
Beyond average diagnostic performance, variation among different population subgroups is an important concern as it can amplify health inequalities. This is a significant challenge for both standalone diagnostic AI systems and clinician experts, both shown to exhibit significant variation in population subgroup performance for a range of medical applications49. Preliminary analysis suggests that CoDoC does reduce variability in performance between different subpopulations, but further work is required to rigorously validate this, alongside further important distribution shifts for real-world medical AI, such as variations in instrumentation, acquisition and imaging technology.