We explored the impact of introducing an AI-DSS on diagnostic decisions made by hospital optometrists when interpreting OCT scans. We expand on previous studies in other areas of medicine which have demonstrated a positive effect of human-AI collaboration when using a system of high diagnostic accuracy (4, 25); however, unlike previous work, we used a high proportion of cases (60%) in which the outputs of our AI system were incorrect (disagreed with the reference standard) or were ambiguous (more than one diagnosis proposed with high probability).
Overall, our participants made the most accurate diagnoses with respect to the reference standard when assessing the clinical cases without AI diagnostic support. This 'no Al' accuracy of 81% was very similar to the 80% mean diagnostic accuracy found by Jindal et al (26), where optometrists assessed retinal and optic nerve OCTs to determine whether either were 'diseased'.
The number of 'correct' responses decreased to 75% when AI diagnosis was presented in our cohort. We deliberately selected our cases based on AI outputs because, though infrequent, we aimed to explore how incorrect (whether stemming from a truly incorrect AI diagnosis or a disagreement with an imperfect reference standard) or uncertain AI diagnostic support may affect human diagnostic performance. The difference in practitioners’ responses aligning with the reference standard between the ‘no AI’ and ‘AI diagnosis’ presentations was of borderline significance and became non-significant when excluding the results from the three cases of ERM (supplementary material). A recent study by Tschandl et al (4) reported a negative effect of incorrect AI outputs on participants' diagnostic accuracy. That study, however, arbitrarily modified the output of an AI system to artificially produce incorrect results. We focussed instead on the (rare) actual cases where the AI system produced output inconsistent with the reference standard which does not automatically equate with incorrect output.
Even fewer diagnostic responses agreed with the reference standard when both AI diagnosis and AI segmentation were displayed (68%). The role of clinically ambiguous cases is likely to be the fundamental factor leading to this result. Cases where participants may have based their decisions on innocuous, subtle details revealed on the segmentation overlays rather than the AI diagnosis may offer an interesting and informative perspective on Human-AI interaction. Although the reference standard and the AI diagnosis were aligned in the examples identified, an alternative interpretation of the imaging in favour of a an ERM being present (for set 1) and a CNV diagnosis (for set 2) could conceivably be made even by ophthalmology specialists.
These findings also highlight a conundrum on the value of presenting segmentation overlays to provide more information to clinicians, especially those less experienced in the interpretation of OCT scans. The diagnostic classification algorithm was trained on the segmentation produced by the segmentation algorithm; however, it was trained using clinical labelling of segmentations by experts at MEH, who were able to differentiate nuanced presentations of pathological OCT features highlighted by the segmentation algorithm in the broader context of each case. This creates different thresholds for pathology detection 'reference standards' and thus discrepancies between the segmentation and diagnostic outputs. For any AI systems in healthcare, a clear distinction is required between levels of ‘detectable’ and ‘clinically significant’ pathology and one must be careful when showing visualisations of intermediate stages to users, as they may be misinterpreted. Considering also the positive effect that the visualisations had on participants' trust, the effect of the segmentation overlays observed in our study suggests it is important for any additional visualisation to be aligned with the AI diagnostic output.
There were no significant differences between the number of correct responses from the two groups based on level of experience. This is contrary to findings of a previous study in ECG interpretation using a non-AI system (14). However, we again compare our findings to Tschandl et al (4), whose diagnostic task was similar to ours, in that it used multi-class outputs and an AI-DSS. That study found an inverse relationship between the net gain from AI-based support and participant experience for an accurate AI system. Our combined findings suggest that less-experienced participants may benefit most from correct AI diagnostic support, but all users are equally influenced by incorrect outputs.
In our study, AI did not increase optometrists' diagnostic confidence, either with or without segmentation overlays. Bond et al (14) reported that incorrect automated diagnostic support significantly reduced interpreters' confidence. Despite our selection of 60% of cases where the AI was ‘incorrect’/ambiguous’ there was still no significant impact on diagnostic confidence for the full cohort. Future research should assess diagnostic confidence using the AI with its true diagnostic accuracy for clinical implementation (5).
While AI in ophthalmology offers great potential, the social and legal challenges cannot be ignored. Reliability and accountability of the AI systems and their impact on clinical decision-making creates a complicated dynamic with healthcare professionals. For AI to be accepted by clinicians, both personally and institutionally, the systems must be reliable and trusted (27). In this study, only one participant reported that they distrusted the AI diagnoses (without segmentation), with 16 neutral and 13 trusting. Given our case selection, it would have been possible to inadvertently introduce a bias against the system. Dietvorst et al (28), describe this as 'algorithm aversion', which is the reluctance to use algorithms known to be imperfect. Participants may detect the AI’s imperfect accuracy and uncertainty and calibrate their trust (29) based on this isolated experience of using the AI.
Another challenge of introducing AI into clinical practice is the well-known "opaque box" problem (27), describing many AI systems as non-transparent. Even though the accuracy of the AI was matched between the ‘AI diagnosis’ and ‘AI diagnosis plus segmentation’ presentations, the increased transparency with the segmentation overlays may have created the significantly higher level of trust in the AI when segmentations were displayed. This finding was particularly interesting in our study as although there was increased trust in the system when segmentations were displayed, participants agreed less on average with the AI diagnosis and reference standard in this presentation format. Further research is required to explore how different elements of AI visualisations are utilised during clinical decision-making and which aspects most influence clinicians' OCT interpretation.
Limitations
We have identified four main limitations to this study. Firstly, because the study was run remotely it was not possible to observe participants' decision-making processes. Future research with observations and/or detailed exit interviews would provide valuable insights into participants' interactions with AI systems.
Secondly, the AI segmentation model was trained by human graders who annotated thousands of OCT slices for features of ocular pathology based on grading protocols. Such protocols mandated the annotation of any trace of features such as ERM even if not clinically significant. In such cases of trace ERM, both ‘ERM’ and ‘normal’ can be considered an acceptable diagnosis based on the different thresholds for detectable vs clinically significant pathology. In comparison, the reference standard clinical diagnosis would typically only diagnose pathology such as ERM if it was considered clinically significant. As a result, the classification of both AI and participant diagnostic decisions into ‘correct’ and ‘incorrect’ compared to the reference standard is occasionally ambiguous.
Our study involved matching across the three study conditions based on clinical case selection. Although our matched cases were confirmed by a medical retina specialist (KB) we recognise that individual cases are unique and that it would be impossible to find identical cases when matching for AI outputs, OCT appearance and clinical information.
Finally, while we aimed to maximise the ecological validity of the study, it was limited in both not reflecting a natural mix of cases and including less patient information than would normally be available.