Who should do as AI say? Only non-task expert physicians benefit from correct explainable AI advice

doi:10.21203/rs.3.rs-1687219/v1

Download PDF

Article

Who should do as AI say? Only non-task expert physicians benefit from correct explainable AI advice

https://doi.org/10.21203/rs.3.rs-1687219/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Artificial intelligence (AI)-generated clinical advice is becoming more prevalent in healthcare. However, the impact of AI-generated advice on physicians’ decision-making is underexplored. In this study, physicians received X-rays with correct diagnostic advice and were asked to make a diagnosis, rate the advice’s quality, and judge their own confidence. We manipulated whether the advice came with or without a visual annotation on the X-rays and whether it was labeled as coming from an AI or a human radiologist. Overall, receiving annotated advice from an AI resulted in the highest diagnostic accuracy. Physicians rated the quality of AI advice higher than human advice. Neither manipulation had strong effects on participants’ confidence. Importantly, the results varied among task experts and non-task experts, with only the latter considerably benefiting from correct explainable AI advice. These findings raise important considerations for the deployment of diagnostic advice in healthcare.

The number of artificial intelligence (AI) enabled software applications for radiology is growing rapidly. By now, there are more than 190 CE-marked^{^[1]} products available, from which almost 100 have received Class II or Class III FDA clearance (www.AIforRadiology.com). For chest X-rays, which are the most frequently performed radiological examination worldwide (1 ), there is a wide array of certified AI-enabled clinical decision support systems (AI-CDSS) on the market (2 ). Many AI models developed for radiology tasks have shown excellent performance equal to or even surpassing human experts (e.g., 3 , 4 ), but few studies have investigated these products' actual clinical impact (e.g., physicians’ diagnostic performance and patient outcomes) when implemented in a natural clinical setting (2 ). The limited work that does investigate whether AI-CDSS has actual clinical benefits is inconclusive (5 , 6 , 2 ), suggesting that the technology might not automatically lead to better patient outcomes.

One reason for the rather limited effectiveness of AI-CDSS in deployment may be a difference in usage: while these systems are evaluated based on their predictions in isolation, in practice they are most often used in conjunction with a human intermediary. As long as AI-enabled radiology software does not autonomously diagnose or classify findings, the predictions from the models have to be regarded as diagnostic advice which can be accepted or rejected by a physician who has to make the final decision. However, at the moment, research on how users interpret and act on AI-generated advice is limited. Previous studies have shown that people often rely heavily on any given advice and even fail to dismiss inaccurate advice. This has been shown both in clinical tasks among physicians (7 , 8 ) and in other decision-making scenarios (9 ). Knowing that physicians can be influenced by advice, it is crucial to study the optimal way of presenting clinical advice to maximize its efficacy.

To this end, one promising direction for helping users contextualize and better incorporate AI-generated advice is making the inner workings and decision criteria of AI models more transparent (10 ). Providing additional reasoning for AI recommendations (e.g., visual annotations on X-rays) could potentially help mitigate over-reliance and encourage appropriate trust (11 ). Previous research has shown that providing case-by-case explanations indeed increases trust in and reliance on the advice, even when the advice is incorrect (12 ). However, it is less well understood how the explainability of advice affects users with different levels of task expertise. For instance, it is plausible that physicians who receive less specialized training than radiologists to review medical images might benefit more from diagnostic advice alongside a visible annotation on the image indicating what region influenced the advice. To the best of our knowledge, the question of whether explanations affect the diagnostic decisions of physicians with different amounts of task expertise has not yet been not yet studied.

Several studies have compared diagnostic performance when reviewing images with and without AI support (see 13 for a systematic review in CT and chest X-rays). However, in a clinical setting, physicians often receive advice from colleagues, or are asked to re-review cases from someone else to give a second opinion (14 , 15 ). Therefore, comparing different forms of advice—e.g., AI-generated or from a colleague—is more pertinent to fully understand the unique impact of AI-enabled CDSS on physicians' clinical decision-making. Indeed, verifying and falsifying suggestions is a different cognitive task than independently gathering findings and deducing a diagnosis (16 , 17 ). Therefore, comparing diagnostic performance when reviewing cases with and without AI support, while extremely interesting, tests different underlying decision-making processes. When comparing different forms of advice, studies have found both algorithmic aversion (i.e., preferring human advice compared to an algorithm, (e.g., 18 )) and algorithmic appreciation (i.e., preferring advice from an algorithm compared to human advice (e.g., 19 )). These varying observations might be due to several factors—for instance, it has been shown that people with high task expertise are more inclined to dismiss or devalue task-related advice from an AI system than are people with low task expertise (7 , 19 ). It has also been shown that even when participants rate the quality of AI advice to be lower than human advice, they still follow both sources of advice to the same degree (7 , 9 ). Given this discrepancy between the evaluation of and the reliance on the advice, testing the influence of the source of advice on physicians’ confidence in their final diagnostic decision is also pertinent.

In the present study, physicians with different levels of task expertise (i.e., task experts and non-task experts) were asked to review a series of chest X-rays, since this is a prevalent diagnostic task for which AI technology is widely applicable. The participants received only accurate diagnostic advice (verified by an expert), with two levels of explainability by either showing visual annotations indicating where the proposed condition can be seen on the X-rays or presenting the X-rays without an annotation. Additionally, the purported source of advice was labeled as coming either from an AI system or a human radiologist. We tested the impact of these two advice manipulations (explainability and source) on three dependent variables: physicians’ 1) diagnostic accuracy, 2) ratings of advice quality, and 3) confidence in their final diagnosis. Figure 1 gives an overview of the experimental design.

^{^[1]} “CE marking indicates that a product has been assessed by the manufacturer and deemed to meet EU safety, health and environmental protection requirements. It is required for products manufactured anywhere in the world that are then marketed in the EU.” (https://europa.eu/youreurope/business/product-requirements/labels-markings/ce-marking/index_en.htm)

Analysis

We calculated three mixed-effects regression models, one for each dependent variable: 1) diagnostic accuracy, 2) advice quality ratings, and 3) confidence in the diagnosis. The diagnostic accuracy was assessed using a logistic regression model because it was measured as a binary variable (accurate/inaccurate). Linear regression models were applied for the advice quality and the confidence ratings. Each dependent variable was regressed on the explainability of the advice (annotated vs. non annotated), the source of the advice (AI vs. human), the task expertise (radiologists vs. IM/EM physicians), the interaction between explainability (annotated vs. non-annotated) and source (AI vs. human), and the control variables (professional identification, belief in professional autonomy, self-reported AI knowledge, attitude toward AI technology, and years of professional experience). All models included fixed effects for all variables mentioned above and a random effect for the participant to account for non-independence of observations and differences in their skills, as well as a random effect for the patient case to account for their different difficulty levels. One of the eight cases, which had been taken from the previous study without changes, one had no clinical abnormalities (diagnosis: normal) and, consequently, no annotations on the image. A second case was shown without annotations due to a technical issue. These two cases had to be excluded from the analysis because the explainability condition could not be unambiguously assigned.

Non-Task Experts Benefited from Correct Explainable AI Advice

First, we tested whether the participants’ diagnostic accuracy was influenced by the experimental manipulations (see Table 1). Unsurprisingly, task experts (i.e., radiologists) performed significantly better than non-task experts (i.e., IM/EM physicians). Overall, participants showed a higher diagnostic accuracy when they received advice with an explanation (i.e., an annotation). Surprisingly, they also performed better when the advice was labeled as coming from the AI instead of the human. When looking at the results separated by task expertise (defined by discipline), we found that providing annotated advice improved the performance of the IM/EM physicians, but had no significant effect on the radiologists’ diagnostic accuracy (see Fig. 2a). The source of advice did not significantly affect performance in either expertise group (see Fig. 2b). Moreover, higher self-reported AI knowledge was generally associated with better task performance (see Table 1). It should be noted here that task experts rated their self-reported AI knowledge significantly higher than non-task experts (Table S2), which might explain this result.

Table 1.

Logistic mixed multilevel regression models for participants’ diagnostic accuracy.

Predictors	Odds Ratios	SE	95% CI	z	p
Intercept	1.30	1.38	0.16 – 10.37	0.24	0.807
Explainability [annotated]	2.30	0.61	1.37 – 3.86	3.17	0.002
Source [AI]	2.09	0.57	1.22 – 3.59	2.69	0.007
Task expertise [experts: radiologists]	2.20	0.51	1.40 – 3.47	3.41	0.001
Professional identification	1.00	0.12	0.80 – 1.26	0.02	0.985
Beliefs about professional autonomy	1.20	0.14	0.95 – 1.52	1.57	0.117
Self-reported AI-knowledge	1.34	0.19	1.01 – 1.77	2.03	0.043
Attitude toward AI	0.99	0.12	0.78 – 1.26	-0.06	0.954
Professional experience (years)	0.99	0.01	0.97 – 1.01	-1.11	0.268
Explainability [annotated] x Source [AI]	0.57	0.22	0.27 – 1.20	-1.49	0.137

Note. SE = standard error; p = probability of committing a Type I error; random effects: σ²=3.29,τ₀₀ _ID= 0.60, τ₀₀ _PATIENTID=1.09, ICC = 0.34, N_ID= 222, N_PATIENTID= 6, Observations = 1332, Marginal R² = 0.086 / Conditional R² =0.397; OR>1 variable associated with higher odds for correct diagnosis; OR<1 variable associated with lower odds for correct diagnosis, OR=1 variable does not affect odds of outcome.

Non-Task Experts Preferred Explainable Advice

Next, we examined whether the experimental manipulations influenced the participants’ advice quality rating (see Table 2). Participants rated the quality of the advice on average higher if it was given with an explanation, i.e., with a visible annotation on the X-ray. When comparing the results by task expertise, we saw that annotated advice only led to higher advice quality ratings among non-task experts but not task experts (see Fig. 2c). The source of the advice had no effect on the advice quality rating — neither among IM/EM physicians nor radiologists (see Fig. 2d). Participants with higher task expertise rated the quality of the advice significantly higher than non-task experts. Attitude toward AI technology was the only other significant (positive) predictor for the quality rating in the overall sample (see Table 2).

Table 2.

Linear mixed multilevel regression models for advice quality rating.

Predictors	Estimate	SE	95% CI	t	p
Intercept	4.55	0.49	3.59 – 5.50	9.36	<0.001
Explainability [annotated]	0.25	0.07	0.11 – 0.39	3.42	0.001
Source [AI]	0.03	0.11	-0.19 – 0.25	0.29	0.771
Task expertise [experts: radiologists]	0.24	0.10	0.04 – 0.45	2.32	0.020
Professional identification	0.03	0.05	-0.07 – 0.14	0.59	0.554
Beliefs about professional autonomy	-0.09	0.05	-0.20 – 0.01	-1.70	0.089
Self-reported AI-knowledge	0.07	0.07	-0.06 – 0.20	1.08	0.279
Attitude toward AI	0.13	0.06	0.02 – 0.23	2.27	0.023
Professional experience (years)	-0.00	0.01	-0.02 – 0.01	-0.90	0.367
Explainability [annotated] x Source [AI]	-0.07	0.10	-0.27 – 0.12	-0.75	0.454

Note. SE = standard error; p = probability of committing a Type I error; random effects: σ²=0.79,τ₀₀ _ID= 0.42, τ₀₀ _PATIENTID=0.16, ICC = 0.42, N_ID= 222, N_PATIENTID= 6, Observations = 1332, Marginal R² = 0.042 / Conditional R² =0.449. The regression estimate indicates how much the mean quality rating changes given a one-unit shift in the predictor while holding other predictors in the model constant.

AI Advice Boosted Task Experts’ Confidence in their Diagnosis

When looking at participants’ confidence in their diagnostic decisions, task experts (i.e., radiologists) reported being more confident with their diagnosis, as expected. Neither the explainability nor the source of the advice influenced the confidence rating in the combined sample. However, when comparing task experts with non-task experts, radiologists surprisingly reported higher confidence in their final diagnosis when they received advice labeled as coming from the AI vs. the human (see Fig. 2e). IM/EM physicians’ confidence rating was not affected by the source of the advice. The only other variable associated with being more confident with their diagnosis was higher self-reported AI knowledge (see Table 3). As mentioned above, the fact that task experts rated their self-reported AI knowledge significantly higher than non-task experts (Table S2), might explain this result.

Table 3.

Linear mixed multilevel regression models for confidence in the diagnosis.

Predictors	Estimate	SE	95% CI	t	p
Intercept	4.60	0.43	3.76 – 5.44	10.76	<0.001
Explainability [annotated]	0.06	0.07	-0.09 – 0.20	0.77	0.440
Source [AI]	0.19	0.10	-0.01 – 0.39	1.91	0.056
Task expertise [experts: radiologists]	0.72	0.09	0.54 – 0.89	7.93	<0.001
Professional identification	-0.00	0.05	-0.09 – 0.09	-0.05	0.956
Beliefs about professional autonomy	0.01	0.05	-0.09 – 0.10	0.16	0.874
Self-reported AI-knowledge	0.19	0.06	0.08 – 0.30	3.39	0.001
Attitude toward AI	-0.01	0.05	-0.11 – 0.08	-0.26	0.794
Professional experience (years)	0.01	0.00	-0.00 – 0.02	1.79	0.074
Explainability [annotated] x Source [AI]	-0.03	0.10	-0.23 – 0.17	-0.32	0.749

Note. SE = standard error; p = probability of committing a Type I error; random effects: σ²=0.85,τ₀₀ _ID= 0.27, τ₀₀ _PATIENTID=0.16, ICC = 0.34, N _ID= 222, N _PATIENTID= 6, Observations = 1332, Marginal R² = 0.130 / Conditional R² =0.424.The regression estimate indicates how much the mean confidence rating changes given a one-unit shift in the predictor while holding other predictors in the model constant.

Performance Across Clinical Cases

Finally, we also looked at participants’ task performance for each clinical case (see Fig. 3). Overall performance was high; however, it was much lower for Case ID PT011 under both advice manipulations. This finding is consistent with our previous study using the same patient cases (1). While annotations on the X-rays had no benefit for the task experts across all cases, in this more difficult case, annotations seem to have had a positive effect even on task experts. Among IM/EM physicians, receiving annotated advice was generally associated with higher diagnostic accuracy (except for case PT007, which showed the smallest difference between the two conditions). Interestingly, non-task experts’ performance was on par with experts in the annotated condition. This might indicate that non-task experts benefit more from explainable advice independent of case complexity. Across all cases, the source of the advice had little effect on radiologists’ performance. Non-task experts showed slightly better performance (to a varying degree) when receiving advice labeled as coming from an AI across all but one case (PT015).

AI-enabled clinical decision support systems (AI-CDSS) are increasingly being implemented in healthcare facilities to improve efficiency and patient outcomes (2, 6). For the foreseeable future, predictions by an AI-CDSS will be seen as advice to support physicians in making diagnostic decisions. However, the underlying mechanisms of how the presentation of diagnostic advice affects physicians’ decision-making are still understudied. Considering the potential adverse consequences of implementing suboptimally designed AI-CDSS, the present study aimed to examine some of these mechanisms systematically. The findings raise important considerations for the use of AI-generated advice.

The results indicate that having an explanation (i.e., annotation on the X-ray indicating the area that determined the prediction) as part of the advice positively affected physicians' diagnostic accuracy and their quality rating. Several striking findings stand out when looking at the impact of explainability of advice. First, when comparing task experts (i.e., radiologists) with non-task experts (i.e., IM/EM physicians), only the latter group significantly benefited from the annotations. There are at least two probable explanations for this. On one hand, since the overall performance of the radiologists was very high, it is possible that task experts also profited from seeing the annotations, but ceiling effects masked these benefits. This assumption is supported by looking at the individual patient cases, where we see that task experts indeed profited more from visual annotations when reviewing more challenging cases. Therefore, future studies should include more complex cases to examine this further. On the other hand, it is also possible that task experts do not need visual explanations on the X-rays when already receiving findings and a primary diagnosis since this information is sufficient to guide their attention to the critical area on the image. Similarly, it is also feasible that written findings and diagnoses could be less easily comprehensible than a simple visual annotation for non-task experts. It would be interesting to record the reading time per case as a secondary outcome variable to examine the effects of explainable advice further. Reading time might decrease significantly by having annotations both for task-experts and non-experts. Moreover, it is surprising that annotations did not affect participants’ confidence in their own final diagnosis. We assumed that highlighting the area which determined the advice’s prediction would make participants more confident in accepting or rejecting the proposed diagnosis (20). Considering the high confidence level among task experts, ceiling effects might again explain this surprising result among the radiologists. However, non-task experts did not report being more confident after receiving visual annotations on the X-rays as well. This is especially surprising since their diagnostic accuracy and advice quality rating were higher in this condition. More research is needed to understand better which factors influence decision confidence among physicians when receiving diagnostic advice.

The findings from the present study indicate that whether the source of the advice was labeled as coming either from an AI system or a human radiologist influenced participants. Overall, physicians’ diagnostic accuracy was higher when they received AI advice. Surprisingly, the advice quality rating did not differ between the two sources of advice, which means that we neither found evidence for algorithmic appreciation nor algorithmic aversion. This finding is contrary to a previous study using the same materials, in which task experts did show algorithm aversion by rating the quality of AI advice lower than human advice (7). We hypothesized that this inconsistency could be caused by differences in attitude toward AI technology or self-reported AI knowledge among the two samples; however, this was not the case (see Table S3 and Table S4 for a sample comparison). In the previous study, participants were shown both correct and incorrect advice, which is another potential reason for the inconsistent results. Research has shown that people are less willing to forgive an error made by an AI compared to a human (18). And indeed, when comparing the data from the cases with correct advice without annotations (the only comparable group between the two studies), we found that participants in the previous study rated the quality of AI advice (M = 4.96, SD = 1.21) significantly lower than did participants in the present study (M = 5.28, SD = 1.17). At the same time, the quality rating of the human advice did not differ between the previous (M = 5.28, SD = 1.29) and the present study (M = 5.26, SD = 1.32). This finding confirms results from other research indicating that people are less lenient with an AI when it errs. We were also surprised to find that receiving advice from the AI compared to human advice was followed by higher confidence ratings among task experts. Their higher confidence in the AI condition was not reflected by differences in diagnostic accuracy; however, as with the explainability manipulation, ceiling effects might have masked trends. Task experts might assume that the AI system uses a different reviewing approach to a human and, therefore, could have access to additional information unknown to them. Consequently, when the AI comes to the same conclusion as they do, it might be a more valuable confirmation and validation of their decision for the task experts. At the same time, non-task experts might perceive the advice from both an AI and a human expert to be of equal value.

Finally, we want to highlight that receiving annotated advice from an AI led to the highest diagnostic accuracy among non-task experts (M = 86.84, SD = 33.89) and task experts (M = 94.61, SD = 22.64). It is noteworthy that non-task experts’ performance in the annotated AI condition was almost on par with task experts’ performance in the not-annotated human condition (M = 88.00, SD = 32.60). These findings underline the idea that non-task experts such as IM and EM physicians’ may especially benefit from implementing AI-CDSS for image reviewing tasks. In hospitals, physicians not trained in radiology often have to read X-rays and diagnose patients without radiology reporting during night or weekend shifts. Moreover, in many rural areas, even in highly developed countries, outside of regular business hours, expert radiology reporting is not available at all. Consequently, utilizing AI-CDSS, especially for IM and EM physicians, may help to increase patient flow efficiency and to provide good and safe patient care. However, this assumption will only result in better clinical outcomes when the AI-CDSS is highly accurate in its predictions. Our participants strongly relied on any advice given to them. We are confident in making this claim because participants often failed to dismiss inaccurate advice in the previous study, leading to lower overall performance (7). When comparing the diagnostic accuracy between the two studies with the same condition (correct advice without annotation), we see no statistically significant difference (t = -1.71, p = 0.09). This result leads to the assumption that participants would have also failed to dismiss inaccurate advice in the present study. We decided to provide only correct advice in the present study because we wanted to test an ideal scenario where diagnostic advice should almost never fail. Considering that only eight cases were presented, presenting incorrect advice might have led participants to believe that the advisor was incompetent and dismiss the advice more often. Unfortunately, in reality, no AI-CDSS and no human radiologists have an accuracy rate of 100%. Therefore, more research is needed to test how advice with case-by-case explanations will affect physicians’ reliance on the advice when it is incorrect.

The present study has some limitations. First, the experiment was conducted online with participants knowing that their diagnosis would not affect the treatment decision for a real patient. The limited ability to capture decision risks in an experimental setup may have influenced physicians' reliance on the advice differently than would have been the case in a natural clinical setting. Second, participants only reviewed eight patient cases (and only six were included in the analysis), which somewhat narrows the generalizability of the results. However, it was necessary to keep the number of cases low to limit the duration of the experiment so that we could recruit the necessary number of physicians willing to complete the experiment. To overcome this limitation, our cross-institutional panel of radiologists selected cases to be representative of different levels of difficulty and clinical abnormalities. Finally, in the study invitations sent to institutions across the US and Canada, we asked people who had already participated in the previous study not to participate in the present study. Since the data is anonymized, we cannot rule out that a person might have participated in both studies. However, since the data for the present study were collected two years later, it is improbable that participants could recall the cases.

In conclusion, the fact that physicians benefitted the most from receiving explainable advice from an AI system underlines the potential opportunities that AI-enabled decision aids could have for the field of radiology and beyond. Our findings indicate that the implementation of AI-CDSS might be most valuable for non-radiologists when reviewing medical images and making timely clinical decisions without radiology reporting. This specific use case for AI-enabled clinical advice has a great potential for improving workflows, clinical outcomes, and patient safety. Further research should focus on how explainable AI advice has to be presented to non-task experts to optimize utility while minimizing blind over-reliance in the event that the AI-CDSS errs.

Participants

In total, N = 223 participants finished the online experiment and were included in the data analysis. The sample consisted of physicians with different levels of task expertise. On the one hand, physicians trained in internal medicine or emergency medicine (IM/EM) often review chest X-rays but have relatively little formal training on viewing medical images and were consequently classified as non-task experts. Radiologists with specialized training in reviewing medical images were classified as task experts. Participants were recruited via email. Study invitations were sent to staff and residents at hospitals in the US and Canada and to residency program coordinators with the request to distribute the link. Table 4 displays the sample demographics.

Table 4.

Participant demographics

	IM/EM (N=117)	Radiology (N=106)	Overall (N=223)
Gender
Female	46 (39.3%)	38 (35.8%)	84 (37.7%)
Male	66 (56.4%)	67 (63.2%)	133 (59.6%)
Other	1 (0.9%)	0 (0%)	1 (0.4%)
Prefer not to answer	4 (3.4%)	1 (0.9%)	5 (2.2%)
Professional experience (years)
Mean (SD)	10.8 (9.95)	9.50 (8.96)	10.2 (9.49)
Median [Min, Max]	7.00 [0.500, 40.0]	6.00 [0.500, 45.0]	7.00 [0.500, 45.0]
Age
18-24	2 (1.7%)	1 (0.9%)	3 (1.3%)
25-34	59 (50.4%)	54 (50.9%)	113 (50.7%)
34-44	29 (24.8%)	33 (31.1%)	62 (27.8%)
45-54	12 (10.3%)	11 (10.4%)	23 (10.3%)
55-64	9 (7.7%)	6 (5.7%)	15 (6.7%)
65-74	4 (3.4%)	1 (0.9%)	5 (2.2%)
75-84	0 (0%)	0 (0%)	0 (0%)
85 or older	0 (0%)	0 (0%)	0 (0%)
Prefer not to answer	2 (1.7%)	0 (0%)	2 (0.9%)
Ethnicity
American Indian or Alaska Native	0 (0%)	3 (2.8%)	3 (1.3%)
Asian (Far East, Southeast Asia, Indian)	26 (22.2%)	31 (29.2%)	57 (25.6%)
Black or African American	2 (1.7%)	2 (1.9%)	4 (1.8%)
Multiple ethnicities selected	3 (2.6%)	1 (0.9%)	4 (1.8%)
Native Hawaiian or Pacific Islander	1 (0.9%)	0 (0%)	1 (0.4%)
White (Europe, Middle East, North Africa)	72 (61.5%)	56 (52.8%)	128 (57.4%)
Other	3 (2.6%)	4 (3.8%)	7 (3.1%)
Prefer not to answer	10 (8.5%)	9 (8.5%)	19 (8.5%)

Note. IM = internal medicine; EM = emergency medicine; N = numbers of participants

Data Source and Case Selection

The methods used in the present study are similar to a previously published experiment (7 ). The same materials from this previous study were used. The chest X-rays (frontal +/- lateral projections) were sourced from the open-source MIMIC Chest X-ray database (21 , 22 ). The Laboratory for Computational Physiology (LCP) gave explicit approval to use the X-rays in our study. A panel of three radiologists curated a set of candidate cases in an iterative process. Finally, eight cases that both reflect everyday clinical practice and test common weaknesses in chest X-ray evaluation were selected (23 , 24 ) by a senior radiologist (EC; see online supplements for more details). EC added patient information, radiologic findings, proposed diagnoses, and image annotations (highlighting the area on the X-ray that leads to the diagnosis) to each case. The image IDs, patient information, radiologic findings, and diagnoses are included in the supplemental material (see online supplements). The X-rays can be found through the image IDs in the MIMIC-CXR dataset v2.0.0. To ensure that the findings, diagnoses, and image annotations, which were presented as diagnostic advice during the experiment, were correct and all cases appropriate for an assessment by physicians with different expertise levels, we pre-tested the material with six additional radiologists with varying experience levels.

Experimental Design

Instructions: The pre-registered (https://osf.io/sb9hf, https://osf.io/f69mz) experiment was conducted online (Qualtrics, Provo, UT). Participants were given basic information about the purpose of the study and an estimated study duration of 10 to 15 minutes. They were informed that participation was completely voluntary and anonymous, that they could quit the study at any time without negative consequences, and about the option of being included in a raffle as compensation for their participation. Only individuals who gave written informed consent to take part in the study (by clicking a checkbox) and confirmed that they were currently practicing radiology, internal medicine, or emergency medicine (residency included) in the USA or Canada could move on to the experiment.

Procedure: Next, participants learned that their task was to review and diagnose eight patient cases as accurately as possible, for which they received chest X-rays, a brief clinical history, and diagnostic advice that could be used for their final decisions. The chest X-rays were shown as a static image within the Qualtrics interface, but participants were asked to open links to an external DICOM^{^[1]} viewer (pacsbin, Orion Medical Technologies, LLC, Towson, MD) and review the images there. The web-based DICOM viewer allowed them to adequality examine the images using all standard features of a fully functional viewing tool (e.g., zoom, window, change levels, look at annotations). We presented one example image to the participants and asked them to get familiar with the functionalities of the DICOM viewer before proceeding to review their first case. The participating physicians were asked to rate the quality of the presented diagnostic advice, give a final diagnosis, and judge how confident they were with their diagnosis. After the reviewing process was finished, the participants completed a short survey, including questions about demographics, professional identification, belief in professional autonomy, self-reported AI knowledge, and attitude toward AI.

Manipulations: Each diagnostic advice consisted of radiologic findings and a primary diagnosis. Two characteristics of the advice were manipulated in the present experiment:

Explainability of the advice: The advice came with a visual annotation (arrows) on the chest X-rays, pointing at the area on the chest X-ray which determined the primary diagnosis. This was done to provide further explanations about the reasoning behind the advice to the participant. Arrows were chosen because existing AI-enabled clinical decision support systems (CDSS) typically use this form of annotation. By default, the annotations were automatically turned on when opening the image on DICOM-viewer. The explainability of the advice was manipulated within subjects, which means that each participant received cases with and without additional explanation. Participants received four cases with visual annotation and four without visible annotation on the X-rays. Which cases came with or without annotation was randomized.
Source of the advice: The advice was labeled as coming either from an AI-based model (CHEST-AI) or an experienced radiologist (Dr. S. Johnson). The source of the advice was manipulated between subjects; this means that participants received advice only from one source throughout the entire experiment. Receiving advice from both sources may have led the participants to adjust their quality rating based on their attitudes towards AI technology. The exact wordings for the manipulation were:

AI: "The findings and primary diagnoses were generated by CHEST-AI, a well-trained, deep-learning-based artificial intelligence (AI) model with a performance record (regarding diagnostic sensitivity and specificity) on par with experts in the field."
Human: "The findings and primary diagnoses were generated by Dr. S. Johnson, an experienced radiologist with a performance record (regarding diagnostic sensitivity and specificity) on par with experts in the field."

Consequently, the experiment followed a 2 (explainability of the advice: annotated vs. not annotated) x 2 (source of the advice: AI vs. human) mixed factorial design to examine the effect of these two manipulations on the dependent variables (see Fig. 1).

Measures

The present study had three dependent variables: 1) diagnostic accuracy, 2) advice quality ratings, and 3) confidence in the diagnosis:

Diagnostic accuracy: The participating physicians had to give their final diagnosis for each case by expressing whether they agreed with the diagnosis given as advice ("Do you agree with [primary diagnosis] as the primary diagnosis"). They had three response options: 1 = full agreement ("Yes, I agree with this diagnosis."), 2 = agreement with slight modification ("Yes, I agree with this diagnosis but would like to add a slight modification"), or 3 = disagreement and providing an alternative diagnosis ("No, I don't agree with this diagnosis" followed by "Please provide an alternative primary diagnosis"). Since the advice was always correct, the diagnosis was counted as 1 = accurate when the participants agreed with the advice (full agreement or agreement with slight modification). The diagnosis was counted as 0 = inaccurate when the participant disagreed with the advice.
Advice quality rating: Participants were asked to rate several aspects of the quality of the advice they received for each case. These aspects included a) agreement ("How much do you agree with the findings?"), b) usefulness ("How useful are the findings to you for making a diagnosis?"), c) trustworthiness ("How much do you trust [source of advice]?", and future consultation ("Would you consult [source of advice] in the future?"). Every question was rated on a 7-point Likert scale from 1 (not at all) to 7 (extremely/definitely). The mean of the participant’s responses to these questions was calculated to express the overall advice quality rating. The scale showed a very good internal consistency (Cronbach's α ≥ 0.87), as was already demonstrated in a previous study (7 ).
Confidence in the diagnosis: For each case, participants rated the confidence in their final diagnosis with one item ("How confident are you with your primary diagnosis?") on a 7-point Likert scale from 1 (not at all) to 7 (extremely).

Several additional variables, which were considered to be relevant for this task, were measured: A) Professional identification: Five items (e.g., "In general, when someone praises doctors, it feels like a personal compliment." (25 )) were answered on a Likert scale from 1 (strongly disagree) to 7 (strongly agree); Cronbach's α = 0.74. B) Belief in professional autonomy: Four items (e.g., "Individual physicians should make their own decisions regarding what is to be done in their work." (26 )) were answered on a Likert scale from 1 (strongly disagree) to 7 (strongly agree); Cronbach's α = 0.65. C) Self-reported AI knowledge: One item ("How would you consider your own general knowledge of artificial intelligence (AI)?” (7 )) was answered on a scale from 1 (I have no knowledge) to 5 (Expert knowledge). D) Attitude toward AI: Three items ("How much do you agree with the following statements? AI will make most people's lives better; AI is dangerous to society; AI poses a threat to my career. (7 )) were answered on a Likert scale from 1 (strongly disagree) to 7 (strongly agree); Cronbach's α = 0.61.

Group Differences

The randomization of participants into one of the two sources of the advice conditions (N_AI = 116 vs. N_human= 106) worked well. Participants in both conditions did not significantly differ in their levels of professional identification, belief in professional autonomy, self-reported AI knowledge, attitude toward AI technology, and years of professional experience (see Table S1). When divided by task expertise, the only difference among the medical fields was that radiologists rated their self-reported AI knowledge to be higher than IM/EM physicians (see Table S2). Since the experimental randomization worked and the two professional groups had similar values on all but one control variable, we ran the regressions with all participants.

^{^[1]} The Digital Imaging and Communications in Medicine (DICOM) format is commonly used for radiologic images and allows grouping multiple images and metadata together in a lossless format.

Data Availability

To ensure that this study can be reproduced, material, data, and scripts can be found on OSF: https://osf.io/h7aj3/.

Acknowledgments

We want to thank Alistair Johnson for his help with using and searching the MIMIC-CXR dataset. We also thank Seth J. Berkowitz and Alexander Merritt for their support in curating the set of candidate cases.

Author contributions

S.G.: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Visualization, Writing – original draft

H.S.: Conceptualization, Investigation, Methodology, Project administration, Writing - Review & Editing

M.R.: Conceptualization, Methodology, Writing - Review & Editing

E.L.: Funding acquisition, Methodology, Writing - Review & Editing

T.K.K.: Formal analysis, Validation, Writing - Review & Editing

M.F.C.H.: Writing - Review & Editing

A.A.: Resources

S.C.G.: Resources

J.F.C.: Resources, Supervision

D.F.: Supervision

F.C.K.: Supervision

M.G.: Conceptualization, Funding acquisition, Methodology, Supervision, Writing - Review & Editing

E.C.: Conceptualization, Funding acquisition, Methodology, Supervision, Writing - Review & Editing

Competing interests

The authors declare that they have no conflict of interest.

Funding

The research was funded by a grant from the Volkswagen Foundation (Grant #: 98525, 98526, 98527)

S. Raoof, D. Feigin, A. Sung, S. Raoof, L. Irugulpati, E. C. Rosenow, Interpretation of plain chest roentgenogram. Chest. 141, 545–558 (2012).
K. G. van Leeuwen, S. Schalekamp, M. J. C. M. Rutten, B. van Ginneken, M. de Rooij, Artificial intelligence in radiology: 100 commercially available products and their scientific evidence. Eur Radiol. 31, 3797–3804 (2021).
P. Rajpurkar, J. Irvin, R. L. Ball, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. P. Langlotz, B. N. Patel, K. W. Yeom, K. Shpanskaya, F. G. Blankenberg, J. Seekins, T. J. Amrhein, D. A. Mong, S. S. Halabi, E. J. Zucker, A. Y. Ng, M. P. Lungren, Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med. 15, e1002686 (2018).
D. Killock, AI outperforms radiologists in mammographic screening. Nat Rev Clin Oncol. 17, 134–134 (2020).
A. Tariq, S. Purkayastha, G. P. Padmanaban, E. Krupinski, H. Trivedi, I. Banerjee, J. W. Gichoya, Current clinical applications of artificial intelligence in radiology and their best supporting evidence. Journal of the American College of Radiology. 17, 1371–1381 (2020).
K. G. van Leeuwen, M. de Rooij, S. Schalekamp, B. van Ginneken, M. J. C. M. Rutten, How does artificial intelligence in radiology improve efficiency and health outcomes? Pediatr Radiol (2021), doi:10.1007/s00247-021-05114-8.
S. Gaube, H. Suresh, M. Raue, A. Merritt, S. J. Berkowitz, E. Lermer, J. F. Coughlin, J. V. Guttag, E. Colak, M. Ghassemi, Do as AI say: Susceptibility in deployment of clinical decision-aids. npj Digital Medicine. 4 (2021), doi:10.1038/s41746-021-00385-9.
M. Jacobs, M. F. Pradier, T. H. McCoy, R. H. Perlis, F. Doshi-Velez, K. Z. Gajos, How machine-learning recommendations influence clinician treatment selections: the example of antidepressant selection. Transl Psychiatry. 11, 108 (2021).
K. Vodrahalli, T. Gerstenberg, J. Zou, Do humans trust advice more if it comes from AI? An analysis of human-AI interactions (2021), doi:10.48550/ARXIV.2107.07015.
M. Ghassemi, L. Oakden-Rayner, A. L. Beam, The false hope of current approaches to explainable artificial intelligence in health care. The Lancet Digital Health. 3, e745–e750 (2021).
H. Lee, S. Yune, M. Mansouri, M. Kim, S. H. Tajmir, C. E. Guerrier, S. A. Ebert, S. R. Pomerantz, J. M. Romero, S. Kamalian, R. G. Gonzalez, M. H. Lev, S. Do, An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets. Nat Biomed Eng. 3, 173–182 (2019).
M. Jacobs, M. F. Pradier, T. H. McCoy, R. H. Perlis, F. Doshi-Velez, K. Z. Gajos, How machine-learning recommendations influence clinician treatment selections: the example of antidepressant selection. Translational psychiatry. 11, 1–9 (2021).
D. Li, L. M. Pehrson, C. A. Lauridsen, L. Tøttrup, M. Fraccaro, D. Elliott, H. D. Zając, S. Darkner, J. F. Carlsen, M. B. Nielsen, The added effect of artificial intelligence on physicians’ performance in detecting thoracic pathologies on CT and chest X-ray: A systematic review. Diagnostics. 11, 2206 (2021).
M. Lin, S. C. Pappas, J. Sellin, H. B. El-Serag, Curbside consultations: The good, the bad, and the ugly. Clinical Gastroenterology and Hepatology. 14, 2–4 (2016).
H. Geijer, M. Geijer, Added value of double reading in diagnostic radiology,a systematic review. Insights Imaging. 9, 287–301 (2018).
Y. Ozuru, S. Briner, C. A. Kurby, D. S. McNamara, Comparing comprehension measured by multiple-choice and open-ended questions. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale. 67, 215–227 (2013).
S. M. Bonner, Mathematics strategy use in solving test items in varied formats. The Journal of Experimental Education. 81, 409–428 (2013).
B. J. Dietvorst, J. P. Simmons, C. Massey, Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General. 144, 114–126 (2015).
J. M. Logg, J. A. Minson, D. A. Moore, Algorithm appreciation: People prefer algorithmic to human judgment. Organizational Behavior and Human Decision Processes. 151, 90–103 (2019).
A. Bussone, S. Stumpf, D. O’Sullivan, (IEEE, 2015), pp. 160–169.
A. Johnson, T. Pollard, R. Mark, S. Berkowitz, S. Horng, MIMIC-CXR Database (version 2.0.0). PhysioNet (2019), doi:https://doi.org/10.13026/C2JT1Q.
A. E. W. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C. Deng, R. G. Mark, S. Horng, MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data. 6, 317 (2019).
J. P. Kanne, N. Thoongsuwan, E. J. Stern, Common Errors and Pitfalls in Interpretation of the Adult Chest Radiograph. Clinical Pulmonary Medicine. 12, 97–114 (2005).
K. L. Humphrey, C. C. Wu, M. D. Gilman, A. H. El-Sherief, J.-A. O. Shepard, G. F. Abbott, Where Are They All Hiding? Common Blind Spots on Chest Radiography. Contemporary Diagnostic Radiology. 34, 1–5 (2011).
D. R. Hekman, H. K. Steensma, G. A. Bigley, J. F. Hereford, Effects of organizational and professional identification on the relationship between administrators’ social influence and professional employees’ adoption of new work behavior. Journal of Applied Psychology. 94, 1325–1335 (2009).
T. J. Hoff, Professional commitment among US physician executives in managed care. Social Science & Medicine. 50, 1433–1444 (2000).

(Not answered)

SupplementsWhoshoulddoasAIsay.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Who should do as AI say? Only non-task expert physicians benefit from correct explainable AI advice

Status:

Version 1

Abstract

Figures

Introduction

Results

Discussion

Methods

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1