Analysis
We calculated three mixed-effects regression models, one for each dependent variable: 1) diagnostic accuracy, 2) advice quality ratings, and 3) confidence in the diagnosis. The diagnostic accuracy was assessed using a logistic regression model because it was measured as a binary variable (accurate/inaccurate). Linear regression models were applied for the advice quality and the confidence ratings. Each dependent variable was regressed on the explainability of the advice (annotated vs. non annotated), the source of the advice (AI vs. human), the task expertise (radiologists vs. IM/EM physicians), the interaction between explainability (annotated vs. non-annotated) and source (AI vs. human), and the control variables (professional identification, belief in professional autonomy, self-reported AI knowledge, attitude toward AI technology, and years of professional experience). All models included fixed effects for all variables mentioned above and a random effect for the participant to account for non-independence of observations and differences in their skills, as well as a random effect for the patient case to account for their different difficulty levels. One of the eight cases, which had been taken from the previous study without changes, one had no clinical abnormalities (diagnosis: normal) and, consequently, no annotations on the image. A second case was shown without annotations due to a technical issue. These two cases had to be excluded from the analysis because the explainability condition could not be unambiguously assigned.
Non-Task Experts Benefited from Correct Explainable AI Advice
First, we tested whether the participants’ diagnostic accuracy was influenced by the experimental manipulations (see Table 1). Unsurprisingly, task experts (i.e., radiologists) performed significantly better than non-task experts (i.e., IM/EM physicians). Overall, participants showed a higher diagnostic accuracy when they received advice with an explanation (i.e., an annotation). Surprisingly, they also performed better when the advice was labeled as coming from the AI instead of the human. When looking at the results separated by task expertise (defined by discipline), we found that providing annotated advice improved the performance of the IM/EM physicians, but had no significant effect on the radiologists’ diagnostic accuracy (see Fig. 2a). The source of advice did not significantly affect performance in either expertise group (see Fig. 2b). Moreover, higher self-reported AI knowledge was generally associated with better task performance (see Table 1). It should be noted here that task experts rated their self-reported AI knowledge significantly higher than non-task experts (Table S2), which might explain this result.
Table 1.
Logistic mixed multilevel regression models for participants’ diagnostic accuracy.
Predictors
|
Odds Ratios
|
SE
|
95% CI
|
z
|
p
|
Intercept
|
1.30
|
1.38
|
0.16 – 10.37
|
0.24
|
0.807
|
Explainability [annotated]
|
2.30
|
0.61
|
1.37 – 3.86
|
3.17
|
0.002
|
Source [AI]
|
2.09
|
0.57
|
1.22 – 3.59
|
2.69
|
0.007
|
Task expertise [experts: radiologists]
|
2.20
|
0.51
|
1.40 – 3.47
|
3.41
|
0.001
|
Professional identification
|
1.00
|
0.12
|
0.80 – 1.26
|
0.02
|
0.985
|
Beliefs about professional autonomy
|
1.20
|
0.14
|
0.95 – 1.52
|
1.57
|
0.117
|
Self-reported AI-knowledge
|
1.34
|
0.19
|
1.01 – 1.77
|
2.03
|
0.043
|
Attitude toward AI
|
0.99
|
0.12
|
0.78 – 1.26
|
-0.06
|
0.954
|
Professional experience (years)
|
0.99
|
0.01
|
0.97 – 1.01
|
-1.11
|
0.268
|
Explainability [annotated] x Source [AI]
|
0.57
|
0.22
|
0.27 – 1.20
|
-1.49
|
0.137
|
Note. SE = standard error; p = probability of committing a Type I error; random effects: σ2 =3.29, τ00 ID = 0.60, τ00 PATIENTID = 1.09, ICC = 0.34, NID = 222, NPATIENTID = 6, Observations = 1332, Marginal R2 = 0.086 / Conditional R2 = 0.397; OR>1 variable associated with higher odds for correct diagnosis; OR<1 variable associated with lower odds for correct diagnosis, OR=1 variable does not affect odds of outcome.
Non-Task Experts Preferred Explainable Advice
Next, we examined whether the experimental manipulations influenced the participants’ advice quality rating (see Table 2). Participants rated the quality of the advice on average higher if it was given with an explanation, i.e., with a visible annotation on the X-ray. When comparing the results by task expertise, we saw that annotated advice only led to higher advice quality ratings among non-task experts but not task experts (see Fig. 2c). The source of the advice had no effect on the advice quality rating — neither among IM/EM physicians nor radiologists (see Fig. 2d). Participants with higher task expertise rated the quality of the advice significantly higher than non-task experts. Attitude toward AI technology was the only other significant (positive) predictor for the quality rating in the overall sample (see Table 2).
Table 2.
Linear mixed multilevel regression models for advice quality rating.
Predictors
|
Estimate
|
SE
|
95% CI
|
t
|
p
|
Intercept
|
4.55
|
0.49
|
3.59 – 5.50
|
9.36
|
<0.001
|
Explainability [annotated]
|
0.25
|
0.07
|
0.11 – 0.39
|
3.42
|
0.001
|
Source [AI]
|
0.03
|
0.11
|
-0.19 – 0.25
|
0.29
|
0.771
|
Task expertise [experts: radiologists]
|
0.24
|
0.10
|
0.04 – 0.45
|
2.32
|
0.020
|
Professional identification
|
0.03
|
0.05
|
-0.07 – 0.14
|
0.59
|
0.554
|
Beliefs about professional autonomy
|
-0.09
|
0.05
|
-0.20 – 0.01
|
-1.70
|
0.089
|
Self-reported AI-knowledge
|
0.07
|
0.07
|
-0.06 – 0.20
|
1.08
|
0.279
|
Attitude toward AI
|
0.13
|
0.06
|
0.02 – 0.23
|
2.27
|
0.023
|
Professional experience (years)
|
-0.00
|
0.01
|
-0.02 – 0.01
|
-0.90
|
0.367
|
Explainability [annotated] x Source [AI]
|
-0.07
|
0.10
|
-0.27 – 0.12
|
-0.75
|
0.454
|
Note. SE = standard error; p = probability of committing a Type I error; random effects: σ2 =0.79, τ00 ID = 0.42, τ00 PATIENTID = 0.16, ICC = 0.42, NID = 222, NPATIENTID = 6, Observations = 1332, Marginal R2 = 0.042 / Conditional R2 = 0.449. The regression estimate indicates how much the mean quality rating changes given a one-unit shift in the predictor while holding other predictors in the model constant.
AI Advice Boosted Task Experts’ Confidence in their Diagnosis
When looking at participants’ confidence in their diagnostic decisions, task experts (i.e., radiologists) reported being more confident with their diagnosis, as expected. Neither the explainability nor the source of the advice influenced the confidence rating in the combined sample. However, when comparing task experts with non-task experts, radiologists surprisingly reported higher confidence in their final diagnosis when they received advice labeled as coming from the AI vs. the human (see Fig. 2e). IM/EM physicians’ confidence rating was not affected by the source of the advice. The only other variable associated with being more confident with their diagnosis was higher self-reported AI knowledge (see Table 3). As mentioned above, the fact that task experts rated their self-reported AI knowledge significantly higher than non-task experts (Table S2), might explain this result.
Table 3.
Linear mixed multilevel regression models for confidence in the diagnosis.
Predictors
|
Estimate
|
SE
|
95% CI
|
t
|
p
|
Intercept
|
4.60
|
0.43
|
3.76 – 5.44
|
10.76
|
<0.001
|
Explainability [annotated]
|
0.06
|
0.07
|
-0.09 – 0.20
|
0.77
|
0.440
|
Source [AI]
|
0.19
|
0.10
|
-0.01 – 0.39
|
1.91
|
0.056
|
Task expertise [experts: radiologists]
|
0.72
|
0.09
|
0.54 – 0.89
|
7.93
|
<0.001
|
Professional identification
|
-0.00
|
0.05
|
-0.09 – 0.09
|
-0.05
|
0.956
|
Beliefs about professional autonomy
|
0.01
|
0.05
|
-0.09 – 0.10
|
0.16
|
0.874
|
Self-reported AI-knowledge
|
0.19
|
0.06
|
0.08 – 0.30
|
3.39
|
0.001
|
Attitude toward AI
|
-0.01
|
0.05
|
-0.11 – 0.08
|
-0.26
|
0.794
|
Professional experience (years)
|
0.01
|
0.00
|
-0.00 – 0.02
|
1.79
|
0.074
|
Explainability [annotated] x Source [AI]
|
-0.03
|
0.10
|
-0.23 – 0.17
|
-0.32
|
0.749
|
Note. SE = standard error; p = probability of committing a Type I error; random effects: σ2 =0.85, τ00 ID = 0.27, τ00 PATIENTID = 0.16, ICC = 0.34, N ID = 222, N PATIENTID = 6, Observations = 1332, Marginal R2 = 0.130 / Conditional R2 = 0.424.The regression estimate indicates how much the mean confidence rating changes given a one-unit shift in the predictor while holding other predictors in the model constant.
Performance Across Clinical Cases
Finally, we also looked at participants’ task performance for each clinical case (see Fig. 3). Overall performance was high; however, it was much lower for Case ID PT011 under both advice manipulations. This finding is consistent with our previous study using the same patient cases (1). While annotations on the X-rays had no benefit for the task experts across all cases, in this more difficult case, annotations seem to have had a positive effect even on task experts. Among IM/EM physicians, receiving annotated advice was generally associated with higher diagnostic accuracy (except for case PT007, which showed the smallest difference between the two conditions). Interestingly, non-task experts’ performance was on par with experts in the annotated condition. This might indicate that non-task experts benefit more from explainable advice independent of case complexity. Across all cases, the source of the advice had little effect on radiologists’ performance. Non-task experts showed slightly better performance (to a varying degree) when receiving advice labeled as coming from an AI across all but one case (PT015).