About the AI Software Device and Prior Evidence
qER (version 2.0 EU, manufactured by Qure.ai) is a CE (Conformité Européenne) class IIb and Federal Drug Agency (FDA) 510(k) cleared AI software device for analysing NCCTH scans in patients aged ≥ 18 years. The device can detect abnormalities such as ICH including its five subtypes of extradural (EDH), subdural (SDH), subarachnoid (SAH), intraparenchymal (IPH), and intraventricular (IVH) haemorrhages, mass effect, midline shift, cranial fractures, hypodensities suggestive of infarct and atrophy. The core component of each algorithm is a classification convolutional neural network (CNN) that has been trained to detect a specific abnormality.8 A scan-level probability score (a value between 0 and 1) for each abnormality is the backend output from the corresponding algorithm. A threshold is applied to the probability score to determine the presence or absence of the target abnormality. A secondary capture (SC) image is generated which informs the radiologist of the presence of any of the target abnormalities if any are present (Fig. 1). Scans containing artefacts, post-operative defects, and metal implants are known to cause inaccurate outputs by qER, and these are part of the device warnings so that radiologists are aware of the possibility of inaccurate AI outputs.
In a study conducted in Sweden using data from a stroke registry, qER was found to have approximately 97% sensitivity in detecting non-traumatic ICH, and 95% of the missed ICHs were < 1 mm in diameter.9 Another study using an external validation dataset of 491 NCCTH scans collected from inpatient and outpatient settings in India, reported a sensitivity of 81.95% (95% CI: 75.99–86.99) in detecting ICH at a high specificity of 90%.8 In a study conducted for the purpose of qER regulatory clearance using 1320 NCCTH scans from multiple sites in the United States, the AUC was reported to be more than 97% for ICH, skull fracture, mass effect, and midline shift.10 For this evaluation, we focused on the ICH detection capability of qER.
Data Collection
Medica Group Limited is a large telemedicine practice with a UK arm, Medica Reporting Limited (MRL) offering teleradiology and telepathology services utilising more than 500 reporters globally. Emergency and urgent telereporting of imaging investigations are routinely provided and the AI device (qER) has been deployed and successfully integrated with the acute teleradiology workflow for assistance in reporting NCCTH scans since the end of 2020. During the real-time reporting of scans, the original reporting radiologists would see a priortisation flag in their worklist corresponding to an NCCTH scan if the AI detected the presence of ICH in that scan.
A subset of all consecutive NCCTH scans that were successfully analysed by the AI device from patients aged ≥ 18 years referred for emergency and urgent teleradiology reporting from 44 different hospital sites in the UK during a 4-month period from 13 September 2023 to 16 January 2024 were selected randomly. A sample of 100 to 125 scans with radiologist confirmed the presence of ICH would enable us to estimate an anticipated minimum positive percent agreement of AI of 85% with radiologists with about 7 to 6.25% precision (half-width of the 95% confidence interval).11 Assuming the prevalence of ICH to be about 8%, a sample of approximately 1400 NCCTH scans was desired for this evaluation which would also allow us to estimate an anticipated minimum 90% overall agreement with a higher precision of about 1.58%. The original DICOM (Digital Imaging and Communications in Medicine) NCCTH images, original radiology report, and the SC generated by AI were available in the historical electronic data to be used for this evaluation.
Evaluation Process
The original DICOM NCCTH images, original radiology, and the AI-generated SC were retrospectively evaluated by a team of 30 auditing radiologists (5 neurospecialist radiologists and 25 general radiologists) with an average experience of 13 years (median experience of 12 years) in radiological reporting. Each patient’s data was evaluated by a single auditing radiologist and none of the auditing radiologists evaluated their own previous original radiology report. This process of evaluation (auditing) is periodically followed in Medica as part of internal standard operating procedures. The auditing radiologist, after thorough inspection of the original NCCTH image, AI-generated SC, and original radiology report assigned five distinct categories to each scan as listed below:
-
Agree – Great Spot: The auditing radiologist agreed with the positive (presence of ICH) finding by AI and decided this positive finding is a good subtle spot. These are true positive (TP) cases.
-
Agree with positive finding: The auditing radiologist agrees with the positive finding by AI. These are TP cases.
-
Agree with negative finding: The auditing radiologist agrees with the negative (absence of ICH) finding by AI. These are true negative (TN) cases.
-
Disagree with positive finding: The auditing radiologist disagrees with the positive finding by AI. These are false positive (FP) cases.
-
Disagree with negative finding: The auditing radiologist disagrees with the positive finding by AI. These are false negative (FN) cases.
If the auditing radiologist disagrees with the original radiology report, a discrepancy is raised for the report to review the scan as per standard operating procedure in Medica. However, this analysis was out-of-scope for this evaluation in which we only focussed on the auditing outcome (auditing radiologist impressions versus AI). If there is a disagreement with AI, the auditing radiologist also reports the potential reason for disagreement. A descriptive report of the failure analysis is also reported.
Statistical Analysis
We use the terms overall agreement, positive and negative percent agreement (PPA and NPA) to denote the accuracy, sensitivity, and specificity respectively of the AI device in detecting ICH based on the guidance from FDA12 for reporting results from analysis of diagnostic devices when a non-reference standard is used. Point estimates and corresponding 95% exact binomial confidence interval (95% CI) of overall agreement, PPA, NPA, positive predictive value (PPV), and negative predictive value (NPV) are reported. A descriptive analysis of the reasons for false results is also reported. Agreement between the AI tool and the auditing radiologist was also quantified by Gwet’s AC113; Cohen’s kappa and Prevalence and Bias Adjusted Kappa (PABAK).14 A multivariable logistic regression model was also fitted with age, sex, and the interaction term between age and sex as independent variables to investigate if there is any association of age and sex with incorrect AI results. The statistical analysis was conducted in R version 4.3.2 and only anonymised data after removing patient identifiers was used for any statistical analysis.
Ethical Considerations
This was a retrospective evaluation from an auditing process already implemented in the teleradiology practice (Medica) as part of standard operating procedures. The Medica Reporting Limited Clinical Governance Committee, Chief Medical Officer and Caldicott Guardian have approved this evaluation. The data used for statistical analysis is non identifiable and gives assurance of the expected performance of our system for CT Head reporting prioritisation.