Evaluation of an artificial intelligence-based software device for detection of intracranial haemorrhage in teleradiology practice

doi:10.21203/rs.3.rs-4546916/v1

Download PDF

Research Article

Evaluation of an artificial intelligence-based software device for detection of intracranial haemorrhage in teleradiology practice

https://doi.org/10.21203/rs.3.rs-4546916/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Objectives

Artificial Intelligence (AI) algorithms have the potential to assist radiologists in the reporting of head CT scans. We investigated the performance of an AI-based software device used in a large teleradiology practice for intracranial haemorrhage (ICH) detection.

Methods

A randomly selected subset of all noncontrast CT head (NCCTH) scans from patients aged ≥ 18 years referred for urgent teleradiology reporting from 44 different hospitals within the UK over a 4-month period was considered for this evaluation. 30 auditing radiologists evaluated the NCCTH scans and the AI output retrospectively. Agreement between AI and auditing radiologists is reported along with failure analysis.

Results

A total of 1315 NCCTH scans from as many distinct patients were evaluated. 112 (8.5%) scans had ICH. Overall agreement, positive percent agreement, negative percent agreement, and Gwet’s AC1 of AI with radiologists were found to be 93.5% (95% CI: 92.1–94.8), 85.7% (77.8–91.6), 94.3% (92.8–95.5) and 0.92 (0.90–0.94) respectively in detecting ICH. 9 out of 16 false negative outcomes were due to missed subarachnoid haemorrhages and these were predominantly subtle haemorrhages. The most common reason for false positive results was due to motion artefacts.

Conclusions

AI demonstrated very good agreement with the radiologists in the detection of ICH.

Nuclear Medicine & Medical Imaging

Artificial Intelligence and Machine Learning

Neurology

radiology

teleradiology

neurology

medical imaging

haemorrhage

artificial intelligence

Real-world evaluation of an AI-based CT head interpretation device is reported. Knowledge of scenarios where false negative and false positive results are possible will help reporting radiologists.

Computed tomography (CT) scans are one of the most common radiological investigations performed in hospitals. According to the latest annual statistical release from National Health Service (NHS) England, approximately 7 million CT scans were done in the year 2022/23 and this represents an increase of about 5.5% from the previous year and more than double the number of scans done in the year 2012/13.¹ Among the diverse types of CT scans, non-contrast CT head (NCCTH) is a prominent study due to its established usefulness in the diagnostic workup of patients presenting with acute head injury, polytrauma and suspected stroke as it is non-invasive and has a fast turnaround time.² Because of the marked increase in the number of imaging investigations and the limited expansion in the number of radiologists, the workload of radiologists is continuously on the rise. According to The Royal College of Radiologists (RCR), there is a 29% shortfall of radiologists as of 2022 and this shortfall is predicted to rise to about 40% by 2027.³ Usage of artificial intelligence (AI) based solutions and teleradiology are suggested as potential strategies to tackle this increased volume demand and limited reporting supply problem.^4,5

AI algorithms are becoming increasingly popular in the field of radiology and many such algorithms are receiving clearance by regulatory bodies including in the subspeciality of neuroimaging.⁶ A recent systematic review of AI algorithms for intracranial haemorrhage (ICH) detection discusses several such potentially useful algorithms but stresses the need for more real-world evidence of their benefits.⁷ As the use of AI is becoming increasingly popular, a deeper understanding of the performance of the AI is important for its optimal use by radiologists. We conducted an evaluation of one such regulatory cleared and commercially available AI-based software device (qER) which has been deployed in a teleradiology practice for assisting in the reporting and prioritisation of NCCTH scans by radiologists.

About the AI Software Device and Prior Evidence

qER (version 2.0 EU, manufactured by Qure.ai) is a CE (Conformité Européenne) class IIb and Federal Drug Agency (FDA) 510(k) cleared AI software device for analysing NCCTH scans in patients aged ≥ 18 years. The device can detect abnormalities such as ICH including its five subtypes of extradural (EDH), subdural (SDH), subarachnoid (SAH), intraparenchymal (IPH), and intraventricular (IVH) haemorrhages, mass effect, midline shift, cranial fractures, hypodensities suggestive of infarct and atrophy. The core component of each algorithm is a classification convolutional neural network (CNN) that has been trained to detect a specific abnormality.⁸ A scan-level probability score (a value between 0 and 1) for each abnormality is the backend output from the corresponding algorithm. A threshold is applied to the probability score to determine the presence or absence of the target abnormality. A secondary capture (SC) image is generated which informs the radiologist of the presence of any of the target abnormalities if any are present (Fig. 1). Scans containing artefacts, post-operative defects, and metal implants are known to cause inaccurate outputs by qER, and these are part of the device warnings so that radiologists are aware of the possibility of inaccurate AI outputs.

In a study conducted in Sweden using data from a stroke registry, qER was found to have approximately 97% sensitivity in detecting non-traumatic ICH, and 95% of the missed ICHs were < 1 mm in diameter.⁹ Another study using an external validation dataset of 491 NCCTH scans collected from inpatient and outpatient settings in India, reported a sensitivity of 81.95% (95% CI: 75.99–86.99) in detecting ICH at a high specificity of 90%.⁸ In a study conducted for the purpose of qER regulatory clearance using 1320 NCCTH scans from multiple sites in the United States, the AUC was reported to be more than 97% for ICH, skull fracture, mass effect, and midline shift.¹⁰ For this evaluation, we focused on the ICH detection capability of qER.

Data Collection

Medica Group Limited is a large telemedicine practice with a UK arm, Medica Reporting Limited (MRL) offering teleradiology and telepathology services utilising more than 500 reporters globally. Emergency and urgent telereporting of imaging investigations are routinely provided and the AI device (qER) has been deployed and successfully integrated with the acute teleradiology workflow for assistance in reporting NCCTH scans since the end of 2020. During the real-time reporting of scans, the original reporting radiologists would see a priortisation flag in their worklist corresponding to an NCCTH scan if the AI detected the presence of ICH in that scan.

A subset of all consecutive NCCTH scans that were successfully analysed by the AI device from patients aged ≥ 18 years referred for emergency and urgent teleradiology reporting from 44 different hospital sites in the UK during a 4-month period from 13 September 2023 to 16 January 2024 were selected randomly. A sample of 100 to 125 scans with radiologist confirmed the presence of ICH would enable us to estimate an anticipated minimum positive percent agreement of AI of 85% with radiologists with about 7 to 6.25% precision (half-width of the 95% confidence interval).¹¹ Assuming the prevalence of ICH to be about 8%, a sample of approximately 1400 NCCTH scans was desired for this evaluation which would also allow us to estimate an anticipated minimum 90% overall agreement with a higher precision of about 1.58%. The original DICOM (Digital Imaging and Communications in Medicine) NCCTH images, original radiology report, and the SC generated by AI were available in the historical electronic data to be used for this evaluation.

Evaluation Process

The original DICOM NCCTH images, original radiology, and the AI-generated SC were retrospectively evaluated by a team of 30 auditing radiologists (5 neurospecialist radiologists and 25 general radiologists) with an average experience of 13 years (median experience of 12 years) in radiological reporting. Each patient’s data was evaluated by a single auditing radiologist and none of the auditing radiologists evaluated their own previous original radiology report. This process of evaluation (auditing) is periodically followed in Medica as part of internal standard operating procedures. The auditing radiologist, after thorough inspection of the original NCCTH image, AI-generated SC, and original radiology report assigned five distinct categories to each scan as listed below:

Agree – Great Spot: The auditing radiologist agreed with the positive (presence of ICH) finding by AI and decided this positive finding is a good subtle spot. These are true positive (TP) cases.
Agree with positive finding: The auditing radiologist agrees with the positive finding by AI. These are TP cases.
Agree with negative finding: The auditing radiologist agrees with the negative (absence of ICH) finding by AI. These are true negative (TN) cases.
Disagree with positive finding: The auditing radiologist disagrees with the positive finding by AI. These are false positive (FP) cases.
Disagree with negative finding: The auditing radiologist disagrees with the positive finding by AI. These are false negative (FN) cases.

If the auditing radiologist disagrees with the original radiology report, a discrepancy is raised for the report to review the scan as per standard operating procedure in Medica. However, this analysis was out-of-scope for this evaluation in which we only focussed on the auditing outcome (auditing radiologist impressions versus AI). If there is a disagreement with AI, the auditing radiologist also reports the potential reason for disagreement. A descriptive report of the failure analysis is also reported.

Statistical Analysis

We use the terms overall agreement, positive and negative percent agreement (PPA and NPA) to denote the accuracy, sensitivity, and specificity respectively of the AI device in detecting ICH based on the guidance from FDA¹² for reporting results from analysis of diagnostic devices when a non-reference standard is used. Point estimates and corresponding 95% exact binomial confidence interval (95% CI) of overall agreement, PPA, NPA, positive predictive value (PPV), and negative predictive value (NPV) are reported. A descriptive analysis of the reasons for false results is also reported. Agreement between the AI tool and the auditing radiologist was also quantified by Gwet’s AC1¹³; Cohen’s kappa and Prevalence and Bias Adjusted Kappa (PABAK).¹⁴ A multivariable logistic regression model was also fitted with age, sex, and the interaction term between age and sex as independent variables to investigate if there is any association of age and sex with incorrect AI results. The statistical analysis was conducted in R version 4.3.2 and only anonymised data after removing patient identifiers was used for any statistical analysis.

Ethical Considerations

This was a retrospective evaluation from an auditing process already implemented in the teleradiology practice (Medica) as part of standard operating procedures. The Medica Reporting Limited Clinical Governance Committee, Chief Medical Officer and Caldicott Guardian have approved this evaluation. The data used for statistical analysis is non identifiable and gives assurance of the expected performance of our system for CT Head reporting prioritisation.

From a potentially eligible pool of 27, 432 NCCTH scans, a subset of 1416 scans were randomly sampled from patients aged ≥ 18 years. These scans were originally also processed by qER in real-time. After excluding 301 duplicate scans, 1315 NCCTH scans from as many distinct patients were available for the final analysis (Fig. 2). 619 (47.1%) of the cases were males, 696 (52.9%) were females and the mean age of the patients was 67.7 (median: 73.0, IQR: 53.0–84.0, standard deviation: 20, Table 1).

Table 1

Baseline Characteristics
Baseline Characteristic	Overall N = 1,315^*	ICH Positive N = 112^*	ICH Negative N = 1,203^*
Age	68 ± 20	69 ± 16	68 ± 21
Gender
Female	696 (53%)	55 (49%)	641 (53%)
Male	619 (47%)	57 (51%)	562 (47%)
^*Mean ± SD; n (%)

The AI device output was found to be in agreement with the auditing radiologists for the presence and absence of ICH in 93.5% (95% CI: 92.1–94.8) cases. The presence of ICH was confirmed in 112 (8.5%) of the scans. AI detected 96 of the 112 ICH cases (PPA: 85.7%; 95% CI: 77.8–91.6) and the NPA was found to be 94.3% (95% CI: 92.8–95.5) (Tables 2–3). Two (2.1%) of the 96 true positive cases were deemed to be “Agree – Great Spot” by the auditing radiologist; one was a subtle SAH, and the other one was a haemorrhage in a surgical cavity in a scan containing motion artefact.

Table 2

Contingency table showing ICH classification results.
	Auditing Radiologist: ICH	Auditing Radiologist: No ICH	Total
AI: ICH	96 (TP)	69 (FP)	165
AI: No ICH	16 (FN)	1134 (TN)	1150
Total	112	1203	1315

Table 3

Overall evaluation results
Metric	Point Estimate	95% CI Lower Limit	95% CI Upper Limit
Overall Agreement	93.5%	92.1	94.8
PPA	85.7%	77.8	91.6
NPA	94.3%	92.8	95.5
PPV	58.2%	50.3	65.8
NPV	98.6%	97.8	99.2
Gwet’s AC1	0.92	0.90	0.94

An almost perfect agreement is found between the AI device and the auditing radiologist in terms of Gwet’s AC1 (0.92; 95% CI: 0.90–0.94) and PABAK (0.87; 95% CI: 0.84–0.90). The Cohen's kappa was estimated to be 0.66 (95% CI: 0.59–0.73).

There were no associations of age (p = 0.8), sex (p = 0.6), or the interaction term of age and sex (p = 0.8) with incorrect AI results as per the multivariable logistic regression analysis (Table 4).

Table 4

Multivariable Logistic Regression Analysis Output
Dependent Variable	OR^*	95% CI^*	p-value
Sex			0.6
Female Male	1.56	0.36–7.03	0.6
Age	1.00	0.98–1.01	0.8
*Sex Age** Male * Age	1.00	0.98–1.02	0.8
^*OR = Odds Ratio, CI = Confidence Interval

Failure Analysis

There were 16 false negative outcomes (false negative rate: 14.3%) of which 9 (56.2%), 4 (25.0%), and 3 (18.8%) were due to missed SAH, SDH, and IPH, respectively. Of the 9 missed SAH cases, 8 were reported by the auditing radiologist as “small” or “tiny” bleeds.

There were 69 false positive results indicating a false positive rate of 5.7%. 64 (92.7%) of them had a reason for the false positive AI outcome documented by the auditing radiologist. 21 (32.8%) were due to motion artefacts causing confusion with bone margins, 15 (23.4%) were due to tumours such as meningiomas/haemangiomas which were of acute blood density, 14 (21.9%) were due to AI flagging calcified structures mistakenly as bleeds, and 10 (15.6%) were due to hyperdense venous sinuses. The remaining four false positive cases were due to assorted reasons such as aneurysms (n = 2), hyperdense tentorium (n = 1), and cerebral oedema causing normal structures to appear denser (n = 1).

In our analysis, the AI device was found to correctly classify an NCCTH scan for the presence or absence of ICH in 1230 of the 1315 scans as indicated by an overall agreement with auditing radiologists of 93.5% (95% CI: 92.1–94.8). The PPA and NPA were found to be 85.7% (95% CI: 77.8–91.6) and 94.3% (95% CI: 92.8–95.5) which are slightly higher than what was reported in a prior diagnostic accuracy study of the same AI device conducted using NCCTH scans from in-hospital and outpatient radiology centres in India⁸. The PPA of the AI device in our evaluation was lower than that reported (approximately 97%) in a diagnostic accuracy study conducted using NCCTH scans from patients with spontaneous non-traumatic ICH from a Swedish stroke registry.⁹ The differences in the study population characteristics could be a reason for these observed differences. Our study population consisted of NCCTH scans from unfiltered acute cases which were referred for urgent telereporting. A systematic review and meta-analysis of various AI algorithms in detecting ICH are available in the literature. The reported diagnostic accuracies on those studies are more or less comparable with what we have observed in our analysis although a comprehensive comparison is difficult due to variations in the study population and reference standards. We also observed a high NPV of 98.6% (95% CI: 97.8–99.2) for the AI device which might be useful because excluding bleed in a clinical acute stroke case helps in determining the eligibility for thrombolysis.¹⁵

The strength of our evaluation is that our data reports the performance of AI in detecting ICH in a real-world teleradiology setting. We were also able to report a failure analysis of false positive and false negative AI results. We found that the AI device tends to miss small volume SAH. False positive results were majorly due to AI flagging calcified structures, tumours, and movement artefacts as ICH. It is to be noted that the device description includes a warning that inaccurate results are likely in scans with movement artefacts and thus radiologists need to be cautious in interpreting AI results in such scenarios. False positive and false negative results were not associated with the age and sex of the patients. Our study is limited by a relatively small number of ICH scans and thus the precision (half-width of the 95% CI) of the point estimate of PPA that we could achieve was about 7%. We also did not have access to the probability scores of AI and thus we could not conduct a Receiver-Operating-Characteristics analysis. Since each scan was audited by only one radiologist, significant inter-reader variability impacting the evaluation cannot be ruled out.

The AI device demonstrated very good agreement with radiologists in the detection of ICH from NCCTH scans. The optimal use of AI by radiologists can be augmented by knowledge of the scenarios where AI is likely to generate inaccurate outputs. Accurate AI recognition of haemorrhage can enable higher-risk scans to be prioritised for reading thus potentially enabling rapid patient management.

Conflicts of Interest

GP, JW, and RL are employees of Medica Group Limited. DR, AK, SK and SG are employees of Qure.ai.

Funding

This evaluation had no specific funding as this auditing is already a part of the standard operating procedures at Medica.

Acknowledgments

We are very thankful for the support and contributions received from Miss Deanna Stearman and Kevin Terrins in the overall project and data management for this evaluation. We are also thankful to all the reporting and auditing radiologists, especially Dr. Praveen Kumar Vasanthraj Dr Sighelgaita Rizzo, Dr. Martin Crowe, and Dr. Sardar Shadab Qasim.

NHS Diagnostic Imaging Dataset Annual Statistical Release 2022/23 [Internet]. 2023 [cited 2024 Feb 16]. https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2023/11/Annual-Statistical-Release-2022-23-PDF-1.3MB-1.pdf
Dieckmeyer M, Sollmann N, Kupfer K, Löffler MT, Paprottka KJ, Kirschke JS et al (2023) Computed Tomography of the Head: A Systematic Review on Acquisition and Reconstruction Techniques to Reduce Radiation Dose. Clin Neuroradiol 33(3):591–610
RCR, Clinical Radiology Workforce Census 2022 [Internet]. [cited 2024 Feb 16]. https://www.rcr.ac.uk/news-policy/policy-reports-initiatives/clinical-radiology-census-reports/
Kalidindi S, Gandhi S (2023) Workforce Crisis in Radiology in the UK and the Strategies to Deal With It: Is Artificial Intelligence the Saviour? Cureus. 15(8):e43866
RCR. Teleradiology and Outsourcing Census [Internet]. [cited 2024 Feb 19]. Available from: file:///C:/Users/Dennis Robert/Downloads/rcr_publication-teleradiology-and-outsourcing-census.pdf
Bajaj S, Khunte M, Moily NS, Payabvash S, Wintermark M, Gandhi D et al (2023) Value Proposition of FDA-Approved Artificial Intelligence Algorithms for Neuroimaging. J Am Coll Radiol 20(12):1241–1249
Agarwal S, Wood D, Grzeda M, Suresh C, Din M, Cole J et al (2023) Systematic Review of Artificial Intelligence for Abnormality Detection in High-volume Neuroimaging and Subgroup Meta-analysis for Intracranial Hemorrhage Detection. Clin Neuroradiol 33(4):943–956
Chilamkurthy S, Ghosh R, Tanamala S, Biviji M, Campeau NG, Venugopal VK et al (2018) Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study. Lancet (London England) 392(10162):2388–2396
Hillal A, Sultani G, Ramgren B, Norrving B, Wassélius J, Ullberg T (2023) Accuracy of automated intracerebral hemorrhage volume measurement on non-contrast computed tomography: a Swedish Stroke Register cohort study. Neuroradiology 65(3):479–488
FDA 510(k) Premarket Notification - K200921 [Internet]. 2020 [cited 2021 Aug 9]. https://www.accessdata.fda.gov/cdrh_docs/pdf20/K200921.pdf
Hajian-Tilaki K (2014) Sample size estimation in diagnostic test studies of biomedical informatics. J Biomed Inform [Internet]. ;48:193–204. https://www.sciencedirect.com/science/article/pii/S1532046414000501
FDA. Guidance for Industry and FDA Staff Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests [Internet]. Fda. 2007 [cited 2024 Feb 21]. pp. 1–39. http://www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/ucm071148.htm
Gwet KL (2021) Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters: Vol 2: Analysis of Quantitative Ratings. Vol. 1
Chen G, Faris P, Hemmelgarn B, Walker RL, Quan H (2009) Measuring agreement of administrative data with chart data using prevalence unadjusted and adjusted kappa. BMC Med Res Methodol [Internet]. ;9(1):5. https://doi.org/10.1186/1471-2288-9-5
Powers WJ, Rabinstein AA, Ackerson T, Adeoye OM, Bambakidis NC, Becker K et al (2019) Guidelines for the early management of patients with acute ischemic stroke: 2019 update to the 2018 guidelines for the early management of acute ischemic stroke a guideline for healthcare professionals from the American Heart Association/American Stroke A. Stroke 50:344–418

The authors declare potential competing interests as follows: GP, JW, and RL are employees of Medica Group Limited. DR, AK, SK and SG are employees of Qure.ai.

Download PDF

Version 1

posted

You are reading this latest preprint version

Evaluation of an artificial intelligence-based software device for detection of intracranial haemorrhage in teleradiology practice

Status:

Version 1

Abstract

Figures

Advances in knowledge

Introduction

Materials and Methods

About the AI Software Device and Prior Evidence

Data Collection

Evaluation Process

Statistical Analysis

Ethical Considerations

Results

Failure Analysis

Discussion

Conclusion

Declarations

Conflicts of Interest

Funding

Acknowledgments

References

Additional Declarations

Status:

Version 1