Almost all published literature relating to AI assessment for acute appendicular fractures in children are based on radiographic interpretation, with fractures of the upper limb (specifically the elbow) being the most common body part assessed. Nearly all articles used training, validation and testing data derived from a single centre. When AI tools were compared to the performance of human readers, the algorithms demonstrated comparable diagnostic accuracy rates, and in one study improved/ augmented the diagnostic performance of a radiologist.
In this review we focussed on the assessment of computer aided/ artificial intelligence methods for paediatric appendicular fracture detection, given that these are the most commonly encountered fractures in an otherwise healthy paediatric population (accounting for approximately 70–99% of paediatric fractures[38–40], with less than 5% of fractures affecting the axial skeleton[41–43]). Publications related to the application of computer aided/AI algorithms for paediatric skull and spine fractures have been described. One developed an AI algorithm for detection of skull fractures in children from plain radiographs[44] (using CT head report as reference standard) and reported high AUC values both on their internal test set (0.922) and external validation set (0.870), with improvements in accuracy of human readers when using AI assistance (compared to without). Whilst demonstrating proof of concept, since most radiology guidelines encourage the use of CT over radiographs for paediatric head trauma[45–47], clinical applicability is limited.
In two articles pertaining to spine fractures[48; 49], the authors applied commercially available, semi-automated software tools designed for adults to a paediatric population for the detection of vertebral fractures on plain radiography or dual-energy X-ray absorptiometry (DEXA). They reported low sensitivity for both software (36% and 26%) not sufficiently reliable for vertebral fracture diagnosis. This finding raises an important general issue regarding the need for adequate validation and testing of AI tools in specific patient populations, in this case children, prior to clinical application to avoid potentially detrimental clinical consequences. This was conducted in the current systematic review for one commercially available product (Rayvolve®, AZMed) which demonstrated high diagnostic accuracy rates, particularly for older children (sensitivity 97.1% versus 91.6% for 5–18 year olds versus 0–4 year olds; p < 0.001). Whilst other fracture detection products are now commercially available (e.g. BoneView, Gleamer[50]) peer-reviewed publications of such products to date relate only to diagnostic accuracy rates in adults[51] (although paediatric outcomes are available as a conference abstract on the company website[52]).
Most studies in this review specifically chose to develop and apply their AI algorithm for one specific body part, rather than all bones of the paediatric skeleton. Taking the commonest body part for assessment (i.e. the elbow), dedicated algorithms yielded higher diagnostic accuracy rates than the commercially available product for the same body part (which was trained to detect fractures across the entire appendicular skeleton). In this example the improvement in sensitivity was between 89.5–90.7% (for test data, using dedicated algorithms) versus 88% for the generalised tool. Whilst the difference may be small, it could vary across other body parts which we have insufficient dedicated algorithm information for. It will therefore be important to better understand the epidemiology of fractures across different population groups, and whether algorithms that have increased diagnostic accuracies for certain commonly fractured body parts would need to be additionally implemented for certain institutes.
Another aspect highlighted by the present study relates to patient selection, with variable inclusion and exclusion criteria amongst the different studies, with few for example, assessing fractures in children under 2 years old (who are more likely to be investigated for suspected physical abuse[53]), or those with inherited bone disorders (e.g. osteogenesis imperfecta). This could be due to fewer children within these categories attending emergency departments to provide the necessary imaging data for training AI models, but the result is that specific paediatric populations may be unintentionally marginalised or poorly served by such new technologies and raises potential ethical considerations about their future usage particularly when performance characteristics are extrapolated beyond the population on which the tool was developed and validated[54]. An example would be an AI tool which could help to evaluate the particular aspects of fractures relating to suspected physical abuse as an adjunct to clinical practice given that many practising paediatric radiologists do not feel appropriately trained or confident in this aspect of imaging assessment[55–58]. Whilst data is limited, one study did address the topic of using AI for identifying suspected physical abuse through the detection of corner metaphyseal fractures (a specific marker of abuse)[59] with a high diagnostic accuracy. Future studies addressing these patient populations, and with details regarding socioeconomic backgrounds of cases used for training data, would be helpful to develop more inclusive and clinically relevant tools. Expanding the topic of fracture assessment to address bone healing and post-orthopaedic complications may be another area for further development given that most articles also excluded cases with healing fractures, presence of casts or indwelling orthopaedic hardware.
With the exception of one study, all methods for developing artificial intelligence for fracture detection identified in this review relied on creating or retraining deep convolutional neural networks with the ability to ‘learn’ features within an image to better provide the most accurate desired output classification. Only one study exclusively adopted a more traditional machine learning method using stricter, rule-based computer aided detection methods for identifying bowing fractures of the forearm[36]. It is unclear whether using a convolutional neural network was unsuitable or less accurate for the detection of these specific fractures or was not attempted due to lack of capability, however differences in performance of various methods should be compared within the same dataset both in relation to performance but also resource requirements/costs and other aspects such as ‘exploitability’ of features used by the algorithm. It is likely that the trend for future AI tools for paediatric fracture detection will include development of single or an ensemble of convolutional neural networks to provide optimal performance. Nonetheless, one should not completely disregard simpler machine learning methods, and consider how they can be best employed given the significant computational power and thus carbon footprint produced from training deep learning solutions, especially in light of current global efforts for creating a more sustainable environment[60].
Although there are fewer publications relating to AI applications for paediatric fractures than in adult imaging, these data have demonstrated that several solutions are being developed and tested with children in mind. Given the current crisis in the paediatric radiology workforce and restricted access to specialist services[61–66], an immediate, accurate fracture reporting service could potentially confer a cost-saving effect[67] and neutralise healthcare inequalities. Nevertheless, health economic analyses and studies assessing whether such algorithms do translate into real improvements in patient outcomes are lacking, and it is unclear how generalisable many of the algorithms may be given that most have been tested in a single centre, without external validation. It should also be recognised that there may be great differences between optimised test performance in validation sets versus the ‘real-world’ impact of implementing such a tool into routine clinical workflows, both as a consequence of differences/variations in input data, but also usability aspects and pragmatic ability to incorporate such tools into existing workflows. These factors raise questions regarding future widespread implementation and funding of AI solutions as individual hospitals and healthcare systems will required return on their investment at the level of clinical/operational impact rather than pure ‘test performance’[68]. Improved methods of secure data sharing (possibly with public datasets of paediatric appendicular radiographs) and greater collaboration between hospitals and industrial and academic partners could be beneficial in terms of developing and implementing novel digital tools for paediatric imaging at a lower cost, and future implementation studies are required.
There were several limitations to the present study. During the literature review, we included studies that specifically related to paediatric fracture detection. It is possible that some additional studies may have included children within their population dataset, but did not make this explicit in their abstract or methodology and therefore may have been excluded. Secondly the AI literature is expanding at a rapid rate, and it is likely by the time of publication that newer articles may be available. In order to minimise this effect, an updated review of the literature using the same search strategy was performed immediately before article submission to ensure the timeliness of the findings. We also acknowledge that articles relating to AI applications may be published in open source, but non peer-reviewed research sharing repositories (e.g. arXiv) which were not searched and therefore excluded since only adequately peer-reviewed articles were included. Finally, it proved difficult to consistently extract the required information from the available literature. When assessing for bias, we used a slight adaptation of the QUADAS-2 guideline (whilst future tools are developed[69]) and in some cases the study methodology appeared incomplete or incomprehensible, particularly those written prior to published AI reporting guidelines[70–72]. Accordingly, we included the AI algorithm methodology as a supplementary table due to wide variations in reporting making direct comparisons challenging.