These results generally prove the hypothesis that the performance of an AI model developed based on the aforementioned strategies of accommodating small development set size is comparable to senior pediatric emergency medicine physicians in pediatric elbow radiograph triage. This study complements the earlier study on binomial classification of pediatric elbow fractures by Rayan et al [5], with our study utilizing strategies for model development taking into account the relatively smaller development set, as well as human annotators as opposed to natural language processing for image curation.
Several points to this study were identified. First, we assessed the performance of the AI model and clinical group in using a metric of sensitivity instead of accuracy as this was deemed important in its planned deployment for ER radiograph triage. Second, in this paper we used an EfficientNet B1 model with 240 x 240 resolution. Recognizing the obvious differences in the set characteristics, this model using a lower resolution model and smaller dataset size nonetheless managed respectable results as compared to other studies which used higher resolution images and dataset size i.e. this study as compared to the study by Rayan et al [5] reported a model AUC of 0.804 as compared to 0.95. The reasons for this would require further study but it may be conceptually important that the performance of the model depends on factors other than image resolution [15] or set size alone, with the network architecture possibly also an important factor contributing to model performance. The EfficientNet [6] family of models has shown among other Convolutional Neural Networks efficacy in terms of performance and speed using commercially available GPU processing capabilities in the classification of skin lesions [16], computed tomography (CT) lung scans [17] and diabetic retinopathy [18] but this is the probably one of the first papers employing this model in paediatric elbow radiographs. In this study, a lower powered B1 version of the model was employed as compared to higher (i.e. B4 to B7) versions due to limitations in processing power, directions for a future study may include comparing the EfficientNet model to existing architectures such as Resnet and InceptionNet, as well as the relative performance of higher powered versions of the EfficientNet model. Third, the method of administration and testing of the reference clinical group should be carefully considered. In our study, we took care not to prime the clinical group on the breakdown of normal and abnormal cases in the test set, we also had the physicians indicate the abnormal diagnosis if present and place a marker over the site where the abnormality was seen in an attempt to have the physician aim for accuracy rather than sensitivity of detection as an outcome as per their usual clinical practice. These factors should be taken into account in the design of trials which compare the performance of human raters to AI models.
There were a number of limitations of this study. First, the cases were retrieved from a single institution potentially limiting its generalizability. Second, a multiview approach to classification by the AI (i.e. analysis of Antero-Posterior (AP) and lateral projection radiographs) as performed in earlier studies [5] was not possible in this study due to resource limitations. Third, the resolution of the model was also below the standard resolution of a radiograph which may affect the model’s sensitivity to subtle abnormality [9, 15], this may be potentially overcome through using more advanced versions of EfficientNet i.e. B6 and B7 models [19]. Fourth, despite the stated aims, the intrinsic limitations of an AI model developed on a small sized development dataset should also not be lightly viewed. The AI model specificity may have potentially improved with using a larger sized development set; this should however be taken in the context of the real world challenges in obtaining large volumes of high quality annotated data. Fifth, this study did not compare the EfficientNet model to existing architectures such as Resnet and InceptionNet and it would be useful to perform these analyses in future work to assess the merit of this compound scaling network architecture compared to other established models.
We note that our study had limitations in development set size not seen in previous papers. However, the prevalence of such data-contested environments may not exactly be rare in the setting of supervised learning, currently the standard for image classification tasks, as the burden of accurate data annotation is a perennial limiting factor. In summary, this study shows that an AI algorithm developed based on strategies of overcoming small sized development sets has value in creating clinically relevant models.