ImageNet2,3 is a dataset comprising millions of images of the natural world. ImageNet, as an open-sourced dataset, has been a central resource to derive sophisticated models in computer vision. Transfer learning1 is a common deep learning approach whereby a model designed for one problem can be reused to initiate a different but related task in machine learning. Due to the lack of annotated images and limited resources of computing power to train new models from scratch, transfer learning has become a popular method in deep learning for researchers to transfer the knowledge gained from pre-trained models to a related problem and thus speed up the training process with fewer input data and improve the performance and generalizability of a deep learning model11. Transfer learning with models trained using ImageNet has been extensively explored in medical imaging AI applications. The architectures of VGG12, ResNet13–15, Inception networks16–19, MobileNet20, and DenseNet21 pre-trained with ImageNet have been widely adopted and used in medical imaging applications for COVID-19 diagnosis on chest CT22, classification of fibrotic lung disease23, and classification of skin cancer24. Despite the high performance of many medical imaging models pre-trained with ImageNet, successful transfer learning requires reasonably large sample size, diversity of images, and similarity between the training image database and the target application images. While ImageNet meets the size and diversity criteria, there remains a significant dissimilarity between the training images in ImageNet and the medical images in the new task that represents an important limitation. The development of transfer learning strategies to bridge that gap is an active medical imaging machine learning research area.
RadImageNet provides millions of annotated advanced medical images from various modalities demonstrating numerous different pathologies and can be used to develop pre-trained models for predictive tasks in image-based medicine. We propose that a pre-trained model based solely on medical radiographic features from a vast medical imaging database will provide more appropriate feature representation for image-based predictive problems in medicine than pre-trained models derived from the natural images in ImageNet.
In this study, we describe a large-scale, diverse medical imaging dataset, RadImageNet, to generate pre-trained convolutional neural networks trained solely from medical imaging to be used as the basis of transfer learning for medical imaging applications. We compare the pre-trained weights derived from RadImageNet and ImageNet on multiple medical imaging use cases, including target image modalities and anatomies not included in the training RadImageNet database. We show that the pre-trained networks generated from RadImageNet exceed the performance of pre-trained models developed from ImageNet. Furthermore, we show how a medical image recognition problem with a small medical image dataset could benefit from pre-trained weights derived from RadImageNet. This provides evidence that the pre-trained weights from RadImageNet can be transferable across multiple modalities, anatomies, and pathologies. Figure 1 illustrates an overview of this study.
The Radimagenet Database
The RadImageNet dataset includes 5 million annotated CT, MRI, and ultrasound images of musculoskeletal, neurologic, oncologic, gastrointestinal, endocrine, and pulmonary pathology. For direct comparison with ImageNet (the initial size for the ImageNet challenge was 1.4 million images), we collected the most frequent modalities and anatomies on the same scale. The RadImageNet dataset was collected between January 2005 and January 2020 from 131,872 patients at an outpatient radiology facility in New York City. Each study was annotated by a board-certified fellowship-trained radiologist. As part of the interpretation of each study, the reading radiologist chose images representative of the pathology shown in each exam. The pathology demonstrated on each of these “key images” was annotated, and a region of interest was created to identify the imaging findings. These annotations were extracted from the key images and provided the basis for the RadImageNet classes. The portions of the RadImageNet database used for comparison to ImageNet consist of three radiologic modalities, eleven anatomies, and 165 pathologic labels (Fig. 2a and Extended Data Table 1). Inception-Res-Net-v216–19, ResNet5013–15, DenseNet12121, and InceptionV316 convolution neural network architectures were generated from the data in RadImageNet. We stratified the RadImageNet dataset by patient ID, allowing no overlaps in either training, validation, and test sets. The dataset was split into 75% training set, 10% validation set, and 15% test set. The performance of the models was reported on the test set. Furthermore, we randomly sampled 2,016 images from the test set and compared the model performance to three senior sub-specialized fellowship-trained radiologists who were uninvolved with the labeling of the images in RadImageNet (Extended Data Fig. 1).
The RadImageNet Inception-Res-Net-v2, ResNet50, DenseNet121 and InceptionV3 models achieved top-1 accuracies of 78.86%, 75.04%, 74.47% and 76.13%respectively and top-5 accuracies of 95.52%, 94.30%, 93.93%, and 95.23% respectively (Fig. 2b).
Comparison Of Radimagenet And Imagenet Pre-trained Models
The four aforementioned RadImageNet models were used as the pre-trained models for five medical imaging applications to compare their performance to the ImageNet pre-trained models. We applied the pre-trained models to transfer learning problems using publically available datasets, including bone age prediction on hand and wrist x-rays6 pneumonia detection in ICU patients on chest radiographs7; ACL tear detection on MRI8; SARS-CoV-2 detection on chest CT9; and hemorrhage detection on head CT10. These applications were selected to evaluate the capabilities of RadImageNet models on multiple applications that included modalities, anatomies, and labels that were and were not contained in the RadImageNet database.
Bone Age Prediction
For bone age prediction on hand and wrist x-rays, the RadImageNet-trained models showed significant reduction in mean absolute errors compared to the ImageNet-trained models. For RadImageNet-trained Inception-Res-Net-v2,ResNet50, DenseNet121 and InceptionV3 networks, (Fig. 3a) mean absolute errors were 10.42 months (mean absolute deviation (MAD) = 6.89; P = 0.0071), 11.12 months (MAD = 6.99 ; P < 0.0001), 10.97 months (MAD = 6.91 ; P < 0.0001), and 10.25 months (MAD = 6.54; P < 0.001 ) respectively, whereas for ImageNet-trained Inception-Res-Net-v2, ResNet50, DenseNet121 and InceptionV3 networks mean absolute errors were 11.01 months (MAD = 7.03), 15.22 months (MAD = 10.18), 12.26 months (MAD = 7.87), and 11.29 months (MAD = 7.20) respectively. The RadImageNet models also demonstrated narrower standard deviation on modified Bland-Altman plots25,26 (Extended Data Fig. 2)
Pneumonia Detection
For pneumonia detection on chest radiographs, the RadImageNet-trained models showed significant improvement in the area under curve of receiver operating characteristics (AUROC) compared to the ImageNet-trained models. For RadImageNet-trained Inception-Res-Net-v2,ResNet50, DenseNet121 networks the AUROC was 87.42% (95% confidence interval (CI) 86.22%, 88.63%; P = 0.0032), 86.81% (95% CI 85.55%, 88.07%; P = 0.00042), 86.93% (95% CI 85.68%, 88.19%; P = 0.013) respectively, whereas for ImageNet-trained Inception-Res-Net-v2, ResNet50 and DenseNet121 networks the AUROC was 86.34% (95% CI 85.04%, 87.54%), 85.48% (95% CI 84.16%, 86.80%) and 86.02% (95% CI 84.72%, 87.32%) respectively (Fig. 3b and Extended Data Fig. 3a). The InceptionV3 network of RadImageNet achieved an AUROC of 86.51% (95% CI 85.24%, 87.78%; P = 0.36) that showed a non-significant difference from the InceptionV3 network of ImageNet that showed an AUROC of 86.15% (95% CI 84.84%, 87.45%).
ACL Tear Detection
For ACL tear detection on MRI, due to the limited size of the dataset, we conducted 5-fold cross validation for this application. The performance of the models was reported on the mean and standard deviation on the five folds of test sets (Fig. 3c and Extended Data Fig. 3b). The RadImageNet Inception-Res-Net-v2, ResNet50, DenseNet121 and InceptionV3 demonstrated AUROC of 92.27% ± 3.45% (P < 0.0001), 87.11% ± 3.10% (P < 0.0001), 90.41% ± 5.90% (P = 0.12) and 92.33% ± 3.13% (P < 0.0001) respectively while the ImageNet Inception-Res-Net-v2, ResNet50, DenseNet121 and InceptionV3 showed AUROC of 83.80% ± 6.49%, 81.72% ± 3.61%, 90.27% ± 2.08% and 60.20% ± 9.78% respectively.
SARS-CoV-2 Detection
For SARS-CoV-2 detection on chest CT, the RadImageNet-trained models showed significant improvement in the AUROC compared to the ImageNet-trained models. For RadImageNet-trained Inception-Res-Net-v2 and ResNet50 networks the AUROC was 99.88% (95% CI 99.84%, 99.92%; P < 0.0001) and 99.89% (95% CI 99.85%, 99.92%; P < 0.0001) respectively whereas for ImageNet-trained Inception-Res-Net-v2 and ResNet50 the AUROC was 99.28% (95% CI 99.17%, 99.40%) and 99.59% (95% CI 99.47%, 99.71%) respectively (Fig. 3d and Extended Data Fig. 3c). The RadImageNet DenseNet121 and InceptionV3 networks achieved an AUROC of 99.67% (95% CI 99.56%, 99.78%; P = 0.29) and 99.74% (95% CI 99.66%, 99.83%; P = 0.74) respectively that showed a non-significant difference from the ImageNet DenseNet121 and InceptionV3 networks that achieved an AUROC of 99.62% (95% CI 99.55%, 99.68%) and 99.73 (95% 99.64, 99.82%) respectively.
Hemorrhage Detection
For hemorrhage detection on head CT, the RadImageNet-trained models illustrated a significant improvement in the AUROC compared to the ImageNet-trained models. For RadImageNet-trained Inception-Res-Net-v2, ResNet50 and InceptionV3 networks the AUROC was 96.29% (95% CI 96.14%, 96.43%; P = 0.0090), 96.31% (95% CI 96.17%, 96.45%; P < 0.0001), and 96.40% (95% CI 96.26%, 96.54%; P = 0.00064) respectively whereas for ImageNet-trained Inception-Res-Net-v2, ResNet50 and InceptionV3 networks the AUROC was 96.16% (95% CI 96.02%, 96.31%), 96.06% (95% CI 95.91%, 96.22%) and 96.23% (95% CI 96.09%, 96.38%) respectively (Fig. 3e and Extended Data Fig. 4). The RadImageNet DenseNet121 achieved an AUROC of 96.22% (95% CI 96.07%, 96.37%; P = 0.14) that showed a non-significant difference from the ImageNet DenseNet121 which achieved an AUROC of 96.14% (95% CI 95.99%, 96.30%).
The gradient class activation map27 (Grad-cam) was used to demonstrate the features learned by the algorithms. We present the Grad-cam (Fig. 4) for the paired algorithms on each of the five applications to visualize the distinguishing features captured by the models. Grad-cam images of successful predictions of the RadImageNet and ImageNet models were used to compare the learned features.
Potential Clinical Applications
The pre-trained models trained from RadImageNet can improve predictive performance and generalizability for medical imaging applications. We simulated multiple scenarios of different medical imaging applications with multiple modalities, anatomies, and pathologies where RadImageNet models inherited some or none of the knowledge about that application. These applications show that the pre-trained RadImageNet models demonstrated significant improvement compared to the ImageNet pre-trained models. These outcomes suggest that the RadImageNet pre-trained models can improve medical imaging applications where transfer learning is needed. Moreover, gradient class activation maps suggest the interpretation of the RadImageNet models more closely conforms to the regions of interest as defined by radiologists.