In this study, we evaluated five pre-trained convolutional neural network models using a 5-fold cross-validation method on our DCE-BMRI dataset. Our objective was to identify the best-performing model, which we defined as the one excelling across all predefined evaluation criteria. Following this, we fine-tuned the chosen model to enhance its performance further and tested its generalization capability on a validation set.
Our findings indicated that the VGG19 model demonstrated superior performance, achieving accuracies of 1.00 and 0.96 on the training and testing sets, respectively. Moreover, VGG19 achieved the highest Area Under the Curve (AUC) of 0.92 on the first validation set, but this dropped to 0.76 on a subsequent validation set, suggesting limitations in its generalization ability. Previous research supports the notion that fine-tuning can enhance the accuracy and precision of such models[17–19].
Consequently, we developed five distinct fine-tuning strategies for VGG19. Strategy S4 emerged as the most successful, yielding the highest test accuracy (0.97) and the lowest test loss on the validation set, indicating a superior generalization capability compared to the other strategies. When comparing the AUC scores of strategies S1-5 on the validation set, S4 again scored highest with an AUC of 0.89. These results are promising for advancing the accuracy of medical image classification diagnostics.
We also delved into whether the S4 model exhibited different AUC scores across BI-RADS categories 3, 4, and 5. Interestingly, S4 performed best in BI-RADS 3 (AUC 0.90), followed by BI-RADS 4 (AUC 0.86), and showed the least performance in BI-RADS 5 (AUC 0.65). The BI-RADS 3 category, typically applied when the likelihood of cancer is less than 2%, aims to minimize unnecessary biopsies for pathologically benign findings. However, patient compliance with follow-up MRI recommendations every six months is notably low in this category[20].
The BI-RADS system is set to evolve with new breast imaging modalities. Key areas for improvement include expanding the lexicon for common findings and clarifying the application of Category 3[21]. BI-RADS 3 represents a significant portion (13.9%) of diagnostic exams [22], often leading to follow-up procedures for patients classified as ‘probably benign’, yet compliance with these follow-up recommendations remains a challenge [20]. This lack of compliance raises concerns about the clinical and economic implications, particularly regarding the resolution time and outcomes for patients in this category.
In a particular study, only 1.4% of BI-RADS 3 lesions were found to be malignant, including two cases of delayed diagnosis at 13.2 and 33.2 months, respectively. The incidence of delayed diagnosis due to additional MRI-detected lesions during follow-up was notably low (0.7%), consisting exclusively of T1N0 contralateral cancers. This finding suggests that annual follow-up may suffice for BI-RADS 3 lesions identified by MRI before surgery[15]. Consequently, accurately distinguishing between benign and malignant lesions in BI-RADS 3 is crucial. Our study potentially offers significant benefits to a substantial number of patients diagnosed with BI-RADS category lesions using DCE-BMRI imaging.
BI-RADS category 4 lesions are associated with a high likelihood of malignancy, with estimates ranging from 2–95%. The BI-RADS 4 classification, to a degree, is subjective; the outcomes of biopsies in this category vary significantly, and the rate of cancer detection relative to the number of biopsies performed is relatively low (17.8%)[15]. Moreover, unnecessary biopsies can result in a range of adverse effects, including pain, fear, emotional distress, and financial costs.
Breast MRI is highly sensitive, yet it often presents a challenge in differentiating between atypical malignant and benign lesions, leading to potential overclassification in the BI-RADS 4 category and subsequent invasive biopsies. The wide range of positive predictive values for MRI-guided biopsies (2.5–84.0%) [23, 24], indicates that many patients undergo unnecessary procedures. Indeed, numerous women subjected to biopsies for benign findings endure unnecessary discomfort, expenses, potential complications, cosmetic alterations, and anxiety [25]. Identifying predictors of benign BI-RADS 4 masses, therefore, could be highly beneficial.
Efforts to enhance the assessment of BI-RADS 4 lesions could improve the identification of benign lesions, thereby reducing the frequency of unnecessary biopsies. Some researchers have developed predictive models based on imaging features or multiparameter MRI data to better evaluate BI-RADS 4 lesions, though these models typically rely on traditional imaging features subjectively defined by radiologists[26].
The BI-RADS 5 category is applied when imaging findings suggest a malignancy probability of 95% or higher. According to MC et al.[27], the positive predictive value of BI-RADS 5 assessments is only 71.4%, indicating that not all lesions classified as BI-RADS 5 are malignant[28], and surgery is often recommended for this category. It is recognized that a single imaging finding rarely confers such a high risk of malignancy; rather, a combination of features is necessary to elevate a lesion to Category 5[7]. However, it's important to acknowledge that even when tissue samples from molecular biopsies are used, they may not fully represent the entire lesion, as biopsies often target only a small, specific area of a heterogeneous lesion, introducing a bias in lesion selection.
Rc also known as sensitivity, measures a classifier's completeness. A lower Rc value indicates the classifier's limited capability in handling large FP values. Recent publications have led to the introduction of new and updated performance benchmarks in the latest edition, replacing outdated metrics. The recall rate benchmark has been revised accordingly. Initially, about half of all radiologists were unable to meet the 10% benchmark for recall rate, prompting a revision to a more achievable target of 12%, a standard met by over 75% of radiologists[7]. Our study showed that the S4 model exhibited the highest recall rate (0.89) among all DTL models, which is notable given the relatively limited class diversity in our dataset.
This study, however, is not without limitations. Firstly, the training set contained a relatively small number of images, particularly lacking in rare lesion types. Consequently, our dataset may not fully represent the broader spectrum of breast disease patients, potentially impacting the accuracy of the DTL model. Therefore, it’s essential to conduct further analysis with more extensive datasets to comprehensively evaluate the robustness of the DTL model. Secondly, our study relied solely on static Dynamic Contrast-Enhanced Breast Magnetic Resonance Imaging (DCE-BMRI) images, excluding other routine diagnostic procedures such as clinical evaluations, breast ultrasounds, and mammography. Thirdly, we limited our investigation to just five pre-trained models; future research should explore a wider range of models to determine their robustness on larger datasets. Lastly, while this paper does not delve into the various methods of fine-tuning Convolutional Neural Network (CNN) models, these topics will be the focus of our subsequent studies.