Baseline characteristics of patients
In this study, we recruited a total of 773 patients, including 543 with breast tumors and 230 with non-breast tumors. The characteristics of all patients are in the Supporting Table 1. Significant differences were observed in the clinical features between the two cohorts, including age, maximum diameter, and BI-RADS classification. However, there were no significant differences in the tumor location and focus between the training and testing cohorts. After comprehensive multivariate analysis, age, maximum diameter, and BI-RADS classification were ultimately identified as key indicators and were rigorously integrated into the construction of our clinical prediction model. Supporting Table 2 presents a summary of characteristics for both the training and testing cohorts among non-breast tumor patients. Following comprehensive multivariate analysis, focus emerged as a crucial factor and was subsequently incorporated into the development of a clinical prediction model.
Table 1
Best epoch of DeepLabv3_resnet50 and FCN_resnet50
Model name | global_acc% | mIoU% | Dice % | mDice% | Epoch |
DeepLabv3 | 99.4 | 86.2 | 84.4–99.7 | 92.0 | 100 |
FCN | 99.5 | 88.7 | 87.6–99.8 | 93.7 | 100 |
Results of semi-automatic segmentation
We summarized the training results of the two segmentation models (Fig. 4). It was observed that both models achieved high accuracy. Specifically, as shown in Table 1, the DeepLabv3_resnet50 model attained a global accuracy of 99.4%, an average intersection over union (IoU) of 84.2%, and an average Dice coefficient of 92.0% during the best epoch. Similarly, the FCN_resnet50 model attained a global accuracy of 99.5%, an average IoU of 88.7%, and an average Dice coefficient of 93.7% during the best epoch. Both segmentation models were trained for 100 epochs.
Feature extraction and selection
In radiomics, both the FCN pathway and the DeepLabv3 pathway have extracted distinct radiomics features. After feature screening using t-tests, Spearman rank correlation tests, and Lasso models, the final features with non-zero coefficients were obtained, respectively. These features were used to construct a radiomics model through 5-fold cross-validation.
In deep learning, a pre-trained ResNet-101 model was used to extract 2048-dimensional deep learning features from the avgpool layer, which were then compressed to 8 dimensions to construct a deep learning model.
Performance of the radiomics and deep learning models in predicting tumor and non-tumor patients
In this study, we constructed radiomics models and deep learning models using the results of the DeepLabv3 and FCN approaches, respectively (Fig. 5). The results showed that the AUC range of the DeepLabv3-radiomics model was 0.716–0.886. The AUC of the DeepLabv3-deep learning model in the training cohorts was 0.839 (95% CI 0.807–0.872), and in the testing cohorts, it was 0.716 (95% CI 0.632-0.800). The AUC range of the FCN-radiomics model was 0.729–0.887. The AUC of the FCN-deep learning model in the training cohorts was 0.846 (95% CI 0.815–0.877), and in the testing cohorts, it was 0.701 (95% CI 0.609–0.793).
Development and performance of the combined and stacking models in predicting tumor and non-tumor patients
In the subsequent study, we combined deep learning features and radiomics features from different segmentation approaches to construct the DeepLabv3 deep learning radiomics model and FCN deep learning radiomics model. Figure 6 shows the performance of these two models in the testing cohort, with an AUC of 0.659–0.753 for the DeepLabv3 model and an AUC of 0.646–0.733 for the FCN model. The DeepLabv3 model performed slightly better than the FCN model in the testing cohort.
Finally, we utilized stacking to fuse the clinical model, DeepLabv3 deep learning radiomics model, and FCN deep learning radiomics model using logistic regression, resulting in the final stacking model(nomogram). Experimental results in Fig. 5 demonstrated that the stacking model, which combined the clinical model, combined 1, and combined 2, significantly improved the ability to differentiate between tumor and non-tumor patients, with an AUC of 0.890 (95% CI 0.861–0.918). This model demonstrated a sensitivity of 0.844 and a specificity of 0.815.In the testing cohort, the AUC of this stacking model reached 0.780 (95% CI 0.707–0.853). The sensitivity of the stacking model was 0.713, and the specificity (0.739) was sufficiently high to identify non-tumor patients. All performance results were evaluated based on 5-fold cross-validation.
To assess the clinical utility of different prediction models, we evaluated their performance using decision curve analysis(DCA). As shown in the Fig. 7, various deep learning radiomics models, including the DeepLabv3 deep learning radiomics model (combined 1), the FCN deep learning radiomics model (combined 2), and the stacking model, were demonstrated in the decision curve analysis across the training and testing cohorts. It can be observed that the stacking model exhibited significantly higher clinical benefit compared to other models.
Development and performance of the combined and stacking models in predicting adenosis and other type of lesions
In the prediction task of adenosis and other lesion types, we have developed corresponding DeepLabv3 and FCN deep learning radiomics models. As depicted in Fig. 8, the performance of these two models in the testing cohort revealed AUC values ranging from 0.663 to 0.802 for the DeepLabv3 model and 0.677 to 0.737 for the FCN model. The final stacking model effectively distinguished patients with adenosis from those with other lesions, achieving an AUC of 0.813 (95% CI: 0.752–0.873) in the training cohort. The sensitivity and specificity of this model were 0.613 and 0.859, respectively. In the testing cohort, the stacking model exhibited an AUC of 0.771 (95% CI: 0.617–0.924), along with a sensitivity of 0.759 and specificity of 0.765.
The DCA curve presented in Fig. 9 demonstrated the clinical utility of the DeepLabv3 deep learning radiomics model, the FCN deep learning radiomics model, and the stacking model in both the training and testing cohorts. The stacking model exhibited superior clinical value compared to the other models.