In this part, we illustrate the philosophy of FEBrNet, its computational operations and data flow step by step. In short, there are four steps in the workflow of FEBrNet: 1) feature distillation, 2) entropy matrix generation, 3) responsible frame recommendation, and 4) binary classification. Figure 1 depicts the data flow of FEBrNet. In the first step, state-of-the-art deep learning models are trained to acquire the knowledge from static breast ultrasound images. The backbone of the model is transferred to the second step for parallel feature extraction from breast ultrasound videos, which are independent of the static images. Together with key weights from the first step, frame-by-frame feature vectors are concatenated into a new feature matrix (feature entropy matrix). Third, we design a new entropy reduce method to select a subset of all frames to represent the entire ultrasound video for this particular event of breast lesion diagnosis. Meanwhile, binary classification of benign or malignancy is also conducted based on the feature entropy matrix to assist physicians in the diagnosis.
Feature distillation
Two major merits stand out in feature distillation: 1) Pre-accumulated physician-selected images contain a plethora of breast lesion features, in particular malignant ones. Therefore, to create a model capable of extracting essential malignant characteristics, Convolutional Neural Networks (CNN) can be pretrained on a relatively large ultrasound image dataset. 2) To a large extent, the backbone of the pretrained model can accelerate the process of video feature extraction compared to training parallel model from scratch. Here, most standard CNN models, including DenseNet, ResNet, and MobileNet, could be used. DenseNet and MobileNet are used in this experiment for comparing light-weight model (3,230,914 parameters for MobileNet) and sophisticated model (7,039,554 parameters for DenseNet).
We split the entire image dataset into three portions in our research (train : valid : test = 8:1:1), with images of the same patient appearing in only one of the subsets. For data augmentation, random stiff transformations on the original image can be used to mimic the image displacement, zoom, and flip that might occur as physicians scan for nodules in real-world clinical procedures. Some of the specialized methods including rotation, zoom, translation, and flip, as well as grayscale adjustment are also used. Binary Cross-entropy Loss is utilized to calculate classification loss and adjust network weights.
Entropy matrix generation
Various nodular features are scattered over the temporal sequence of frames. The principle of entropy matrix generation is mapping all features of all frames to a high-dimensional space, where all features of a single frame can be presented as one vector. Transferred from step 1, backbone of pretrained CNN model serves as a feature extractor to distill essential features from each frame of video in parallel and creates feature matrices. Feature matrices reveals the feature intensity of each frame in diverse feature dimension, where the number of dimensions is determined by the backbone model (1024 dimensions in the DenseNet121 backbone). By incorporating the weights of the final layers in step 1 as they represent the indicative information of malignancy, we obtained feature entropy matrices.
Responsible frame recommendation
Here, feature entropy matrices are used to rank the contribution of frames for breast nodule diagnosis, and a key variable we define is called FScore, which is the sum of values of feature entropy matrices from all feature dimensions at each frame. With a higher FScore, the frame contributes more characteristics indicating the possibility of malignant. FScore assists in locating the frame that contributes the most to the possibility of malignancy, which we defined as the first responsible frame of the model's prediction. However, since adjacent frames usually share similar image signatures and have very close FScore, which means the frame with the largest FScore and the second largest FScore may looks almost identical.
To choose a comprehensive set of responsible frames with varied features, we first extend the concept of FScore from a single image to a collection of frames, because video in essence, is a collection of frames. Second, we propose a novel entropy reduce method to select a minimal set of all frames with a lowest sum of entropy values. This set of frames is considered to have the highest likelihood to represent the entire video. The essential philosophy of the entropy method is a greedy mechanism, where the next frame is repeatedly searched to reduce the sum of entropy of all selected frames until the entropy sum can not be reduced. More mathematical illustrations and examples can be found in Supplement.
Binary classification
For classification task, we re-organize (MaxPool in deep learning) the feature entropy matrix of all frames or selected responsible frames (option 1 or 2 in figure 1) to compress and shape the information into a consistent shape of a vector (1*k). The matrix shows the maximum contribution of video in each feature dimension and is indicative to classify video as benign or malignancy. We refer to the compressed feature entropy matrix as the video feature entropy matrix.
In FEBrNet, we employ a classic machine learning model to analyze video feature entropy matrices and produce final benign-malignant predictions. Here, random forest is adopted to train on our video training set,30 with 1024 feature estimators and the maximum depth below 10. The classification is made based on entropy matrices instead of original videos for two reasons: 1) the feature entropy matrices already encapsulate key information from previous steps for classification; 2) undesirable noisy frames during the continual screening process might also undermine the classifying accuracy.
Experiment Design
Data sources and entry criteria
This retrospective study was conducted in accordance with the procedures specified by the hospitals that participated. The Ethics Committees of the Cancer Hospital of The Chinese Academy of Sciences and Shenzhen People's Hospital authorized this research. To ensure the quality of the data, we based the experiment's inclusion and exclusion criteria on clinical guidelines.
The following criteria apply to data inclusion:
(1) Ultrasound detection of breast nodules;
(2) Nodule diameter must be between 5.0-30.0mm;
(3) Breast tissue surrounding the nodule must be at least 3.0mm thick;
(4) Nodules must be BIRADS 0, 2, 3, 4a, 4b, 4c, or 5;
(5) No intervention or surgery on the nodule has been performed before the ultrasound test;
(6) Patients must undergo surgery or biopsy within one week of the ultrasound data collection and obtain pathological results.
The following criteria are used to exclude data:
(1) normal breasts (BIRADS category 1);
(2) a history of breast surgery or interventional therapies;
(3) image quality is poor;
(4) clinical data for the case are insufficient, and pathological outcomes are untraceable.
Study population and data distribution of image set
The study comprised 13702 2D ultrasound breast nodule images with pathology results acquired from 3448 female patients between 2020.10 and 2021.10 (9177 images from 2457 patients with benign pathology, 4545 images from 991 patients with malignant pathology), as stated in Table 1.
All pictures utilized are grayscale ultrasound images from each of one which a region of interest (ROI) is extracted. All non-object regions in the ultrasound image are eliminated. The image dataset is utilized to build the CNN image classifier in the first step of FEBrNet, which is then transferred to a video classifier.
Table 1. The study population and images in training set, validation set, and test set. Bn: benign. Mal: malignant.
|
Malignancy
|
Bn
|
Mal
|
Train
|
Number of patients
|
2004
|
774
|
(72.14%)
|
(27.86%)
|
Total number of patients
|
2778
|
Number of images
|
7324
|
3482
|
(67.78%)
|
(32.22%)
|
Total number of images
|
10806
|
Validation
|
Number of patients
|
254
|
94
|
(72.99%)
|
(27.01%)
|
Total number of patients
|
348
|
Number of images
|
889
|
404
|
(68.75%)
|
(31.25%)
|
Total number of images
|
1293
|
Test
|
Number of patients
|
199
|
123
|
(61.80%)
|
(38.20%)
|
Total number of patients
|
322
|
Number of images
|
946
|
639
|
(59.68%)
|
(40.32%)
|
Total number of images
|
1585
|
3.3 Study population and data distribution of video set
As shown in Table 2, the study includes 1066 ultrasound breast nodule films with pathology results from 440 female patients between 2020.10 and 2021.10. (546 videos from 237 patients with benign pathology and 520 videos from 203 patients with malignant pathology). Additionally, we gathered the physician chosen responsible frames for each video in the dataset, which are the frames that two senior physicians confirm include significant characteristics indicative of malignancy (random number of responsible frames for each video, including raw frames and annotated frames).
The video dataset is used to train the random forest feature classifier, which processes the pretrained CNN image classifier's features. To prevent information leakage during model training, we ensure that the patients in the video data set do not overlap with the patients in the image data set.
Table2. The distribution of videos in training set, validation set, and test set. Bn: benign. Mal: malignant.
Malignancy
|
Train set
|
Test set
|
Bn
|
Mal
|
Bn
|
Mal
|
Number of patients
|
161
|
133
|
76
|
70
|
(Percentage of malignancy)
|
54.8%
|
45.2%
|
52.1%
|
47.9%
|
Total number of patients
|
294
|
146
|
Number of videos
|
367
|
332
|
179
|
188
|
(Percentage of malignancy)
|
52.5%
|
47.5%
|
48.8%
|
51.2%
|
Total number of videos
|
699
|
367
|
Number of responsible images selected by senior doctors
|
538
|
436
|
1052
|
896
|
(Percentage of malignancy)
|
55.3%
|
44.7%
|
54.0%
|
45.0%
|
Total number of responsible images selected by senior doctors
|
974
|
1948
|
Reader studies experiment design
We conducted four reader studies on the same video test set to compare the performance of the AI system and physicians, as well as to assess the benefits of using AI to aid physicians.
Complete physician diagnosis (Complete-Phy): Six physicians independently read the original video and make diagnoses.
Complete AI diagnosis (Complete-AI): Use FEBrNets with a DenseNet or MobileNet backbone to diagnose videos and evaluate their performance when varying the number of responsible frames is employed.
Physicians select frames, then AI diagnoses (Phy-AI): AI diagnoses based on the responsible frames chosen from a video test set by two senior physicians.
AI selects frames, followed by physician diagnosis (AI-Phy): FEBrNet offers physicians the top three responsible frames and predictions for each video. Physicians make diagnosis based on these information and physicians' diagnostic performance is evaluated.
Both 'Complete-Phy' and 'AI-Phy' use the same six physicians. Physicians are classified into three groups (junior, medium-level, and senior, each with two physicians) based on their experience. We conduct 'Complete-Phy' first, followed by 'AI-Phy' one month later, long enough for physicians to forget about their previous diagnosis.
Statistical evaluation
In this paper, we ran 4 trials to evaluate FEBrNet:
- Results of three resolutions and two classic models are compared, using AUROC, AUPR, Accuarcy, Sensitivity, Specificity, Recall, Precision, and F1-Score, to select optimum resolution for each FEBrNet backbone.
- Results of eleven different numbers of responsible frames of FEBrNet are compared, using AUROC, AUPR, Accuarcy, Sensitivity, Specificity, Recall, Precision, and F1-Score, to assess the impact of number of responsible frames on FEBrNet’s performance.
- ‘Physicians select frames, then AI diagnoses’ and ‘Complete AI diagnosis’ compare, using AUROC, AUPR, Accuarcy, Sensitivity, Specificity, Recall, Precision, and F1-Score, to benchmark FEBrNet when utilizing raw videos without human judgment.
- ‘Complete physician diagnosis’, ‘Complete AI diagnosis’ and ‘AI selects frames, followed by physician diagnosis’ compare, using scatter plot of FEBrNet and physician performances, AUROC, AUPR, Accuarcy, Sensitivity, Specificity, Recall, Precision, and F1-Score, to evaluate how FEBrNet compares physicians when diagnosing alone and whether physicians benefit from the assistance of FEBrNet.