Experiments were carried on an Intel(R) Core(TM) i7-5820K CPU 3.30GHz, 64GB Ram, one Titan X Pascal with 12GB, and the TensorFlow/Keras frame- work for Python. The source code and pre-trained models are available in https:
//github.com/ufopcsilab/EfficientCovidNet. In the following subsections, we present the three experimental setups explored in this work.
In a first setup, in Section 4.1, we investigate the discrepancy regarding the results reported by the methods considered state-of-the-art for the two studied datasets. The best approach for the COVID-CT dataset reports 86.0% of ac- curacy [9]. For the SARS-CoV-2 CT-scan dataset, the state-of-the-art method achieves 97.38% of accuracy [14]. However, the SARS-CoV-2 CT-scan dataset has significantly more images than the COVID-CT dataset and the same num- ber of patients (individuals). To assess whether this difference is due to the evaluation protocol, we perform two experiments. We investigate the impact of selecting samples/images for training and test sets at random and in a second step, we evaluate the impact of performing the selection guided by individuals, that is, ensuring that there are no samples from the same individual simultane- ously in the training and test sets.
In a second setup, in Section 4.2, we investigate a very important aspect, which is the generalization power of a model. A model is only useful if it can also generalize to data from other distributions or other datasets. In this regard, we evaluate how the model, trained with the SARS-CoV-2 CT-scan dataset, behaves when it is faced with images from another dataset, the COVID-CT Dataset. We follow the data-split protocol proposed in [15].
Finally, for the third setup, we explore our EfficientCovidNet model only with the COVID-CT Dataset, considering the protocol proposed in [15]. This setup aims to expand the comparison of the proposed approach with the liter- ature since this dataset is the most popular to date. Here we also explore the impact of varying the size of the input images.
4.1 Setup 1 : 5-fold evaluation on a Large Dataset
To evaluate the performance of the proposed approach, we tested the pro- tocol proposed by Soares et al. [14] and three different scenarios using a 5-fold cross-validation: (i) “Random”, (ii) “Slices”, and (iii) “Voting”. The “Random” evaluation divide the data into training and test sets randomly. The “Slice” eval- uation consider all the CT images independent of each other but consider the patient division, that is, we prevent samples from one individual simultaneously in the training and test sets. In this manner, the model will always be evaluated with samples from unknown individuals. Finally, the “Voting” evaluation con- sider all images of an individual and a voting scheme to achieve a diagnosis per individual instead of by instance or image. Considering that several CT images are acquired in a single exam for a single individual, we believe that the disease patterns will not be present on all instances. Thus, an evaluation using a voting scheme, considering all possible instances of one individual, could increase the chances of success.
4.1.1 Results
Following the protocol proposed in [14], the proposed approach in this work enhanced all metrics as shown in Table 5.
Table 5: Classification protocol proposed in [14].
|
|
|
|
Approach
|
Acc (%)
|
SeC (%)
|
+PC (%)
|
Soares et al. [14]
|
97.38
|
95.53
|
99.16
|
Proposed Approach
|
98.99
|
98.80
|
99.20
|
Despite the outstanding results presented in Table 5, we believe that such results are overestimated. Upon this fact, we introduce a 5-fold classification and some changes in the original protocol as described and with the results presented in Table 6.
Table 6: 5-fold classification by slicing and with voting.
|
|
|
|
Approach
|
Acc (%)
|
SeC (%)
|
+PC (%)
|
Random
|
98.5 ± 0.4
|
98.6 ± 0.6
|
98.4 ± 0.6
|
Slice
|
86.6 ± 10.1
|
94.8 ± 4.5
|
79.7 ± 20.9
|
Voting
|
89.6 ± 5.1
|
92.0 ± 10.0
|
77.5 ± 23.3
|
The “Random” evaluation presents better results when compared to the two other approaches (“Slice” and “Voting”). One of the reasons is due to data from the same patient/individual in both training and test sets, which leads to an overestimated result. Upon this fact, our hypothesis is that an approach tends to learn the patterns related to the individuals instead the COVID patterns.
In the “Slice” evaluation, the samples are classified as an isolated instance, such as the “Random” one but ensuring that all samples of an individual are exclusively present only on one data partition: training or test set. A down- grade is observed which clearly shows an overestimation from the “Random” evaluation.
On the opposite to the “Slice” evaluation, the “Voting” one considers all images of an individual to decide whether the individual is infected or not. It is worth to emphasize that the same model is used in both approaches, that is, the model trained by image (only one “ slice ” of the lung).
Due to the nature of CT scans, we believe the disease patterns will not manifest in all slices (instance/images) of an individual CT exam, and results of “Slice” and “Voting” evaluation reflect that. We believe this can generate false positives and therefore impact the figures of approaches (See Table 6). Besides, this problem can be seen as a multiple instance learning (MIL) problem [25] and that a MIL-based approach can be a promising path for future work.
Comparing the results of both Tables 5 and 6, we believe the presence of samples from the same individual in training and test tends to lead an overes- timation of an approach. To circumvent this issue, it is necessary to ensure the division of the dataset considering the individual, and the use of a cross-dataset approach.
4.2 Setup 2 : Cross-dataset evaluation
For this experiment, we investigate the impact of learning a model in one data distribution and evaluate on another one. This scenario is closer to reality since it is almost impossible to train a model with images acquired from all available sensors, environments and individuals.
On this setup, the SARS-CoV-2 CT-scan dataset [14] is used only for train- ing, and none image of this dataset is present on the test set. For the test set, we use the dataset presented in [15], the COVID-CT, since it is a dataset used by several authors in the literature. We follow the protocol proposed in [15] to split the COVID-CT in train and test sets, however, we highlight that for training the model only images from the SARS-CoV-2 CT-scan dataset is used. We also evaluated other test configurations, such as using the COVID-CT train- ing partition as a test and also combining both partitions from the COVID-CT dataset as a larger test set (See Table 7). We also test the opposite scenario, in which we use all images from the COVID-CT dataset [15] for training and all images of SARS-CoV-2 CT-scan dataset [14] to test.
4.2.1 Results
Table 7: crossdataset
Training dataset Test dataset Acc (%)
|
SeC (%)
|
+PC (%)
|
SARS-CoV-2 COVID-CT [15]
59.12
|
64.14
|
54.95
|
SARS-CoV-2 COVID-CT [15]
56.16
|
53.06
|
54.74
|
SARS-CoV-2 COVID-CT [15]
58.31
|
61.03
|
54.90
|
COVID-CT [15] SARS-CoV-2
45.25
|
54.39
|
46.36
|
|
|
CT-scan dataset [14] (Train)
CT-scan dataset [14] (Test)
CT-scan dataset [14] (Train + Test)
(Train + Test) CT-scan dataset [14]
As one can see, the model performance is drastically reduced when we com- pare cross-dataset evaluation against an intra dataset one. We believe that the reason for this behavior is due to data acquisition diversity. Images from dif- ferent datasets can be acquired by different equipment, different image sensors, and thus, change relevant features on the images impairing recognition. The model could learn how to identify portions and patterns of one image that may indicate the presence (or absence) of COVID-19, although, those patterns may no appear in a different dataset.
Training on COVID-CT [15] and testing in SARS-CoV-2 CT-scan dataset[14] presents even worse results since COVID-CT training set is smaller.
We believe such test should be mandatory for all methods aiming at COVID- 19 recognition with CT images, since it is the one that most resembles a real test.
4.3 Setup 3 : Impact of input resolution
In this setup, we evaluate the protocol presented in [15] only on COVID- CT dataset. Zhao et al. [15] proposes to divide the COVID-CT dataset into three sets: training, validation, and testing. We also applied data augmentation by rotating (max 0.15 degrees for each side), randomly zooming (80% of the are) with 20% of chance and horizontal flipping with a probability of 50%. We stress that the data augmentation is applied only for training data. The final number of training images totalized 2968 images (1442 of COVID and 1408 of NonCOVID). Using the protocol in [15], the test set consists of 203 images (98 of COVID and 105 of NonCOVID).
4.3.1 Results
In Table 8, we report the results of the proposed approach using the pro- tocol described in [15]. One may observe that the experiments with the same approach used in Setups 1 and 2 (EfficientNet-B3) has a worse performance when compared with the ones available in the literature.
Aiming to reduce the incidence of overfitting during training of “Architecture 1”, we propose a deeper network. In most of the cases, when the deeper network is used (see “Architecture 2” in Table 8), rather than a Architecture 1 one (see “Architecture 1” in Table 8), a gain is observed on all reported figures.
We emphasize that the architectures with the largest image size (550x550) present the worst performance among the experiment (varying the input size), in the opposite direction of what is expected. Our hypothesis is that there are some small images (281x202) that are expanded and severely distorted, which hinders the COVID-19 patterns on images.
Table 8: Custom input using the EfficientNet-B0 as the base network.
|
|
|
|
|
|
Depth
|
Input size
|
Acc (%)
|
SeC (%)
|
+PC (%)
|
F 1 (%)
|
EfficientNet-B3
|
300x300
|
77.34
|
69.39
|
80.95
|
74.72
|
|
224x224
|
79.31
|
70.41
|
84.15
|
76.70
|
|
300x300
|
76.85
|
69.39
|
80.00
|
74.32
|
Architecture 1
|
350x350
|
80.79
|
79.59
|
80.41
|
80.00
|
|
400x400
|
83.25
|
80.61
|
84.04
|
82.29
|
|
450x450
|
83.25
|
81.63
|
83.33
|
82.47
|
|
500x500
|
83.74
|
83.67
|
82.83
|
83.25
|
|
550x550
|
79.31
|
77.55
|
79.17
|
78.35
|
|
224x224
|
83.74
|
77.55
|
87.36
|
82.16
|
|
300x300
|
81.28
|
79.59
|
81.25
|
80.41
|
Architecture 2
|
350x350
|
86.21
|
81.63
|
88.89
|
85.10
|
|
400x400
|
80.30
|
74.49
|
82.95
|
78.49
|
|
450x450
|
77.34
|
75.51
|
77.08
|
76.29
|
|
500x500
|
87.68
|
79.59
|
93.98
|
86.19
|
|
550x550
|
75.37
|
64.29
|
80.77
|
71.60
|
|
|
|
|
|
|
The best model is the one with the Architecture 2 with input size of 500x500 (source available at https://github.com/ufopcsilab/EfficientCovidNet). The ROC curve of the model is presented in Figure 8.
We present in Table 9 a comparison of the best proposed approach against the ones available in the literature. Despite the results presented by Amyar et al. [12] and Mobiny et al. [10], both evaluated their approach with only 105 images (47 COVID and 58 NonCovid) and, therefore, they cannot be directly compared to the present work. Thus, the best results previously obtained in this setup were presented in [9]. Although the work proposed here overcomes it in terms of accuracy and F1-score on COVID-CT dataset using a significantly smaller model ( 3 × smaller). The base model proposed in [9] needs 14,149,480 parameters while the one proposed here only 4,779,038 parameters.
Table 9: Comparison with literature. # - Evaluated with a different test set: only 105 images (47 COVID and 58 NonCovid).
|
|
|
|
Approach
|
Acc
|
F 1
|
AUC
|
#Amyar et al.[12]
|
86.0
|
-
|
93.0
|
#Mobiny et al.[10]
|
87.6
|
87.1
|
96.1
|
Polsinelli et al.[11]
|
83.0
|
83.3
|
-
|
He et al.[9]
|
86.0
|
85.0
|
94.0
|
Proposed approach
|
87.6
|
86.19
|
90.5
|