A. The Experiment Setting
To evaluate the proposed method, five public polyp datasets, i.e. the Kvasir-SEG[19], ClinicDB[20], ColonDB[21], Endoscene[22], and ETIS[23], are used. Specifically, the ClinicDB and Kvasir-SEG datasets are used to assess the learning ability of the model. The ClinicDB contains 612 images that are extracted from colonoscopy videos. Kvasir-SEG includes 1000 polyp images. In the experiment, the same 548 and 900 images in the ClinicDB and Kvasir-SEG datasets are used as the training sets, and the remaining 64 and 100 images are used as the corresponding testing sets.
All the experiments are implemented by using the PyTorch framework. Considering the difference in the size of each polyp image, a multi-scale strategy is used in the training. In addition, the AdamW optimizer is used to update the network parameters, which is widely used in transformer networks[24, 25]. The learning rate is set to be 1e-4, and the weight decay is also adjusted to be 1e-4. In addition, the size of the input image is adjusted to be 352*352, and the minibatch size is 16 for 100 epochs. In the test section, only the image size is adjusted to be 352*352, and there is no post-processing optimization strategy.
The training process uses two loss functions to optimize the output model, which can be expressed by the following formula:
\(L={L_{main}}+{L_{aux}}\) (8)
where \({L_{main}}\) and \({L_{aux}}\) are the primary and auxiliary loss function ,respectively.
The main loss function calculates the loss between the final segmentation result and the ground truth. The formula can be written as:
$${L_{main}}=L_{{IoU}}^{w}({Y_2},G)+L_{{BCE}}^{w}({Y_2},G)$$
9
The auxiliary loss function calculates the loss between the intermediate result from FAM and the ground truth. The formula can be written as :
\({L_{\text{a}\text{u}\text{x}}}=L_{{\text{I}\text{o}\text{U}}}^{\text{w}}({Y_1},G)+L_{{\text{B}\text{C}\text{E}}}^{\text{w}}({Y_1},G)\) (10)
where\(L_{{IoU}}^{w}\) and \(L_{{BCE}}^{w}\) are the weighted intersection over union (IoU) loss and weighted binary cross entropy (BCE) loss. The prediction graph is limited in terms of global structure (object level) and local detail (pixel level) perspectives, which is different from the standard BCE loss function (treat all pixels equally).
Six popularly used evaluation, indices, including the Dice, IoU, mean absolute error (MAE), weighted F-measure(\({\text{F}}_{\beta }^{w}\)), S-measure (Sα)[26], and E-measure (\({E_\varepsilon }\))[27], are adopted to evaluate the performances. The Dice and IoU are region-level similarity measures that mainly focus on the internal consistency of segmented objects. We use the average values of Dice and IoU, denoted as mDice and mIoU, respectively. MAE measures the difference between the model prediction results and the actual labels. The weighted F-measure (\({\text{F}}_{\beta }^{w}\)) comprehensively considers the recall and precision, MAE measures the average pixel-wise absolute error between normalized saliency prediction map and binary ground-truth mask, S-measure evaluates the structural similarity between the real-valued saliency map and the binary ground-truth. It considers object-aware and region-aware structure similarities: E-measure considers global means of the image and local pixel matching simultaneously[28].
B. Experimental Results
In order to verify the effectiveness and robustness of the proposed model, 7 famous network models are compared, namely : UNet [7], UNet++[9], MSEG[29], ACSNet[30], PraNet [10], EU-Net[31]and SANet[32].
As can be seen from Table V, the mDice, mIoU, ,Sα, m ,MAE scores of the proposed model on the ETIS dataset are higher than those of UNet by 2.89%, 2.74%, 2.7%, 1.09%, 1.64%, and 0.3%, respectively. In addition, it can be seen from Table I, II, III and IV that the six evaluation metrics also achieve good results on the other four datasets. The combined results show that the model has better learning ability.
Figure 5 and Fig. 6 show the visualization results of different segmentation methods on the two datasets of ClinicDB and ColonDB. Figure 7 shows the visualization results of different segmentation methods on the other datasets of Kvasir-seg、ENDOSCENE. From left to right, the segmentation results are obtained by the UNet, UNet++, MSEG, ASCNet, PraNet, SANet, EU-Net, and the proposed model, respectively. The red curve is the boundary of the actual value of the lesion ground[33–34]. It can be seen from Fig. 4 and Fig. 5 that compared with other segmentation results, the proposed method pays more attention to the lesion area than the UNet and UNet++, suppresses the unimportant feature area, and the segmentation result is more accurate than the UNet. With little difference between the color pixels of the lesion area and the color pixels of the background area, the model can pay more attention to the trim edges than PraNet. In general, the VTANet not only effectively alleviates the disturbance of tumor size, surrounding tissues and cascades but also obtains segmentation results closer to the real ground mask. The comprehensive evaluation and visual effects show that the proposed method achieves better segmentation results with less missed and false detection in polyp lesion segmentation.
Table Ⅰ. THE SEGMENTATION RESULTS OF KVASIR-SEG DATASET
| mDice | mIoU | \({\text{F}}_{\beta }^{w}\) | \({S_\alpha }\) | \(m{E_\varepsilon }\) | MAE |
UNet | 0.818 | 0.746 | 0.794 | 0.858 | 0.881 | 0.055 |
UNet++ | 0.821 | 0.743 | 0.808 | 0.862 | 0.886 | 0.048 |
MSEG | 0.897 | 0.839 | 0.885 | 0.912 | 0.942 | 0.028 |
ASCNet | 0.898 | 0.838 | 0.882 | 0.920 | 0.941 | 0.032 |
PraNet | 0.898 | 0.840 | 0.885 | 0.915 | 0.944 | 0.030 |
SANet | 0.904 | 0.847 | 0.892 | 0.915 | 0.949 | 0.027 |
EU-Net | 0.908 | 0.854 | 0.893 | 0.917 | 0.951 | 0.028 |
VTANet | 0.921 | 0.865 | 0.912 | 0.923 | 0.956 | 0.023 |
TABLE Ⅱ. THE SEGMENTATION RESULTS OF CLINICDB DATASET
| mDice | mIoU | \({\text{F}}_{\beta }^{w}\) | \({S_\alpha }\) | \(m{E_\varepsilon }\) | MAE |
UNet | 0.823 | 0.755 | 0.811 | 0.889 | 0.913 | 0.019 |
UNet++ | 0.794 | 0.729 | 0.785 | 0.873 | 0.891 | 0.022 |
MSEG | 0.909 | 0.864 | 0.907 | 0.938 | 0.961 | 0.007 |
ASCNet | 0.882 | 0.826 | 0.873 | 0.927 | 0.947 | 0.011 |
PraNet | 0.899 | 0.849 | 0.896 | 0.936 | 0.979 | 0.009 |
SANet | 0.912 | 0.856 | 0.907 | 0.929 | 0.968 | 0.012 |
EU-Net | 0.902 | 0.846 | 0.891 | 0.936 | 0.959 | 0.011 |
VTANet | 0.916 | 0.867 | 0.916 | 0.943 | 0.972 | 0.010 |
Table Ⅲ. The Segmentation Results Of ColonDB dataset
| mDice | mIoU | \({\text{F}}_{\beta }^{w}\) | \({S_\alpha }\) | \(m{E_\varepsilon }\) | MAE |
UNet | 0.512 | 0.432 | 0.498 | 0.713 | 0.696 | 0.061 |
UNet++ | 0.483 | 0.410 | 0.467 | 0.691 | 0.680 | 0.064 |
MSEG | 0.735 | 0.666 | 0.724 | 0.834 | 0.859 | 0.038 |
ASCNet | 0.716 | 0.649 | 0.697 | 0.829 | 0.839 | 0.039 |
PraNet | 0.712 | 0.640 | 0.699 | 0.820 | 0.847 | 0.043 |
SANet | 0.753 | 0.670 | 0.726 | 0.837 | 0.869 | 0.043 |
EU-Net | 0.756 | 0.681 | 0.730 | 0.831 | 0.863 | 0.045 |
VTANet | 0.767 | 0.694 | 0.743 | 0.856 | 0.876 | 0.041 |
Table Ⅳ. The Segmentation Results Of ETIS Dataset
| mDice | mIoU | \({\text{F}}_{\beta }^{w}\) | \({S_\alpha }\) | \(m{E_\varepsilon }\) | MAE |
UNet | 0.398 | 0.335 | 0.366 | 0.684 | 0.643 | 0.036 |
UNet++ | 0.401 | 0.344 | 0.390 | 0.683 | 0.629 | 0.035 |
MSEG | 0.700 | 0.630 | 0.671 | 0.828 | 0.854 | 0.015 |
ASCNet | 0.578 | 0.509 | 0.530 | 0.754 | 0.737 | 0.059 |
PraNet | 0.628 | 0.567 | 0.600 | 0.794 | 0.808 | 0.031 |
EU-Net | 0.687 | 0.609 | 0.636 | 0.793 | 0.807 | 0.067 |
SANet | 0.750 | 0.654 | 0.685 | 0.849 | 0.881 | 0.015 |
VTANet | 0.763 | 0.669 | 0.693 | 0.855 | 0.884 | 0.038 |
Table Ⅴ. The Segmentation Results Of Endoscene Dataset
| mDice | mIoU | \({\text{F}}_{\beta }^{w}\) | \({S_\alpha }\) | \(m{E_\varepsilon }\) | MAE |
UNet | 0.710 | 0.627 | 0.684 | 0.843 | 0.847 | 0.022 |
UNet++ | 0.707 | 0.624 | 0.687 | 0.839 | 0.834 | 0.018 |
MSEG | 0.874 | 0.804 | 0.852 | 0.924 | 0.948 | 0.009 |
ASCNet | 0.863 | 0.787 | 0.825 | 0.923 | 0.939 | 0.013 |
PraNet | 0.871 | 0.797 | 0.843 | 0.925 | 0.950 | 0.010 |
EU-Net | 0.837 | 0.765 | 0.805 | 0.904 | 0.919 | 0.015 |
SANet | 0.888 | 0.815 | 0.859 | 0.928 | 0.962 | 0.008 |
VTANet | 0.904 | 0.826 | 0.872 | 0.941 | 0.978 | 0.009 |
Table ⅤI. The Ablation Results Of Etis Dataset
| mDice | mIoU | \({\text{F}}_{\beta }^{w}\) | \({S_\alpha }\) | \(m{E_\varepsilon }\) | MAE |
PVT | 0.712 | 0.623 | 0.609 | 0.821 | 0.807 | 0.046 |
PVT + AAFM | 0.734 | 0.624 | 0.687 | 0.839 | 0.834 | 0.038 |
PVT + FAM | 0.754 | 0.604 | 0.652 | 0.824 | 0.848 | 0.049 |
PVT + ASM | 0.715 | 0.657 | 0.625 | 0.823 | 0.839 | 0.043 |
VTANet | 0.763 | 0.669 | 0.693 | 0.855 | 0.884 | 0.038 |
To verify the generalization ability of the proposed model, three polyp segmentation datasets, including ETIS, ColonDB, and EndoScene, are used for testing. There are 196 images in ETIS, 380 in ColonDB, and 60 in Endoscene, respectively. It can be seen from Tables Ⅲ, Ⅳ, and Ⅴ, the mDice score on the ColonDB dataset is 2.55% higher than the UNet model. The mIoU score on the ETIS dataset is 2.74% higher than the U-Net model. The score on the Endoscene dataset is 2.55% higher than the UNet model[35–38]. The results show that the proposed model has strong generalization ability.
Finally, the effectiveness of each component to the overall model is described in detail, and the settings of training, testing and hyperparameters are consistent with the previous ones. We use PVTv2 as a baseline and verify the performance of the proposed model by removing modules from the proposed model. The experimental results of different modules in Table 6 show that these modules have played a role in improving network performance. It can be seen from Table 6 that after the introduction of the FAM module, the mDice score is 0.42% higher than the original basic PVTv2 network. The introduction of AAFM module also improves the performance of the original PVT network[39–40].