VTANet: A Novel Polyp Segmentation Network Based on the Vision Transformer and Attention Mechanism

doi:10.21203/rs.3.rs-3978065/v1

Download PDF

Research Article

VTANet: A Novel Polyp Segmentation Network Based on the Vision Transformer and Attention Mechanism

https://doi.org/10.21203/rs.3.rs-3978065/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The accurate segmentation of the polyp is very important for the diagnosis and treatment plans of the colorectal cancer. Although the UNet model and the models with the U-shaped structure have achieved great success in polyp image segmentation, they are still limited by the colors, sizes and the shapes of polyps, as well as the low contrast, various noise and blurred edges of the colonoscopy, which are easy to result in a large amount of redundant information, the weak complementarity between different levels of features and inaccurate polyp localization. To deal with the special characteristics of the polyps images and improve the segmentation performance, a new segmentation model named VTANet, which is based the pyramid vision transformer and BAM (Bottleneck Attention Module), is developed. The proposed model consists of four modules: the pyramid vision transformer (PVT) encoder, the Feature Aggregation Module (FAM), the Adaptive Attention Fusion Module (AAFM), and the Aggregation Similarity Module (ASM). The PVT learns a more robust representation model, the FAM enhances the complementarity between features by cascading the encoder features and acquiring richer context and fine-grain features. The AAFM makes the polyp localization more accurate by introducing the BAM attention module to obtain richer details of the polyps. To verify the effectiveness and accuracy, experiments on five popularly used datasets are carefully designed and implemented. The segmentation results of six stat-of-the-art methods fully demonstrate the superiority of the proposed method, in the term of the visual comparison and six quantitative metrics, i.e. the Dice, IoU, Mean Absolute Error, weighted F-measure, S-measure, and E-measure.

Polyp segmentation

UNet model

the attention mechanism

pyramid vision transformer

As shown in Fig. 1, the colorectal cancer is the third most prevalent and second lethal cancer in the word. According to statistical reports, the colorectal cancer also maintains an increasing trend in prevalence and mortality rate in China, which accounts for almost 23.7% of the 4.57 million new cancer cases in each year^[1]. The polyps, growing abnormally in the colon and rectum over the time, are the main cause of the colorectal cancer. Cells in the colon or the rectum grow out of control, it is easy to become the cancer, even lead to the death. Thus, the ability to quickly and accurately detect the location of polyps and make the treatments, such as the colonoscopy, resection operation in the early time, is very important for the health of the patients.

The precise localization and extraction of the polyps are the crucial steps to make the diagnosis and treatment plans. Medical image segmentation provides the strong and helpful tool for doctors to carefully observe the lesion and accurately implement the operation^[2]. However, as shown in Fig. 2, the segmentation of the polyp is a challenging task. Firstly, even they are the same type, the size, color and texture are different. Secondly, due to the reflection of the intestinal mucus and polyps under colonoscopy, the contrast between polyps and surrounding mucosa is not strong enough, and the boundary is not very clear. The two above reasons may cause the missed detection and inaccurate segmentation of polyps. Thus, the accurate segmentation method for the potential polyps in the early stage is of great significance for preventing the colorectal cancer.

(a) the prevalence rate (b) the mortality

Figure 1. The prevalence and mortality of colorectal cancer.

In traditional medical image segmentation, the early conventional colon polyp image segmentation algorithm mainly analyzes the characteristics of colon polyps. Mamonov et al. designed a binary classifier to mark each frame of an image as containing or not containing polyps according to the geometric analysis and texture content on each edge of a colonoscopy video^[3]. Bernal et al. obtained the polyp boundary information using a Window Median Depth of Valleys Accumulation (WM-DOVA) energy map, and detected intestinal polyps through polyp texture to complete the detection of polyp area^[4]. Sasmal et al. proposed a detection method based on the shape of polyps, mainly using the principal component analysis (PCA) method and the region-based active contour model to complete the segmentation^[5]. For these segmentation algorithms, they heavily relied on the manual extraction of features, and the polyps and their surrounding tissues are distinguished by the training classifiers. The expression ability of the manual extraction features is quite limited. Although the traditional algorithms are relatively simple in implementation, they cannot consider the effective features of the polyp area at the same time and cannot combine these features. Therefore, the segmentation results of them are not satisfying.

Recently, it is reported that compared with traditional segmentation methods, the deep learning-based segmentation methods perform better. The main principle of the deep learning based colon polyp image segmentation algorithms is to design a convolutional neural network model, use the colon polyp images and labels to train the model, and then use the trained model for segmentation. The typical models include the CNN, GAN and the UNet model^[6]. For example, Ronneberger et al. proposed a fully symmetric UNet network with an encoder-decoder structure^[7]. The UNet network uses skip connections between the encoder and the decoder for

feature fusion,which performs well in cell segmentation tasks. Inspired by the successful application of the UNet network in biomedical image segmentation, more and more related works on the UNet model and its variant structure are used to segment the polyps. For example, Zhang et al. proposed a U-shaped network ResUNet with a deep residual system^[8]. The residual connection is introduced into each convolution module of UNet to obtain deeper image features, thereby improving the accuracy of segmentation results. Zhou et al. proposed UNet + + model by reducing the depth of the unknown network, it redesigns the jump connection, and designs a scheme to prune the network to improve the performance of UNet^[9]. Fan et al. proposed a parallel reverse attention network PraNet for accurate segmentation of polyps^[10]. Jha et al. proposed a double UNet network. By cascading two variants of the UNet structure to form a dual-network structure, the entire network has more robust feature extraction capabilities, a larger receiving domain^[11].

Though good results have been obtained by the above-mentioned deep learning-based segmentation algorithms, there is still much improvement space to accurately segment the polyps due to the special characteristics of the polyps: (1) the colors of polyps and surrounding tissues are extremely similar; (2) a diversity of sizes, shapes, and textures of the polyps; (3) some polyps may be hidden in the folds of the colon.

In order to deal with the special characteristics for the segmentation of the polyps, a new segmentation model named VTANet, which is based the pyramid vision transformer and BAM attention, is proposed in this paper. Experiments on five public polyp image datasets demonstrate the proposed model greatly improves the polyp image segmentation performance.

A. The Architecture of the VTANet

As shown in Fig. 3, the proposed VTANet model consists of four key modules: the pyramid vision transformer (PVT) encoder, the Feature Aggregation Module (FAM), the Attention Fusion Module (AAFM), and the Aggregation Similarity Module (ASM). PVT is used to obtain the long-range dependencies features from the encoder. FAM aggregates the high-level features to obtain the semantic and location information of polyps. AAFM removes the noise and enhances the low-level polyp representation information, including the texture, color and edge. ASM combines the low-level and high-level features provided by AAFM and FAM, effectively transmitting information to the whole region.

B. The Transformer Encoder

Due to the uncontrollable factors in the collection of polyp images, they tend to contain the significant noise, such as the motion blur, rotation, and reflection. Some recent work has reported that the vision transformer shows more robust performance and better robustness than CNNs^[12–14]. Inspired by these ideas, the vision transformer is used as the backbone network to extract more robust and powerful polyp information. Different from the fixed columnar structure or shift window, the PVT is a pyramid architecture whose representation is calculated with spatial-reduction attention operations. Thus it can reduce the resource consumption. Specifically, the encoder part uses an improved version of PVT, namely PVTV2, with the more powerful feature extraction capabilities. In order to make PVTV2 suitable to the segmentation task of the polyps, the last classification layer is deleted, and four multi-scale feature maps (x1, x2, x3, x4) are generated at different stages. In these feature maps, x1 provides the polyp information in the underlying features, x2, x3, x4 provide the advanced features of polyps, such as the semantic and location information.

C. The Feature Aggregation Module

The primary purpose of the FAM module is to combine high-level features from the encoder into a better feature representation to improve the performance. It captures different information through different features to enhance the robustness of the model and reduce the overfitting. Specifically, we define F (·) as a convolutional unit composed of a 3 × 3 convolutional layer with padding set to be 1. The Batch normalization^[15]and ReLu^[16]are used. And we define G (·) as a convolutional unit composed of a 1*1 convolutional layer with padding set to be 1 and ReLu. Firstly, the highest-level feature map x4 is up-sampled, and the up-sampled results are passed through the convolution unit F1 (·) to obtain X_4 − t. Then, the obtained result X_4 − t is spliced with the feature mapping x3 in the

encoder to generate the fusion feature X_3 − a. The result obtained by X_3 − a through the convolution unit G1 (·) is up-sampled, and the up-sampled result is passed through the convolution unit F2 (·) to receive X_3 − t.Then, the obtained result X_3 − t is spliced with the feature map X₂ in the encoder to generate the fusion feature X_2 − a. Finally, the obtained feature fusion X_2 − a obtains the final output feature fam-feature of the module through the convolution unit G2(·). The process is described by the follows equations.

$$a={G_1}({X_{3 - a}})\{ Concat\{ {X_3},up\{ {F_1}({X_4})\} \} \}$$

$$feature={G_2}({X_{2 - a}})\{ Concat\{ {X_2},up\{ {F_2}(a)\} \} \}$$

D. The Adaptive Attention Fusion Module

The low-level features usually contain rich details, such as the texture, color and edge of polyps. However, polyps are often very similar in appearance to the background. Therefore, a powerful extractor is needed to identify the details of polyps. As shown in the Fig. 4, an adaptive attention fusion module is introduced to capture the details of polyps from different dimensions of the low-level feature map X₁. Precisely, the adaptive attention fusion module consists of the channel attention operation Attc(·) and the spatial attention operation Atts(·)^[17]. Firstly, the feature map X₁ generated by the encoder is encoded into a one-dimensional feature vector through global average pooling so that each channel has a global receptive field; then, the whole connection layer is used to reduce the dimension of the one-dimensional feature vector, and the ReLu activation function is used for nonlinear processing. Then the entire connection layer is used to increase the dimension. Finally, the corresponding weight Mc (X₁) is obtained by the batch normalization. In summary, the calculation formula for the channel attention operation is described by Eq. (3):

$${M_c}({X_1})=BN\{ {W_1}\{ {W_0}AvgPool({X_1})\} \}$$

The spatial attention operation process in parallel with the channel attention is as follows: Firstly, the X₁ feature map is reduced by 1*1 convolution, and then the feature information is extracted by two dilated convolutions with a convolution kernel size of 3*3. The dilated convolution has a larger receptive field. Finally, the feature map is mapped to 1×W×H by 1*1 convolution, and the spatial attention map Ms (X₁) is obtained. The calculation formula is described by Eq. (4):

$${M_{_{s}}}({X_1})=BN\{ f_{3}^{{1*1}}\{ f_{2}^{{3*3}}\{ f_{1}^{{3*3}}\{ f_{0}^{{1*1}}\{ f\} \} \} \} \}$$

When the channel attention and spatial attention are fused, the Mc(X₁) and Ms(X₁) are extended to the same latitude through the broadcast mechanism. Then the weights are added to obtain the attention vector Mc (X₁). Finally, the input feature graph X₁ is multiplied by Ms (X₁) element and then added to X₁ through the residual structure. The formula is as follows:

$$X_{1}^{{\prime }}={X_1}+{X_1} \times \sigma ({M_c}({X_{_{1}}})+{M_s}(U))$$

E. The Aggregation Similarity Module

The non-local operation is introduced into the graph convolution domain to implement the aggregate similarity module, which explores the relationship between the low-level local features from AAFM and the high-level cues from FAM. Therefore, ASM can use the global attention to inject detailed appearance features into high-level semantic features. Given a feature map Y₁ containing high-level semantic information and Y₂ with rich appearance details, they are merged through the self-attention. Firstly, the summation of the two linear mapping functions ${W_\theta }( \cdot )$ and ${W_\Phi }( \cdot )$ are applied on Y₁. The dimension of Y₁ is reduced to obtain the feature mapping $Q \in {R^{\frac{H}{8} \times \frac{W}{8} \times 16}}$ and $K \in {R^{\frac{H}{8} \times \frac{W}{8} \times 16}}$. Then, the convolution operation with kernel size of 1*1 is used as the linear mapping process^[18]. The process can be expressed as:

$$Q={W_\theta }({Y_1}),K={W_\Phi }({Y_1})$$

For Y₂, we use the convolution unit Wg to reduce the channel dimension to 32 and interpolate it to the same size as Y₁. Then, the softmax function is applied to the channel dimension, and the second channel is selected as the attention mapping to obtain $T_{2}^{\prime } \in {R^{\text{a}}}$ where a = H/8×w/8×1. Next, calculate the Hadamard between K and $T_{2}^{\prime }$.This operation assigns different weights to the pixels, thereby increasing the weight of the edge pixels. After that, the adaptive pooling operation is used to reduce the displacement of the feature, and the center clipping is applied to obtain the feature map $V \in {R^{4 \times 4 \times 16}}$. The process can be expressed as the following:

$$V={\text{AP(}}K \times F{\text{(}}{W_g}{\text{(}}{Y_2}{\text{)))}}$$

where AP represents pooling and clipping operations.

A. The Experiment Setting

To evaluate the proposed method, five public polyp datasets, i.e. the Kvasir-SEG^[19], ClinicDB^[20], ColonDB^[21], Endoscene^[22], and ETIS^[23], are used. Specifically, the ClinicDB and Kvasir-SEG datasets are used to assess the learning ability of the model. The ClinicDB contains 612 images that are extracted from colonoscopy videos. Kvasir-SEG includes 1000 polyp images. In the experiment, the same 548 and 900 images in the ClinicDB and Kvasir-SEG datasets are used as the training sets, and the remaining 64 and 100 images are used as the corresponding testing sets.

All the experiments are implemented by using the PyTorch framework. Considering the difference in the size of each polyp image, a multi-scale strategy is used in the training. In addition, the AdamW optimizer is used to update the network parameters, which is widely used in transformer networks^{[24, 25]}. The learning rate is set to be 1e-4, and the weight decay is also adjusted to be 1e-4. In addition, the size of the input image is adjusted to be 352*352, and the minibatch size is 16 for 100 epochs. In the test section, only the image size is adjusted to be 352*352, and there is no post-processing optimization strategy.

The training process uses two loss functions to optimize the output model, which can be expressed by the following formula:

$L={L_{main}}+{L_{aux}}$ (8)

where ${L_{main}}$ and ${L_{aux}}$ are the primary and auxiliary loss function ,respectively.

The main loss function calculates the loss between the final segmentation result and the ground truth. The formula can be written as:

$${L_{main}}=L_{{IoU}}^{w}({Y_2},G)+L_{{BCE}}^{w}({Y_2},G)$$

The auxiliary loss function calculates the loss between the intermediate result from FAM and the ground truth. The formula can be written as :

${L_{\text{a}\text{u}\text{x}}}=L_{{\text{I}\text{o}\text{U}}}^{\text{w}}({Y_1},G)+L_{{\text{B}\text{C}\text{E}}}^{\text{w}}({Y_1},G)$ (10)

where$L_{{IoU}}^{w}$ and $L_{{BCE}}^{w}$ are the weighted intersection over union (IoU) loss and weighted binary cross entropy (BCE) loss. The prediction graph is limited in terms of global structure (object level) and local detail (pixel level) perspectives, which is different from the standard BCE loss function (treat all pixels equally).

Six popularly used evaluation, indices, including the Dice, IoU, mean absolute error (MAE), weighted F-measure(${\text{F}}_{\beta }^{w}$), S-measure (Sα)^[26], and E-measure (${E_\varepsilon }$)^[27], are adopted to evaluate the performances. The Dice and IoU are region-level similarity measures that mainly focus on the internal consistency of segmented objects. We use the average values of Dice and IoU, denoted as mDice and mIoU, respectively. MAE measures the difference between the model prediction results and the actual labels. The weighted F-measure (${\text{F}}_{\beta }^{w}$) comprehensively considers the recall and precision, MAE measures the average pixel-wise absolute error between normalized saliency prediction map and binary ground-truth mask, S-measure evaluates the structural similarity between the real-valued saliency map and the binary ground-truth. It considers object-aware and region-aware structure similarities: E-measure considers global means of the image and local pixel matching simultaneously^[28].

B. Experimental Results

In order to verify the effectiveness and robustness of the proposed model, 7 famous network models are compared, namely : UNet ^[7], UNet++^[9], MSEG^[29], ACSNet^[30], PraNet ^[10], EU-Net^[31]and SANet^[32].

As can be seen from Table V, the mDice, mIoU, ,Sα, m ,MAE scores of the proposed model on the ETIS dataset are higher than those of UNet by 2.89%, 2.74%, 2.7%, 1.09%, 1.64%, and 0.3%, respectively. In addition, it can be seen from Table I, II, III and IV that the six evaluation metrics also achieve good results on the other four datasets. The combined results show that the model has better learning ability.

Figure 5 and Fig. 6 show the visualization results of different segmentation methods on the two datasets of ClinicDB and ColonDB. Figure 7 shows the visualization results of different segmentation methods on the other datasets of Kvasir-seg、ENDOSCENE. From left to right, the segmentation results are obtained by the UNet, UNet++, MSEG, ASCNet, PraNet, SANet, EU-Net, and the proposed model, respectively. The red curve is the boundary of the actual value of the lesion ground^[33–34]. It can be seen from Fig. 4 and Fig. 5 that compared with other segmentation results, the proposed method pays more attention to the lesion area than the UNet and UNet++, suppresses the unimportant feature area, and the segmentation result is more accurate than the UNet. With little difference between the color pixels of the lesion area and the color pixels of the background area, the model can pay more attention to the trim edges than PraNet. In general, the VTANet not only effectively alleviates the disturbance of tumor size, surrounding tissues and cascades but also obtains segmentation results closer to the real ground mask. The comprehensive evaluation and visual effects show that the proposed method achieves better segmentation results with less missed and false detection in polyp lesion segmentation.

Table Ⅰ. THE SEGMENTATION RESULTS OF KVASIR-SEG DATASET

	mDice	mIoU	${\text{F}}_{\beta }^{w}$	${S_\alpha }$	$m{E_\varepsilon }$	MAE
UNet	0.818	0.746	0.794	0.858	0.881	0.055
UNet++	0.821	0.743	0.808	0.862	0.886	0.048
MSEG	0.897	0.839	0.885	0.912	0.942	0.028
ASCNet	0.898	0.838	0.882	0.920	0.941	0.032
PraNet	0.898	0.840	0.885	0.915	0.944	0.030
SANet	0.904	0.847	0.892	0.915	0.949	0.027
EU-Net	0.908	0.854	0.893	0.917	0.951	0.028
VTANet	0.921	0.865	0.912	0.923	0.956	0.023

TABLE Ⅱ. THE SEGMENTATION RESULTS OF CLINICDB DATASET

	mDice	mIoU	${\text{F}}_{\beta }^{w}$	${S_\alpha }$	$m{E_\varepsilon }$	MAE
UNet	0.823	0.755	0.811	0.889	0.913	0.019
UNet++	0.794	0.729	0.785	0.873	0.891	0.022
MSEG	0.909	0.864	0.907	0.938	0.961	0.007
ASCNet	0.882	0.826	0.873	0.927	0.947	0.011
PraNet	0.899	0.849	0.896	0.936	0.979	0.009
SANet	0.912	0.856	0.907	0.929	0.968	0.012
EU-Net	0.902	0.846	0.891	0.936	0.959	0.011
VTANet	0.916	0.867	0.916	0.943	0.972	0.010

Table Ⅲ. The Segmentation Results Of ColonDB dataset

	mDice	mIoU	${\text{F}}_{\beta }^{w}$	${S_\alpha }$	$m{E_\varepsilon }$	MAE
UNet	0.512	0.432	0.498	0.713	0.696	0.061
UNet++	0.483	0.410	0.467	0.691	0.680	0.064
MSEG	0.735	0.666	0.724	0.834	0.859	0.038
ASCNet	0.716	0.649	0.697	0.829	0.839	0.039
PraNet	0.712	0.640	0.699	0.820	0.847	0.043
SANet	0.753	0.670	0.726	0.837	0.869	0.043
EU-Net	0.756	0.681	0.730	0.831	0.863	0.045
VTANet	0.767	0.694	0.743	0.856	0.876	0.041

Table Ⅳ. The Segmentation Results Of ETIS Dataset

	mDice	mIoU	${\text{F}}_{\beta }^{w}$	${S_\alpha }$	$m{E_\varepsilon }$	MAE
UNet	0.398	0.335	0.366	0.684	0.643	0.036
UNet++	0.401	0.344	0.390	0.683	0.629	0.035
MSEG	0.700	0.630	0.671	0.828	0.854	0.015
ASCNet	0.578	0.509	0.530	0.754	0.737	0.059
PraNet	0.628	0.567	0.600	0.794	0.808	0.031
EU-Net	0.687	0.609	0.636	0.793	0.807	0.067
SANet	0.750	0.654	0.685	0.849	0.881	0.015
VTANet	0.763	0.669	0.693	0.855	0.884	0.038

Table Ⅴ. The Segmentation Results Of Endoscene Dataset

	mDice	mIoU	${\text{F}}_{\beta }^{w}$	${S_\alpha }$	$m{E_\varepsilon }$	MAE
UNet	0.710	0.627	0.684	0.843	0.847	0.022
UNet++	0.707	0.624	0.687	0.839	0.834	0.018
MSEG	0.874	0.804	0.852	0.924	0.948	0.009
ASCNet	0.863	0.787	0.825	0.923	0.939	0.013
PraNet	0.871	0.797	0.843	0.925	0.950	0.010
EU-Net	0.837	0.765	0.805	0.904	0.919	0.015
SANet	0.888	0.815	0.859	0.928	0.962	0.008
VTANet	0.904	0.826	0.872	0.941	0.978	0.009

Table ⅤI. The Ablation Results Of Etis Dataset

	mDice	mIoU	${\text{F}}_{\beta }^{w}$	${S_\alpha }$	$m{E_\varepsilon }$	MAE
PVT	0.712	0.623	0.609	0.821	0.807	0.046
PVT + AAFM	0.734	0.624	0.687	0.839	0.834	0.038
PVT + FAM	0.754	0.604	0.652	0.824	0.848	0.049
PVT + ASM	0.715	0.657	0.625	0.823	0.839	0.043
VTANet	0.763	0.669	0.693	0.855	0.884	0.038

To verify the generalization ability of the proposed model, three polyp segmentation datasets, including ETIS, ColonDB, and EndoScene, are used for testing. There are 196 images in ETIS, 380 in ColonDB, and 60 in Endoscene, respectively. It can be seen from Tables Ⅲ, Ⅳ, and Ⅴ, the mDice score on the ColonDB dataset is 2.55% higher than the UNet model. The mIoU score on the ETIS dataset is 2.74% higher than the U-Net model. The score on the Endoscene dataset is 2.55% higher than the UNet model^[35–38]. The results show that the proposed model has strong generalization ability.

Finally, the effectiveness of each component to the overall model is described in detail, and the settings of training, testing and hyperparameters are consistent with the previous ones. We use PVTv2 as a baseline and verify the performance of the proposed model by removing modules from the proposed model. The experimental results of different modules in Table 6 show that these modules have played a role in improving network performance. It can be seen from Table 6 that after the introduction of the FAM module, the mDice score is 0.42% higher than the original basic PVTv2 network. The introduction of AAFM module also improves the performance of the original PVT network^[39–40].

To deal with the special features of the colorectal polyps with different sizes and colors, and poor boundary between the polyp region and the surrounding normal region, the VTANet model is constructed and a novel polyp segmentation method based on this model is proposed. It can effectively extract the high-level and low-level feature information of the colorectal polyps and effectively fuse their final outputs. By comparing with seven state-of-the-art segmentation methods, the experimental results on five public datasets show that the proposed method has superior performance and is able to obtain better segmentation results.

Author Contribution

Xinping Guo: Conceived and designed the analysis; Collected the data; Performed the analysis; Wrote the paper.Zizhen Huang: Contributed data or analysis tools; Contributed to the writing of the paper.Yaolong Han: Conceived and designed the experiments; Reviewed drafts of the paper.Zikun Zhang: Analyzed and interpreted the data; Contributed to the writing of the paper; Supervised the research.Lei Wang: Provided critical feedback and helped shape the research, analysis, and manuscript.Additional Authors: providing financial support for the project, administrative support

Acknowledgment

This study was supported by A project ZR2021MF017 supported by Shandong Provincial Natural Science Foundation; A project 2023RKY01015 supported by the Key R&D Program of Shandong Province, China; The National Natural Science Foundation of China ( No. 62273155 ).

J.-H. Shi, Q. Zhang, Y.-H. Tang, and Z.-Q. Zhang, "Polyp-mixer: An efficient context-aware mlp-based paradigm for polyp segmentation," IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 1, pp. 30–42, 2022.
T. Yu and Q. Wu, "HarDNet-CPS: Colorectal polyp segmentation based on Harmonic Densely United Network," Biomedical Signal Processing and Control, vol. 85, p. 104953, 2023.
A. V. Mamonov, I. N. Figueiredo, P. N. Figueiredo, and Y.-H. R. Tsai, "Automated polyp detection in colon capsule endoscopy," IEEE Transactions on Medical Imaging, vol. 33, no. 7, pp. 1488–1502, 2014.
Cao D, Cai B, Liu M. FlowgateUNet: Dental CT image segmentation network based on FlowFormer and gated attention[J]. Signal, Image and Video Processing, 2023: 1–8.
Wang B, Wang F, Dong P, et al. Multiscale transunet++: dense hybrid u-net with transformer for medical image segmentation[J]. Signal, Image and Video Processing, 2022, 16(6): 1607–1614.
J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, 2015: Springer, pp. 234–241.
Z. Zhang, Q. Liu, and Y. Wang, "Road extraction by deep residual u-net," IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, 2018.
Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, "UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation," IEEE Transactions on Medical Imaging, vol. 39, no. 6, pp. 1856–1867, 2020.
D.-P. Fan et al., "Pranet: Parallel reverse attention network for polyp segmentation," in International Conference on Medical Image Computing and Computer-assisted Intervention, 2020: Springer, pp. 263–273.
D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, and H. D. Johansen, "Doubleu-net: A deep convolutional neural network for medical image segmentation,"IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), 2020: IEEE, pp. 558–564.
W. Wang et al., "Pyramid vision transformer: A versatile backbone for dense prediction without convolutions," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578.
C. Gan, Y. Li, H. Li, C. Sun, and B. Gong, "Vqs: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1811–1820.
S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," in International Conference on Machine Learning, 2015: pmlr, pp. 448–456.
S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," in International Conference on Machine Learning, 2015: pmlr, pp. 448–456.
X. Glorot, A. Bordes, and Y. Bengio, "Deep sparse rectifier neural networks," in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011: JMLR Workshop and Conference Proceedings, pp. 315–323.
J. Park, S. Woo, J.-Y. Lee, and I. S. Kweon, "Bam: Bottleneck attention module," arXiv preprint arXiv:1807.06514, 2018.
Y. Su et al., "FeDNet: Feature Decoupled Network for polyp segmentation from endoscopy images," Biomedical Signal Processing and Control, vol. 83, p. 104699, 2023.
D. Jha et al., "Kvasir-seg: A segmented polyp dataset," in MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26, 2020: Springer, pp. 451–462.
P. Sharma, A. Gautam, P. Maji, R. B. Pachori, and B. K. Balabantaray, "Li-SegPNet: Encoder-Decoder Mode Lightweight Segmentation Network for Colorectal Polyps Analysis," IEEE Transactions on Biomedical Engineering, vol. 70, no. 4, pp. 1330–1339, 2022.
N. Tajbakhsh, S. R. Gurudu, and J. Liang, "Automated Polyp Detection in Colonoscopy Videos Using Shape and Context Information," IEEE Transactions on Medical Imaging, vol. 35, no. 2, pp. 630–644, 2016.
V. David et al., "A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images," Journal of Healthcare Engineering,2017,(2017-7-26), vol. 2017, pp. 1–9, 2017.
J. Silva, A. Histace, O. Romain, X. Dray, and B. Granado, "Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer," International Journal of Computer Assisted Radiology and Surgery, vol. 9, no. 2, pp. 283–293, 2013.
W. Wang et al., "Pvt v2: Improved baselines with pyramid vision transformer," Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022.
Z. Liu et al., "Swin transformer: Hierarchical vision transformer using shifted windows," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
F. Milletari, N. Navab, and S.-A. Ahmadi, "V-net: Fully convolutional neural networks for volumetric medical image segmentation," The IEEE International Conference on 3D Vision (3DV), 2016: IEEE, pp. 565–571.
D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji, "Structure-measure: A new way to evaluate foreground maps," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4548–4557.
K. Wang, L. Liu, X. Fu, L. Liu, and W. Peng, "RA-DENet: Reverse Attention and Distractions Elimination Network for polyp segmentation," Computers in Biology and Medicine, vol. 155, p. 106704, 2023.
Liu Y, Han L, Yao B, et al. STA-Former: enhancing medical image segmentation with Shrinkage Triplet Attention in a hybrid CNN-Transformer model[J]. Signal, Image and Video Processing, 2023: 1–10.
W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, and R. Yang, "Salient object detection in the deep learning era: An in-depth survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3239–3259, 2021..
K. Patel, A. M. Bur, and G. Wang, "Enhanced u-net: A feature enhancement network for polyp segmentation," in 2021 18th Conference on Robots and Vision (CRV), 2021: IEEE, pp. 181–188.
J. Wei, Y. Hu, R. Zhang, Z. Li, S. K. Zhou, and S. Cui, "Shallow attention network for polyp segmentation," in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, 2021: Springer, pp. 699–708.
Li, S., Feng, Y., Xu, H., Miao, Y., Lin, Z., Liu, H., "CAENet: Contrast adaptively enhanced network for medical image segmentation based on a differentiable pooling function," Computers in Biology and Medicine, vol. 167, p. 107578, 2023.
L. Liu, Y. Li, Y. Wu, L. Ren, and G. Wang, "LGI Net: Enhancing local-global information interaction for medical image segmentation," Computers in Biology and Medicine, vol. 167, p. 107627, 2023.
Xu, S., Xiao, D., Yuan, B., Liu, Y., Wang, X., Li, N., "FAFuse: A Four-Axis Fusion framework of CNN and Transformer for medical image segmentation," Computers in Biology and Medicine, vol. 166, p. 107567, 2023.
Z. Li, N. Zhang, H. Gong, R. Qiu, and W. Zhang, "MFA-Net: Multiple Feature Association Network for medical image segmentation," Computers in Biology and Medicine, vol. 158, p. 106834, 2023.
Y. Zou, Y. Ge, L. Zhao, and W. Li, "MR-Trans: MultiResolution Transformer for medical image segmentation," Computers in Biology and Medicine, vol. 165, p. 107456, 2023.
J. Zhang, Q. Qin, Q. Ye, and T. Ruan, "ST-unet: Swin transformer boosted U-net with cross-layer feature enhancement for medical image segmentation," Computers in Biology and Medicine, vol. 153, p. 106516, 2023.
Z. Zhang, G. Sun, K. Zheng, J.-K. Yang, X.-r. Zhu, and Y. Li, "TC-Net: A joint learning framework based on CNN and vision transformer for multi-lesion medical images segmentation," Computers in Biology and Medicine, vol. 161, p. 106967, 2023.
Liang B, Tang C, Zhang W, et al. N-Net: an UNet architecture with dual encoder for medical image segmentation[J]. Signal, Image and Video Processing, 2023: 1–9.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

VTANet: A Novel Polyp Segmentation Network Based on the Vision Transformer and Attention Mechanism

Status:

Version 1

Abstract

Figures

I. INTRODUCTION

II. THE WHOLE METHOD

III. EXPERIMENTAL RESULTS AND DISCUSSION

IV. CONCLUSION

Declarations

Author Contribution

Acknowledgment

References

Additional Declarations

Status:

Version 1