A new attempt at full-scale jump connectivity and Transformer

doi:10.21203/rs.3.rs-3445505/v1

In this research, we introduce SwinUnet3+, a pioneering algorithm that integrates Unet with Transformer, to facilitate the automatic segmentation of three primary tissues—subcutaneous fat layer, muscle, and intramuscular fat—in the thoracoabdominal region under challenging conditions, including subcutaneous soft tissue swelling, gas accumulation, artifacts, and fistulas. Our model showcases superior performance in body composition segmentation tasks, with improvements in DSC, IoU, sensitivity, and positive predictive value by 3.2%, 6.05%, 4.03%, and 2.34%, respectively. Notably, in segmenting subcutaneous fat, intramuscular fat, and muscle, SwinUnet3 + yielded the best outcomes. However, the model does exhibit certain limitations, such as a reliance on vast amounts of training data and potential challenges in handling certain image types. Additionally, high-resolution images may pose computational efficiency concerns. In conclusion, while SwinUnet3 + offers considerable advantages in complex medical image segmentation tasks, its limitations warrant acknowledgment. Future research will focus on addressing these challenges and enhancing the model's robustness and generalization capabilities.

Health sciences/Anatomy

Health sciences/Health care

Physical sciences/Mathematics and computing

Computed tomography

Deep learning neural networks

Automated segmentation

Muscle segmentation

Fat segmentation

Medical image segmentation plays a pivotal role in disease diagnosis^[1], treatment planning^[2], and prognosis^[3]. Within the realm of computed tomography (CT) imaging, accurately distinguishing between subcutaneous fat, muscle, and intramuscular fat offers invaluable insights for clinicians when diagnosing diabetes^[4], obesity^[5], and various cancer-related conditions^{[6, 7]}. The accuracy of body composition analysis is now seen as a touchstone for gauging health, nutritional, and functional statuses. Additionally, in surgical contexts, meticulous tissue analysis is vital for both surgical planning and postoperative evaluations^[8].

However, the practical implementation of medical image segmentation encounters numerous hurdles. Image quality can be compromised by noise^[9] and motion blur^[10], and the similar visual characteristics and contrast of tissues like subcutaneous fat, muscle, and intramuscular fat in some imaging techniques complicate segmentation. While clinicians can employ segmentation tools such as Sliceomatic^[11] or NIHImage^[12], the outputs are suboptimal for images compromised by factors like subcutaneous soft tissue swelling or gas accumulation^[13]. Addressing the challenge of extracting meaningful information from these complex medical images is a pressing issue for researchers, with physiological and anatomical individual variations adding another layer of complexity.

Recent years have seen convolutional neural network-based deep learning techniques showing remarkable potential in the domain of medical image segmentation. Models including Unet^[14], Attention Unet^[15], TransUnet^[16], and SwinUnet^[17] have emerged for diverse medical imaging tasks. Yet, there remain issues in harnessing and applying multi-scale information when processing CT images.

To bolster medical image segmentation accuracy, we developed SwinUnet3+, which amalgamates Unet with Transformer. This model is tailored to automatically segment the mentioned tissues in the thoracoabdominal region under intricate conditions. To the best of our understanding, SwinUnet3 + is pioneering in its integration of a U-shaped Transformer architecture with comprehensive skip connections. The model comprises an encoder, bottleneck, decoder, and full-scale skip connections, built on the foundation of Swin Transformer blocks. Images are partitioned into distinct, non-overlapping patches, treated as tokens fed into the encoder for in-depth feature learning. Contextual features derived are up-scaled by a patch expansion layer, fused with encoder features through full-scale skip connections, to revive the spatial resolution of the feature map, subsequently aiding segmentation prediction. Experiments affirm the method's commendable generalization capacity. Our contributions can be distilled into: (1) Constructing a U-shaped Transformer architecture fortified with full-scale skip connections; (2) Demonstrating the efficacy of varying skip connection quantities within the network; and (3) Enhancing the segmentation of key tissues under multifaceted conditions. This novel approach aspires to surmount deep learning challenges posed by high-resolution, multi-scale CT images, striving for a paradigm shift in medical image segmentation.

Related Work

1. Application of Unet and its Variants in Medical Image Segmentation

Since its introduction, Unet^[14] has gained immense popularity in medical image segmentation due to its stellar performance with limited annotated data. Over the years, numerous Unet variants have been proposed to refine performance and cater to specific requirements.

In 2019, Zhou et al. ^[18] unveiled UNet++. While traditional Unet models have garnered success, they are not without their constraints. Notably, deciding upon the model's optimal depth typically demands exhaustive architecture searches, and the existing skip connections in traditional Unet only support information fusion from same-scale feature maps of the encoder-decoder subnets. To circumvent these issues, UNet + + was conceptualized as a nested architecture, assimilating Unets of varying depths. These Unets, with partially shared encoders, are co-trained with deep supervision. Furthermore, UNet + + innovatively redesigned its skip connections, boosting performance by aggregating multi-scale semantic features on the decoder subnet.

The subsequent years saw a proliferation of Unet variants. In 2021, Tran et al. ^[19] introduced TMD-Unet, which enhanced Unet by revamping network node connections, adopting dilated convolutions, and incorporating multi-scale input features. Zhou's team further contributed with Dimension Fusion Unet (D-UNet)^[20], ingeniously merging 2D and 3D convolutions during the encoding phase—a significant stride for multi-modal medical imaging. Concurrently, Abedalla et al. ^[21] launched Ens4B-UNet, integrating four Unet architectures with pretrained backbones.

Addressing the challenge of detecting nuanced structures in medical images, Su et al. ^[22] in 2021 proposed MBFFNet—a real-time segmentation method tailored for colonoscopy examinations, boasting heightened accuracy.

Advancements persisted into 2022 and 2023. Lu et al. ^[23] conducted a meticulous analysis on Unet's segmentation efficacy, eventually formulating a streamlined Unet version, termed Semi-Unet. Furthermore, to tackle the semantic gap issue, Wang et al. incorporated a cross-attention mechanism into Unet, resulting in the MCA-UNet model^[24] .Khaledyan et al. ^[25] compared the performance of different Uet segmentation models on breast ultrasound images.

In summary, the application and evolution of Unet and its derivatives in medical image segmentation underscore the monumental potential of deep learning. Continuous innovation has yielded ever-more specialized methods to fulfill the burgeoning requirements of the medical domain.

2. Application of Transformer in Medical Image Segmentation

Medical image segmentation stands at the forefront of medical image processing, proving instrumental for both diagnosis and treatment planning. Recently, the Transformer architecture, along with its variants, has piqued interest within this sphere. Despite its nascent stage in medical image segmentation, the Transformer's success in computer vision, coupled with its intrinsic self-attention mechanism, has positioned it as a promising research focal point.

Transformers excel in modeling distant pixel interdependencies within medical images—a vital capability considering the high resolutions and 3D formats typical of medical images like CT and MRI scans. Given the richness of texture and structural nuances in medical images, special model adaptability is paramount.

Specifically, various studies have emerged: Lee et al. ^[26] compared methods to embed shape priors into neural network-based segmentation. Lian et al. ^[27] introduced DTNet, a versatile network adept at segmenting CMF bones from extensive CBCT data while simultaneously pinpointing large-scale landmarks. Concurrently, several other researchers proposed novel architectures blending convolutional neural networks and Transformers, optimizing the extraction and utility of both global and local information.A novel structure, the HybridCTrm network, in ^[28] proposed by Sun et al. Furthermore, Chen et al. ^[29] proposed a Transformer-based U-type network PCAT-UNet, which also contains convolution branches. In addition, Wang et al. ^[30] proposed a novel network O-Net, which integrates the advantages of CNN and Transformer to better utilize global and local information to improve the segmentation and classification performance of medical images. However, SwinBTS proposed by Jiang et al. ^[31] is a novel 3D medical image segmentation method that integrates Transformer, convolutional neural network and encoder-decoder structure, defining the 3D brain tumor semantic segmentation task as a sequence-to-sequence prediction challenge.

In essence, Transformers and their offshoots showcase immense promise in medical image segmentation. As the technology advances, they are poised to exert a transformative influence on medical diagnosis and therapeutic planning.

3. Progress in Body Composition Segmentation on CT Images

The intersection of deep learning and medical imaging has garnered significant attention in recent times. Particularly in analyzing body composition from abdominal CT scans, advancements in automatic positioning, recognition, and segmentation technologies have been momentous. Automated segmentation techniques have demonstrated not only time efficiency but also comparative, if not superior, accuracy to manual methodologies.The studies by Dabiri et al. ^[36] and Ackermans et al. ^[37] provide strong evidence for the application of automated segmentation techniques. In particular, for body composition analysis at the L3 position, this automated method not only saves significant time, but also provides similar or even higher accuracy compared to conventional manual analysis methods. On the other hand, the studies by Arayne et al. ^[38] and Koitka et al. ^[39] emphasized the consistency and possible clinical application of body composition analysis at different lumbar levels (e. g., T4 and L3). Furthermore, the study of Hong et al. ^[40] further demonstrated the accuracy of CT analysis of the thoracolumbar spine in the estimation of whole body composition, while the work of Cespedes et al. ^[41] demonstrated the value of an automated approach in the study of cancer prognosis. Finally, Lee et al. ^[42] used an advanced 3D Unet technique to provide an accurate and efficient method for body composition segmentation for whole-body PET-CT images. The results of these studies not only highlight the potential application of deep learning in medical imaging, but also provide important directions for future clinical practice.

However, the segmentation of body composition from CT scans is not devoid of challenges. Variabilities introduced by different scanning equipment and parameters can impede model generalization. Moreover, segmentation can falter in unique circumstances like gas accumulation or the presence of artifacts. Precise demarcation of smaller regions or tissue boundaries remains a formidable challenge. Despite substantial progress, achieving wholly accurate and reliable automated body composition analysis necessitates further research and technological innovation.

Dataset

Body Composition Segmentation Dataset:

Our study utilizes a CT image dataset from Heilongjiang Province Hospital, sourced between March 2021 to November 2022. The dataset encompasses subjects with subcutaneous soft tissue swelling, artifacts, gas shadows in the fat layer, and fistulas in CT images. In total, 321 patients' data were amassed, with a mean age of 57 years, spanning an age bracket of 20–89 years. Notably, some patients' data stem from CT scans taken at disparate time intervals. Nevertheless, due to clinical review relevance, such data was incorporated. A seasoned researcher employed 3DSlicer software version 5.0^[43] for the segmentation process. Specifically, images were categorized into four groups: subcutaneous fat, muscle, intramuscular fat, and others. Segmentation of subcutaneous fat necessitated the skin layer's exclusion to guarantee precise subcutaneous fat segmentation, as depicted in Fig. 1. Furthermore, each scan underwent normalization by linearly scaling the 16-bit grayscale values within the [0,1] range, based on grayscale value extremities.

For data distribution, 80% was allocated for training, with 10% each for validation and testing. For a robust estimation of model performance, we opted for 5-fold cross-validation. Metrics such as Dice Similarity Coefficient (DSC), Intersection over Union (IoU), Positive Predictive Value (PPV), and Sensitivity (Sens) served as tools for assessing the model's segmentation prowess.

External Validation Dataset:

The Synapse multi-organ segmentation dataset, originating from the MICCAI 2015 Multi-Atlas Abdomen Labeling Challenge, encompasses 30 abdominal CT scans, totaling 3779 axial enhanced abdominal clinical CT images. Each scan contains between 85 to 198 slices, each with a 512 × 512-pixel resolution. The dataset primarily targets eight abdominal organs: aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, and stomach. To uphold the study's precision, data was randomly split into a training batch of 18 cases and a validation batch of 12. In terms of model performance assessment, both the average Dice Similarity Coefficient (DSC) and average Hausdorff Distance (HD) were employed as evaluation metrics.

A B C

Figure 1: Abdominal fistula segmentation labels. Figure A is the original cross-sectional image of a patient with abdominal fistula; in Figure B, the red part represents muscle, the yellow part represents subcutaneous fat, and the blue part represents intramuscular fat; Figure C is the segmentation label.

Experimental Methods

Data Augmentation

In deep learning ventures, data augmentation stands as a quintessential method to curb overfitting while amplifying model generalization capacity. The fundamental premise here is to cultivate novel and varied samples by implementing transformations on existing training samples, thereby enriching the training dataset. The essence of data augmentation lies in preserving pixel class consistency when instituting these transformations, ensuring the preservation of data veracity.

To augment the training dataset and boost model resilience effectively, we embraced the following data augmentation techniques: random rotation, scaling, horizontal flipping, and vertical flipping. Employing these four strategies effectively quadrupled our sample size, thereby infusing more diversity into model training.

It's imperative to underscore that these data augmentation strategies are solely applied to the training set. Given that the validation set's role is model performance evaluation, it must retain its innate state, ensuring model validation results maintain their objectivity and trustworthiness. Thus, in our study, the validation dataset remains untouched to preserve the integrity and precision of model assessments.

SwinUnet3 + Network

Network Architecture

The structure of our proposed SwinUnet3 + network is illustrated in Fig. 2, encompassing four primary sections: encoder, decoder, bottleneck, and full-scale skip connections. Initially, the input image is divided into 44 block sizes using the patch partition operation, with each block being distinct. This division strategy refines image analysis while preserving image completeness. In our methodology, each block possesses a feature dimension of 44*3 = 48. Each block is perceived as a 48-dimensional feature vector, furnishing substantial details for image content comprehension and manipulation. Subsequently, these feature dimensions undergo a linear embedding layer, denoted as C. This transformation layer plays a pivotal role in translating original features into a more adaptable representation, offering convenience for ensuing processing stages.

Swin Transformer Block

The Swin Transformer Block is a pioneering module, efficaciously modeling input data by substituting conventional multi-head self-attention (MSA) modules with a shift window-based alternative. As depicted in Fig. 3, every Swin Transformer Block integrates a shift window-based MSA module and a dual-layer multilayer perceptron (MLP) furnished with a Gelu non-linear activation function. A LayerNorm (LN) ensures normalization preceding every MSA module and MLP. Additionally, post each module, residual connections amplify the model's expressive capability.

In essence, the Swin Transformer Block efficiently models input data via the incorporation of shift window-based MSA modules and other optimization techniques. Within our network, a sequence of two Swin Transformer Blocks is employed to magnify model performance, facilitating deeper comprehension and processing of input data.

In our SwinUnet3 + architecture, two sequential Swin Transformer Blocks are integrated to elevate the model's performance. This dual-block configuration enables our model to delve deeper into the input data, resulting in more precise and optimized image segmentation. Symbolically, the dual integration of these blocks can be denoted as:

$${\widehat{z}}^{l}=W-MSA\left(LN\right({z}^{l-1}\left)\right)+{z}^{l-1}$$

$${z}^{l}=MLP\left(LN\right({z}^{l}\left)\right)+{\widehat{z}}^{l}$$

$${\widehat{z}}^{l+1}=SW-MSA\left(LN\right({z}^{l}\left)\right)+{z}^{l}$$

$${z}^{l}=MLP\left(LN\right({z}^{l+1}\left)\right)+{\widehat{z}}^{l+1}$$

Exploring the intricacies of the SwinUnet3 + architecture reveals the presence of two distinctive multi-head self-attention modules, namely W-MSA (Window Multi-Head Self-Attention) and SW-MSA (Shift Window Multi-Head Self-Attention). Though each possesses unique design aspects and computational processes, their fundamental computational formulas are akin.

W-MSA: Designed to address the scale challenge, W-MSA segments the input into several distinct small windows. Within each window, a separate multi-head self-attention computation is performed. This segmented computation not only boosts computational efficiency but also heightens the model's ability to capture localized information with accuracy.

SW-MSA: To further tackle computational complexities, the SW-MSA incorporates an additional shift window strategy. This ensures inter-window data exchange, compensating for any W-MSA limitations. By leveraging shift windows, the model harmonizes between localized and broader perspectives, optimizing feature extraction.

At their core, both modules derive from the conventional multi-head self-attention framework. They calculate the query matrix (Q), key matrix (K), and value matrix (V). Following this, a dot product of Q and K is computed, followed by applying a softmax function to procure attention weights. These weights are then multiplied with V to produce the final output.

What sets W-MSA and SW-MSA apart from traditional modules is their method of input sequence processing. While W-MSA relies on windowed processing, SW-MSA emphasizes inter-window communication via its shift window strategy. These methods empower the SwinUnet3 + network to efficiently extract and amalgamate information.

For both W-MSA and SW-MSA, the computation formulas can be summarized as:

$$\text{A}\text{t}\text{t}\text{e}\text{n}\text{t}\text{i}\text{o}\text{n}(\text{Q},\text{K},\text{V})=\text{s}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\left(\frac{\text{Q}{\text{K}}^{\text{T}}}{\sqrt{{\text{d}}_{\text{k}}}}\right)\text{V}$$

where Q, K, and V symbolize the query, key, and value matrices, respectively, and dk signifies model dimensionality. Within the W-MSA, the input sequence is segmented into separate windows, with the formula being applied within each. In the SW-MSA, the input undergoes division into various shift windows, again applying the formula within each segment.

The choice to incorporate two consecutive Swin Transformer blocks in the SwinUnet3 + design offers multi-level feature representation. Each block is adept at capturing varied abstraction levels, transitioning from granular details to overarching semantics. Continuous stacking facilitates the extraction of increasingly abstract features, enabling a more nuanced representation of input data.

Moreover, this design is instrumental in capturing information across multiple scales. Each block targets different scale features, allowing the model to discern structures and patterns across these scales. Through sequential information flow, each block refines its features based on the prior block, shifting from rudimentary to intricate features. This sequential refinement assists the model in detecting minute input data variations.

The inclusion of diverse feature transformations and activation functions heightens the model's non-linear modeling capabilities, facilitating a better fit to complex data distributions and boosting model expressiveness. Notably, the shared parameters within the Swin Transformer blocks minimize the model's parameter load, yielding a lightweight yet potent feature learning model.

To encapsulate, the dual Swin Transformer blocks' integration ensures the optimal capture and application of sequence information, refining the model's overall performance.

Decoder

In the domain of deep learning, the encoder-decoder framework plays a pivotal role in tasks such as image segmentation and generation. The encoder is designed to extract and encapsulate rich features from input data. In contrast, the decoder employs these extracted features to either reconstruct or generate the desired outputs. For enhanced performance and computational efficiency, decoders are typically architecturally reflective of encoders.

Adopting this paradigm, we incorporated Swin Transformer blocks to construct an intricately symmetrical decoder. The Swin Transformer, equipped with its window self-attention mechanism, has demonstrated exceptional proficiency in processing structured data, notably images.

Within the encoder, the "patch merging layers" are employed to condense spatial resolution while amplifying feature depth. To counterbalance this downsampling in the decoder, we introduced an innovative "patch expansion layer." This layer ingeniously subdivides input feature maps into smaller units. Each unit is then further fragmented into four sections, which are subsequently merged, effectively doubling the spatial resolution from its initial state. Although this increases spatial specificity, the feature dimensionality concurrently decreases. To recalibrate and refine this dimensionality, a linear or a fully connected layer is applied post-merging.

Such a design approach empowers our model to efficiently assimilate spatial information and refine it at a granular level. The patch expansion layer, as a pioneering technique, facilitates models in heightening spatial resolution while adeptly preserving and optimizing feature dimensionality and informational richness.

In essence, the amalgamation of the consecutive Swin Transformer blocks and patch expansion layers has birthed a robust model capable of meticulously extracting and harnessing intrinsic data information. This dramatically boosts performance in intricate tasks involving image or other structured data analysis.

Full-Scale Skip Connections

In image segmentation tasks, achieving high precision necessitates more than just single-scale information—it mandates comprehensive multi-scale information integration. This becomes paramount when interpreting intricate medical images. For instance, delineating intricate structures, like organs and tissues, necessitates a profound interpretation spanning from a global overview to minute local details. While SwinUnet^[17] has made commendable strides in multi-scale information fusion, its efficacy wanes in comparison to SwinUnet++^[44] due to inherent limitations in its connection strategy. This is particularly discernible in intricate segmentation tasks such as the precise demarcation of organ positions and boundaries. To redress these lacunae, we drew inspiration from Unet3+^[45] and were particularly captivated by its comprehensive skip connection approach.

This all-encompassing skip connection approach equips the network to meticulously assimilate minute details at superficial levels while concurrently harnessing overarching semantic information at profound depths. This strategic design ensures exhaustive multi-scale information capture and utilization, significantly enhancing segmentation precision.

Taking the second Swin Transformer block (TD2) in the decoder as an example, as shown in Figure [4], it not only directly obtains features from its corresponding encoder layer TE2 but also adds multiple skip connections to ensure the integration of features at multiple scales. For instance, the upper connection employs various maximum pooling strategies to downsample the features of TE1, effectively blending shallow layer features. The lower skip connections, on the other hand, upsample the features of TD3 and TD4 using bilinear interpolation, aiming to restore the spatial details of the features.

Deep learning models face a key challenge when integrating information: how to judiciously combine features from different depths, balancing intricate edges and texture details with holistic semantic information. Presently, two mainstream feature fusion strategies dominate: one is the point-wise addition adopted by FCN, and the other is the channel dimension concatenation utilized by Unet. We opted for the latter, as it more comprehensively integrates information from both deep and shallow layers.

After fusion, we obtain a feature map with consistent resolution and 256 channels. This feature dimension configuration is based on extensive experiments and optimization to ensure the model possesses ample representational capability. Subsequently, we further enhance these features using a convolutional layer with 256 3×3 filters. This local connection approach aids in capturing richer contextual details. To boost network stability and amplify its non-linear expressive power, we also incorporate BN and ReLU activation functions.

To provide a more systematic understanding of this full-scale skip connection strategy, we can characterize it mathematically with the formula

Here, the function C denotes the convolution operation; function H symbolizes the feature aggregation mechanism, which includes a convolution layer, a BN, and a ReLU. Functions D and U represent the upsampling and downsampling operations, respectively. The symbol [ ] signifies channel dimension concatenation fusion. The variable i stands for the downsampling level during the encoding phase, while N represents the total number of layers in the encoder. This formal representation offers clearer insight into how models facilitate information exchange across varying depth.

Implementation Details

Our proposed method was implemented and trained on two NVIDIA RTX3090 graphics processing units (GPUs), each boasting 24GB of memory. We adopted a batch size of 24 during the training phase. For the backpropagation optimization of the model, we employed the Stochastic Gradient Descent (SGD) optimizer. The momentum was configured at 0.9, and the initial learning rate was established at 1×10 − 3. The dimensions for the input image and block were set at 256×256 and 4, respectively. To initialize our model parameters, we leveraged weights that were pre-trained on the ImageNet dataset. The SwinUnet3 + was implemented using the PyTorch framework (version 1.11.0), complemented by Python 3.8 and CUDA 11.3.

Comparative Experiment

To rigorously assess the SwinUnet3 + model's capability in medical image segmentation, we executed experiments on a dataset that encompasses 321 patients diagnosed with thoracoabdominal body composition irregularities.

Out of the models analyzed, SwinUnet3 + demonstrated superior efficacy in body composition segmentation tasks, securing top scores in primary evaluation metrics, As shown in Fig. 5. Specifically, the model augmented the DSC, IoU, sensitivity, and positive predictive value metrics by 3.2%, 6.05%, 4.03%, and 2.34%, respectively. In segmenting subcutaneous fat, intramuscular fat, and muscle tissues, SwinUnet3 + again outperformed competitors, with detailed scores presented in Table 1.

Figure 6's visualization underscores that CNN-based methods tend to exhibit over-segmentation tendencies, possibly stemming from the inherent locality of convolutional operations. By fusing the transformer with a U-shaped structure enriched with full-scale skip connections, and thereby marrying low-level detail with high-level semantic understanding, SwinUnet3 + accomplished superior segmentation outcomes.

Table 1 Quantitative comparison results of different methods on the body composition segmentation dataset.

Methods

SAT

SM

IMAT

DSC

IoU

Sens

PPV

DSC

IoU

Sens

PPV

DSC

IoU

Sens

PPV

Unet

88.54

79.44

89.34

87.79

86.49

76.20

88.37

90.11

87.62

77.97

89.05

86.23

Attention Unet

91.54

84.40

89.71

93.14

90.37

82.43

90.62

90.12

88.43

79.26

88.23

88.67

Unet3+

90.99

83.47

90.60

91.32

89.42

80.86

89.84

89.02

90.40

82.48

89.35

91.43

ViT

91.98

85.15

88.63

95.50

92.36

85.80

91.25

93.48

92.92

86.78

90.96

94.99

TransUnet

95.80

91.94

94.35

96.68

95.12

90.69

92.97

96.33

93.50

87.80

93.09

94.08

SwinUnet

96.00

92.31

93.76

98.31

95.04

90.54

93.22

96.84

95.74

91.83

95.31

96.19

SwinUnet3+

99.45

98.90

99.31

99.58

98.64

97.32

97.53

99.78

98.28

96.62

97.56

99.00

External Validation Experiment

For external validation, we opted for the Synapse dataset, aiming to probe the efficacy of the SwinUnet3 + model in medical image segmentation. The dataset boasts diverse medical imagery, presenting a rigorous test environment. Our findings confirmed that the model attained a Dice coefficient of 81.21% and registered a Hausdorff Distance (HD) of 20.76%. These results suggest the model's robust generalizability and aptitude in addressing novel data.

Ablation Experiment

1. Influence of Skip Connection Quantity: Altering the skip connections count to 4, 8, and 12, we delved into how varying skip connection counts influence segmentation performance. Preliminary observations indicate that this model's segmentation performance experiences a boost as the number of skip connections escalate. Thus, to fortify the model's robustness, we settled on 12 skip connections, as illustrated in Table 2.

Table 2

Impact of different skip connection quantities on the model
	Average				SAT				SM				IMAT
	DSC	IoU	Sens	PPV	DSC	IoU	Sens	PPV	DSC	IoU	Sens	PPV	DSC	IoU	Sens	PPV
4	95.92	92.16	96.02	95.75	95.98	92.27	95.38	96.59	95.68	91.72	96.24	95.13	96.10	92.49	96.44	95.53
8	97.52	95.16	97.31	97.77	98.18	96.43	97.47	99.00	97.26	94.67	96.84	97.65	97.11	94.38	97.62	96.65
12	98.79	97.61	98.13	99.45	99.45	98.90	99.31	99.58	98.64	97.32	97.53	99.78	98.28	96.62	97.56	99.00

2. Impact on Input Resolution: We juxtaposed outcomes from resolutions of 256×256 against 512×512 (detailed in Table 3). For a 512×512 input, we retained a patch size of 4, leading to a transformer sequence approximately fourfold larger. A heightened sequence length bore enhanced segmentation results. For SwinUnet3+, Table 3 shows a resolution shift from 256×256 to 512×512 boosted the average DSC merely by 0.48%. Balancing computational demands, our experiments primarily utilized a 256×256 resolution to validate SwinUnet3+'s efficacy.

Table 3

Impact of different input resolutions on the model.
	Average				SAT				SM				IMAT
	DSC	IoU	Sens	PPV	DSC	IoU	Sens	PPV	DSC	IoU	Sens	PPV	DSC	IoU	Sens	PPV
256	98.79	97.61	98.13	99.45	99.45	98.90	99.31	99.58	98.64	97.32	97.53	99.78	98.28	96.62	97.56	99.00
512	99.27	98.56	98.43	99.87	99.84	99.68	99.35	99.99	99.37	98.75	98.76	99.98	98.60	97.24	97.58	99.64

In this exploration, we introduced SwinUnet3+, a pioneering algorithmic model amalgamating Unet with Transformer. The model is crafted to autonomously segment three crucial thoracoabdominal tissues under intricate scenarios. Performance metrics validated its proficiency, particularly in body composition segmentation tasks, where it consistently excelled across primary evaluation metrics.

Yet, while the model showcases significant advances across evaluation metrics, it's not devoid of constraints. Chiefly, its performance is contingent on ample and quality training data. Additionally, while excelling in complex medical image scenarios, it may falter with specific image types, particularly those with intricate textures or noise. High-resolution image processing might also pose computational challenges.

In summation, our SwinUnet3 + model exhibits pronounced potential in intricate medical image segmentation. Its integration of full-scale skip connections and patch expansion layers fosters nuanced spatial information capture. Through this innovative approach, we envision substantial advancements in the medical imaging segmentation realm. Nevertheless, acknowledging its limitations, our future endeavors will pivot towards refining its robustness and versatility, whilst expanding its applicability across diverse medical imaging tasks. We anticipate SwinUnet3 + ushering in transformative innovations in medical image analytics.

Acknowledgements

Not applicable.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author’s contribution

L. Yin designed the research plan, collected and analyzed the data, and wrote the paper. H. Chen provided guidance and revised and polished the paper.

Data Availability

Data supporting the conclusions of this article are available from the corresponding author.

Ethics approval and consent to participate

The data for this study were sourced entirely from publicly available online datasets. As there were no actual human or animal participants involved, there was no need for ethical approval or informed consent from participants.

Consent for publication

As all the data used in this study was obtained from public online sources and does not contain any identifiable personal information, there was no need to secure written consent for publication.

Conflict of interest disclosure

We affirm that there are no documented conflicts of interest linked to this publication, and there has been no substantial financial backing for this study that could have impacted its results. The authors assert that they do not have any competing financial interests.

Data availability statement

All data that support the finding of this study are inculded in this manuscript and its supplementary information files.

Author details

¹Faculty of Medical Technology, Qiqihar Medical University, Qiqihar 161006, China

²Department of CT, Heilongjiang Provincial Hospital, Harbin 150036, China

Lv, Qianting et al. “Automatic analysis of bronchus-artery dimensions to diagnose and monitor airways disease in cystic fibrosis.” Thorax, thorax-2023-220021. 21 Sep. 2023, doi:10.1136/thorax-2023-220021.
Wang, Yiwei et al. “Root canal treatment planning by automatic tooth and root canal segmentation in dental CBCT with deep multi-task feature learning.” Medical image analysis vol. 85 (2023): 102750. doi:10.1016/j.media.2023.102750.
Saber, Ralph et al. “Radiomics using computed tomography to predict CD73 expression and prognosis of colorectal cancer liver metastases.” Journal of translational medicine vol. 21,1 507. 27 Jul. 2023, doi:10.1186/s12967-023-04175-7.
Maltais, Alexandre et al. “Trunk muscle quality assessed by computed tomography: Association with adiposity indices and glucose tolerance in men.” Metabolism: clinical and experimental vol. 85 (2018): 205-212. doi:10.1016/j.metabol.2018.04.003.
Zou, Xiantong et al. “Gender-specific data-driven adiposity subtypes using deep-learning-based abdominal CT segmentation.” Obesity (Silver Spring, Md.) vol. 31,6 (2023): 1600-1609. doi:10.1002/oby.23741.
Saalfeld, Sylvia et al. “Prognostic role of radiomics-based body composition analysis for the 1-year survival for hepatocellular carcinoma patients.” Journal of cachexia, sarcopenia and muscle, 10.1002/jcsm.13315. 17 Aug. 2023, doi:10.1002/jcsm.13315.
Keyl, Julius et al. “Deep learning-based assessment of body composition and liver tumour burden for survival modelling in advanced colorectal cancer.” Journal of cachexia, sarcopenia and muscle vol. 14,1 (2023): 545-552. doi:10.1002/jcsm.13158.
Lan, Qiaoqing et al. “Radiomics in Addition to Computed Tomography-Based Body Composition Nomogram May Improve the Prediction of Postoperative Complications in Gastric Cancer Patients.” Annals of nutrition & metabolism vol. 78,6 (2022): 316-327. doi:10.1159/000526787.
Zhong, Yicheng et al. “Bi-Graph Reasoning for Masticatory Muscle Segmentation from Cone-Beam Computed Tomography.” IEEE transactions on medical imaging, vol. PP 10.1109/TMI.2023.3304557. 11 Aug. 2023, doi:10.1109/TMI.2023.3304557.
Apostolova, Ivayla et al. “Combined correction of recovery effect and motion blur for SUV quantification of solitary pulmonary nodules in FDG PET/CT.” European radiology vol. 20,8 (2010): 1868-77. doi:10.1007/s00330-010-1747-1.
Irving, Brian A et al. “NIH ImageJ and Slice-O-Matic computed tomography imaging software to quantify soft tissue.” Obesity (Silver Spring, Md.) vol. 15,2 (2007): 370-6. doi:10.1038/oby.2007.573.
Schneider, Caroline A et al. “NIH Image to ImageJ: 25 years of image analysis.” Nature methods vol. 9,7 (2012): 671-5. doi:10.1038/nmeth.2089.
Troschel, Amelie S et al. “Computed Tomography-based Body Composition Analysis and Its Role in Lung Cancer Care.” Journal of thoracic imaging vol. 35,2 (2020): 91-100. doi:10.1097/RTI.0000000000000428.
O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al., “Attention Unet: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.
J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” arXiv preprint arXiv:2105.05537, 2021.
Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation. IEEE Trans Med Imaging. 2020;39(6):1856-1867. doi:10.1109/TMI.2019.2959609.
Tran ST, Cheng CH, Nguyen TT, Le MH, Liu DG. TMD-Unet: Triple-Unet with Multi-Scale Input Features and Dense Skip Connection for Medical Image Segmentation. Healthcare (Basel). 2021;9(1):54. Published 2021 Jan 6. doi:10.3390/healthcare9010054.
Zhou Y, Huang W, Dong P, Xia Y, Wang S. D-UNet: A Dimension-Fusion U Shape Network for Chronic Stroke Lesion Segmentation. IEEE/ACM Trans Comput Biol Bioinform. 2021;18(3):940-950. doi:10.1109/TCBB.2019.2939522.
Abedalla A, Abdullah M, Al-Ayyoub M, Benkhelifa E. Chest X-ray pneumothorax segmentation using Unet with EfficientNet and ResNet architectures. PeerJ Comput Sci. 2021;7:e607. Published 2021 Jun 29. doi:10.7717/peerj-cs.607.
Su H, Lin B, Huang X, Li J, Jiang K, Duan X. MBFFNet: Multi-Branch Feature Fusion Network for Colonoscopy. Front Bioeng Biotechnol. 2021;9:696251. Published 2021 Jul 14. doi:10.3389/fbioe.2021.696251.
Lu H, She Y, Tie J, Xu S. Half-UNet: A Simplified Unet Architecture for Medical Image Segmentation. Front Neuroinform. 2022;16:911679. Published 2022 Jun 9. doi:10.3389/fninf.2022.911679.
Wang H, Cao P, Yang J, Zaiane O. MCA-UNet: multi-scale cross co-attentional Unet for automatic medical image segmentation. Health Inf Sci Syst. 2023;11(1):10. Published 2023 Jan 30. doi:10.1007/s13755-022-00209-4。
Khaledyan D, Marini TJ, O'Connell A, Parker K. Enhancing Breast Ultrasound Segmentation through Fine-tuning and Optimization Techniques: Sharp Attention UNet. Preprint. bioRxiv. 2023;2023.07.14.549040. Published 2023 Jul 18. doi:10.1101/2023.07.14.549040.
Lee MCH, Petersen K, Pawlowski N, Glocker B, Schaap M. TeTrIS: Template Transformer Networks for Image Segmentation With Shape Priors. IEEE Trans Med Imaging. 2019;38(11):2596-2606. doi:10.1109/TMI.2019.2905990.
Lian C, Wang F, Deng HH, et al. Multi-task Dynamic Transformer Network for Concurrent Bone Segmentation and Large-Scale Landmark Localization with Dental CBCT. Med Image Comput Comput Assist Interv. 2020;12264:807-816. doi:10.1007/978-3-030-59719-1_78.
Sun Q, Fang N, Liu Z, Zhao L, Wen Y, Lin H. HybridCTrm: Bridging CNN and Transformer for Multimodal Brain Image Segmentation. J Healthc Eng. 2021;2021:7467261. Published 2021 Oct 1. doi:10.1155/2021/7467261.
Chen D, Yang W, Wang L, Tan S, Lin J, Bu W. PCAT-UNet: UNet-like network fused convolution and transformer for retinal vessel segmentation. PLoS One. 2022;17(1):e0262689. Published 2022 Jan 24. doi:10.1371/journal.pone.0262689.
Wang T, Lan J, Han Z, et al. O-Net: A Novel Framework With Deep Fusion of CNN and Transformer for Simultaneous Segmentation and Classification. Front Neurosci. 2022;16:876065. Published 2022 Jun 2. doi:10.3389/fnins.2022.876065.
Jiang Y, Zhang Y, Lin X, Dong J, Cheng T, Liang J. SwinBTS: A Method for 3D Multimodal Brain Tumor Segmentation Using Swin Transformer. Brain Sci. 2022;12(6):797. Published 2022 Jun 17. doi:10.3390/brainsci12060797.
Xiao Z, Su Y, Deng Z, Zhang W. Efficient Combination of CNN and Transformer for Dual-Teacher Uncertainty-guided Semi-supervised Medical Image Segmentation. Comput Methods Programs Biomed. 2022;226:107099. doi:10.1016/j.cmpb.2022.107099.
Xu Y, He X, Xu G, et al. A medical image segmentation method based on multi-dimensional statistical features. Front Neurosci. 2022;16:1009581. Published 2022 Sep 15. doi:10.3389/fnins.2022.1009581.
Ling Z, Yang S, Gou F, Dai Z, Wu J. Intelligent Assistant Diagnosis System of Osteosarcoma MRI Image Based on Transformer and Convolution in Developing Countries. IEEE J Biomed Health Inform. 2022;26(11):5563-5574. doi:10.1109/JBHI.2022.3196043.
Ding H, Liu C, Wang S, Jiang X. VLT: Vision-Language Transformer and Query Generation for Referring Segmentation. IEEE Trans Pattern Anal Mach Intell. 2023;45(6):7900-7916. doi:10.1109/TPAMI.2022.3217852.
Dabiri S, Popuri K, Ma C, et al. Deep learning method for localization and segmentation of abdominal CT. Comput Med Imaging Graph. 2020;85:101776. doi:10.1016/j.compmedimag.2020.101776.
Ackermans LLGC, Volmer L, Timmermans QMMA, et al. Clinical evaluation of automated segmentation for body composition analysis on abdominal L3 CT slices in polytrauma patients. Injury. 2022;53 Suppl 3:S30-S41. doi:10.1016/j.injury.2022.05.004.
Arayne AA, Gartrell R, Qiao J, Baird PN, Yeung JM. Comparison of CT derived body composition at the thoracic T4 and T12 with lumbar L3 vertebral levels and their utility in patients with rectal cancer. BMC Cancer. 2023;23(1):56. Published 2023 Jan 16. doi:10.1186/s12885-023-10522-0.
Koitka S, Kroll L, Malamutmann E, Oezcelik A, Nensa F. Fully automated body composition analysis in routine CT imaging using 3D semantic segmentation convolutional neural networks [published correction appears in Eur Radiol. 2020 Nov 27;:]. Eur Radiol. 2021;31(4):1795-1804. doi:10.1007/s00330-020-07147-3.
Hong JH, Hong H, Choi YR, et al. CT analysis of thoracolumbar body composition for estimating whole-body composition. Insights Imaging. 2023;14(1):69. Published 2023 Apr 24. doi:10.1186/s13244-023-01402-z.
Cespedes Feliciano EM, Popuri K, Cobzas D, et al. Evaluation of automated computed tomography segmentation to assess body composition and mortality associations in cancer patients. J Cachexia Sarcopenia Muscle. 2020;11(5):1258-1269. doi:10.1002/jcsm.12573.
Lee YS, Hong N, Witanto JN, et al. Deep neural network for automatic volumetric segmentation of whole-body CT images for body composition assessment. Clin Nutr. 2021;40(8):5038-5046. doi:10.1016/j.clnu.2021.06.025.
Kikinis R , Pieper S D , Vosburgh K G .3D Slicer: A Platform for Subject-Specific Image Analysis, Visualization, and Clinical Support[J].Springer New York, 2014.DOI:10.1007/978-1-4614-7657-3_19.
Liu P, Song Y, Chai M, Han Z, Zhang Y. Swin-UNet++: A Nested Swin Transformer Architecture for Location Identification and Morphology Segmentation of Dimples on 2.25Cr1Mo0.25V Fractured Surface. Materials (Basel). 2021;14(24):7504. Published 2021 Dec 7. doi:10.3390/ma14247504.
Huang H, et al., "UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 1055-1059, doi: 10.1109/ICASSP40776.2020.9053405.

No competing interests reported.

A new attempt at full-scale jump connectivity and Transformer

Status:

Version 1

Abstract

Figures

Introduction

Materials and Methods

Dataset

Body Composition Segmentation Dataset:

External Validation Dataset:

Experimental Methods

Data Augmentation

SwinUnet3 + Network

Network Architecture

Swin Transformer Block

Decoder

Full-Scale Skip Connections

Implementation Details

Results

Comparative Experiment

External Validation Experiment

Discussion

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1