Medical image segmentation plays a pivotal role in disease diagnosis[1], treatment planning[2], and prognosis[3]. Within the realm of computed tomography (CT) imaging, accurately distinguishing between subcutaneous fat, muscle, and intramuscular fat offers invaluable insights for clinicians when diagnosing diabetes[4], obesity[5], and various cancer-related conditions[6, 7]. The accuracy of body composition analysis is now seen as a touchstone for gauging health, nutritional, and functional statuses. Additionally, in surgical contexts, meticulous tissue analysis is vital for both surgical planning and postoperative evaluations[8].
However, the practical implementation of medical image segmentation encounters numerous hurdles. Image quality can be compromised by noise[9] and motion blur[10], and the similar visual characteristics and contrast of tissues like subcutaneous fat, muscle, and intramuscular fat in some imaging techniques complicate segmentation. While clinicians can employ segmentation tools such as Sliceomatic[11] or NIHImage[12], the outputs are suboptimal for images compromised by factors like subcutaneous soft tissue swelling or gas accumulation[13]. Addressing the challenge of extracting meaningful information from these complex medical images is a pressing issue for researchers, with physiological and anatomical individual variations adding another layer of complexity.
Recent years have seen convolutional neural network-based deep learning techniques showing remarkable potential in the domain of medical image segmentation. Models including Unet[14], Attention Unet[15], TransUnet[16], and SwinUnet[17] have emerged for diverse medical imaging tasks. Yet, there remain issues in harnessing and applying multi-scale information when processing CT images.
To bolster medical image segmentation accuracy, we developed SwinUnet3+, which amalgamates Unet with Transformer. This model is tailored to automatically segment the mentioned tissues in the thoracoabdominal region under intricate conditions. To the best of our understanding, SwinUnet3 + is pioneering in its integration of a U-shaped Transformer architecture with comprehensive skip connections. The model comprises an encoder, bottleneck, decoder, and full-scale skip connections, built on the foundation of Swin Transformer blocks. Images are partitioned into distinct, non-overlapping patches, treated as tokens fed into the encoder for in-depth feature learning. Contextual features derived are up-scaled by a patch expansion layer, fused with encoder features through full-scale skip connections, to revive the spatial resolution of the feature map, subsequently aiding segmentation prediction. Experiments affirm the method's commendable generalization capacity. Our contributions can be distilled into: (1) Constructing a U-shaped Transformer architecture fortified with full-scale skip connections; (2) Demonstrating the efficacy of varying skip connection quantities within the network; and (3) Enhancing the segmentation of key tissues under multifaceted conditions. This novel approach aspires to surmount deep learning challenges posed by high-resolution, multi-scale CT images, striving for a paradigm shift in medical image segmentation.
Related Work
1. Application of Unet and its Variants in Medical Image Segmentation
Since its introduction, Unet[14] has gained immense popularity in medical image segmentation due to its stellar performance with limited annotated data. Over the years, numerous Unet variants have been proposed to refine performance and cater to specific requirements.
In 2019, Zhou et al. [18] unveiled UNet++. While traditional Unet models have garnered success, they are not without their constraints. Notably, deciding upon the model's optimal depth typically demands exhaustive architecture searches, and the existing skip connections in traditional Unet only support information fusion from same-scale feature maps of the encoder-decoder subnets. To circumvent these issues, UNet + + was conceptualized as a nested architecture, assimilating Unets of varying depths. These Unets, with partially shared encoders, are co-trained with deep supervision. Furthermore, UNet + + innovatively redesigned its skip connections, boosting performance by aggregating multi-scale semantic features on the decoder subnet.
The subsequent years saw a proliferation of Unet variants. In 2021, Tran et al. [19] introduced TMD-Unet, which enhanced Unet by revamping network node connections, adopting dilated convolutions, and incorporating multi-scale input features. Zhou's team further contributed with Dimension Fusion Unet (D-UNet)[20], ingeniously merging 2D and 3D convolutions during the encoding phase—a significant stride for multi-modal medical imaging. Concurrently, Abedalla et al. [21] launched Ens4B-UNet, integrating four Unet architectures with pretrained backbones.
Addressing the challenge of detecting nuanced structures in medical images, Su et al. [22] in 2021 proposed MBFFNet—a real-time segmentation method tailored for colonoscopy examinations, boasting heightened accuracy.
Advancements persisted into 2022 and 2023. Lu et al. [23] conducted a meticulous analysis on Unet's segmentation efficacy, eventually formulating a streamlined Unet version, termed Semi-Unet. Furthermore, to tackle the semantic gap issue, Wang et al. incorporated a cross-attention mechanism into Unet, resulting in the MCA-UNet model[24] .Khaledyan et al. [25] compared the performance of different Uet segmentation models on breast ultrasound images.
In summary, the application and evolution of Unet and its derivatives in medical image segmentation underscore the monumental potential of deep learning. Continuous innovation has yielded ever-more specialized methods to fulfill the burgeoning requirements of the medical domain.
2. Application of Transformer in Medical Image Segmentation
Medical image segmentation stands at the forefront of medical image processing, proving instrumental for both diagnosis and treatment planning. Recently, the Transformer architecture, along with its variants, has piqued interest within this sphere. Despite its nascent stage in medical image segmentation, the Transformer's success in computer vision, coupled with its intrinsic self-attention mechanism, has positioned it as a promising research focal point.
Transformers excel in modeling distant pixel interdependencies within medical images—a vital capability considering the high resolutions and 3D formats typical of medical images like CT and MRI scans. Given the richness of texture and structural nuances in medical images, special model adaptability is paramount.
Specifically, various studies have emerged: Lee et al. [26] compared methods to embed shape priors into neural network-based segmentation. Lian et al. [27] introduced DTNet, a versatile network adept at segmenting CMF bones from extensive CBCT data while simultaneously pinpointing large-scale landmarks. Concurrently, several other researchers proposed novel architectures blending convolutional neural networks and Transformers, optimizing the extraction and utility of both global and local information.A novel structure, the HybridCTrm network, in [28] proposed by Sun et al. Furthermore, Chen et al. [29] proposed a Transformer-based U-type network PCAT-UNet, which also contains convolution branches. In addition, Wang et al. [30] proposed a novel network O-Net, which integrates the advantages of CNN and Transformer to better utilize global and local information to improve the segmentation and classification performance of medical images. However, SwinBTS proposed by Jiang et al. [31] is a novel 3D medical image segmentation method that integrates Transformer, convolutional neural network and encoder-decoder structure, defining the 3D brain tumor semantic segmentation task as a sequence-to-sequence prediction challenge.
In essence, Transformers and their offshoots showcase immense promise in medical image segmentation. As the technology advances, they are poised to exert a transformative influence on medical diagnosis and therapeutic planning.
3. Progress in Body Composition Segmentation on CT Images
The intersection of deep learning and medical imaging has garnered significant attention in recent times. Particularly in analyzing body composition from abdominal CT scans, advancements in automatic positioning, recognition, and segmentation technologies have been momentous. Automated segmentation techniques have demonstrated not only time efficiency but also comparative, if not superior, accuracy to manual methodologies.The studies by Dabiri et al. [36] and Ackermans et al. [37] provide strong evidence for the application of automated segmentation techniques. In particular, for body composition analysis at the L3 position, this automated method not only saves significant time, but also provides similar or even higher accuracy compared to conventional manual analysis methods. On the other hand, the studies by Arayne et al. [38] and Koitka et al. [39] emphasized the consistency and possible clinical application of body composition analysis at different lumbar levels (e. g., T4 and L3). Furthermore, the study of Hong et al. [40] further demonstrated the accuracy of CT analysis of the thoracolumbar spine in the estimation of whole body composition, while the work of Cespedes et al. [41] demonstrated the value of an automated approach in the study of cancer prognosis. Finally, Lee et al. [42] used an advanced 3D Unet technique to provide an accurate and efficient method for body composition segmentation for whole-body PET-CT images. The results of these studies not only highlight the potential application of deep learning in medical imaging, but also provide important directions for future clinical practice.
However, the segmentation of body composition from CT scans is not devoid of challenges. Variabilities introduced by different scanning equipment and parameters can impede model generalization. Moreover, segmentation can falter in unique circumstances like gas accumulation or the presence of artifacts. Precise demarcation of smaller regions or tissue boundaries remains a formidable challenge. Despite substantial progress, achieving wholly accurate and reliable automated body composition analysis necessitates further research and technological innovation.