A deep learning framework based on SSM for detecting small and occluded objects in complex underwater environments

doi:10.21203/rs.3.rs-5228280/v1

Download PDF

Article

A deep learning framework based on SSM for detecting small and occluded objects in complex underwater environments

https://doi.org/10.21203/rs.3.rs-5228280/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Regular monitoring of marine life is essential for preserving the stability of marine ecosystems. However, underwater target detection presents several challenges, particularly in balancing accuracy with model efficiency and real-time performance. To address these issues, we propose an innovative approach that combines the Structured Space Model (SSM) with feature enhancement, specifically designed for small target detection in underwater environments. We developed a high-accuracy, lightweight detection model—UWNet. The results demonstrate that UWNet excels in detection accuracy, particularly in identifying difficult-to-detect organisms like starfish and scallops. Compared to other models, UWNet reduces the number of model parameters by 5% to 390%, significantly improving computational efficiency while maintaining top detection accuracy. Its lightweight design enhances the model's applicability for deployment on underwater robots, enabling effective real-time detection of subaquatic targets.

Earth and environmental sciences/Ocean sciences/Marine biology

Physical sciences/Mathematics and computing/Computer science

The ocean, covering about 71% of Earth's surface, is the largest biome on the planet. Marine organisms play a vital role in maintaining the delicate balance of this ecosystem^1-3. They regulate the global climate through carbon sequestration and support biodiversity through complex food webs. Benthic marine organisms, such as holothurians and scallops, provide significant food and medicinal value while enhancing marine sediment health and productivity. For example, holothurians can increase seabed sediment productivity by up to 50%, and scallops can filter up to 200 liters of water per hour. Therefore, regular and effective marine biological monitoring is crucial for assessing ecosystem status and implementing timely conservation measures.

Given the challenging conditions of the marine environment, especially in deep underwater regions, manual detection and recording are often difficult. Underwater robots have become essential for these tasks, but their computational power is restricted by hardware limitations, making real-time detection a critical requirement. An efficient, lightweight underwater biological detection framework is needed⁴.

Recent research in object detection using deep learning has focused on two main approaches: single-stage detection algorithms and two-stage object detection algorithm. Single-stage object detection algorithms, often based on the YOLO series, provide real-time results by directly generating detection outputs. Enhancements to these algorithms typically involve adding attention mechanisms, improving feature extraction, or designing better feature fusion branches to improve the detection of small targets^5-8. In contrast, two-stage object detection algorithms, such as the R-CNN series, first use a Region Proposal Network (RPN) to generate candidate regions and then refine these regions using convolutional neural networks. Improving two-stage detectors involves enhancing the accuracy of candidate region generation, for example, by using Transformer's self-attention mechanisms to better detect small targets^9-12. While two-stage algorithms offer higher accuracy, they have higher computational complexity and slower inference speeds, making them less suitable for real-time applications where speed is critical.

The Mamba model, introduced at the end of 2023, tackles the challenge of processing long sequences in natural language processing without the use of attention mechanisms or multi-layer

perceptrons¹³. Instead, it integrates the Selective State Space Model (SSM), offering a simplified yet efficient architecture. Initially successful in image classification, Mamba has been expanded to image segmentation and object detection, advancing computer vision technology^14-16. In this context, we introduce a novel deep learning-based object detection network, UWNet, which builds on the foundation of the YOLOv8 framework and incorporates the Mamba model to enhance feature extraction. We apply the Mamba model to underwater target detection, aimed at improving the detection of small objects in complex underwater environments. Our Multi-Scale Implicit Feature Fusion (MSFF) module, compared to current attention-based methods, offers a more lightweight solution for capturing small object features. By integrating Mamba into the network’s backbone, we leverage its selective scanning mechanism to address the limitations of CNN's local window feature extraction and combine the advantages of Transformers with the linear computational efficiency of SSM. Experiments demonstrate that our method performs exceptionally well on underwater datasets, significantly outperforming the latest object detection algorithms. It achieves state-of-the-art detection accuracy and accurately detects small underwater objects while maintaining a low parameter count. This makes it highly suitable for deployment on underwater robots for real-time detection. Our method shows substantial improvements over baseline models across four test sets (Test-A, Test-B, URPC2021, and DUO). Specifically, the mean Average Precision (mAP50) increased by 4.9%, 4.9%, 2.2%, and 3.1%, respectively, and the mAP50-95 increased by 3.2%, 3.5%, 2.7%, and 4.5%, respectively. Additionally, our model has a parameter count of only 6.67 million, which represents a reduction of 40%, 67%, 21%, 227%, and 390% compared to the latest mainstream object detectors, including YOLOv7, YOLOv8s, YOLOv10s, Mamba-YOLO-B, and RT-DETR.

UWNet not only showcases the effectiveness of combining Mamba with YOLO for underwater small object detection but also supports real-time marine life monitoring. Its lightweight design makes it ideal for deployment on underwater robots, enabling efficient real-time detection and data collection. This enhances marine exploration and provides reliable support for ecological conservation. Figure 1 illustrates the pipeline of our study, presenting the key steps and processes involved throughout the research.

Multi-scale implicit feature enhancement module

Many scholars have proposed using attention mechanisms (such as CBAM, ECA, SE, CA, EMA) or multi-scale feature fusion modules to enhance target detection models, which can improve the accuracy of detecting small targets. Attention mechanisms typically focus on extracting features by space, channel, or a combination of both. While these mechanisms can significantly improve model performance on many datasets, they also have limitations. In underwater environments, where small targets are common, focusing solely on channel or spatial features can result in excessive attention to high-resolution background information. This can lead to the model extracting irrelevant contextual information. Multi-scale feature fusion, on the other hand, expands the model's receptive field by using convolutional kernels of different sizes or pooling operations with varying strides. This approach captures information from input feature maps at multiple scales. However, using larger convolutional kernels increases the model's computational load, affecting its lightweight characteristics. Additionally, excessive pooling, especially with larger strides, may lead to the loss of fine-grained details of small targets.

To overcome these problems and allow our model to better focus on small targets at a lower computational cost, we propose a lightweight Multi-Scale Implicit Feature Enhancement Module (MSFF). This module combines advanced multi-view feature capture technology and uses element-wise multiplication for implicit feature enhancement¹⁷. By mapping input feature maps into a higher-dimensional space, the MSFF module captures more detailed feature information without significantly increasing computational complexity, thereby enhancing the model's capability to detect small targets. In the feature extraction process, the MSFF module begins by performing initial processing on the input feature maps through average pooling. It then extracts and integrates features using a series of depthwise separable convolutions. Features from different convolutional kernel sizes are implicitly fused through element-wise multiplication after undergoing vertical and horizontal separable convolutions. This approach enhances feature upscaling and ensures comprehensive representation of features at various scales with fine granularity.

The MSFF module further refines the expression of fused features using an activation function. It then integrates these enhanced features with the original input features through residual connections, ensuring effective information interaction across different depth levels. Experiments have shown that our design retains rich details and semantic information from the original images while significantly improving the model's performance in detecting small targets and handling complex scenes. By employing this innovative multi-scale implicit feature enhancement mechanism, MSFF achieves high-dimensional multi-scale fusion of input feature maps. This method markedly improves the model's detection accuracy and robustness. Overall, the MSFF module provides a novel solution for computer vision tasks, demonstrating exceptional adaptability and performance advantages in complex scenarios.

SSM-based feature extraction module

In the domain of object detection, Convolutional Neural Networks (CNNs) and Transformer models are the most widely used and developed paradigms, each with its own strengths and limitations. CNNs extract features using convolutional windows with fixed strides, which limits the receptive field at each layer and can lead to a substantial computational burden if larger strides are used. Despite this, CNNs excel in quickly learning target features from limited data, providing computational efficiency and a manageable parameter count. However, their fixed-stride approach restricts their ability to model global and long-distance dependencies^18–22. Transformers, introduced to computer vision in 2020, overcome these limitations by capturing long-distance dependencies through a self-attention mechanism. However, this mechanism’s computational complexity scales with the square of the input sequence length, resulting in increased computational and memory demands. Additionally, Transformers generally require larger datasets to be effective, which poses a challenge for underwater target detection due to the scarcity of real datasets^23–25.

This work applies the Mamba model to underwater target detection, achieving notable results. The Mamba model incorporates three key features: it uses the HiPPO method to address long-distance modeling issues, compensating for CNNs' local modeling limitations; it employs a selective scanning mechanism that transforms parameters in the state-space equation into input parameters for real-time adaptation; and it features linear computational complexity, which reduces the computational burden compared to the Transformer's self-attention mechanism.

For underwater target detection, we propose a new feature extraction module called MSDBlock. This module enhances performance in detecting small targets by incorporating two key components: the Hybrid Feature Integration Block (HFIB) and the Unidirectional Gating Block (UGB). The HFIB module boosts detection capabilities in complex scenes by integrating both channel and spatial attention mechanisms. In underwater environments, where small targets are common and background interference is severe, relying only on local spatial feature extraction can lead to excessive focus on high-resolution background information. This often results in poor capture of small target features. To address this, we have designed two components to optimize the feature extraction process: the Lightweight Spatial Attention Module (LSAM) and the Lightweight Channel Attention Module (LCAM). The LSAM generates spatial attention weights using two layers of convolution followed by a Sigmoid activation function. This effectively captures spatial correlations within the input feature maps, improving the efficiency of spatial attention. Meanwhile, the LCAM creates channel attention weights through global average pooling and a series of convolution operations, which enhances channel correlations and effectively utilizes global information. These designs enrich the local information available for subsequent SS2D operations without significantly increasing the computational burden on the model.

Experimental results show that the Unidirectional Gating Block (UGB) enhances the efficiency and accuracy of feature extraction by processing input features more directly. The architecture of the MSDBlock is depicted in Fig. 2a. While research on applying the Mamba model to the visual domain is gradually increasing, studies that combine it with the YOLO framework are still relatively limited. This paper is the first to integrate the Mamba model with the YOLO framework specifically for underwater target detection. We hope this research will encourage more scholars to explore the potential of the Mamba model in detecting underwater small targets. Our study also validates the effectiveness and applicability of the Mamba model in such scenarios.

Underwater target detection network—UWNet

In underwater target detection, traditional methods often miss detections due to the complexity of underwater environments, which include numerous small targets, occlusions, and overlapping objects. These challenges significantly impact accuracy and robustness. To overcome these challenges, we present a new network architecture called UWNet, which builds on the high-performing YOLOv8 framework. Firstly, we replace the original downsampling convolution with SPDConv²⁶. Traditional convolutions process the entire image directly, which can lead to loss of spatial details of small targets during downsampling. SPDConv, on the other hand, divides the input tensor into multiple subregions, allowing the network to extract features at a finer granularity and improving the detection of small targets. Additionally, by integrating the MSDBlock into the backbone feature extraction section, UWNet achieves global feature extraction, addressing the limitations of traditional CNNs that rely on local window modeling. The integration of the Multi-Scale Implicit Feature Fusion (MSFF) with the detection head enables the network to consider information across different scales, leading to more comprehensive feature capture. The architecture of UWNet is illustrated in Fig. 3.

We conducted experiments using the URPC2020 dataset from the National Underwater Robot Target Detection Algorithm Competition. This dataset comprises 5,544 images, which were divided into training and validation sets using a 4:1 ratio. The dataset features four target categories: holothurian, echinus, scallop, and starfish. These targets are small and widely distributed in underwater images, making detection challenging. Our method achieved 85.1% mAP50 and 51.0% mAP50-95 on the validation set. Additionally, the precision and recall were 84.2% and 77.2%, respectively, showing improvements of 2.4%, 2.7%, 2.7%, and 0.8% compared to YOLOv8n. The comparison results are illustrated in Fig. 5. To further assess the effectiveness of our approach, we compared it with the latest object detection models. While accuracy in underwater object detection often depends on networks with greater width and depth, our model achieved state-of-the-art performance despite its remarkably low parameter count. This demonstrates that our method not only maintains high accuracy but also ensures model efficiency, making it highly suitable for deployment in underwater robots requiring real-time detection capabilities. On the validation set, UWNet achieved mAP50 and mAP50-95 scores of 85.1% and 51.0%, respectively, representing the highest detection accuracy. The model has a total parameter count of only 6.67 million, and the final trained model size is 13.5 MB, showcasing its lightweight characteristics. Compared to YOLOv9 and YOLOv10, UWNet's mAP50-95 is higher by 2.5% and 1.4%, respectively. The Mamba-YOLO model, with a total parameter count of 21.8 million, achieved an mAP50-95 of 50.1% on the validation set, surpassing YOLOv10 and RT-DETR, and attaining a high mAP50 score. However, UWNet surpasses Mamba-YOLO while using less than one-third of its parameters, demonstrating the effectiveness of our proposed MSDBlock in underwater target detection scenarios. Overall, our model achieves the best performance in underwater target detection while remaining lightweight. The only area where our method is not optimal is GFLOPs, where it shows a slight gap compared to YOLOv5s and YOLOv9t. Nonetheless, the total parameter count of our method is lower than YOLOv5s, and the final trained UWNet model size is significantly smaller than YOLOv9, as detailed in Supplementary Table 1. This is partly due to Mamba’s additional computational resources required for selective scanning. Despite this extra cost, it is justified by the substantial improvements in detection accuracy provided by the MSDBlock module, particularly in dynamic underwater environments and amidst color bias interference. In comparison to YOLOv5s and YOLOv9t, our method demonstrates a clear improvement in detection accuracy, especially in mAP50, where it exceeds YOLOv5 and YOLOv9 by 2.6% and 2%, respectively. Figure 4 illustrates the comparison between model parameter count and detection accuracy.

Figure 4c visually demonstrates the performance of different models in various scenes. Green boxes indicate correct target detections, blue boxes denote incorrect detections, and red boxes highlight missed targets. In the first image, there are 46 targets to detect. Our proposed model successfully detects most of the targets with only a few missed detections and no false detections. The detection performance is comparable to Mamba-YOLO, while YOLOv8, YOLOv9, and YOLOv10 exhibit a significant number of missed and false detections. The second image features overlapping holothurian. The visualization results show that our method correctly detects them. The third image represents an underwater scene under general conditions, where UWNet achieves the best detection performance compared to other models.

Performance on two test sets

To validate the applicability of the proposed UWNet model in underwater small target scenarios, we evaluated its detection performance using two independent test sets in different environments.

The selected datasets are the A and B test sets from the Underwater Target Detection Algorithm Competition URPC2020, containing 800 and 1,200 test images, respectively. Experimental results indicate that on the A test set, our method improved mAP and mAP50-95 by 4.9% and 3.2%, respectively, compared to the baseline model. Results on the B test set also revealed significant improvements in detection accuracy, with mAP and mAP50-95 increasing by 4.9% and 3.5%, respectively, compared to the baseline model. We conducted experiments using the latest object detectors based on the YOLO series, Mamba, and Transformer architectures on both datasets, as shown in Fig. 5. The results demonstrate that our model achieved the highest mAP50 and mAP50-95 values on Test A. On Test B, while the mAP50 was slightly lower than that of RT-DETR, our method still achieved the highest mAP50-95 score.

As shown in Fig. 4c, in underwater detection scenarios, echinus is entirely black and resemble rocks in underwater images, making it easy to confuse them with the background. However, due to their abundance, they are relatively easier to detect. Scallops, on the other hand, often occupy a very small resolution in underwater images and their shape resembles the seabed, leading to frequent false positives and missed detections by many models. Although starfish have a small resolution, they are not as densely distributed as scallops and have bright surface colors, making them relatively easier to detect compared to scallops. Holothurians, while having a larger resolution compared to scallops and starfish, exhibit non-uniform shapes in images as they tend to change form when startled, making them more challenging to detect consistently.

On the Test A dataset, the UWNet model achieved the highest mAP50 and mAP50-95 among the comparison models. On Test B, UWNet’s mAP50 was only 1% lower than that of RT-DETR, while its mAP50-95 reached the highest detection accuracy. The RT-DETR model utilizes ResNet50 as its backbone network, resulting in a total parameter count of 32.66 million, which is a 390% increase compared to UWNet. Additionally, RT-DETR significantly increases the computational complexity, making it unsuitable for deployment in underwater robots requiring real-time detection. Furthermore, RT-DETR demonstrated inconsistent performance across different underwater datasets, indicating significant potential for improvement in its robustness for underwater detection tasks. On both the Test A and Test B datasets, UWNet achieved the highest detection accuracy for the most challenging target, scallop. For echinus, which has the highest number of small bounding boxes, UWNet also demonstrated near-optimal detection performance. Although UWNet’s detection performance for starfish was second only to RT-DETR, its recognition and detection capabilities for holothurian still need improvement. Comparative experiments demonstrated the suitability of the UWNet model for underwater target detection and its robustness across different test datasets. UWNet outperformed most of the comparison models in terms of detection accuracy on the various test sets. More detailed model comparison data is presented in Fig. 5, and detailed model data can be found in Supplementary Table 2.

Validating the generalization of UWNet across different underwater scenarios

In order to verify the generalization and robustness of the UWNet model, we retrained it using two new underwater datasets: the DUO dataset and the URPC2021 dataset. The DUO dataset includes 6,671 training images and 1,111 test images, while the URPC2021 dataset consists of 7,600 images divided into training and validation sets with a 4:1 ratio. UWNet was trained separately on each dataset to evaluate its applicability across different data sources. Supplementary Table 3 presents a detailed overview of the experimental results for each model. Under consistent experimental conditions, UWNet achieved an mAP50 of 86.2% and an mAP50-95 of 68% on the DUO dataset, outperforming other state-of-the-art object detectors. Compared to the baseline model, UWNet improved mAP50 and mAP50-95 by 3.1% and 4.5%, respectively. Notably, the AP50 for holothurian, echinus, scallop, and starfish reached 88.5%, 93.3%, 69.2%, and 86.2%, respectively, achieving the highest detection accuracy among the compared models. On the URPC2021 dataset, UWNet achieved an mAP50 of 84.6% and an mAP50-95 of 51.2%. While the mAP50 performance was the highest among the models, the mAP50-95 was only 0.3% lower than that of Mamba-YOLO. Despite Mamba-YOLO's architectural advantages and greater parameter count, UWNet's parameter count is reduced by 227% compared to Mamba-YOLO. Furthermore, UWNet outperformed Mamba-YOLO on other test datasets. On the URPC2021 dataset, UWNet achieved the best detection performance for echinus, scallop, and starfish, with holothurian detection accuracy only 0.3% lower than YOLOv10.

By conducting experiments on the DUO and URPC2021 datasets, we have demonstrated UWNet's effectiveness in underwater target detection and its ability to generalize across different datasets. To further validate the model, we used GradCAM to generate heatmaps, which visually compare UWNet's focus on specific objects and regions during feature extraction from underwater images with that of other models. The visualizations of these feature maps are shown in Fig. 6.

Underwater target detection is currently a major area of interest among researchers, who are addressing two main challenges in this field. First, existing underwater datasets often suffer from image blurring and color distortion^27–29. Underwater images frequently exhibit a blue or green color cast due to light absorption and scattering, making it difficult to differentiate between targets and the background. Second, underwater targets are often small and can overlap, complicating the detection of all targets simultaneously. To address these challenges, it is crucial to develop a precise and lightweight underwater target detection network. Such a network should achieve accurate detection results with minimal parameter cost, making it suitable for deployment on underwater robots. This advancement would greatly enhance detection accuracy while satisfying real-time requirements, thereby enhancing marine exploration. Detecting and tracking underwater organisms in real time, along with monitoring population trends, are crucial for evaluating the stability of marine ecosystems and aiding scientific research.

In Fig. 7, we analyzed the size distribution of targets across various underwater datasets. The analysis reveals a common trend: echinus are the most frequently annotated small-sized targets, while holothurians predominantly appear as medium or large-sized objects in underwater scenes. For detailed data on each dataset, refer to Supplementary Table 4. UWNet exhibits strong performance in detecting echinus, achieving AP50 scores of 87.5%, 87.5%, 93.3%, and 91.5% on the URPC2020 test sets, the URPC2021 dataset, and the DUO dataset, respectively. Figure 7 also shows that scallops and starfish are the next most abundant small targets after echinus in the URPC2020 and URPC2021 datasets. However, the detection performance for scallops is relatively lower compared to other targets. This reduced performance may be due to the small size of scallops in images, making them difficult to distinguish from the seabed in the blue and green-dominated underwater backgrounds, which hampers the model's feature extraction capabilities. Despite this, UWNet significantly outperforms other models in detecting scallops and starfish, further demonstrating its superior capability in handling small underwater targets. Overall, UWNet achieves optimal results across all three underwater datasets, underscoring its robustness in various underwater scenarios.

The experimental findings effectively showcase the Mamba model's superior capability in handling small underwater targets. Our method has the following advantages: Firstly, we have designed a new feature extraction module—MSDBlock, which effectively integrates the Mamba model into the underwater detection scenario, addressing the issue of information loss for small targets during local window feature extraction in CNN methods. The Mamba model establishes long-distance dependencies through HiPPO and selectively retains or discards input information by parameterizing the model's input. Secondly, we have introduced a new module—the Multi-Scale Implicit Feature Fusion Module (MSFF). This module, through multi-perspective feature extraction, comprehensively extracts feature information of the input image at different receptive fields. Subsequently, it implicitly increases the feature dimensions, mapping the input features to higher dimensions, thereby enhancing the model's learnability. Through the MSFF module, the model can better differentiate between target and background information. Lastly, addressing the challenges faced in current underwater target detection, we have proposed a novel network framework—UWNet. During the experiments, we primarily focused on the design of the network architecture without excessively adjusting the model's hyperparameters. The experimental results indicate that the proposed model exhibits good applicability and generalization capabilities in underwater target detection scenarios.

Comparing with other state-of-the-art models, YOLOv9 introduced Programmable Gradient Information (PGI) and auxiliary reversible branches, which significantly enhanced detection accuracy. However, this enhancement was achieved at the expense of increased model parameters and longer training durations. YOLOv10 addressed the latency issue caused by Non-Maximum Suppression (NMS) and enabled end-to-end object detection. While both models perform well in general scenarios, they still require further refinement for underwater target detection applications. RT-DETR combines the strengths of CNNs and Transformers, utilizing the global modeling capabilities of Transformers for object detection tasks. However, a major limitation of RT-DETR is its large parameter size and slow inference speed, making it less suitable for real-time detection. Additionally, its performance in detecting small underwater targets is suboptimal, primarily because Transformers rely on large datasets for training, and specialized underwater small target data is often scarce. In contrast, this paper introduces UWNet, which combines the strengths of Mamba and CNNs. UWNet attains optimal detection accuracy while maintaining the lowest number of parameters, outperforming current state-of-the-art detectors in underwater target detection.

Our future research will focus on two key aspects. First, although the current model's parameter size is relatively efficient, further optimization can be achieved through model pruning and knowledge distillation to create an even more lightweight underwater target detector. Second, we aim to explore two underwater data augmentation methods: one involves using underwater image processing techniques to enhance image clarity and reduce the impact of color distortion; the other leverages generative AI models, such as the Diffusion model, to augment underwater datasets. This would allow the model to learn from various colors and backgrounds by expanding the available data for underwater target detection. Additionally, we plan to evaluate UWNet's applicability for detecting small targets in non-underwater environments, further validating its versatility across different application scenarios. Through systematic experiments and performance optimization, we anticipate that UWNet will demonstrate robustness and flexibility in a wide range of object detection tasks.

The stability of marine ecosystems is closely tied to the sustainable development of countries worldwide. However, human exploitation of natural resources has gradually disrupted this balance. To address this issue, quantitative monitoring of marine organisms is essential for assessing the stability of marine ecosystems in real-time. The technology for underwater target detection is crucial in this process, as it effectively identifies and records both the species and the abundance of marine organisms. providing reliable data to support marine conservation efforts. Our proposed method, UWNet, is a lightweight and high-precision network architecture. When integrated with underwater robots, it enables real-time detection and precise quantification of underwater organisms, offering scientists immediate data to assess changes in the marine environment. Additionally, UWNet can be used to track rare underwater species, facilitating their protection in real time. By continuously monitoring the movements and behaviors of these species, scientists can better understand their survival status and implement appropriate conservation measures to prevent population decline. In conclusion, underwater target detection technology is critical for maintaining the stability of marine ecosystems. Its application will inject new energy into global marine conservation efforts, fostering a harmonious coexistence between humanity and nature.

State space models

Due to its strong ability to capture long-range dependencies and efficiently represent dynamic systems, the Structured State Space Model (SSM) has garnered increasing attention from researchers. SSM is conceptually similar to Recurrent Neural Networks (RNNs), with the primary distinction being that SSM removes the nonlinear transformation component from the hidden state update equations. At its core, SSM is represented by a set of linear ordinary differential equations, as shown in Eq. (1):

$$\begin{gathered} h^{\prime}(t)=Ah(t)+Bx(t) \hfill \\ y(t)=Ch(t) \hfill \\ \end{gathered}$$

Where A is the state transition matrix, representing the relationship of the hidden state h(t) as it evolves over time. The input matrix B and output matrix C represent the relationships between the input signal x(t), hidden state h(t), and output y(t), respectively. However, in deep learning applications, the given signals are often discrete. This necessitates converting the state-space equations from a continuous system to a discrete system. This conversion is one of the key improvements in the evolution of SSM to S4—parameter discretization. Specifically, this is achieved by applying a zero-order hold to the input signal. The rules for parameter discretization are shown in Eq. (2):

$$\begin{gathered} \bar {A}=\exp (\Delta A) \hfill \\ \bar {B}={(\Delta A)^{ - 1}}(\exp (\Delta A) - I)\Delta B \hfill \\ \end{gathered}$$

After applying the discretization rules, the discrete SSM representation is given as shown in Eq. (3):

$$\begin{gathered} {{h^{\prime}}_t}=\bar {A}{h_{t - 1}}+\bar {B}{x_t} \hfill \\ {y_t}=C{h_t} \hfill \\ \end{gathered}$$

The final Mamba model incorporates a selective scanning mechanism into the SSM, allowing the parameters A, B, and C in the state-space equation to become input-dependent parameters (which can influence the state transition matrix A through the discretization rules). This mechanism is particularly significant for underwater small target detection. Its core advantage lies in scanning information from different directions, enabling a comprehensive understanding of the input data within both the current and global context. This allows the Mamba model to dynamically adjust the parameter matrices.

In the context of underwater target detection, UWNet effectively addresses the challenges of this field by incorporating the Mamba model. Underwater environments often exhibit significant color distortion, with images predominantly featuring blue and green tones, which can lead to confusion between targets and backgrounds. Additionally, the presence of numerous small underwater targets that often overlap poses a greater challenge for traditional CNN-based local feature extraction methods. By integrating CNN and SSM methods within UWNet, the model captures both local dependencies and global features. The CNN component first extracts local features from underwater images, effectively identifying small targets and their surrounding short-range contextual relationships, thereby mitigating interference between targets. Subsequently, the SSM captures long-term dependencies within sequence data, further processing the local features to enhance the model's ability to represent global information in the scene. This combination of local and global features allows the model to better differentiate underwater targets from the background, even in the presence of color distortion or overlapping targets. The integration of the Mamba model into UWNet enhances both the precision and robustness of underwater target detection, effectively addressing the inherent challenges and complexities of underwater environments.

Implicit feature mapping

Traditional deep learning paradigms typically rely on convolutions, linear layers, and activation functions, which elevate input features from low-dimensional to high-dimensional spaces by increasing network depth. While effective, this explicit high-dimensional feature mapping method significantly increases the model's complexity and computational load. Transformer models offer another high-dimensional feature mapping method through their self-attention mechanism, which projects input features into three different spaces and constructs an attention matrix via dot product operations. Nevertheless, the computational cost of self-attention increases quadratically with the expansion of sequence length, which limits its feasibility for real-time applications. In contrast, element-wise multiplication provides a more efficient alternative. By fusing features from different subspaces through per-element multiplication, this process implicitly maps input features into a high-dimensional nonlinear feature space without requiring deeper or wider networks, thus avoiding increased model complexity. Additionally, element-wise multiplication can generate feature representations with nearly infinite dimensions, akin to the polynomial kernel function in kernel methods, while maintaining computational efficiency. This offers a more effective method for high-dimensional feature representation in deep learning models.

Applying element-wise multiplication in underwater target detection presents clear advantages. By implicitly mapping input features into a nonlinear high-dimensional space, the model captures richer feature representations, allowing for finer distinctions between small underwater targets and complex backgrounds. This is particularly beneficial in addressing challenges such as target blurring due to color distortion and overlapping targets. Moreover, underwater environments are highly complex, with varying lighting conditions and water turbidity. The high-dimensional feature space enables the model to adapt more effectively to these challenges, maintaining excellent detection accuracy and stability even in difficult underwater scenarios. The adaptability to adverse conditions further enhances the model's robustness. Ultimately, real-time performance is critical in underwater detection tasks, and element-wise multiplication, with its low computational complexity, enables efficient real-time detection. This efficiency makes it a highly practical and effective method for underwater target detection.

Model architecture and training

Our deep learning model is built upon the YOLO framework, comprising three core elements: the Backbone, the Neck, and the Head. The SPDConv and MSDBlock modules are integrated into the Backbone to enhance the model's feature extraction capabilities from input images. The MSFF module is added before each detection head in the Head section to improve the model's ability to detect targets of different sizes. For the URPC2020 dataset, the batch size was set to 8, while for the URPC2021 and DUO datasets, the batch size was set to 16. The input image size during training was 640×640, and the number of training epochs was set to 200. We employed the built-in image augmentation strategies provided by YOLOv8, including mosaic, fliplr, and translate. The model was trained using the SGD optimizer, with both the initial learning rate (lr0) and final learning rate (lrf) set to 0.01, and the initial momentum set to 0.937. In this study, the experiments were conducted using PyTorch 2.0.0 and Cuda 11.8. All training, validation, and testing on the URPC2020 dataset was done on an NVIDIA RTX 2080Ti. Due to the large number of training images in the URPC2021 and DUO datasets, the training, validation, and testing processes on both datasets were performed on two NVIDIA RTX 2080Ti. Specifically, we trained the model using the URPC2020, URPC2021, and DUO datasets, and tested the model's performance on each dataset. The weights obtained from training on URPC2020 were used to evaluate the model on the Test A and Test B datasets.

Validation metrics on underwater datasets

We assessed the performance of various models on underwater datasets by utilizing the following metrics: Accuracy, Recall, Average Precision (AP), mean Average Precision (mAP), Floating Point Operations (GFLOPs), and parameter count (Para, M). The calculation methods for Accuracy (P), Recall (R), AP, and mAP are provided in Eq. (4):

$$\begin{gathered} \Pr ecision=\frac{{TP}}{{TP+FP}} \hfill \\ \operatorname{Re} call=\frac{{TP}}{{TN+FN}} \hfill \\ AP=\int_{0}^{1} {P(R)dR} \hfill \\ mAP=\frac{1}{n}\sum\limits_{{i=1}}^{n} {A{P_i}} \hfill \\ \end{gathered}$$

In the formulas mentioned above, TP stands for true positives, TN for true negatives, FP indicates false positives, and FN signifies false negatives. AP indicates the detection precision for a specific target class, while mAP represents the average detection precision across all classes. These two metrics are commonly used to assess the accuracy of detection models. GFLOPs refers to the number of floating-point operations required to process a single input image; a lower value indicates a less complex model. GFLOPs and Para (parameter count) are typically used to evaluate a model's lightweight performance, with lower values indicating a more lightweight and efficient model.

Competing interests

The authors report no conflicts of interest.

Author contributions

Y.M.Z is the corresponding author of this manuscript. Y.M.Z and J.M.L contributed equally to the paper and are recognized as co-first authors. Ideas were developed by Y.M.Z, L.L, C.D.W, W.C, and Z.L.L. Methodology was designed by Y.M.Z, J.M.L, L.L, and C.D.W. The underwater dataset was collected by J.M.L, H.Y.Z, and L.Y.M. The model code was written and debugged by Y.M.Z, J.M.L, H.Y.Z, and L.Y.M. Model training and data organization were conducted by H.Y.Z and L.Y.M. Model results were visualized by Y.M.Z, J.M.L, L.Y.M, and L.L. Model comparison experiments and ablation studies were performed by Y.M.Z, J.M.L, and H.Y.Z. Manuscript formatting was managed by H.Y.Z, L.Y.M, W.C, and Z.L.L. All authors made significant contributions to the manuscript content and have read and approved its publication.

Acknowledgements

This research was supported in part by the National Natural Science Foundation of China (62403108, 42301256, U20A20197 and 61973063), the Liaoning Provincial Natural Science Foundation Joint Fund (2023-MSBA-075), the Ministry of Industry and Information Technology Project (TC220H05X-04), the Scientific Research Foundation of Liaoning Provincial Education Department (LJKQR20222509), the Fundamental Research Funds for the Central Universities (N2426005).

Data availability

The underwater datasets used in this study (URPC2020, URPC2021, and DUO) have all been uploaded to the GitHub repository. The DOI is https://github.com/JML123123/Underwarter-Datasets.

Code availability

The code for this study can be found at the following URL. The DOI is https://github.com/JML123123/UWNet.

Liu, P. et al. YWnet: A convolutional block attention-based fusion deep learning method for complex underwater small target detection. Ecol. Inform. 79, 102401 (2024).
Zhou, H. et al. Real-time underwater object detection technology for complex underwater environments based on deep learning. Ecol. Inform. 102680 (2024).
Liu, L. & Li, P. Plant intelligence-based PILLO underwater target detection algorithm. Eng. Appl. Artif. Intell. 126, 106818 (2023).
Xu, S. et al. A systematic review and analysis of deep learning-based underwater object detection. Neurocomputing 527, 204-232 (2023).
Zhu, X. et al. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2778-2788 (ICCV, 2021).
Shi, J. & Wu, W. SRP-UOD: Multi-branch hybrid network framework based on structural re-parameterization for underwater small object detection. in 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2715-2719 (2024).
Wang, H. et al. YOLOv8-QSD: An improved small object detection algorithm for autonomous vehicles based on YOLOv8. IEEE Trans. Instrum. Meas. 73, 1-16 (2024)
Zheng, L., Hu, T. & Zhu, J. Underwater sonar target detection based on improved ScEMA-YOLOv8. IEEE Geosci. Remote Sens. Lett. 21, 1-5 (2024).
Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015).
Cai, Z. & Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2018).
Cao, H. et al. Trf-net: A transformer-based RGB-D fusion network for desktop object instance segmentation. Neural Comput. Appl. 35, 21309-21330 (2023).
Liu, H.-I. et al. A denoising fpn with transformer R-CNN for tiny object detection. IEEE Trans. Geosci. Remote Sens. 62, 1-15 (2024)
Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. Preprint at http://arxiv.org/abs/2312.00752 (2023).
Zhu, L. et al. Vision mamba: Efficient visual representation learning with bidirectional state space model. Preprint at http://arxiv.org/abs/2401.09417 (2024).
Huang, T. et al. LocalMamba: Visual state space model with windowed selective scan. Preprint at http://arxiv.org/abs/2403.09338 (2024).
Ruan, J. & Wang, S. Vm-unet: Vision mamba unet for medical image segmentation. Preprint at http://arxiv.org/abs/2402.02491 (2024).
Ma, X. et al. Rewrite the stars. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5694-5703 (CVPR, 2024).
Bochkovskiy, A. YOLOv4: Optimal speed and accuracy of object detection. Preprint at http://arxiv.org/abs/2004.10934 (2020).
Wang, C. Y. et al. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7464-7475 (CVPR, 2023).
Wang, C. Y. et al. YOLOv9: Learning what you want to learn using programmable gradient information. Preprint at http://arxiv.org/abs/2402.13616 (2024).
Wang, A. et al. YOLOv10: Real-time end-to-end object detection. Preprint at http://arxiv.org/abs/2405.14458 (2024).
Wang, Z. et al. Mamba YOLO: SSMs-based YOLO for object detection. Preprint at http://arxiv.org/abs/2406.05835 (2024).
Carion, N. et al. End-to-end object detection with transformers. in European Conference on Computer Vision, 213-229 (ECCV, 2020).
Zhu, X. et al. Deformable DETR: Deformable transformers for end-to-end object detection. Preprint at http://arxiv.org/abs/2010.04159 (2020).
Zhao, Y. et al. DETRs beat YOLOs on real-time object detection. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16965-16974 (CVPR, 2024)
Sunkara, R. & Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 443-459 (2022).
Chen, R. et al. UIESC: An underwater image enhancement framework via self-attention and contrastive learning. IEEE Trans. Ind. Inf. 19, 11701-11711 (2023).
Zhuang, P. et al. Underwater image enhancement with hyper-laplacian reflectance priors. IEEE Trans. Image Process. 31, 5442-5455 (2022).
Zhang, W. et al. Underwater image enhancement via minimal color loss and locally adaptive contrast enhancement. IEEE Trans. Image Process. 31, 3997-4010 (2022).

There is NO Competing Interest.

supplementtable.docx
Table 1, Table 2, Table 3, Table 4, Table 5.

Download PDF

Version 1

posted

You are reading this latest preprint version

A deep learning framework based on SSM for detecting small and occluded objects in complex underwater environments

Status:

Version 1

Abstract

Figures

Introduction

Results

Multi-scale implicit feature enhancement module

SSM-based feature extraction module

Underwater target detection network—UWNet

Performance on two test sets

Validating the generalization of UWNet across different underwater scenarios

Discussion

Methods

State space models

Implicit feature mapping

Model architecture and training

Validation metrics on underwater datasets

Declarations

Competing interests

Author contributions

Acknowledgements

Data availability

Code availability

References

Additional Declarations

Supplementary Files

Status:

Version 1