Multi-scale implicit feature enhancement module
Many scholars have proposed using attention mechanisms (such as CBAM, ECA, SE, CA, EMA) or multi-scale feature fusion modules to enhance target detection models, which can improve the accuracy of detecting small targets. Attention mechanisms typically focus on extracting features by space, channel, or a combination of both. While these mechanisms can significantly improve model performance on many datasets, they also have limitations. In underwater environments, where small targets are common, focusing solely on channel or spatial features can result in excessive attention to high-resolution background information. This can lead to the model extracting irrelevant contextual information. Multi-scale feature fusion, on the other hand, expands the model's receptive field by using convolutional kernels of different sizes or pooling operations with varying strides. This approach captures information from input feature maps at multiple scales. However, using larger convolutional kernels increases the model's computational load, affecting its lightweight characteristics. Additionally, excessive pooling, especially with larger strides, may lead to the loss of fine-grained details of small targets.
To overcome these problems and allow our model to better focus on small targets at a lower computational cost, we propose a lightweight Multi-Scale Implicit Feature Enhancement Module (MSFF). This module combines advanced multi-view feature capture technology and uses element-wise multiplication for implicit feature enhancement17. By mapping input feature maps into a higher-dimensional space, the MSFF module captures more detailed feature information without significantly increasing computational complexity, thereby enhancing the model's capability to detect small targets. In the feature extraction process, the MSFF module begins by performing initial processing on the input feature maps through average pooling. It then extracts and integrates features using a series of depthwise separable convolutions. Features from different convolutional kernel sizes are implicitly fused through element-wise multiplication after undergoing vertical and horizontal separable convolutions. This approach enhances feature upscaling and ensures comprehensive representation of features at various scales with fine granularity.
The MSFF module further refines the expression of fused features using an activation function. It then integrates these enhanced features with the original input features through residual connections, ensuring effective information interaction across different depth levels. Experiments have shown that our design retains rich details and semantic information from the original images while significantly improving the model's performance in detecting small targets and handling complex scenes. By employing this innovative multi-scale implicit feature enhancement mechanism, MSFF achieves high-dimensional multi-scale fusion of input feature maps. This method markedly improves the model's detection accuracy and robustness. Overall, the MSFF module provides a novel solution for computer vision tasks, demonstrating exceptional adaptability and performance advantages in complex scenarios.
SSM-based feature extraction module
In the domain of object detection, Convolutional Neural Networks (CNNs) and Transformer models are the most widely used and developed paradigms, each with its own strengths and limitations. CNNs extract features using convolutional windows with fixed strides, which limits the receptive field at each layer and can lead to a substantial computational burden if larger strides are used. Despite this, CNNs excel in quickly learning target features from limited data, providing computational efficiency and a manageable parameter count. However, their fixed-stride approach restricts their ability to model global and long-distance dependencies18–22. Transformers, introduced to computer vision in 2020, overcome these limitations by capturing long-distance dependencies through a self-attention mechanism. However, this mechanism’s computational complexity scales with the square of the input sequence length, resulting in increased computational and memory demands. Additionally, Transformers generally require larger datasets to be effective, which poses a challenge for underwater target detection due to the scarcity of real datasets23–25.
This work applies the Mamba model to underwater target detection, achieving notable results. The Mamba model incorporates three key features: it uses the HiPPO method to address long-distance modeling issues, compensating for CNNs' local modeling limitations; it employs a selective scanning mechanism that transforms parameters in the state-space equation into input parameters for real-time adaptation; and it features linear computational complexity, which reduces the computational burden compared to the Transformer's self-attention mechanism.
For underwater target detection, we propose a new feature extraction module called MSDBlock. This module enhances performance in detecting small targets by incorporating two key components: the Hybrid Feature Integration Block (HFIB) and the Unidirectional Gating Block (UGB). The HFIB module boosts detection capabilities in complex scenes by integrating both channel and spatial attention mechanisms. In underwater environments, where small targets are common and background interference is severe, relying only on local spatial feature extraction can lead to excessive focus on high-resolution background information. This often results in poor capture of small target features. To address this, we have designed two components to optimize the feature extraction process: the Lightweight Spatial Attention Module (LSAM) and the Lightweight Channel Attention Module (LCAM). The LSAM generates spatial attention weights using two layers of convolution followed by a Sigmoid activation function. This effectively captures spatial correlations within the input feature maps, improving the efficiency of spatial attention. Meanwhile, the LCAM creates channel attention weights through global average pooling and a series of convolution operations, which enhances channel correlations and effectively utilizes global information. These designs enrich the local information available for subsequent SS2D operations without significantly increasing the computational burden on the model.
Experimental results show that the Unidirectional Gating Block (UGB) enhances the efficiency and accuracy of feature extraction by processing input features more directly. The architecture of the MSDBlock is depicted in Fig. 2a. While research on applying the Mamba model to the visual domain is gradually increasing, studies that combine it with the YOLO framework are still relatively limited. This paper is the first to integrate the Mamba model with the YOLO framework specifically for underwater target detection. We hope this research will encourage more scholars to explore the potential of the Mamba model in detecting underwater small targets. Our study also validates the effectiveness and applicability of the Mamba model in such scenarios.
Underwater target detection network—UWNet
In underwater target detection, traditional methods often miss detections due to the complexity of underwater environments, which include numerous small targets, occlusions, and overlapping objects. These challenges significantly impact accuracy and robustness. To overcome these challenges, we present a new network architecture called UWNet, which builds on the high-performing YOLOv8 framework. Firstly, we replace the original downsampling convolution with SPDConv26. Traditional convolutions process the entire image directly, which can lead to loss of spatial details of small targets during downsampling. SPDConv, on the other hand, divides the input tensor into multiple subregions, allowing the network to extract features at a finer granularity and improving the detection of small targets. Additionally, by integrating the MSDBlock into the backbone feature extraction section, UWNet achieves global feature extraction, addressing the limitations of traditional CNNs that rely on local window modeling. The integration of the Multi-Scale Implicit Feature Fusion (MSFF) with the detection head enables the network to consider information across different scales, leading to more comprehensive feature capture. The architecture of UWNet is illustrated in Fig. 3.
We conducted experiments using the URPC2020 dataset from the National Underwater Robot Target Detection Algorithm Competition. This dataset comprises 5,544 images, which were divided into training and validation sets using a 4:1 ratio. The dataset features four target categories: holothurian, echinus, scallop, and starfish. These targets are small and widely distributed in underwater images, making detection challenging. Our method achieved 85.1% mAP50 and 51.0% mAP50-95 on the validation set. Additionally, the precision and recall were 84.2% and 77.2%, respectively, showing improvements of 2.4%, 2.7%, 2.7%, and 0.8% compared to YOLOv8n. The comparison results are illustrated in Fig. 5. To further assess the effectiveness of our approach, we compared it with the latest object detection models. While accuracy in underwater object detection often depends on networks with greater width and depth, our model achieved state-of-the-art performance despite its remarkably low parameter count. This demonstrates that our method not only maintains high accuracy but also ensures model efficiency, making it highly suitable for deployment in underwater robots requiring real-time detection capabilities. On the validation set, UWNet achieved mAP50 and mAP50-95 scores of 85.1% and 51.0%, respectively, representing the highest detection accuracy. The model has a total parameter count of only 6.67 million, and the final trained model size is 13.5 MB, showcasing its lightweight characteristics. Compared to YOLOv9 and YOLOv10, UWNet's mAP50-95 is higher by 2.5% and 1.4%, respectively. The Mamba-YOLO model, with a total parameter count of 21.8 million, achieved an mAP50-95 of 50.1% on the validation set, surpassing YOLOv10 and RT-DETR, and attaining a high mAP50 score. However, UWNet surpasses Mamba-YOLO while using less than one-third of its parameters, demonstrating the effectiveness of our proposed MSDBlock in underwater target detection scenarios. Overall, our model achieves the best performance in underwater target detection while remaining lightweight. The only area where our method is not optimal is GFLOPs, where it shows a slight gap compared to YOLOv5s and YOLOv9t. Nonetheless, the total parameter count of our method is lower than YOLOv5s, and the final trained UWNet model size is significantly smaller than YOLOv9, as detailed in Supplementary Table 1. This is partly due to Mamba’s additional computational resources required for selective scanning. Despite this extra cost, it is justified by the substantial improvements in detection accuracy provided by the MSDBlock module, particularly in dynamic underwater environments and amidst color bias interference. In comparison to YOLOv5s and YOLOv9t, our method demonstrates a clear improvement in detection accuracy, especially in mAP50, where it exceeds YOLOv5 and YOLOv9 by 2.6% and 2%, respectively. Figure 4 illustrates the comparison between model parameter count and detection accuracy.
Figure 4c visually demonstrates the performance of different models in various scenes. Green boxes indicate correct target detections, blue boxes denote incorrect detections, and red boxes highlight missed targets. In the first image, there are 46 targets to detect. Our proposed model successfully detects most of the targets with only a few missed detections and no false detections. The detection performance is comparable to Mamba-YOLO, while YOLOv8, YOLOv9, and YOLOv10 exhibit a significant number of missed and false detections. The second image features overlapping holothurian. The visualization results show that our method correctly detects them. The third image represents an underwater scene under general conditions, where UWNet achieves the best detection performance compared to other models.
Performance on two test sets
To validate the applicability of the proposed UWNet model in underwater small target scenarios, we evaluated its detection performance using two independent test sets in different environments.
The selected datasets are the A and B test sets from the Underwater Target Detection Algorithm Competition URPC2020, containing 800 and 1,200 test images, respectively. Experimental results indicate that on the A test set, our method improved mAP and mAP50-95 by 4.9% and 3.2%, respectively, compared to the baseline model. Results on the B test set also revealed significant improvements in detection accuracy, with mAP and mAP50-95 increasing by 4.9% and 3.5%, respectively, compared to the baseline model. We conducted experiments using the latest object detectors based on the YOLO series, Mamba, and Transformer architectures on both datasets, as shown in Fig. 5. The results demonstrate that our model achieved the highest mAP50 and mAP50-95 values on Test A. On Test B, while the mAP50 was slightly lower than that of RT-DETR, our method still achieved the highest mAP50-95 score.
As shown in Fig. 4c, in underwater detection scenarios, echinus is entirely black and resemble rocks in underwater images, making it easy to confuse them with the background. However, due to their abundance, they are relatively easier to detect. Scallops, on the other hand, often occupy a very small resolution in underwater images and their shape resembles the seabed, leading to frequent false positives and missed detections by many models. Although starfish have a small resolution, they are not as densely distributed as scallops and have bright surface colors, making them relatively easier to detect compared to scallops. Holothurians, while having a larger resolution compared to scallops and starfish, exhibit non-uniform shapes in images as they tend to change form when startled, making them more challenging to detect consistently.
On the Test A dataset, the UWNet model achieved the highest mAP50 and mAP50-95 among the comparison models. On Test B, UWNet’s mAP50 was only 1% lower than that of RT-DETR, while its mAP50-95 reached the highest detection accuracy. The RT-DETR model utilizes ResNet50 as its backbone network, resulting in a total parameter count of 32.66 million, which is a 390% increase compared to UWNet. Additionally, RT-DETR significantly increases the computational complexity, making it unsuitable for deployment in underwater robots requiring real-time detection. Furthermore, RT-DETR demonstrated inconsistent performance across different underwater datasets, indicating significant potential for improvement in its robustness for underwater detection tasks. On both the Test A and Test B datasets, UWNet achieved the highest detection accuracy for the most challenging target, scallop. For echinus, which has the highest number of small bounding boxes, UWNet also demonstrated near-optimal detection performance. Although UWNet’s detection performance for starfish was second only to RT-DETR, its recognition and detection capabilities for holothurian still need improvement. Comparative experiments demonstrated the suitability of the UWNet model for underwater target detection and its robustness across different test datasets. UWNet outperformed most of the comparison models in terms of detection accuracy on the various test sets. More detailed model comparison data is presented in Fig. 5, and detailed model data can be found in Supplementary Table 2.
Validating the generalization of UWNet across different underwater scenarios
In order to verify the generalization and robustness of the UWNet model, we retrained it using two new underwater datasets: the DUO dataset and the URPC2021 dataset. The DUO dataset includes 6,671 training images and 1,111 test images, while the URPC2021 dataset consists of 7,600 images divided into training and validation sets with a 4:1 ratio. UWNet was trained separately on each dataset to evaluate its applicability across different data sources. Supplementary Table 3 presents a detailed overview of the experimental results for each model. Under consistent experimental conditions, UWNet achieved an mAP50 of 86.2% and an mAP50-95 of 68% on the DUO dataset, outperforming other state-of-the-art object detectors. Compared to the baseline model, UWNet improved mAP50 and mAP50-95 by 3.1% and 4.5%, respectively. Notably, the AP50 for holothurian, echinus, scallop, and starfish reached 88.5%, 93.3%, 69.2%, and 86.2%, respectively, achieving the highest detection accuracy among the compared models. On the URPC2021 dataset, UWNet achieved an mAP50 of 84.6% and an mAP50-95 of 51.2%. While the mAP50 performance was the highest among the models, the mAP50-95 was only 0.3% lower than that of Mamba-YOLO. Despite Mamba-YOLO's architectural advantages and greater parameter count, UWNet's parameter count is reduced by 227% compared to Mamba-YOLO. Furthermore, UWNet outperformed Mamba-YOLO on other test datasets. On the URPC2021 dataset, UWNet achieved the best detection performance for echinus, scallop, and starfish, with holothurian detection accuracy only 0.3% lower than YOLOv10.
By conducting experiments on the DUO and URPC2021 datasets, we have demonstrated UWNet's effectiveness in underwater target detection and its ability to generalize across different datasets. To further validate the model, we used GradCAM to generate heatmaps, which visually compare UWNet's focus on specific objects and regions during feature extraction from underwater images with that of other models. The visualizations of these feature maps are shown in Fig. 6.