Improved weighted bidirectional FPN aquatic real-time target detection model based on cross-scale connections

doi:10.21203/rs.3.rs-4001173/v1

Download PDF

Research Article

Improved weighted bidirectional FPN aquatic real-time target detection model based on cross-scale connections

https://doi.org/10.21203/rs.3.rs-4001173/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

As a pillar industry in coastal areas, aquaculture needs artificial intelligence technology to drive its economic development. This paper proposes a new method of multi-scale feature fusion and integrates it into the YOLOv5 backbone network for automated operations in the aquaculture industry. This model completes the computerized classification and detection of aquatic products, increases the industry's productivity, and fosters economic development. To provide a foundation of data for training the model, this research creates a dataset comprising 15 species of marine products. The data preprocessing section suggests an underwater image enhancement approach to raise the dataset's quality. Mosaic data augmentation is presented to enrich the dataset and bolster its features. A weighted bi-directional feature pyramid network is improved and fused into the necking network to improve the ability of multi-scale feature fusion of the network, significantly strengthening the efficiency of feature extraction of the network. Moreover, the accuracy and speed of model prediction are significantly increased by integrating the SimAM attention mechanism and introducing the FReLU activation function in the network backbone section. The comparison and ablation experiments show the suggested model's superiority and efficacy. The enhanced YOLOv5 target detection model's experimental results, verified by the mAP and FPS evaluation metrics, can achieve 0.953 and 203 frames per second. Compared to the base YOLOv5 network, the evaluation metrics improved by 0.067 and 48 frames per second, respectively. In summary, our method can quickly and accurately identify aquatic products and achieve real-time target detection of marine products, laying the foundation for developing automation systems in the aquaculture industry.

underwater target recognition

automated aquaculture

improved YOLOv5

underwater image enhancement

The oceans are a massive energy reservoir and an indispensable power source for humankind's sustainable development. The rich resources in the oceans can be rationally developed and utilized, while the oceans, lakes, and rivers provide the environment for human economic production. Aquaculture is a rapidly expanding sector of the agricultural structure, contributing significantly to both the fishery economy and the national economy. It can vigorously drive the economic development of coastal areas and gradually become a pillar industry. Globally, farmed aquatic products' overall production is also rising yearly. The scale of farming and quality benefits have an impact on the aquaculture industry's financial income. Traditional low-density, low-yield aquaculture methods, which are driven by the quality and yield of aquatic products, have not been able to keep up with the demands of the marine economy's development. As a result, the aquaculture industry has incorporated new and modern information technology and intelligent technology to optimize the chain of aquaculture. However, it's crucial to use computer vision technologies to create novel aquaculture, which can achieve production modernization and automation, improve production efficiency, promote the development of the fishery economy, and save workforce and material resources.

Since 2014, deep learning-based target detection technology has advanced quickly, and its range of applications is growing, such as road surface collapse detection [1], crop pest detection [2], ship target detection [3], and the field of automatic driving [4]. The mainstream underwater target recognition technology currently includes sonar image detection [5]and underwater optical image detection [6–8]. Sonar equipment is more expensive and will cause a certain degree of noise pollution, affecting the growth and development of aquatic organisms, so the underwater target recognition technology based on sonar equipment does not apply to the field of aquaculture. Underwater optical images taken by underwater camera equipment using optical image processing technology can realize the accurate recognition of submerged targets. Underwater cameras with underwater robots and other automated equipment for underwater operations can complete the feeding and fishing of specific types of aquatic products. They will not cause pollution of the underwater environment. As technology advances, so do their precision and frame rate, which can completely satisfy the automation requirements of the aquaculture sector.

Target detection techniques built on deep learning have become increasingly popular as a result of deep learning's advancement and superior performance. Currently, classical target recognition algorithms mainly include single-stage and multi-stage approaches. Single-stage approaches include YOLO (You Only Look Once) series [9], SSD [10], and RetinaNet [11] algorithms. Single-stage target detection algorithms mean the target can be detected by extracting features only once, and their recognition speed is faster. Still, their accuracy is inferior compared to multi-stage methods. Multi-stage methods include RCNN [12], Fast RCNN [13], Faster RCNN [14], and Mask RCNN [15] algorithms. Multi-stage target detection methods lower the inference speed while increasing detection accuracy by first extracting instance bounding boxes based on the input picture and then performing secondary correction based on the candidate region to obtain the detection point results, which is suitable for the domain of higher requirements on the accuracy of detection, but does not require real-time performance. Even though prediction accuracy is still inferior to the two-stage target detection algorithms, the YOLO series' quicker inference time has made it a mainstay in the industry's marketplace.Even though prediction accuracy is still inferior to the two-stage target detection algorithms, the YOLO series' quicker inference time has made it a mainstay in the industry's marketplace. The YOLOv5 network introduces adaptive anchor box calculation and adaptive image scaling based on YOLOv4 [16]. It improves the scale of CSPNet [17] in the backbone network so that it has four scales of network structure: S, M, L, and XL, which can be applied to various fields to meet different needs. Compared with the latest YOLOv7 and YOLOv8 networks, the YOLOv5 model technology is more mature. The network structure is more stable and more robust. Especially for the research field of this paper, the model needs to be able to perform well in real underwater scenes with a certain degree of real-time, which has a higher demand for the stability and flexibility of the model, combined with the above considerations, and this paper chooses YOLOv5 as the basic network of the model. The capability of feature extraction for target detection models is crucial, and to achieve a higher level of multi-scale feature fusion, this paper introduces an improved weighted bi-directional feature pyramid network (BiFPN [18]) instead of the original PANet [19] structure in the YOLOv5 neck network.

Traditional aquatic organism identification methods can use machine learning techniques to recognize the class of marine organisms being studied. For example, classifiers such as Naive Bayes [20] (NB), Decision Tree [21] (DT), and Support Vector Machine (SVM) algorithms rely on manual selection techniques for feature extraction, i.e., selecting the relevant traits based on human subjectivity and the selection of characteristics using such methods is very subjective, insufficient, and tends to overlook essential features; thus their accuracy is limited. Several researchers have made rapid progress using deep learning to implement target detection algorithms in the last several years.

In 2020, Song et al. [22] achieved mAP values greater than 90% on a small dataset of aquatic organisms by combining the Mask R-CNN with the MSRCR technique for image augmentation. Although the detection accuracy was increased, the training period was lengthy. The same year, Han et al. [23] integrated the refined YOLOv3 algorithm into an underwater robot for real-time detection of enhanced marine creature images. Nevertheless, the method is plagued by leakage issues and is unable to identify marine species with hazy edges. Using the enhanced YOLOv4 network, Mao et al. [24] presented a model in 2021 for the detection of marine species in shallow waters. To increase detection accuracy, the YOLOv4 network's Embedded Connection (EC) component was built and integrated. This technique lowers computing work while increasing detection accuracy. Iqbal et al. [25] presented a significant end-to-end CNN in 2022 with the goal of classifying fish behavior into two groups: normal and hungry. By changing the number of layers in the fully connected layer and deciding whether to employ the maximum pooling technique, they were able to assess the CNN's performance. According to the experimental results, the detection method's accuracy may be increased by 10% to reach 98% accuracy and demonstrate high performance by including a maximum pooling function into the CNN's shallow architecture and adding three fully connected layers. For the categorization of fish species, Kaya et al. [26] developed the CNN-based model IsVoNet8, which demonstrated a 91.37% classification accuracy in 2023. The same year, Ren et al. [27] conducted a study that used LIBS and Raman spectroscopy to create a new method of fish species identification. They combined two machine learning algorithms, SVM and CNN, with Raman spectroscopic data from 13 different fish species to generate a classification model, with the proposed CNN model achieving the highest classification accuracy of 96.2%. Even if the field of underwater target recognition has made great strides in the past, there is still much space for improvement, particularly in the areas of aquatic creature detection, localization, species identification, and quantitative statistics. At the same time, underwater target recognition requires high real-time performance, so designing a fast and high-accuracy underwater target detection model is vital.

This study presents research that enhances the YOLOv5 algorithm and applies it to underwater target detection, aiming to achieve automated rearing and fishing in the aquaculture industry. The real-time classification and detection model for aquaculture that this study suggests greatly enhances the speed and accuracy of underwater target recognition, saves fishermen's time, increases the productivity of fishermen, and fosters the growth of the aquaculture sector. To summarize, our work has made the following major contributions:

This study creates an underwater target recognition dataset consisting of 1108 aquatic images of different resolutions containing 15 common aquatic species. A training set plus a validation set comprise our dataset: 178 images make up the validation set and 930 images make up the training set. The corresponding labels were also created for this dataset;
Because of the low-quality images captured using underwater cameras, there are often cases of low image contrast, high noise, color deviation from the real world, and uneven or darker brightness. An underwater image quality improvement technique is shown as a solution to this issue. It can successfully raise the quality of the image and automatically modify the brightness of the image to fix the color deviation issue. The accuracy of subsequent target recognition is significantly improved;
Because environmental conditions limit underwater image acquisition, the gathered dataset contains a rather limited number of images. The training effect is general, so this paper introduces the Mosaic data enhancement algorithm, which enriches the dataset and expands the number of images to twice as many as the previous ones. The ablation experiments' findings demonstrate how the dataset preparation technique can successfully raise the mean average precision (mAP) value while boosting network robustness;
In this paper, the backbone and neck networks of YOLOv5 are improved, and the YOLOv5-SimF target detection model is proposed, which introduces the SimAM [28] attention mechanism and the FReLU activation function [29], which significantly improves the network's inference speed and recognition accuracy, and is used for realizing real-time automated classification and recognition of aquatic products.

This is how the remainder of the paper is structured. The "Materials and Methods" section describes the dataset, the dataset preprocessing process, and the YOLOv5-SimF object detection model. The "Results" section shows the experimental results. The "Discussion" section analyzes and discusses the experimental results. The "Conclusions" section summarizes the work of this paper.

2.1. Experimental dataset and its pre-processing

2.1.1. Dataset preparation

This study creates a dataset containing 15 aquatic species categories with 1108 marine images divided into training and validation sets. Existing open datasets of underwater images cover a wide range of biotic and abiotic targets. In contrast, the dataset created in this paper focuses on underwater biotic targets, which are economically viable. It is a dataset of underwater images collected specifically for the aquatic sector, from which a target detection model trained on this dataset can be used to achieve fine-grained target classification for fish. The dataset's sources are aquatic photographs collected from farms and various images online. The 15 aquatic species are abalone, carp, salmon, jellyfish, scallop, perch, silver pomfret, catfish, grouper, shrimp, tilefish, crab, squid, yellow croaker, and turbot. These cover most of the common aquatic species in the aquaculture industry. We construct corresponding labels for each target in each image of this dataset for the network to learn and train. The dataset species sampling is shown in Fig. 1.

2.1.2. Underwater image enhancement network

Underwater images captured by underwater camera equipment will have different degrees of quality degradation problems, affecting the accuracy of target detection, so it's essential to pre-process underwater images before target detection. Underwater optical images generally have three types of issues: uneven brightness or darkness; different wavelengths of light are absorbed and scattered by the water medium to different degrees, and underwater images generally show blue-green color, with specific color deviation; propagation of underwater light will be absorbed and scattered by the water medium, resulting in image fogging and reduced contrast. For the above three types of problems, this paper proposes an underwater image enhancement network that can make better the degradation of underwater images. The backbone structure of the underwater image enhancement network is shown in Fig. 2.

First, the white balance algorithm is used to enhance the contrast and adjust the hue of the image, and the principle as in Eq. (1).

$$\left\{ \begin{gathered} C\left( {R^{\prime}} \right)=C\left( R \right) * \frac{{\overline {R} +\overline {G} +\overline {B} }}{{3\overline {R} }} \hfill \\ C\left( {G^{\prime}} \right)=C\left( G \right) * \frac{{\overline {R} +\overline {G} +\overline {B} }}{{3\overline {G} }} \hfill \\ C\left( {B^{\prime}} \right)=C\left( B \right) * \frac{{\overline {R} +\overline {G} +\overline {B} }}{{3\overline {B} }} \hfill \\ \end{gathered} \right.$$

where$C\left( R \right)$,$C\left( G \right)$and$C\left( B \right)$represent the input image R, G and B three-channel components,$C\left( {R^{\prime}} \right)$,$C\left( {G^{\prime}} \right)$and$C\left( {B^{\prime}} \right)$represent the output image three-channel components,$\overline {R}$, $\overline {G}$and $\overline {B}$represent the average value of the image in the three channels.

Second, the image's brightness is adjusted by an improved gamma correction. The gamma correction is ineffective in correcting overly bright or dark regions due to a small range of gamma values. The enhanced gamma correction improves in correcting both areas of uneven brightness of the image, and the improved gamma correction as in Eq. (2)-(5).

$$O\left( {x,y} \right)=255 \times {\left( {\frac{{i\left( {x,y} \right)}}{{255}}} \right)^\gamma }$$

$$\gamma 1=\frac{1}{{1+\left( {1 - \theta \times \frac{m}{{255}}} \right) \times \cos \left( {\pi \times \frac{{L\left( {x,y} \right)}}{{255}}} \right)}}$$

$$\gamma 2=\frac{1}{{1+\left( {1 - \theta \times \left( {1 - \frac{m}{{255}}} \right)} \right) \times \cos \left( {\pi \times \frac{{L\left( {x,y} \right)}}{{255}}} \right)}}$$

$$\gamma =\left\{ \begin{gathered} \gamma 1,\frac{m}{{255}} \leqslant 0.5 \hfill \\ \gamma 2,\frac{m}{{255}}>0.5 \hfill \\ \end{gathered} \right.$$

where$O\left( {x,y} \right)$denotes the pixel value of the image after improved gamma correction,denotes the average value of the pixels of the input image,$L\left( {x,y} \right)$denotes the value of the pixels of the input image, and$\theta$ = 0.6for the best correction effect.

Finally, Underwater image color deviance is corrected using the unsupervised color correction approach. The algorithm simultaneously linearly stretches the histograms of the R, G, and B channels in the RGB color model and the S and I channels in the HSI color model to improve the image contrast and enhance the actual color and brightness of the image. The comparison of the original and improved images of the test set data is shown in Fig. 3.

The approach described in this research is contrasted with four additional image enhancement algorithms in order to confirm its efficacy; the comparison algorithms include MSRCR, UDCP, CLAHE, and Water-Net [30], of which the first three are traditional machine learning techniques, deep learning network is the final one. Two measures are employed in this paper to evaluate the image quality: UIQM [31] and UCIQE [32], which are targeted to assess the performance of underwater image enhancement algorithms. Table 1 demonstrates the experimental results, and it is evident that the suggested method performs better than the other algorithms that were examined.

Table 1

Evaluation results
algorithms	UIQM	UCIQE
MSRCR	1.965	0.478
UDCP	2.093	0.845
CLAHE	1.862	0.351
Water-Net	1.458	0.745
Proposed method	2.205	0.855

2.1.3. Mosaic augmentation

The Mosaic augmentation method is used to preprocess the dataset to enable the model better to learn the distinct features of various aquatic species. Compared with the image data on land, the acquisition of underwater image data is more complex. The acquisition conditions are restricted, and nowadays, the number of open datasets of underwater images is small, which is not enough to complete the training of multi-species target detection models. Hence, the model training effect is hardly satisfactory. In addition, fish targets are generally more aggregated and have smaller target sizes, making the network feature extraction process more difficult. Therefore, enhancement of the existing dataset is essential. Mosaic data enhancement is an improved version of the CutMix data enhancement method. Compared with other data enhancement methods, Mosaic data enhancement does not have extraneous pixels, which improves the training efficiency and further enhances the model's orientation ability and classification performance, and the inference cost remains unchanged. This method stitches any four images in the dataset by random scaling, random cropping, and random arranging, which enriches the training dataset significantly when the many minor targets are added via random scaling so that the robustness of the network is improved. At the same time, in the case of only one GPU, good results can be achieved even if the batch size cannot be set too large. Two times as many datasets are available thanks to mosaic data augmentation. Figure 4 depicts the mosaic data augmentation procedure.

2.2. Adaptive anchor box calculation and adaptive image scaling

In order to train the model, it first generates a prediction box based on the original anchor box. This prediction box is then compared to the ground-truth box; the error between the two is computed, the update is inverted, and the network parameters are iterated. The first anchor box value in YOLOv5 is encoded into the code, and each training session adaptively determines the ideal anchor box value across various training sets. Moreover, before each image is input into the model for training, it must be scaled to a standard size because the resolution of the images in the dataset varies. The YOLO algorithm typically uses the sizes 416*416 and 608*608. The size of the black edges at the ends of the photographs varies after scaling and filling due to the variations in the aspect ratio of the images. The inference speed will be impacted by information redundancy if more pixel values are filled. To reduce processing and speed up inference, YOLOv5 has been changed to adaptively add the least amount of black edges to the image.

2.3. YOLOv5-SimF Target Detection Model

2.3.1. Improved YOLOv5 algorithm

The input, neck, head, and backbone networks make up the four components of the YOLOv5 network. In order to centralize the information of the images W and H on the channel and act as downsampling without causing information loss, the input image is first sliced before being fed into the backbone network. The Convolutional Layer, Batch Normalization, and FReLU activation function make up the CBF module in the backbone network section. Target identification, semantic segmentation, and image classification are among the tasks where the FReLU activation function performs better than activation functions like ReLU and SiLU. It gets around activation functions' insensitivity to spatial cues when performing visual tasks. FReLU, an activation function specifically created for visual tasks, adds very little spatial conditional overhead to ReLU and PReLU, extending them to 2D activation. It somewhat raises the accuracy of small target detection.The formula for FReLU is found in Eq. (6)-(7).

$$y=\hbox{max} \left( {xc,i,j,T\left( {xc,i,j} \right)} \right)$$

$$T\left( {xc,i,j} \right)=x_{{c,i,j}}^{\omega }\cdot p_{c}^{\omega }$$

where$T\left( {xc,i,j} \right)$represents a simple and efficient spatial context feature extractor. $x_{{c,i,j}}^{\omega }$represents the window centered at the 2D position$\left( {i,j} \right)$on the channel; $p_{c}^{\omega }$represents the shared parameters of this window in the same channel.

The image undergoes a series of CBF module downsampling in the backbone section and CSP module feature extraction to generate a set of feature maps with different resolutions. At the backbone network's terminus, the SPPCSP module is introduced. In the primary YOLOv5 network, SPP is used for processing; the role of SPP is to obtain different receptive fields through maximum pooling and increase the receptive fields so that the network can adapt to varying resolutions of feature maps.SPP utilizes maximum pooling at four different scales for processing, and it has four different receptive fields used to distinguish between large, medium, and small targets. Here, the SPP module is improved by introducing the SPPCSP module, which divides the features into two parts, one of which is processed by conventional CBF, and the SPP structure processes the other. Finally, the two parts are concatenated, reducing the computation by half and achieving speed improvement. In the neck network structure, a feature pyramid network (FPN) considers multiple scales of features and performs multi-scale feature fusion; however, FPN pass information top-down, resulting in feature fusion being limited by the unidirectional information flow. The original YOLOv5 introduced PANet in the neck network to improve this shortcoming, which achieves bottom-up feature delivery but simultaneously requires more parameters and computational cost. To address this problem and improve the efficiency of the target detection model, this paper presents a key innovation in the necking network, where BiFPN is improved to make it more suitable for YOLOv5 networks, and its effectiveness is experimentally evaluated. In addition, YOLOv5 provides four different sizes of network models, S, M, L, and XL, which have different depths and widths.YOLOv5s denotes the lightest network, and YOLOv5xl is the deepest and widest network. Here, by modifying the values of a set of hyperparameters, the number of network layers and convolutional kernels can be set, thus choosing to use networks with different specifications. The inference speed that is fastest, YOLOv5s, also has the lowest accuracy. YOLOv5xl has the slowest inference speed and the highest accuracy, which are suitable for application domains with different needs. In this study, The base network, YOLOv5m, can meet real-time performance requirements and attain a high accuracy rate. In the head network, the RepConv [33] structure is introduced. RepConv obtains a good performance on the VGG [34] structure by reparameterization, which raises the accuracy of the network's predictions without adding more parameters or convolutional computation. The RepConv structure has different network constructs for training and inference. The output is obtained by summing two branches with different numbers of convolutional kernels and a normalized branch during training, and during inference, the branch parameters are reparameterized to the main branch. The number of channels in the final output feature map is 3 × (NC + 5), where 3 denotes three anchor boxes with different aspect ratios, NC represents the number of categories, and 5 indicates the two parameters of the anchor box's center point and the two parameters of the anchor box's length and width plus a foreground probability parameter for the anchor box. YOLOv5 reduces overfitting by employing Dropblock [35] regularization, which is derived from the 2017 Cutout [36] method of data augmentation. In this method, Dropblock applies Cutout to each feature map after Cutout zeroes out portions of the input image. It starts with a tiny ratio during training and grows this ratio linearly with the training process, as opposed to having a set zeroing ratio. Dropblock, in contrast to Cutout, is more efficient and offers a thorough upgrade and enhancement to the network's regularization process. Figure 5 displays the core components and network architecture of the YOLOv5-SimF target detection model.

2.3.2. Weighted bi-directional feature pyramid network

In this paper, a significant innovation is accomplished in the YOLOv5 neck network, which further improves the network's performance with better accuracy and higher efficiency than the original YOLOv5 network. We improve the weighted bi-directional feature pyramid network and introduce it into our proposed network architecture. The experimental results show that the innovation can not only be adapted to the existing network architecture but also that the network's performance has been improved critically, mainly in terms of the efficiency and accuracy of the target detection model. The main schematic diagram of the improved BiFPN network is shown in Fig. 6(c): effective bidirectional cross-scale connections and weighted feature fusion introduced to aggregate features at different resolutions of the feature map. Each node in Fig. 6 corresponds to features at different scales. The BiFPN in Fig. 6(c) removes some nodes that have only one input and without feature fusion, then they contribute less to the feature fusion network, and a simplified PANet bi-directional network is obtained by excising these nodes; secondly, in order to improve the ability of the network to fuse the features, an additional edge is added to the original input-to-output nodes that are at the same level. These added edges correspond to the dashed and red solid arrows in Fig. 5(a). At the same time, it does not add much computational cost. Finally, in order to achieve a higher level of feature fusion, we will repeat a feature network layer multiple times, i.e., top-down and bottom-up bi-directional path network layers, which corresponds to Fig. 5(a) can be seen in Fig. 6(c) with repeated blocks = 3, i.e., the feature network layer is repeated three times. The principle of multi-scale aggregation is shown in Eq. (8).

$${\overrightarrow P ^{out}}=f\left( {{{\overrightarrow P }^{in}}=P_{{li}}^{{in}}} \right),i=1,2...$$

where$P_{{li}}^{{in}}$denotes the $li$-layer feature, the network aims to find a transformationcan efficiently aggregate different input features${\overrightarrow P ^{in}}$and output a new set of features${\overrightarrow P ^{out}}$.

For input features with different resolutions, whose importance varies because they contribute differently to the output features, an additional weight should be assigned to each input, and the network should be allowed to learn the value of the weight. Here, the weights are calculated using fast normalized feature fusion with the following formula:

$$O=\sum\nolimits_{i} {\frac{{\omega i}}{{\varepsilon +\sum\nolimits_{j} {\omega j} }} \cdot Ii}$$

where $\omega i \geqslant 0$is ensured by applying the Relu activation function after each$\omega i$and$\varepsilon =0.0001$to ensure numerical stability, and the weights are between 0 and 1 by the normalization process. In summary, BiFPN integrates bidirectional cross-scale connections and a fast normalized feature fusion‘s weight calculation method to optimize multi-scale feature fusion in neck networks, and ablation experiments validate the effectiveness of introducing the improved BiFPN network.

2.3.3. SimAM attention mechanism

This study incorporates the SimAM attention mechanism into the neck network to further increase the prediction accuracy of the YOLOv5-SimF model. Lightweight and effective, the SimAM attention module is simple to use. SimAM concentrates on information related to both channel and spatial dimensions. In the situation of restricted mathematical resources, no extra parameters are required to calculate the 3D attention weights, successfully avoiding the issue of growing model parameters caused by structural modifications. The attention mechanism weights can be computed using only an energy function. Through the computation of the energy function, it is possible to determine that a neuron's significance increases and its distinction from other neurons is more pronounced when its energy is lower. Consequently, the SimAM attention mechanism has a wide range of applications and is capable of precisely capturing the important information found in image features. Figure 7 depicts the architecture of the SimAM attention mechanism.

In Fig. 7, the 3-D weights are calculated as follows:

$$\mathop X\limits^{ \bullet } =sigmoid\left( {\frac{1}{E}} \right) \odot X$$

$$E=\frac{{4\left( {{\sigma ^2}+\lambda } \right)}}{{{{\left( {t - \mu } \right)}^2}+2{\sigma ^2}+2\lambda }}$$

whereis the input feature,is the energy function on each channel, and the sigmoid function is used to limit the possible oversize values in. is the value of the input feature$t \in X$,$\lambda$is the constant$1e - 4$, $\mu$and${\sigma ^2}$denote the mean and variance of each channel inrespectively, which are calculated by the following formulas:

$$\mu =\frac{1}{M}\sum\nolimits_{{i=1}}^{M} {xi}$$

$${\sigma ^2}=\frac{1}{M}\sum\nolimits_{{i=1}}^{M} {{{\left( {xi - \mu } \right)}^2}}$$

where$M=H \times W$denotes the number of neurons on each channel.The weight of each neuron can be obtained through the above calculation, and introducing this attention mechanism improves the accuracy of the model target detection without increasing the computational burden of the network effectively.

2.3.4. Loss function of the model

Three components make up the target detection model's overall loss: confidence loss, localization loss, and classification loss. Classification loss employs a binary cross-entropy loss for each category prediction, taking into account the possibility that a target may belong to many categories simultaneously. The calculation formula is as follows:

$$yi=sigmoid\left( {xi} \right)=\frac{1}{{1+{e^{ - xi}}}}$$

$$Lclass= - \sum\limits_{{i=1}}^{N} {y_{i}^{ * }\log \left( {{y_i}} \right)+\left( {1 - y_{i}^{ * }} \right)\log \left( {1 - {y_i}} \right)}$$

wheredenotes the total number of categories,$xi$is the predicted value of the current category,${y_i}$is the probability of the current category obtained after the activation function, and$y_{i}^{ * }$is the label of the true value of the current category, which takes the value of 0 or 1.

The localization loss in this paper is the CIoU LOSS [37] function. Three geometric characteristics are considered by the CIoU LOSS function: aspect ratio, centroid distance, and overlap area. In order to bring the prediction boxes closer to the ground truth box, CIoU adds the loss of the prediction boxes' scale based on DIoU as well as the loss of their length and width. In addition to increasing the prediction box regression's speed and accuracy, CIoU takes the bounding box aspect ratio's scaling information into account. The formulation of localization loss is found in Eq. (16)-(18).

$Llocal=1 - CIoU=1 - \left( {IoU - \left( {\frac{{{\rho ^2}\left( {b,{b^{gt}}} \right)}}{{{c^2}}}} \right)+\alpha \upsilon } \right)$ (16) $\upsilon =\frac{4}{{{\pi ^2}}}{\left( {\arctan \frac{{{w^{gt}}}}{{{h^{gt}}}} - \arctan \frac{w}{h}} \right)^2}$ (17)

$$\alpha =\frac{\upsilon }{{\left( {1 - IoU} \right)+\upsilon }}$$

where$IoU$represents the intersection over union,$\rho$represents the Euclidean distance betweenand${b^{gt}}$, represents the parameter of the center coordinates of the prediction box,${b^{gt}}$represents the parameter of the center coordinates of the real target bounding box, and${\rho ^2}$represents the square of the distance between the two center points. represents the length of the diagonal of the smallest outer rectangle of the two rectangles. $\alpha$and$\upsilon$are parameters that measure the consistency of the aspect ratio. ${w^{gt}}$and${h^{gt}}$represent the width and height of the true box, andandrepresent the width and height of the predicted box.

Confidence loss employs a binary cross-entropy loss function, just as classification loss. The following is the precise calculation formula:

$$Lconf\left( {o,c} \right)= - \frac{{\sum\limits_{{i=1}}^{N} {\left( {oi\ln \left( {\widehat {c}i} \right)+\left( {1 - oi} \right)\ln \left( {1 - \widehat {c}i} \right)} \right)} }}{N}$$

$$\widehat {c}i=sigmoid\left( {ci} \right)$$

where$oi \in \left[ {0,1} \right]$denotes the IoU of the predicted and real boxes, is the predicted value, $\widehat {c}i$is the predicted confidence obtained fromvia sigmoid activation function, andis the number of positive and negative samples.The total loss function for the target detection model is calculated as follows:

$$LOSS={\lambda _1}Lclass+{\lambda _2}Llocal+{\lambda _3}Lconf$$

where${\lambda _1}$, ${\lambda _2}$and${\lambda _3}$are balancing parameters.

The DIoU Non-Maximum Suppression (DIoU NMS) technique is used in the post-processing step of the target detection algorithm to reduce false detections and eliminate duplicate boxes. The highest-scoring detected box and all other detected boxes are given a corresponding IoU value in the conventional NMS algorithm, and those boxes whose values are greater than the NMS threshold are filtered out. As can be observed, the only element taken into account by the traditional NMS algorithm is IoU. Nevertheless, in real-world application scenarios, only one detection box is frequently left behind after NMS processing when two distinct objects are quite close to one another because of the relatively large IoU value. This results in the error scenario of missed detection. Because the IoU ignores the aspect ratio and center point distance, it simply takes into account the overlap region between the predicted and actual boxes. This is why the DIoU NMS takes into account both the IoU and the separation between the two boxes' center points. It might be regarded as a box of two objects and won't be filtered out if the IoU between two boxes is comparatively significant yet the distance between the centers of two boxes is rather large. The false detection rate of the traditional NMS method is successfully decreased by the DIoU NMS algorithm.

3.1. Experimental environment and parameter settings

The environment for model training, as well as testing, is a PC with NVIDIA GeForce RTX 3060 Laptop GPU, memory 16G, 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz CPU, and Windows 11 OS. Pytorch 1.7.1, CUDA 11.0, CUDNN 8.2, Python 3.8. The experimental parameter settings are shown in Table 2. The settings of the hyperparameters Depth_multiple and Width_multiple in Table 2 specify the depth and width of the network, i.e., the number of layers of convolutional layers and convolutional kernels in the network.

Table 2

Parameter settings.
Parameter	Size
Learning rate	1e-4
Optimizer	Adam
Batch size	8
Epoch	100
Depth_multiple	0.67

3.2. Evaluation metrics of Experimental result

This paper uses the following evaluation measures for the experimental results: F1, AP (Average Precision), mAP (mean Average Precision), R (Recall), P (Precision), and FPS (Frames per Second). The precision P value denotes the percentage of the model-predicted positive samples that really contain positive samples. The percentage of positive samples that the model successfully detected out of all positive samples is represented by recall R. The accuracy of class detection is measured by average precision (AP), which is the average of the precision under various recall rates. The value of AP is the size of the area under the PR curve. The following is the precise calculation formula:

$$P=\frac{{TP}}{{TP+FP}}$$

$$R=\frac{{TP}}{{TP+FN}}$$

$$AP=\int_{0}^{1} {P\left( R \right)dR}$$

where$TP$indicates that positive samples are predicted as positive,$FP$indicates that positive samples are predicted as negative, and$FN$indicates that negative samples are predicted as positive. Eq. (24) represents the PR curve obtained by setting different confidence levels with R as the horizontal coordinate and P as the vertical coordinate, and the area enclosed by this PR curve and the coordinate axis is the value of AP.

The accuracy of multiple-category detection is measured by the mean average of precision, or mAP, which is the average of the APs of several categories. The mAP50 and mAP50-95 are the evaluation measures used in this project. The average precision when the intersection over the union IoU threshold is 0.5 is denoted by the mAP50, and the average precision when the IoU threshold is between 0.5 and 0.95 is denoted by the mAP50-95. The F1 Score, which takes accuracy and memory into account and provides a thorough assessment of the model's advantages and disadvantages, is calculated as the harmonic average of recall and precision. The number of frames per second (FPS) that the model is able to detect is used to gauge its inference speed. The specific calculation formula is as follows:

$$mAP=\frac{{\sum\limits_{{i=1}}^{N} {APi} }}{N} \times 100\%$$

$$F1=\frac{{2PR}}{{P+R}}$$

whereis the number of categories and$APi$is the$AP$of thecategory.

3.3. Experimental results

For the 15 categories of this study, the experimental results are shown in Table 3.

Table 3

Experimental results of each category.
Class	P	R	AP50	AP50-95
Abalone	0.918	0.882	0.950	0.690
Carp	1	0.989	0.995	0.818
Catfish	0.881	0.900	0.932	0.635
Grouper	0.839	0.875	0.955	0.696
Squid	0.903	0.915	0.944	0.707
Tilefish	0.899	0.938	0.937	0.782
Shrimp	0.968	1	0.995	0.861
Crab	0.972	0.951	0.976	0.746
Turbot	0.819	0.697	0.775	0.519
Yellow croaker	0.994	0.917	0.947	0.724
Salmon	0.952	1	0.995	0.866
Jellyfish	0.984	1	0.995	0.745
Scallop	0.952	0.955	0.993	0.815
Perch	1	0.990	0.995	0.782
Silver pomfret	0.814	0.873	0.907	0.695

Figure 8 displays the visualization results of the YOLOv5-SimF model provided in this paper in the experimental test set. From this, it is evident that the 15 species of aquatic products engaged in this study can be accurately classified and localized. When paired with hardware, deep learning's exceptional performance makes it possible for the aquaculture sector to operate automatically. The model is entirely applicable to real-world scenarios due to its high recognition accuracy and low misdetection rate. Figure 9 displays the results of the real-time target detection sampling of the natural undersea scene.

4.1. Experimental label database

The experiments labeled the ground truth box of each image in the dataset created in this paper. The distribution of labels in the dataset is shown in Fig. 10 The horizontal coordinates of the histograms in Fig. 10(a) indicate the different categories, and the vertical coordinates indicate the number of label instances in the different categories. Figure 10(b) counts the length and width of all ground truth boxes in the dataset, where the center coordinates of each ground truth box are set at the same position. The scatter plots in Fig. 10(c) and Fig. 10(d) represent the distribution of the ground truth box's center coordinates (x,y) in the image and the width and height distribution of the ground truth box, respectively.

In addition, Fig. 11 summarizes the labels of the training set data and plots the relationship diagram between the four variables of the training set data labels: x, y, width, and height. The histogram in the first row of Fig. 11 shows the distribution of the horizontal coordinates x of the center point of the anchor box. It can be seen that the horizontal coordinates x of the center point are concentrated in the positions of 0.25, 0.50, and 0.75 of the image, with the position of 0.50 in the center of the image being the most prevalent; the histogram in the second row shows the distribution of the vertical coordinates y of the center point of the anchor box, it can be seen that the vertical coordinate y is uniformly distributed over the image at 0.25, 0.50 and 0.75. The histogram in the third row shows the distribution of the width of the anchor boxes, and it can be seen that most of the anchor boxes are not wider than half of the image; the histogram in the last line shows the distribution of the height of the anchor boxes, and it can be seen that the height of most of the anchor boxes is not more than half of the image.

4.2. Model performance

The target detection classification model's confusion matrix is displayed in Fig. 12 The number in the matrix's square represents the likelihood that the model will correctly identify the category, while the horizontal and vertical axes represent the actual and predicted categories, respectively. By using the confusion matrix diagram, the classification model presented in this research may accomplish a more accurate classification task. Figure 13 displays the curve diagrams for the target detection model's operational outcomes, by which the performance of a model can be more intuitively assessed. Figure 13(a) shows the F1-Confidence curve, which can reflect the relationship between the F1 scores and the confidence level; Fig. 13(b) represents the Precision-Confidence curve, which reflects the precision values under different confidence thresholds; Fig. 13(c) shows PR curve, the area under this curve is taken as AP; Fig. 13(d) shows Recall-Confidence curve, which reflects the relationship between recall and confidence. The visual examination of the model evaluation metrics during the dataset's training is displayed in Fig. 14, where 100 epochs are represented by the horizontal axis. In addition to the changes in accuracy, recall, mAP50, and mAP50-95 of the model throughout training, the figure shows variations in box loss, objectness loss, and categorization loss for the training and validation sets during the model training period.

4.3. Ablation experiment analysis

We performed ablation experiments to confirm the efficacy of the dataset preparation, and Table 4 displays the findings. The experimental findings demonstrate that the dataset's Mosaic data and underwater image augmentation can greatly increase the model prediction accuracy, as evidenced by the mAP50's 0.16 and mAP50-95's 0.206 improvements.

Table 4

Ablation experiment results of dataset preprocessing.
Underwater image enhancement network	Mosaic augmentation	P	R	mAP50	mAP50-95
		0.778	0.782	0.793	0.533
√		0.815	0.806	0.822	0.599
	√	0.881	0.900	0.932	0.635
√	√	0.927	0.925	0.953	0.739

We performed ablation experiments on the enhanced model to confirm the efficacy of the suggested improved YOLOv5 model; the outcomes are displayed in Table 5. The preprocessed training dataset is utilized for the various models in Table 5. The ablation experiments validated the effectiveness of each component separately, where the one without introducing the improved BiFPN network is the original YOLOv5 feature extraction network: the FPN + PAN structure. The experimental results show that the improved BiFPN network contributes greatly to the model's performance. The prediction accuracy significantly improves without sacrificing much of the model's inference speed. The model's gain in accuracy after introducing the improved BiFPN network is much greater than the decrease in speed.The enhanced model's FPS increases by 48 frames per second, F1 Score increases by 0.061, mAP50-95 increases by 0.041, and mAP50 improves by 0.067. The trial outcomes show how well the YOLOv5 model has been improved by our work.

Table 5

Ablation experiment results of improved YOLOv5.
SPPCSP	FReLU	SimAM	RepConv	Improved BiFPN	P	R	mAP50	mAP50-95	FPS	F1
					0.856	0.874	0.886	0.698	155	0.865
√					0.854	0.876	0.885	0.697	199	0.865
√	√				0.862	0.874	0.895	0.695	206	0.870
√	√	√			0.879	0.882	0.897	0.705	210	0.880
√	√	√	√		0.889	0.892	0.902	0.709	211	0.891
√	√	√	√	√	0.927	0.925	0.953	0.739	203	0.926

We also carried out comparison studies on four different sizes of YOLOv5 networks in order to balance the requirements of target detection for accuracy and speed. The findings are displayed in Table 6. Table 6 indicates that while the larger-scale network model can achieve higher detection accuracy, it is unable to meet the demand for real-time detection. Consequently, after carefully analyzing the experimental data, we selected YOLOv5m as the main network model, which can meet both the higher detection accuracy and real-time detection requirements. The experimental dataset and network structure remain unchanged despite the four network sizes shown in Table 6 only affecting the network's width and depth.

Table 6

Comparisons of different scale networks.
Model	P	R	mAP50	mAP50-95	FPS
YOLOv5s	0.897	0.900	0.921	0.705	256
YOLOv5m	0.927	0.925	0.953	0.739	203
YOLOv5l	0.945	0.958	0.960	0.758	152
YOLOv5xl	0.968	0.974	0.971	0.774	98

4.4. Comparative experiment analysis

The YOLOv5-SimF target detection model is presented in this paper, and the mainstream target detection models are chosen to perform a comparison experiment with it. The experimental results are displayed in Table 7, which demonstrates that the YOLOv5-SimF target detection model performs better overall than the other comparison models. Faster RCNN, Mask RCNN, SSD, RetinaNet, YOLOv3 [38], YOLOv4, YOLOv5, and YOLOv7 [39] are the comparative algorithms that have been chosen. Two-stage target detection algorithms, such as Faster R-CNN and Mask R-CNN, have the ability to attain a comparatively high detection accuracy. The FPS is significantly lower than the other models, though.YOLOv7's intricate network structure allows it to attain greater accuracy at the cost of reduced detection speed. Figure 15 demonstrates the comparison of the visualization results of predicting targets between the original YOLOv5 and the model proposed in this paper, from which it can be seen that the accuracy of the improved model in this paper for predicting stacked targets and small targets in the background, as well as the classification and localization accuracy, have been significantly improved. The improved model's efficiency, robustness, and accuracy are verified through qualitative and quantitative analyses.

Table 7

Comparison of different target detection models.
Model	P	R	mAP50	mAP50-95	FPS	F1
Faster R-CNN	0.891	0.928	0.919	0.722	78	0.909
Mask R-CNN	0.898	0.935	0.922	0.726	72	0.916
SSD	0.859	0.886	0.875	0.698	123	0.872
RetinaNet	0.852	0.887	0.872	0.692	134	0.869
YOLOv3	0.892	0.909	0.908	0.733	108	0.900
YOLOv4	0.905	0.906	0.912	0.735	122	0.905
YOLOv5	0.915	0.922	0.925	0.736	155	0.918
YOLOv7	0.924	0.916	0.945	0.739	132	0.919
YOLOv5-SimF	0.927	0.925	0.953	0.739	203	0.926

In this study, a real-time target detection model, YOLOv5-SimF, is proposed. Our work aims to promote aquaculture development, drive the rapid economic growth of coastal areas, solve traditional aquaculture problems, such as low efficiency and small scale, and modernize and automate aquaculture and save the fishermen's labor and material resources. Firstly, this paper creates a dataset containing 15 different categories of aquatic products from publicly available data and data collected in this experiment, which provides available resources for the subsequent development of aquatic identification. Secondly, this study made the following two innovations in the data preprocessing part: underwater image quality enhancement and Mosaic data enhancement. The former enhances low contrast, uneven brightness, and color bias issues in underwater images, while the latter enriches the training dataset to help the deep learning network better learn the target's features and partially validates the efficacy of data preprocessing through ablation experiments. Next, using the FReLU activation function to overcome the activation function's spatial insensitivity on visual tasks, this paper improves the YOLOv5 network by introducing the YOLOv5-SimF target detection model, which increases the accuracy of small target detection with minimal computational overhead. A novel weighted bi-directional feature pyramid network is proposed and applied to the neck network, which improves the ability of cross-scale feature fusion and aggregates the SimAM attention module to solve the information overload problem, which all-around improves the efficiency and accuracy of the model to predict the target. The RepConv reparameterized convolutional structure is employed in the head network to further improve network prediction accuracy. Furthermore, this research takes into account three geometric parameters for box loss and applies the CIoU LOSS loss function. Lastly, ablation experiments and comparison experiments are carried out in the experimental section of this paper. The ablation experiment results demonstrate the efficacy of the YOLOv5 network's improvements in this study, and the network's advancements greatly increase the precision and speed of target prediction. The YOLOv5-SimF target detection model suggested in this paper is compared with other popular deep learning network models in comparison experiments. The experimental data and experimental visualization comparison results can intuitively show that the performance of the model proposed in this paper is significantly better than that of other compared models. Statistics on the dataset label were also run before the experiment. Furthermore, we present the test results of the natural underwater scenario, including sampling the underwater video detection results. The underwater camera equipment records all of the underwater videos. As a result, the network model in this study exhibits strong robustness and detection performance, making it a valuable reference point for the automation of the genuine aquaculture sector.

Author Contributions: Conceptualization, Y.M.; methodology, Y.M.; software, Y.M. and Y.W.; validation, Y.M.; formal analysis, Y.M.; investigation, Y.M.; resources, Y.M.; data curation, Y.M.; writing—original draft preparation, Y.M.; writing—review and editing, Y.M.; visualization, Y.M.; supervision, Y.M.; project administration, Y.M., L.J., L.C. and Y.W.; funding acquisition, L.J. All authors have read and agreed to the published version of the manuscript.

Funding: This research was funded by the National Natural Science Foundation of China, grant number 61561010; Guangxi Innovation-Driven Development Special Fund, grant number AA21077008; Guangxi Key Laboratory of Wireless Wideband Communication and Signal Processing, grant numbers GXKL06220102 and GXKL06220108; Guangxi Bagui Scholar, grant number 2019A51; and Innovation Project of GUET Graduate Education, grant numbers 2022YXW07, 2022YCXS080 and 2023YXW02.

Data Availability Statement: The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflict of interestThe authors declare no competing interests.

Ethical approval Not applicable.

Feng, J. H., Yuan, H., Hu, Y. Q., Lin, J., Liu, S. W., Luo, X.: Research on deep learning method for rail surface defect detection. IET Electrical Systems in Transportation. 10(4), 436–442 (2022)
Zhao, Y., Liu, L., Xie, C., Wang, R., Wang, F., Bu, Y., Zhang, S.: An effective automatic system deployed in agricultural Internet of Things using Multi-Context Fusion Network towards crop disease recognition in the wild. Applied Soft Computing. 89, 106128 (2020)
Zhang, D., Zhan, J., Tan, L., Gao, Y., Župan, R.: Comparison of two deep learning methods for ship target recognition with optical remotely sensed data. Neural Computing and Applications, 33, 4639–4649 (2021)
Jia, X., Tong, Y., Qiao, H., Li, M., Tong, J., Liang, B.: Fast and accurate object detector for autonomous driving based on improved YOLOv5. Scientific reports, 13(1), 1–13 (2023)
Zou, L., Liang, B., Cheng, X., Li, S., Lin, C.: Sonar Image Target Detection for Underwater Communication System Based on Deep Neural Network. CMES-Computer Modeling in Engineering & Sciences, 137(3)(2023)
Chen, L., Liu, Z., Tong, L., Jiang, Z., Wang, S., Dong, J., Zhou, H.: Underwater object detection using Invert Multi-Class Adaboost with deep learning. In: Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2020)
Zhao, Z., Liu, Y., Sun, X., Liu, J., Yang, X., Zhou, C.: Composited FishNet: Fish detection and species recognition from low-quality underwater videos. IEEE Transactions on Image Processing, 30, 4719–4734 (2021)
Yang, L., Liu, Y., Yu, H., Fang, X., Song, L., Li, D., Chen, Y.: Computer vision models in intelligent aquaculture with emphasis on fish detection and behavior analysis: A review. Archives of Computational Methods in Engineering, 28, 2785–2816 (2021)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788 (2016)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., Berg, A. C.: Ssd: Single shot multibox detector. In: Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, pp. 21–37 (2016)
Lin, T. Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988 (2017)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587 (2014)
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Ren, S.; He, K.; Girshick, R.; Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 91–99 (2015)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969 (2017)
Bochkovskiy, A., Wang, C. Y., Liao, H. Y. M.: Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv: 2004.10934 (2020)
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H.: CSPNet: A new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391 (2020)
Tan M, Pang R, Le Q V.: Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790 (2020)
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J.: Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768 (2018)
Li, D., Du, L.: Recent advances of deep learning algorithms for aquacultural machine vision systems with emphasis on fish. Artificial Intelligence Review, 1–40 (2022)
Larsen, R., Olafsdottir, H., Ersbøll, B. K.: Shape and texture based classification of fish species. In: Proceedings of the Image Analysis: 16th Scandinavian Conference, pp. 745–749 (2009)
Song, S., Zhu, J., Li, X., Huang, Q.: Integrate MSRCR and mask R-CNN to recognize underwater creatures on small sample datasets. IEEE Access, 8, 172848–172858 (2020)
Han, F., Yao, J., Zhu, H., Wang, C.: Underwater image processing and object detection based on deep CNN method. Journal of Sensors, 2020 (2020)
Mao, G., Weng, W., Zhu, J., Zhang, Y., Wu, F., Mao, Y.: Model for marine organism detection in shallow sea using the improved YOLO-V4 network. Transactions of the Chinese Society of Agricultural Engineering, 37(12), 152–158 (2021)
Iqbal, U., Li, D., Akhter, M.: Intelligent Diagnosis of Fish Behavior Using Deep Learning Method. Fishes, 7(4), 201 (2022)
Volkan, K. A. Y. A., Akgül, İ., TANIR, Ö. Z.: IsVoNet8: A Proposed Deep Learning Model for Classification of Some Fish Species. Journal of Agricultural Sciences, 29(1), 298–307 (2023)
Ren, L., Tian, Y., Yang, X., Wang, Q., Wang, L., Geng, X., … Lin, H.: Rapid identification of fish species by laser-induced breakdown spectroscopy and Raman spectroscopy coupled with machine learning methods. Food Chemistry, 400, 134043 (2023)
Yang, L., Zhang, R. Y., Li, L., Xie, X.: Simam: A simple, parameter-free attention module for convolutional neural networks. In: Proceedings of the International conference on machine learning, pp. 11863–11874.(2021)
Ma, N., Zhang, X., Sun, J.: Funnel activation for visual recognition. In: Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, pp. 351–368 (2020)
Li, C., Guo, C., Ren, W., Cong, R., Hou, J., Kwong, S., Tao, D.: An underwater image enhancement benchmark dataset and beyond. IEEE Transactions on Image Processing, 29, 4376–4389 (2019)
Panetta, K., Gao, C., Agaian, S.: Human-visual-system-inspired underwater image quality measures. IEEE Journal of Oceanic Engineering, 41(3), 541–551 (2015)
Yang, M., Sowmya, A.: An underwater color image quality evaluation metric. IEEE Transactions on Image Processing, 24(12), 6062–6071 (2015)
Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: Repvgg: Making vgg-style convnets great again. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13733–13742 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556 (2014)
Ghiasi, G., Lin, T. Y., Le, Q. V.: Dropblock: A regularization method for convolutional networks. Advances in neural information processing systems, 31 (2018)
DeVries, T., Taylor, G. W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv: 1708.04552 (2017)
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D.: Distance-IoU loss: Faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12993–13000 (2020)
Redmon, J., & Farhadi, A. Yolov3: An incremental improvement. arXiv preprint arXiv: 1804.02767 (2018)
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M.: YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7464–7475 (2023)

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Improved weighted bidirectional FPN aquatic real-time target detection model based on cross-scale connections

Status:

Version 1

Abstract

Figures

1. Introduction

2. Materials and Methods

2.1. Experimental dataset and its pre-processing

2.1.1. Dataset preparation

2.1.2. Underwater image enhancement network

2.1.3. Mosaic augmentation

2.2. Adaptive anchor box calculation and adaptive image scaling

2.3. YOLOv5-SimF Target Detection Model

2.3.1. Improved YOLOv5 algorithm

2.3.2. Weighted bi-directional feature pyramid network

2.3.3. SimAM attention mechanism

2.3.4. Loss function of the model

3. Results

3.1. Experimental environment and parameter settings

3.2. Evaluation metrics of Experimental result

3.3. Experimental results

4. Discussion

4.1. Experimental label database

4.2. Model performance

4.3. Ablation experiment analysis

4.4. Comparative experiment analysis

5. Conclusions

Declarations

References

Additional Declarations

Status:

Version 1