RAVL: A Region Attention Yolo with Two-Stage Training for Enhanced Object Detection

doi:10.21203/rs.3.rs-5300581/v1

Download PDF

Research Article

RAVL: A Region Attention Yolo with Two-Stage Training for Enhanced Object Detection

https://doi.org/10.21203/rs.3.rs-5300581/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Improving the accuracy of object detection has been a key focus of recent research. However, many existing approaches fail to fully utilize location labels to effectively suppress irrelevant background features, which limits detection performance, particularly in the detection of small objects. In this paper, we propose a novel region attention mechanism to address this limitation, which combines of a region attention module(RAM) and a two-stage training strategy(TSTS). The RAM comprises a Squeeze-and-Excitation (SE) block, which dynamically assigns weights to multi-channel feature maps to generate a saliency map, and a fusion block that integrates the features with the saliency map to enhance object features while suppressing background features. We embed the RAM into the shallow layer of any version of YOLO, creating an object detector named Region Attention YOLO (RAVL). RAVL is trained using a two-stage training strategy (TSTS). In the first stage, “no background” images are generated based on the location labels, and a vanilla detector YOLOv8 is trained on them to produce ground truth “no background” features. In the second stage, RAVL is trained from scratch on the original infrared images by minimizing a detection loss and a region attention loss. The region attention loss ensures that the low-level features extracted from “no background” and original images are similar, thereby improving overall detection accuracy. Extensive experiments of YOLOv5, YOLOv8, YOLOv9 and YOLOv10 on the FLIR infrared image datasets and the VisDrone2019 visible light dataset demonstrate that our method can significantly improve the detection performance. YOLOv8 achieves the mAP0.5 score of 81.7% on the FLIR dataset and 42.1% on the VisDrone2019 dataset, which is 3.1% and 5.0% higher than that not using our method. Especially for small objects bicycle in FLIR and pedestrian in VisDrone2019, 5.7% and 7.9% higher mAP0.5 respectively.

object detection

small object

Background removal

feature enhancement

Object detection is to detect and localize objects in images or videos. It finds a variety of applications in the fields from autonomous driving, night navigation, to surveillance, UAV search and tracking. Compared to visible light or RGB object detection, infrared (IR) object detection is more challenging. IR images lack the rich texture and color diversity found in RGB images and are more susceptible to various types of noise and distortions. Detecting objects in IR images, particularly small ones, is more challenging.

In recent years, the rapid advancement of deep learning technology has led to significant progress in object detection. Convolutional neural networks (CNNs) and vision transformers (ViTs) have emerged as the most widely studied frameworks for object detection, consistently demonstrating their effectiveness. Detectors based on these framework can be divided into two-stage and one-stage detectors. Two-stage detector divides the detection process into two phases: region proposal and object detection within the proposed regions. For instance, Faster R-CNN employs a Region Proposal Network (RPN) to initially locate objects, then sends these proposals to subsequent networks for precise localization and classification. One-stage detectors directly predict both the position and class of objects in a single step. While they generally offer faster inference speeds compared to two-stage detectors, this efficiency often comes at the expense of some loss in object localization accuracy. The YOLO(You Only Look Once) family represents a major breakthrough in object detection, offering an impressive balance between speed and accuracy, which makes it particularly effective for real-time applications. Hou et al. advanced this framework by proposing a top-down and bottom-up parallel feature fusion method based on YOLOv4, allowing the network to better focus on weak infrared and small targets[3]. Similarly, Zhang et al. introduced a thermal space excitation attention mechanism within the YOLOv7 model to emphasize important channel information while suppressing less relevant channels, thereby enhancing the network’s ability to capture fine texture details in infrared vehicle detection[4]. Zhang et al. incorporated deformable convolution and partial channel convolution to develop the C2f-FastD module[5], which was integrated into the YOLOv8 backbone. This module improves the network’s ability to learn small target features in complex backgrounds, thereby enhancing detection accuracy on night-time infrared pedestrian image datasets. Compared to two-stage approaches, one-stage methods like the YOLO family are more suitable for deployment on edge devices, owing to their high inference speed.

However, exiting YOLO-based methods for object detection still struggle to identifying the subtle distinctions between objects and backgrounds in infrared and visible light images. A simple yet effective approach to highlight objects is by removing the background from the image, as demonstrated in Fig. 1. We refer to the images in Figure.1(b) as “no background”(NB) images. These NB images are not entirely devoid of background, they are generated by setting the pixel values outside the object bounding boxes to zero. Unfortunately, NB images cannot be generated in real-world applications. Some methods introduce an additional network to enhance object saliency[6], however, these approaches are time consuming and unsuitable for real-time applications. In this paper, we propose a region attention mechanism to mitigate the influence of background features while maintaining efficiency. Our region attention mechanism is composed of two key components: a Two-Stage Training Strategy(TSTS) and a Region Attention Module(RAM). In the first stage, shallow features of NB images are generated, which serve as the Ground Truth features for the second stage. This is achieved by training a vanilla YOLO detector on NB images. In the second stage, we enhance YOLO by integrating a region attention module(RAM) into its shallow layer, resulting in Region Attention YOLO (RAYL), which is then trained from scratch on original images. The proposed RAM consists of a Squeeze-and-Excitation(SE) block and a fusion block. The SE block recalibrates the importance of different feature channels to create a feature map, whose average along the channel axis can be used as a saliency map. The fusion block fuses the feature map with this saliency map to suppress the background features and enhance the important object features. By introducing the region attention module and a two-stage training strategy, the background features in the shallow layers are weakened, making objects features more salient. Notably, during the inference phase, the detection process does not require NB images.

The images shown in Fig. 1 demonstrate (a) original images (b) NB images, (c) the average features in the shallow layers without the region attention mechanism (d) the average features in the shallow layers after using the region attention mechanism.

Overall, our contribution three twofold:

We propose a Region Attention Module (RAM) consisting of a Squeeze-and-Excitation (SE) block and a fusion block. The SE block generates a saliency map. The fusion block then leverages this saliency map to suppress background features and enhance critical object features. We developed our object detector, RAVL, by integrating this region attention module into the shallow layers of YOLO

We propose a Two-Stage Training Strategy(TSTS). The first training stage is to create the GT shallow layer features, and the second stage is to train RAVL, the background features in the shallow layers are suppressed so that the detection accuracy is increased.

We conduct extensive experiments on YOLOv5, YOLOv8, YOLOv9 and YOLO v10, all of which demonstrated the effectiveness of our proposed method, especially in small object detection.

2.1 Small Object detection

Although the YOLO series of object detectors have achieved excellent performance on general size objects detection. There is relatively little work in small object detection. Zhang et al. proposed a deep separable attention guidance network based on YOLOv3, which enhances the network's detection ability for small vehicle objects by utilizing attention mechanisms and deep separable convolutions[7]. Zhu et al. significantly improved the detection accuracy of small objects based on the YOLOv5 transformer, which is suitable for problems such as different heights and motion blur in drone aerial images[8]. Paper[9] introduces CA attention mechanism into the shallow backbone network based on YOLOv7, and uses BiFPN to enhance the fusion of shallow features, compensating for the lack of channel information in the process of extracting small object features in the original model. Paper[10] combines self-attention mechanism with convolutional ACmix attention mechanism to help YOLOv7 network focus attention on the object area, capture key information, ignore irrelevant information, and improve the network's ability to detect small objects. Paper [11] introduces an asymptotic feature pyramid network based on YOLOv8, which better integrates high-level semantic information with shallow detail information and enhances the ability of small object detection. Therefore, the mainstream direction to improve the detection performance of small objects is to introduce attention mechanisms into shallow features.

2.2 Attention mechanism

The attention mechanism is similar to the human visual system, focusing on obtaining useful information while ignoring irrelevant nearby information. When introducing attention mechanism in backbone networks, it has been proven to enhance the model's ability to extract features and improve the detection accuracy of the detection model. The most widely used CNN based, plug and play attention module currently belongs to the channel attention module and the spatial attention module. SENet based on channel attention mechanism[12] automatically learns the importance of different channel features, constructs the interrelationships between channels, and enables the network to selectively combine the semantic information of each channel. STNet based on spatial attention mechanism[13] allows for spatial manipulation of data within the network, transforming feature maps spatially, and enhancing the model's ability to capture different spatial features. On this basis, Woo et al. proposed the Convolutional Block Attention Module (CBAM)[14], which combines channel attention and spatial attention modules, and combines mean pooling and maximum pooling to aggregate features, enabling the network to focus on the spatial regions of objects. Wang et al.[15] were inspired by CBAM and designed a Hybrid Extended Convolutional Attention Module (HDCA) model to focus on important locations of potential targets in the image. In order to balance performance and complexity, paper[16] proposes coordinated attention, which decomposes channel attention into two one-dimensional feature encoding processes and aggregates features along two spatial directions, achieving the capture of long path relationships and the preservation of position information, enhancing the feature representation of the network. In addition, combining self-attention mechanism with CNN based attention mechanism is also an important direction in attention research. The ACmix hybrid attention mechanism proposed in paper[17] combines self-attention mechanism with convolution, which can capture rich contextual relationships while achieving self-attention. However, the computational complexity is relatively high, and the real-time performance of the model cannot be guaranteed. In contrast, our proposed CNN based attention mechanism is lighter and has a prominent effect on the regions of interest in the detection task, which can significantly improve the detection accuracy of the network.

Our method intends to weaken background features so that the detection performance is improved. This method is suitable for one-stage object detection networks, and for the convenience, we take YOLOv8s, a recent state-of-the-art one stage detector as example to describe our method.

3.1 Motivation

The backbone of detectors extracts features from both objects and the background. When the background in a test image differs significantly from that in the training images, the detector is more likely to fail in detecting the object. This is because deep learning-based detectors primarily work by distinguishing between the object and its background, which explains their sensitivity to background variations. In contrast, the human vision system can detect object even when the background changes greatly. The human vision system is more focus on the features of the object itself, and is capable of filtering out the background.

If the background can be effectively removed, the feature maps will depend on the objects and the detector will work more like the human vision system. For instance, we use the FLIR infrared pedestrian dataset to generate NB images. By training and testing YOLOv8s on these NB images, we observe that mAP0.5 has risen at least 10 percentage points and reaches above 0.9 as shown in Table 1. This improvement is particularly significant for small objects such as pedestrians and bicycles. However, NB image does not exist in real-world applications. Even if we can get NB images by using a segmentation network such as U-net, it is impractical because adding such network to a detector will greatly increase inference time. Therefore, we consider using a region attention mechanism to weaken the background features from the feature maps, instead of removing them in the original images.

Currently, most mainstream backbones consist of stacked convolutional modules with identical structures. As the network deepens, the receptive field increases, causing the area associated with the object in the feature maps to expand layer by layer. For small objects, their features may become entangled with background features after 3 or 4 convolutional operations, making very deep neural networks ineffective for detecting small objects. Therefore, we propose reducing the influence of background features in the feature maps at shallower layers.

Table 1

Detection Results on NB images.
Dataset	mAP0.5		mAP0.5:0.95
Dataset	Original images	NB images	Original images	NB images
FLIR	78.5	93.9	42.6	83.1
KAIST	69.4	99.5	32.6	94.1

3.2 Region Attention Mechanism

3.2.1 Two main challenges

We encounter two challenges in weakening background features in the feature map. First, we need to accurately locate the object features within the feature map. Second, we must determine the appropriate layer at which to weaken the background features. We address the first challenge using a region attention mechanism and tackle the second by analyzing the results of training YOLOv8s on NB images. The original infrared images and their corresponding NB images are shown in Figure. 1, the average feature map along the channel axis of NB images and original images, with sizes\(\:160\times\:160\times\:32\) the \(\:80\times\:80\times\:64\) and \(\:40\times\:40\times\:128\) are shown in Fig. 2 and Fig. 3 respectively. It can be observed that as the network layers increase, the region associated with the object expands. In the shallow layers, the object's location closely corresponds to the location of its features. Therefore, if we choose to remove background in shallow feature layers, the effect is similar to background removal in the original image. Conversely, removing background features at deep layers is less effective, as the region containing object information expands with each layer, as shown in Fig. 2(c). Therefore, it is optimal to remove background features at shallow layer. In YOLOv8s, we chose to remove the background features at the feature map with size of \(\:160\times\:160\times\:32\).

3.2.2 Region Attention Mechanism

We propose a region attention mechanism that enables the backbone to weaken background features, emphasizing object features while suppressing non-object features. Specifically, we design a region attention module that consists of an SE block and a fusion block. The SE block captures channel-wise dependencies, while the fusion block we designed enhances the features of the object region. The region attention mechanism is implemented through a two-stage training pipeline.

A. Two-Stage training strategy (TSTS)

In the first stage, a vanilla YOLOv8 is trained on NB images to generate the feature maps of NB images. The YOLOv8 network architecture is shown in Fig. 4.1. During the second stage of training, the first two convolutional layers are fixed to produce Ground Truth features for the subsequent training phase.

In the second stage, we train the final detector, named RAYL, which incorporates the region attention module into the backbone of the vanilla detector, as shown in the second row of Fig. 4.2. RAYL is trained by optimizing a loss function that combines a region attention loss and a detection loss. The region attention loss is a mean square error (MSE) loss, estimating the difference between the features after background removal and their ground truth, which is generated using the pretrained convolutional layers from stage 1, as shown in the first row of Fig. 4.2. The detection loss is the same as used in YOLOv8. The parameters \(\:{\lambda\:}_{1}\) and \(\:{\lambda\:}_{2}\) are introduced to balance the two loss terms, making their value in the same order of magnitude.

\({\text{Loss = }}{\lambda _1}{\text{los}}{{\text{s}}_{{\text{DET}}}}{\text{ + }}{\lambda _2}{\text{los}}{{\text{s}}_{{\text{RA}}}}\)

\({\text{los}}{{\text{s}}_{{\text{RA}}}}={\left( {{{\hat {Y}}_1} - {Y_1}} \right)^2}\)

Where \(\:{Y}_{1}\) is the average feature map along the channel axis, extracted from the infrared image with a size of 160×160×64, and \(\:{\widehat{Y}}_{1}\) is the average of the corresponding feature map extracted from its NB image

The training does not require additional labeled data, and the newly added network is not complex. During the test phase, only second branch is used, so the detection speed remains comparable to that of the vanilla detector.

B. Region attention module (RAM)

As the resolution of the object decreases, detection performance degrades, especially for small and occluded objects in complex background scenes.

As shown in Fig. 5, the feature map \(\:{f}_{{c}_{2}}\in\:{R}^{C\times\:H\times\:W}\) is sent to a SE block and a 1\(\:\times\:\) 1 convolution layer simultaneously. The number of channels of \(\:{f}_{{c}_{2}}\:\:\)is doubled through the 1\(\:\times\:\) 1 convolution, and then is evenly split into \(\:{X}_{1}\) and \(\:{X}_{2}\) along the channel axis. The feature map \(\:Y\) is the output of the SE block. The mean of \(\:Y\) along the channel axis is \(\:{Y}_{1}\in\:{R}^{1\times\:H\times\:W}\). From our training strategy, we known that \(\:{Y}_{1}\) is close to the corresponding feature map from NB images, indicating where to emphasize or suppress. Therefore, we use \(\:{Y}_{1}\) as the region attention map to highlight object regions of \(\:{X}_{2}\), then we get \(\:{X}_{2}^{+}\). The feature maps \(\:{X}_{1}\), \(\:{X}_{2}^{+}\) and \(\:Y\), each pass through three parallel paths, consisting of a convolution layer and a CSPNet. The results from these three paths are concatenated, and a 1\(\:\times\:\) 1 convolution is applied to obtain the final feature map \(\:{f}_{{c}_{2}}^{+}\), which is the feature map after background feature removal.

Through our two-stage training strategy and the Region Attention Module, background features can be weakened without requiring any location information of the objects.

To evaluate the performance of the object detection in IR image, we train and test the detectors on the commonly used FLIR IR pedestrian dataset. First, we establish the baseline YOLOv8s. We compare the proposed method with the baseline and other state-of-the-art object detectors include faster R-CNN[18], SSD[19],YOLOv3[20], etc. Second, we verify the impact of the attention module on detection performance by comparing precision, speed, and computational complexity through ablation experiments.

Our RAYL is implemented in Python 3.11.4, using PyTorch 2.1.0 and CUDA 12.1 on an NVIDIA GeForce RTX 4070. The parameters are as follows: random gradient descent (SGD), momentum 0.937, weight attenuation 0.0005, epoch number 300. The input image size is 640×640.

4.1 Result on the FLIR dataset

FLIR dataset was released by FLIR company in July 2018, which contains 8,862 training images and 1366 testing images. The dataset contains three types of objects: people, bicycle, and car. The subjects were pedestrians and vehicles traveling on highways and streets in Santa California and Barbara during the day (60%) and night (40%) between November and May of the following year.

We compared our detector RAYL with other deep learning detectors, and the results are shown in Table 2, where YOLOv3-SPP is the improved YOLOv3 for detecting infrared objects, and ThermalDet is the improved RefineDet for detecting infrared objects. Our RAYL uses a two-stage training and adds a region attention module to the backbone of YOLOv8s. It can be found that our proposed RAYL achieved a 3.1% improvement of mAP0.5 over YOLOV8s.

Table 2

Quantitative comparison of FLIR.
Model	mAP0.5 (%)
Model	Person/%	Bicycle/%	Car/%	all/%
Faster R-CNN[18]	64.5	49.7	70.0	63.2
SSD[19]	64.2	49.5	71.2	48.5
YOLOv3[20]	70.0	53.7	77.4	58.0
YOLOv3-SPP[21]	70.0	55.6	73.0	66.8
RefineDet[22]	74.2	59.4	82.6	72.9
ThermalDet[23]	71.9	58.4	77.6	74.6
YOLOV5s	84.5	62.2	90.2	79.0
YOLOV8s	83.1	62.1	90.6	78.6
YOLOV8s-TSTS&RAM	85.8	67.8	91.6	81.7

We set the confidence threshold to 0.1 to avoid result bias caused by confidence selection, and we set the IoU threshold to 0.3 to better display the prediction results. Obviously, lowering the confidence threshold to 0.1 reduces the probability of missed detection, but increases the probability of false alarm. Figure 6 shows detection results of the YOLOV8s and RAYL. Occlusion and small objects exist in the image. It is challenging to distinguish between objects and backgrounds. In the first row, the detection results of YOLOV8s are shown, exhibiting significant false positives. By observing the second sow, it's evident that our RAYL exhibits a lower false positive rate and higher accuracy. This improvement can be attributed to the removal of irrelevant information from the feature map, allowing the network to concentrate more on the object area.

Ablation studies

To evaluate the impact of the proposed two-stage training strategy and Region Attention Module on detector performance, we conducted tests using detection networks with different structures. We compared their detection accuracy, speed, number of parameters, and GFLOPs to assess their effectiveness. Table 3 presents the results of the ablation experiment. +TSTS&SE refers to YOLOV8s with an SE block and trained using our Two-Stage Training Strategy(TSTS). + TSTS& RAM is our method using an SE block and a fusion block, trained using our two-stage training strategy. The number of parameter and GFLOPs of the + TSTS are identical to the baseline YOLOV8s, but mAP0.5 increased by 1.2% and mAP0.5:0.95 increased by 1.4%. The combination of TSTS and RAM(+ TSTS&RAM) resulted in a 3.1% increase in mAP0.5 and 2.7% increase in mAP0.95 compared to YOLOV8s. Despite a 1.0ms increase in inference time, our model still achieved a very fast speed of 277 FPS, with only a slight increase in the number of parameters and GFLOPs. The training curve of the ablation experiment is shown in Fig. 7, and it can be observed that our proposed method can achieve a significant improvement of mAP0.5 and mAP0.5:0.95 during the training process. Overall, our method has the highest detection accuracy, indicating that both the two-stage training strategy and Region Attention Modules contribute to improving detector accuracy.

Table 3

Investigations of RAYL structures on FLIR.
Method	mAP0.5	mAP0.5:0.95	Time/ms	Params	GFLOPs
YOLOV8s	78.6	42.5	2.6	11126745	28.4
+TSTS	79.8(↑1.2)	43.9(↑1.4)	2.6	11126745	28.4
+ TSTS&RAM	81.7(↑3.1)	45.2(↑2.7)	3.6	14720251	34.8

We selected the most widely used YOLOv5 model and the latest YOLOV8, YOLOV9, YOLOV10 for ablation experiments to verify the generalizability of our method. In all series, our proposed method achieved varying degrees of accuracy improvement, with yolov8s and yolov9s showing the most significant increase. The combination of TSTS and RAM resulted in a 3.1% increase in mAP0.5 and 2.7% increase in mAP0.95 compared to YOLOV8s. Additionally, our method also resulted in a 2.8% increase in mAP0.5 and 1.4% increase in mAP0.95 compared to YOLOV9s.

Table 4

Investigations of RAYL structures based on the YOLO series on FLIR.
Model	mAP0.5 (%)				mAP0.5:0.95 (%)				FLOPs(G)	Inference time(ms)
Model	Person/%	Bicycle/%	Car/%	all/%	Person/%	Bicycle/%	Car/%	all/%	FLOPs(G)	Inference time(ms)
YOLOV5s	84.5	62.2	90.2	79.0	40.8	23.7	59.8	41.4	17.9	2.0
YOLOV5s-TSTS	84.4	65.3	90.4	80.0	40.7	24.4	59.5	41.5	17.9	2.0
YOLOV5s-TSTS&RAM	84.8	67.0	90.7	80.8(↑1.8)	41.2	24.5	59.3	41.7(↑0.3)	24.3	3.0
YOLOV8s	83.1	62.1	90.6	78.6	42.5	24.3	60.7	42.5	28.4	2.6
YOLOV8s-TSTS	84.7	63.3	91.4	79.8	44.3	24.7	62.8	43.9	28.4	2.6
YOLOV8s-TSTS&RAM	85.8	67.8(↑5.7)	91.6	81.7(↑3.1)	46.1	26.3	63.4	45.2(↑2.7)	34.8	3.6
YOLOV9s	84.0	62.6	90.4	79.0	43.1	23.8	61.0	42.6	38.7	4.6
YOLOV9s-TSTS	84.8	64.5	91.4	80.2	43.9	24.5	62.1	43.5	38.7	4.6
YOLOV9s-TSTS&RAM	85.2	68.9	91.3	81.8(↑2.8)	43.9	25.9	62.1	44.0(↑1.4)	45.1	5.5
YOLOV10s	83.8	62.1	90.6	78.8	42.6	23.4	60.8	42.3	24.5	3.4
YOLOV10s-TSTS	84.1	64.7	90.5	79.8	43.2	22.9	60.6	42.2	24.5	3.4
YOLOV10s-TSTS&RAM	84.7	64.8	91.5	80.4(↑1.6)	43.9	25.4	61.7	43.6(↑1.3)	30.9	4.1

We also conducted experiments with different sizes of YOLOv8. As shown in Table 5, our model improves the accuracy of the baseline model across three different sizes (n, s, m), further demonstrating the general applicability of our model. Our method shows a significant increase in performance for YOLOv8n and YOLOv8s, with mAP0.5 improving by 2.4% and 3.1%, respectively. YOLOv8s combined with our method, YOLOV8s-TSTS&RAM, achieving an mAP0.5 of 81.7%, surpasses the accuracy of YOLOv8m, which has an mAP0.5 of 80.7%. However, the computational complexity and inference time required by YOLOV8s-TSTS&RAM are less than half of those of YOLOV8m, indicating that our method effectively exploits the feature extraction ability of small models.

Table 5

Investigations of RAYL structures based on YOLOV8 of different sizes on FLIR.
Model	mAP0.5 (%)				mAP0.5:0.95 (%)				FLOPs(G)	Inference time(ms)
Model	Person/%	Bicycle/%	Car/%	all/%	Person/%	Bicycle/%	Car/%	all/%	FLOPs(G)	Inference time(ms)
YOLOV8n	81.6	57.5	90.1	76.4	40.4	20.7	59.8	40.3	8.1	1.0
YOLOV8n-TSTS	82.5	56.9	90.4	76.6	42.0	21.2	62.2	41.8	8.1	1.3
YOLOV8n-TSTS&RAM	84.2	61.4	90.6	78.8(↑2.4)	44.2	23.9	62.5	43.5(↑3.2)	9.7	1.8
YOLOV8s	83.1	62.1	90.6	78.6	42.5	24.3	60.7	42.5	28.4	2.6
YOLOV8s-TSTS	84.7	63.3	91.4	79.8	44.3	24.7	62.8	43.9	28.4	2.6
YOLOV8s-TSTS&RAM	85.8	67.8(↑5.7)	91.6	81.7(↑3.1)	46.1	26.3	63.4	45.2(↑2.7)	34.8	3.6
YOLOV8m	86.0	64.7	91.5	80.7	45.8	24.7	63.4	44.7	85.3	5.6
YOLOV8m-TSTS	86.0	66.2	91.6	81.3	46.8	25.5	64.3	45.5	85.4	5.6
YOLOV8m-TSTS&RAM	86.5	69.6	92.0	82.7(↑2.0)	47.5	27.0	64.5	46.3(↑1.6)	99.8	7.6

4.2 Result on the VisDrone2019 dataset

To validate our proposed YOLOV8s-TSTS&RAM model on visible light object detection, we conducted additional experiments on the VisDrone2019 dataset, which consists of visible light images captured by various drone cameras. The VisDrone2019 dataset was collected by the AISKYEYE team at the Machine Learning and Data Mining Laboratory of Tianjin University. The benchmark dataset includes 288 video clips, comprising of 261,908 frames and 10,209 static images. It covers ten categories such as pedestrians, humans, cars, vans, buses, trucks, motorcycles, bicycles, etc. We chose 10,209 static images for the experiment, with 6,471 for training, 548 for validation, and 3,190 for testing.

We compared our detector RAYL with other detectors, and the results are shown in Table 6. Our RAYL uses a two-stage training and adds a region attention module to the backbone of YOLOv8s. It can be found that RAYL achieved 5.0% improvement of mAP0.5 and 3.8% improvement of mAP0.5:0.95 over YOLOV8s.

Table 6

Quantitative comparison of VisDrone2019.
Model	Backbone	mAP0.5 (%)	mAP0.5:0.95 (%)
Faster R-CNN[18]	ResNet	37.8	21.5
Cascade R-CNN[24]	ResNet	39.4	24.2
CenterNet[25]	ResNet50[26]	39.1	22.8
SSD[19]	MobileNetV2[27]	33.7	19
YOLOv5s	CSP-Darknet-53	31.7	17.1
YOLOV8s	CSP-Darknet-53	37.1	22.0
YOLOV8s-TSTS&RAM	CSP-Darknet-53	42.1	25.8

The results of ablation experiments, shown in Table 7, indicates that YOLOV8s-TSTS&RAM improved mAP0.5 from 37.1–42.1% and mAP0.95 from 22.0–25.8% compared to the baseline YOLOV8s network, with accuracy improvements observed across all categories. The training curve of the ablation experiment is shown in Fig. 8. During the training process, our proposed method can still achieve a significant improvement of mAP0.5 and mAP0.5:0.95 on the visible light dataset, indicating that the proposed method has strong robustness for small object detection tasks, allowing the backbone network to pay more attention to object feature information and extract it.

Table 7

Investigations of RAYL structures on VisDrone2019.
Classes	YOLOv8s	YOLOV8s-TSTS	YOLOV8s-TSTS&RAM
Pedestrian/%	39.5	45.2	47.4(↑7.9)
People/%	30.5	34.6	35.8
Bicycle/%	10.3	15.2	14.5
Car/%	78.6	81.2	81.9
Van/%	44.1	45.6	47.7
Truck/%	34.6	37.0	37.4
Tricycle/%	26.3	27.8	30.1
Awning-Tricycle/%	14.1	14.6	16.5
Bus/%	52.2	60.5	61.4
Motor/%	41.0	47.1	48.9
mAP0.5	37.1	40.9(↑3.8)	42.1(↑5.0)
mAP0.5:0.95	22.0	24.7(↑2.7)	25.8(↑3.8)
Time/ms	2.7	2.7	3.3
FPS	370	370	303
Params	11126745	11126745	14720251
GFLOPs	28.5	28.5	34.9

We also selected the YOLOv5 model and YOLOV8, YOLOv9 and YOLOv10 model for ablation experiments to verify the generalizability of our method on VisDrone2019. On all series, our method proposed achieved varying degrees of accuracy improvement. The combination of TSTS and RAM resulted in a 5.0% increase in mAP0.5 and 3.8% increase in mAP0.95 compared to YOLOV8s.

Table 8

Investigations of RAYL structures based on the YOLO series on VisDrone2019.
Model	mAP0.5 (%)	mAP0.5:0.95 (%)	FLOPs(G)	Inference time(ms)
YOLOV5s	31.7	17.1	17.9	1.6
YOLOV5s-TSTS	32.6	17.4	17.9	1.6
YOLOV5s-TSTS&RAM	34.2(↑2.5)	18.7(↑1.6)	24.3	2.3
YOLOV8s	37.1	22.0	28.5	2.7
YOLOV8s-TSTS	40.9	24.7	28.5	2.7
YOLOV8s-TSTS&RAM	42.1(↑5.0)	25.8(↑3.8)	34.9	3.3
YOLOV9s	41.6	25.4	38.8	6.3
YOLOV9s-TSTS	41.9	25.4	38.8	6.3
YOLOV9s-TSTS&RAM	43.3(↑1.7)	26.5(↑1.1)	45.2	7.0
YOLOV10s	38.9	23.4	24.5	3.3
YOLOV10s-TSTS	38.4	23.0	24.5	3.3
YOLOV10s-TSTS&RAM	40.4(↑1.5)	24.6(↑1.2)	30.9	4.9

Table 9 presents the results of YOLOV8 for across different sizes, demonstrating that our method achieves obvious accuracy improvements for both n, s and m sizes, with mAP0.5 increasing by 2.6%, 5.0% and 1.8% respectively. highlighting the robust adaptability of the model in both infrared and visible light data.

Table 9

Investigations of RAYL structures based on YOLOV8 of different sizes on VisDrone2019.
Model	mAP0.5 (%)	mAP0.5:0.95 (%)	FLOPs(G)	Inference time(ms)
YOLOV8n	33.3	19.4	8.1	1.7
YOLOV8n-TSTS	34.2	20.2	8.1	1.7
YOLOV8n-TSTS&RAM	35.9(↑2.6)	21.4(↑2.0)	9.7	2.1
YOLOV8s	37.1	22.0	28.5	2.7
YOLOV8s-TSTS	40.9	24.7	28.5	2.7
YOLOV8s-TSTS&RAM	42.1(↑5.0)	25.8(↑3.8)	34.9	3.3
YOLOV8m	43.8	27.1	85.4	4.3
YOLOV8m-TSTS	44.0	27.0	85.4	4.3
YOLOV8m-TSTS&RAM	45.6(↑1.8)	28.3(↑1.2)	99.8	6.0

Due to the limited texture information and low distinctiveness of small objects, they are more susceptible to background interference, making small object detection a persistent challenge. In this paper, we propose a region attention mechanism aimed at reducing the impact of background features in images, thereby enabling the detector to more effectively distinguish objects from the background. Specifically, we design a two-stage training strategy to create the region attention map and introduce a Region Attention Module to make detector focus more on the object region. We build RAYL based on YOLOv8 and region attention mechanism. Through experiments conducted on multiple datasets, we demonstrate the effectiveness of the training strategy and attention fusion module.

Author Contribution

Weiwen Cai, Huiqian Du and Min Xie wrote the main manuscript text, and Weiwen Cai prepared figures and tables.All authors reviewed the manuscript.

Ullah, A., Xie, H., Farooq, M.O., Sun, Z.: Pedestrian detection in infrared images using fast RCNN. In 2018 Eighth International Conference on Image Processing Theory, Tools and Applications (IPTA) (pp. 1–6). IEEE. (2018), November
Yang, L., Zhong, J., Zhang, Y., Bai, S., Li, G., Yang, Y., Zhang, J.: An improving faster-RCNN with multi-attention ResNet for small target detection in intelligent autonomous transport with 6G. IEEE Trans. Intell. Transp. Syst. 24(7), 7717–7725 (2022)
Hou, Z., Sun, Y., Guo, H., Li, J., Ma, S., Fan, J.: M-YOLO: an object detector based on global context information for infrared images. J. Real-Time Image Proc. 19(6), 1009–1022 (2022)
Zhang, Z., Huang, J., Hei, G., Wang, W.: YOLO-IR-Free: An Improved Algorithm for Real-Time Detection of Vehicles in Infrared Images. Sensors. 23(21), 8723 (2023)
Zhang, Z., Wang, B., Sun, W.: Pedestrian Detection in Nighttime Infrared Images Based on Improved YOLOv8 Networks. In 2023 9th International Conference on Computer and Communications (ICCC) (pp. 2042–2046). IEEE. (2023), December
Ghose, D., Desai, S.M., Bhattacharya, S., Chakraborty, D., Fiterau, M., Rahman, T.: Pedestrian detection in thermal images using saliency maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 0–0). (2019)
Zhang, Z., Liu, Y., Liu, T., Lin, Z., Wang, S.: DAGN: A real-time UAV remote sensing image vehicle detection framework. IEEE Geosci. Remote Sens. Lett. 17(11), 1884–1888 (2019)
Zhu, X., Lyu, S., Wang, X., Zhao, Q.: TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2778–2788). (2021)
Ge, R., Mao, Y., Li, S., Wei, H.: Research On Ship Small Target Detection In SAR Image Based On Improved YOLO-v7. In 2023 International Applied Computational Electromagnetics Society Symposium (ACES-China) (pp. 1–3). IEEE. (2023), August
Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., Huang, G.: On the integration of self-attention and convolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 815–825). (2022)
Niu, K., Yan, Y.: A Small-Object-Detection Model Based on Improved YOLOv8 for UAV Aerial Images. In 2023 2nd International Conference on Artificial Intelligence and Intelligent Information Processing (AIIIP) (pp. 57–60). IEEE. (2023), October
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141). (2018)
Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. Advances in neural information processing systems, 28. (2015)
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19). (2018)
Wang, K., Wei, Z.: YOLO V4 with hybrid dilated convolution attention module for object detection in the aerial dataset. Int. J. Remote Sens. 43(4), 1323–1344 (2022)
Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13713–13722). (2021)
Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., Huang, G.: On the integration of self-attention and convolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 815–825). (2022)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 (pp. 21–37). Springer International Publishing. (2016)
Redmon, J.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767. (2018)
Ruzhen, Z., Jianlin, Z., Xiaoping, Q., Haorui, Z., Zhiyong, X.: Infrared target detection and recognition in complex scene. Opto-electronic Eng. 47(10), 200314–200311 (2020)
Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z.: Single-shot refinement neural network for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4203–4212). (2018)
Cao, Y., Zhou, T., Zhu, X., Su, Y.: Every feature counts: An improved one-stage detector in thermal imagery. In 2019 IEEE 5th International Conference on Computer and Communications (ICCC) (pp. 1965–1969). IEEE. (2019), December
Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6154–6162). (2018)
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6569–6578). (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). (2016)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520). (2018)

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

RAVL: A Region Attention Yolo with Two-Stage Training for Enhanced Object Detection

Status:

Version 1

Abstract

Figures

I. INTRODUCTION

II. RELATED WORK

III Method

IV. EXPERIMENTS

V. CONCLUSION

Declarations

Author Contribution

References

Additional Declarations

Status:

Version 1