To evaluate the performance of the object detection in IR image, we train and test the detectors on the commonly used FLIR IR pedestrian dataset. First, we establish the baseline YOLOv8s. We compare the proposed method with the baseline and other state-of-the-art object detectors include faster R-CNN[18], SSD[19],YOLOv3[20], etc. Second, we verify the impact of the attention module on detection performance by comparing precision, speed, and computational complexity through ablation experiments.
Our RAYL is implemented in Python 3.11.4, using PyTorch 2.1.0 and CUDA 12.1 on an NVIDIA GeForce RTX 4070. The parameters are as follows: random gradient descent (SGD), momentum 0.937, weight attenuation 0.0005, epoch number 300. The input image size is 640×640.
4.1 Result on the FLIR dataset
FLIR dataset was released by FLIR company in July 2018, which contains 8,862 training images and 1366 testing images. The dataset contains three types of objects: people, bicycle, and car. The subjects were pedestrians and vehicles traveling on highways and streets in Santa California and Barbara during the day (60%) and night (40%) between November and May of the following year.
We compared our detector RAYL with other deep learning detectors, and the results are shown in Table 2, where YOLOv3-SPP is the improved YOLOv3 for detecting infrared objects, and ThermalDet is the improved RefineDet for detecting infrared objects. Our RAYL uses a two-stage training and adds a region attention module to the backbone of YOLOv8s. It can be found that our proposed RAYL achieved a 3.1% improvement of mAP0.5 over YOLOV8s.
Table 2
Quantitative comparison of FLIR.
Model | mAP0.5 (%) |
Person/% | Bicycle/% | Car/% | all/% |
Faster R-CNN[18] | 64.5 | 49.7 | 70.0 | 63.2 |
SSD[19] | 64.2 | 49.5 | 71.2 | 48.5 |
YOLOv3[20] | 70.0 | 53.7 | 77.4 | 58.0 |
YOLOv3-SPP[21] | 70.0 | 55.6 | 73.0 | 66.8 |
RefineDet[22] | 74.2 | 59.4 | 82.6 | 72.9 |
ThermalDet[23] | 71.9 | 58.4 | 77.6 | 74.6 |
YOLOV5s | 84.5 | 62.2 | 90.2 | 79.0 |
YOLOV8s | 83.1 | 62.1 | 90.6 | 78.6 |
YOLOV8s-TSTS&RAM | 85.8 | 67.8 | 91.6 | 81.7 |
We set the confidence threshold to 0.1 to avoid result bias caused by confidence selection, and we set the IoU threshold to 0.3 to better display the prediction results. Obviously, lowering the confidence threshold to 0.1 reduces the probability of missed detection, but increases the probability of false alarm. Figure 6 shows detection results of the YOLOV8s and RAYL. Occlusion and small objects exist in the image. It is challenging to distinguish between objects and backgrounds. In the first row, the detection results of YOLOV8s are shown, exhibiting significant false positives. By observing the second sow, it's evident that our RAYL exhibits a lower false positive rate and higher accuracy. This improvement can be attributed to the removal of irrelevant information from the feature map, allowing the network to concentrate more on the object area.
Ablation studies
To evaluate the impact of the proposed two-stage training strategy and Region Attention Module on detector performance, we conducted tests using detection networks with different structures. We compared their detection accuracy, speed, number of parameters, and GFLOPs to assess their effectiveness. Table 3 presents the results of the ablation experiment. +TSTS&SE refers to YOLOV8s with an SE block and trained using our Two-Stage Training Strategy(TSTS). + TSTS& RAM is our method using an SE block and a fusion block, trained using our two-stage training strategy. The number of parameter and GFLOPs of the + TSTS are identical to the baseline YOLOV8s, but mAP0.5 increased by 1.2% and mAP0.5:0.95 increased by 1.4%. The combination of TSTS and RAM(+ TSTS&RAM) resulted in a 3.1% increase in mAP0.5 and 2.7% increase in mAP0.95 compared to YOLOV8s. Despite a 1.0ms increase in inference time, our model still achieved a very fast speed of 277 FPS, with only a slight increase in the number of parameters and GFLOPs. The training curve of the ablation experiment is shown in Fig. 7, and it can be observed that our proposed method can achieve a significant improvement of mAP0.5 and mAP0.5:0.95 during the training process. Overall, our method has the highest detection accuracy, indicating that both the two-stage training strategy and Region Attention Modules contribute to improving detector accuracy.
Table 3
Investigations of RAYL structures on FLIR.
Method | mAP0.5 | mAP0.5:0.95 | Time/ms | Params | GFLOPs |
YOLOV8s | 78.6 | 42.5 | 2.6 | 11126745 | 28.4 |
+TSTS | 79.8(↑1.2) | 43.9(↑1.4) | 2.6 | 11126745 | 28.4 |
+ TSTS&RAM | 81.7(↑3.1) | 45.2(↑2.7) | 3.6 | 14720251 | 34.8 |
We selected the most widely used YOLOv5 model and the latest YOLOV8, YOLOV9, YOLOV10 for ablation experiments to verify the generalizability of our method. In all series, our proposed method achieved varying degrees of accuracy improvement, with yolov8s and yolov9s showing the most significant increase. The combination of TSTS and RAM resulted in a 3.1% increase in mAP0.5 and 2.7% increase in mAP0.95 compared to YOLOV8s. Additionally, our method also resulted in a 2.8% increase in mAP0.5 and 1.4% increase in mAP0.95 compared to YOLOV9s.
Table 4
Investigations of RAYL structures based on the YOLO series on FLIR.
Model | mAP0.5 (%) | mAP0.5:0.95 (%) | FLOPs(G) | Inference time(ms) |
Person/% | Bicycle/% | Car/% | all/% | Person/% | Bicycle/% | Car/% | all/% |
YOLOV5s | 84.5 | 62.2 | 90.2 | 79.0 | 40.8 | 23.7 | 59.8 | 41.4 | 17.9 | 2.0 |
YOLOV5s-TSTS | 84.4 | 65.3 | 90.4 | 80.0 | 40.7 | 24.4 | 59.5 | 41.5 | 17.9 | 2.0 |
YOLOV5s-TSTS&RAM | 84.8 | 67.0 | 90.7 | 80.8(↑1.8) | 41.2 | 24.5 | 59.3 | 41.7(↑0.3) | 24.3 | 3.0 |
YOLOV8s | 83.1 | 62.1 | 90.6 | 78.6 | 42.5 | 24.3 | 60.7 | 42.5 | 28.4 | 2.6 |
YOLOV8s-TSTS | 84.7 | 63.3 | 91.4 | 79.8 | 44.3 | 24.7 | 62.8 | 43.9 | 28.4 | 2.6 |
YOLOV8s-TSTS&RAM | 85.8 | 67.8(↑5.7) | 91.6 | 81.7(↑3.1) | 46.1 | 26.3 | 63.4 | 45.2(↑2.7) | 34.8 | 3.6 |
YOLOV9s | 84.0 | 62.6 | 90.4 | 79.0 | 43.1 | 23.8 | 61.0 | 42.6 | 38.7 | 4.6 |
YOLOV9s-TSTS | 84.8 | 64.5 | 91.4 | 80.2 | 43.9 | 24.5 | 62.1 | 43.5 | 38.7 | 4.6 |
YOLOV9s-TSTS&RAM | 85.2 | 68.9 | 91.3 | 81.8(↑2.8) | 43.9 | 25.9 | 62.1 | 44.0(↑1.4) | 45.1 | 5.5 |
YOLOV10s | 83.8 | 62.1 | 90.6 | 78.8 | 42.6 | 23.4 | 60.8 | 42.3 | 24.5 | 3.4 |
YOLOV10s-TSTS | 84.1 | 64.7 | 90.5 | 79.8 | 43.2 | 22.9 | 60.6 | 42.2 | 24.5 | 3.4 |
YOLOV10s-TSTS&RAM | 84.7 | 64.8 | 91.5 | 80.4(↑1.6) | 43.9 | 25.4 | 61.7 | 43.6(↑1.3) | 30.9 | 4.1 |
We also conducted experiments with different sizes of YOLOv8. As shown in Table 5, our model improves the accuracy of the baseline model across three different sizes (n, s, m), further demonstrating the general applicability of our model. Our method shows a significant increase in performance for YOLOv8n and YOLOv8s, with mAP0.5 improving by 2.4% and 3.1%, respectively. YOLOv8s combined with our method, YOLOV8s-TSTS&RAM, achieving an mAP0.5 of 81.7%, surpasses the accuracy of YOLOv8m, which has an mAP0.5 of 80.7%. However, the computational complexity and inference time required by YOLOV8s-TSTS&RAM are less than half of those of YOLOV8m, indicating that our method effectively exploits the feature extraction ability of small models.
Table 5
Investigations of RAYL structures based on YOLOV8 of different sizes on FLIR.
Model | mAP0.5 (%) | mAP0.5:0.95 (%) | FLOPs(G) | Inference time(ms) |
Person/% | Bicycle/% | Car/% | all/% | Person/% | Bicycle/% | Car/% | all/% |
YOLOV8n | 81.6 | 57.5 | 90.1 | 76.4 | 40.4 | 20.7 | 59.8 | 40.3 | 8.1 | 1.0 |
YOLOV8n-TSTS | 82.5 | 56.9 | 90.4 | 76.6 | 42.0 | 21.2 | 62.2 | 41.8 | 8.1 | 1.3 |
YOLOV8n-TSTS&RAM | 84.2 | 61.4 | 90.6 | 78.8(↑2.4) | 44.2 | 23.9 | 62.5 | 43.5(↑3.2) | 9.7 | 1.8 |
YOLOV8s | 83.1 | 62.1 | 90.6 | 78.6 | 42.5 | 24.3 | 60.7 | 42.5 | 28.4 | 2.6 |
YOLOV8s-TSTS | 84.7 | 63.3 | 91.4 | 79.8 | 44.3 | 24.7 | 62.8 | 43.9 | 28.4 | 2.6 |
YOLOV8s-TSTS&RAM | 85.8 | 67.8(↑5.7) | 91.6 | 81.7(↑3.1) | 46.1 | 26.3 | 63.4 | 45.2(↑2.7) | 34.8 | 3.6 |
YOLOV8m | 86.0 | 64.7 | 91.5 | 80.7 | 45.8 | 24.7 | 63.4 | 44.7 | 85.3 | 5.6 |
YOLOV8m-TSTS | 86.0 | 66.2 | 91.6 | 81.3 | 46.8 | 25.5 | 64.3 | 45.5 | 85.4 | 5.6 |
YOLOV8m-TSTS&RAM | 86.5 | 69.6 | 92.0 | 82.7(↑2.0) | 47.5 | 27.0 | 64.5 | 46.3(↑1.6) | 99.8 | 7.6 |
4.2 Result on the VisDrone2019 dataset
To validate our proposed YOLOV8s-TSTS&RAM model on visible light object detection, we conducted additional experiments on the VisDrone2019 dataset, which consists of visible light images captured by various drone cameras. The VisDrone2019 dataset was collected by the AISKYEYE team at the Machine Learning and Data Mining Laboratory of Tianjin University. The benchmark dataset includes 288 video clips, comprising of 261,908 frames and 10,209 static images. It covers ten categories such as pedestrians, humans, cars, vans, buses, trucks, motorcycles, bicycles, etc. We chose 10,209 static images for the experiment, with 6,471 for training, 548 for validation, and 3,190 for testing.
We compared our detector RAYL with other detectors, and the results are shown in Table 6. Our RAYL uses a two-stage training and adds a region attention module to the backbone of YOLOv8s. It can be found that RAYL achieved 5.0% improvement of mAP0.5 and 3.8% improvement of mAP0.5:0.95 over YOLOV8s.
Table 6
Quantitative comparison of VisDrone2019.
Model | Backbone | mAP0.5 (%) | mAP0.5:0.95 (%) |
Faster R-CNN[18] | ResNet | 37.8 | 21.5 |
Cascade R-CNN[24] | ResNet | 39.4 | 24.2 |
CenterNet[25] | ResNet50[26] | 39.1 | 22.8 |
SSD[19] | MobileNetV2[27] | 33.7 | 19 |
YOLOv5s | CSP-Darknet-53 | 31.7 | 17.1 |
YOLOV8s | CSP-Darknet-53 | 37.1 | 22.0 |
YOLOV8s-TSTS&RAM | CSP-Darknet-53 | 42.1 | 25.8 |
The results of ablation experiments, shown in Table 7, indicates that YOLOV8s-TSTS&RAM improved mAP0.5 from 37.1–42.1% and mAP0.95 from 22.0–25.8% compared to the baseline YOLOV8s network, with accuracy improvements observed across all categories. The training curve of the ablation experiment is shown in Fig. 8. During the training process, our proposed method can still achieve a significant improvement of mAP0.5 and mAP0.5:0.95 on the visible light dataset, indicating that the proposed method has strong robustness for small object detection tasks, allowing the backbone network to pay more attention to object feature information and extract it.
Table 7
Investigations of RAYL structures on VisDrone2019.
Classes | YOLOv8s | YOLOV8s-TSTS | YOLOV8s-TSTS&RAM |
Pedestrian/% | 39.5 | 45.2 | 47.4(↑7.9) |
People/% | 30.5 | 34.6 | 35.8 |
Bicycle/% | 10.3 | 15.2 | 14.5 |
Car/% | 78.6 | 81.2 | 81.9 |
Van/% | 44.1 | 45.6 | 47.7 |
Truck/% | 34.6 | 37.0 | 37.4 |
Tricycle/% | 26.3 | 27.8 | 30.1 |
Awning-Tricycle/% | 14.1 | 14.6 | 16.5 |
Bus/% | 52.2 | 60.5 | 61.4 |
Motor/% | 41.0 | 47.1 | 48.9 |
mAP0.5 | 37.1 | 40.9(↑3.8) | 42.1(↑5.0) |
mAP0.5:0.95 | 22.0 | 24.7(↑2.7) | 25.8(↑3.8) |
Time/ms | 2.7 | 2.7 | 3.3 |
FPS | 370 | 370 | 303 |
Params | 11126745 | 11126745 | 14720251 |
GFLOPs | 28.5 | 28.5 | 34.9 |
We also selected the YOLOv5 model and YOLOV8, YOLOv9 and YOLOv10 model for ablation experiments to verify the generalizability of our method on VisDrone2019. On all series, our method proposed achieved varying degrees of accuracy improvement. The combination of TSTS and RAM resulted in a 5.0% increase in mAP0.5 and 3.8% increase in mAP0.95 compared to YOLOV8s.
Table 8
Investigations of RAYL structures based on the YOLO series on VisDrone2019.
Model | mAP0.5 (%) | mAP0.5:0.95 (%) | FLOPs(G) | Inference time(ms) |
YOLOV5s | 31.7 | 17.1 | 17.9 | 1.6 |
YOLOV5s-TSTS | 32.6 | 17.4 | 17.9 | 1.6 |
YOLOV5s-TSTS&RAM | 34.2(↑2.5) | 18.7(↑1.6) | 24.3 | 2.3 |
YOLOV8s | 37.1 | 22.0 | 28.5 | 2.7 |
YOLOV8s-TSTS | 40.9 | 24.7 | 28.5 | 2.7 |
YOLOV8s-TSTS&RAM | 42.1(↑5.0) | 25.8(↑3.8) | 34.9 | 3.3 |
YOLOV9s | 41.6 | 25.4 | 38.8 | 6.3 |
YOLOV9s-TSTS | 41.9 | 25.4 | 38.8 | 6.3 |
YOLOV9s-TSTS&RAM | 43.3(↑1.7) | 26.5(↑1.1) | 45.2 | 7.0 |
YOLOV10s | 38.9 | 23.4 | 24.5 | 3.3 |
YOLOV10s-TSTS | 38.4 | 23.0 | 24.5 | 3.3 |
YOLOV10s-TSTS&RAM | 40.4(↑1.5) | 24.6(↑1.2) | 30.9 | 4.9 |
Table 9 presents the results of YOLOV8 for across different sizes, demonstrating that our method achieves obvious accuracy improvements for both n, s and m sizes, with mAP0.5 increasing by 2.6%, 5.0% and 1.8% respectively. highlighting the robust adaptability of the model in both infrared and visible light data.
Table 9
Investigations of RAYL structures based on YOLOV8 of different sizes on VisDrone2019.
Model | mAP0.5 (%) | mAP0.5:0.95 (%) | FLOPs(G) | Inference time(ms) |
YOLOV8n | 33.3 | 19.4 | 8.1 | 1.7 |
YOLOV8n-TSTS | 34.2 | 20.2 | 8.1 | 1.7 |
YOLOV8n-TSTS&RAM | 35.9(↑2.6) | 21.4(↑2.0) | 9.7 | 2.1 |
YOLOV8s | 37.1 | 22.0 | 28.5 | 2.7 |
YOLOV8s-TSTS | 40.9 | 24.7 | 28.5 | 2.7 |
YOLOV8s-TSTS&RAM | 42.1(↑5.0) | 25.8(↑3.8) | 34.9 | 3.3 |
YOLOV8m | 43.8 | 27.1 | 85.4 | 4.3 |
YOLOV8m-TSTS | 44.0 | 27.0 | 85.4 | 4.3 |
YOLOV8m-TSTS&RAM | 45.6(↑1.8) | 28.3(↑1.2) | 99.8 | 6.0 |