4.1 Experimental Platform and Relevant Metrics
The evaluation of network performance primarily relies on mAP during the training process and the performance of the trained network on the validation set. Precision(P), recall(R), and mAP are adopted as performance evaluation metrics to assess the detection results quantitatively. The expressions for P and R are as follows:
$$R = TP/\left(TP+FN\right)$$
8
True positives (TP): The number of samples that are positive and correctly classified as positive by the classifier; True negatives (TN): The number of samples that are negative and correctly classified as negative by the classifier; False positives (FP): The number of samples that are negative but incorrectly classified as positive by the classifier; False negatives (FN): The number of samples that are positive but incorrectly classified as negative by the classifier.
Average Precision (AP) is the area under the P-R curve. Generally, a higher AP value indicates a better classifier performance. mAP takes the AP values for each class separately, calculates the average of all class APs, and represents a comprehensive measure of the average precision for the detected objects. Table 2 presents the experimental platform utilized in this study.
Table 2
platform | specifications |
CPU | 13th Gen Intel(R) Core(TM) i5-13600KF |
GPU | NVIDIA GeForce RTX 4070Ti |
Operating System | Windows 11 |
Framework | Pytorch 1.21 |
4.2 Experimental Dataset
To demonstrate the versatility of our model, we conducted experiments on three open-source datasets: NEU-DET[37], GC10[38], and Magnetic tile defect(MTD)[39].
The NEU-DET dataset is an open dataset specifically designed for hot-rolled steel strip defect detection. It consists of 1800 images, with 1440 randomly selected for training and 360 images for validation. The images have a size of 200×200, which we resized to 224×224 before inputting them into the network. The dataset includes six defect classes: crazing, inclusion, patches, pitted_surface, rolled-in_scale, and scratches. Figure 6(a) shows the distribution of each class.
The GC10 dataset is collected from real industrial settings, specifically designed for steel plate surface defect detection. It consists of 10 different defect classes and comprises 3570 grayscale images. Out of these, there are 2294 labeled images available for use. We randomly selected 1836 images as the training set and 458 as the validation set. Figure 6(b) illustrates the distribution of each class.
The MTD dataset is a publicly available dataset for magnetic tile defect detection. It consists of 1344 images, with 392 images containing defects. We randomly selected 314 images for training and 78 images for validation. The dataset encompasses five defect classes: MT_Uneven, MT_Blowhole, MT_Break, MT_Crack, and MT_Fray. Figure 6(c) depicts the distribution of each class.
4.3 Ablation Experiments
Results on the NEU-DET dataset are presented in Table 3, indicating a significant improvement in the performance of the YOLOv8-FCS model by introducing three technical improvements: AIS, DyHeadBlock, and EMABTK. The complete model, incorporating all three improvements, achieved mAP50 and mAP50-95 scores of 77.3% and 44.5%, respectively, outperforming the baseline YOLOv8n model with 74.1% and 41.8%. Moreover, despite the increased parameters and computational complexity observed in the complete model, its recall improved, demonstrating enhanced detection capabilities.
Similar effectiveness of these technical improvements was observed on the GC10 dataset, as shown in Table 4. The complete model achieved mAP50 and mAP50-95 scores of 65.5% and 33.6%, respectively, exhibiting stable improvement compared to the baseline model. Despite the comparatively modest progress on this dataset, considering the potentially higher complexity or diversity of the GC10 dataset, such improvement still validates the robustness and adaptability of the YOLOv8-FCS model. Changes in precision and recall also reflect the model's adaptability to different scenarios.
On the MTD dataset, the complete model also demonstrated excellent performance, as shown in Table 5, achieving mAP50 and mAP50-95 scores of 73.8% and 50.3%, respectively. Notably, the complete model achieved the highest precision of 84.8%, showcasing its significant capability in reducing false positives. The improvement in recall also indicates the model's ability to provide more comprehensive coverage of actual targets, which is crucial for applications such as defect detection that require high precision and recall.
Figure 7 depicts the mAP curves during the training process on the NEU-DET, GC10, and MTD. It is evident from the figure that the mAP values of the YOLOv8-FCS model surpass those of the YOLOv8 model on all three datasets. Furthermore, through a comprehensive analysis of the results from ablation experiments on the three datasets, it is evident that the three technical improvements (AIS, DyHeadBlock, and EMABTK) are critical for enhancing the performance of the YOLOv8-FCS model. These improvements enhance detection accuracy (mAP) and optimize precision and recall, enabling the model to deliver better performance and adaptability across different application scenarios. While these improvements inevitably increase parameters and computational complexity, the trade-off is reasonable considering the significant performance improvement. Overall, the YOLOv8-FCS model, empowered by these innovative improvements, demonstrates strong competitiveness and broad application potential in object detection.
Table 3
The results of ablation experiments on the NEU-DET dataset
Method | Baseline(YOLOv8n) | Our Models |
AIS | | √ | | | √ | | √ | √ |
DyHeadBlock | | | √ | | √ | √ | | √ |
EMABTK | | | | √ | | √ | √ | √ |
mAP50(%) | 74.1 | 75.6 | 75.4 | 74.3 | 76.0 | 73.7 | 75.8 | 77.3 |
mAP50-95 (%) | 41.8 | 43.4 | 43.8 | 42.0 | 44.2 | 41.1 | 43.3 | 44.5 |
Params(M) | 2.87 | 2.87 | 3.32 | 3.62 | 3.32 | 4.08 | 3.62 | 4.08 |
Flops(G) | 8.1 | 8.1 | 9.6 | 9.7 | 9.6 | 11.2 | 9.7 | 11.2 |
FPS | 157.0 | 156.3 | 108.2 | 125.4 | 107.5 | 93.6 | 124.7 | 93.0 |
P (%) | 74.2 | 74.0 | 71.1 | 67.3 | 73.1 | 66.5 | 66.3 | 71.7 |
R(%) | 67.0 | 68.5 | 69.6 | 69.7 | 70.0 | 70.0 | 71.8 | 72.7 |
Table 4
The results of ablation experiments on the GC10 dataset
Method | Baseline(YOLOv8n) | Our Models |
AIS | | √ | | | √ | | √ | √ |
DyHeadBlock | | | √ | | √ | √ | | √ |
EMABTK | | | | √ | | √ | √ | √ |
mAP50(%) | 64.0 | 67.6 | 64.3 | 65.1 | 66.0 | 64.9 | 66.6 | 65.5 |
mAP50-95 (%) | 32.5 | 33.8 | 33.1 | 33.2 | 32.5 | 32.0 | 33.2 | 33.6 |
Params(M) | 2.87 | 2.87 | 3.33 | 3.62 | 3.33 | 4.08 | 3.62 | 4.08 |
Flops(G) | 8.1 | 8.1 | 9.6 | 9.7 | 9.6 | 11.2 | 9.7 | 11.2 |
FPS | 76.3 | 75.6 | 59.3 | 65.5 | 58.7 | 57.6 | 64.7 | 57.0 |
P (%) | 65.2 | 68.4 | 70.7 | 65.9 | 67.6 | 65.7 | 68.1 | 66.9 |
R(%) | 62.4 | 65.0 | 58.8 | 62.4 | 66.1 | 61.7 | 65.5 | 62.9 |
Table 5
The results of ablation experiments on the MTD dataset
Method | Baseline(YOLOv8n) | Our Models |
AIS | | √ | | | √ | | √ | √ |
DyHeadBlock | | | √ | | √ | √ | | √ |
EMABTK | | | | √ | | √ | √ | √ |
mAP50(%) | 71.2 | 74.2 | 72.6 | 73.7 | 71.9 | 74.7 | 72.7 | 73.8 |
mAP50-95 (%) | 48.7 | 51.5 | 49.1 | 51.5 | 50.2 | 50.4 | 51.4 | 50.3 |
Params(M) | 2.87 | 2.87 | 3.32 | 3.62 | 3.32 | 4.08 | 3.62 | 4.08 |
Flops(G) | 8.1 | 8.1 | 9.6 | 9.7 | 9.6 | 11.2 | 9.7 | 11.2 |
FPS | 107.8 | 107.1 | 81.9 | 86.1 | 81.2 | 70.7 | 85.5 | 70.0 |
P (%) | 80.8 | 70.6 | 78.9 | 72.5 | 79.0 | 82.2 | 78.3 | 84.8 |
R(%) | 65.9 | 71.6 | 70.9 | 76.9 | 69.6 | 71.5 | 70.5 | 69.9 |
To gain a deeper understanding of the impact of multiple attention mechanisms on the model's focusing capability, we employed heatmap visualization techniques to analyze the attention distribution of the model, as shown in Fig. 8. The heatmap clearly illustrates that the model significantly reduces its focus on the background while concentrating more on the target objects after introducing attention mechanisms. This finding suggests that attention mechanisms effectively guide the model's attention towards regions that are more crucial for the final detection task, thereby enhancing the model's detection accuracy and efficiency.
We have observed a significant performance improvement through comparative analysis by incorporating multiple attention mechanisms. The visualized results from the heatmaps further validate the effectiveness of these attention mechanisms in enabling the model to focus more on critical information in the images, reducing false detections, and enhancing detection accuracy. These findings not only demonstrate the efficacy of our model design but also provide valuable insights for future research in the field of object detection.
4.4 Comparison with Other Algorithmic Detection Results
On the NEU-DET, GC10, and MTD, various object detection models exhibit different performance characteristics, as shown in Tables 6–8. Key metrics such as parameters, computational complexity, FPS, mAP, precision, and recall show significant differences among the YOLO series (including YOLOv6n, YOLOv6s, YOLOv5n, YOLOv5s, YOLOv3-tiny, YOLOv7-tiny, and the focus of this paper, YOLOv8-FCS), Faster-RCNN, and YOLOX. These differences are influenced by the design philosophies of different models, with one-stage models inclined towards optimizing speed and streamlining the process. In contrast, two-stage models prioritize improving detection accuracy.
On the NEU-DET dataset, YOLOv8-FCS stands out with an impressive mAP50 of 77.3% and mAP50-95 of 44.5%, demonstrating its efficiency and accuracy in handling challenging industrial images. In contrast, other YOLO series models, Faster-RCNN and YOLOX, exhibit competitive performance but fall short in precision, recall, or frame rate. On the GC10 dataset, where models face increased challenges, the overall mAP decreases. YOLOv8-FCS again proves its adaptability and superiority with a mAP50 of 65.5% and mA50-95 of 33.6%. Models like YOLOv6s and YOLOv5n demonstrate advancements in precision and recall, demonstrating their potential in handling complex environments. On the MTD dataset, YOLOv8-FCS performs exceptionally well, particularly with a mAP50 reaching 73.8%, showcasing its strong adaptability to multi-object detection tasks.
Table 6
The results of the comparative experiments on the NEU-DET dataset
Methods | Params (M) | FLOPS (G) | FPS | mAP50 (%) | mAP50-95(%) | P(%) | R(%) |
YOLOv6n[40] | 4.04 | 11.8 | 169.7 | 73.8 | 42.3 | 69.3 | 68.0 |
YOLOv6s[40] | 15.54 | 44.0 | 153.1 | 73.8 | 41.4 | 69.1 | 69.6 |
YOLOv5n[41] | 1.69 | 4.2 | 191.6 | 75.1 | 37.9 | 70.5 | 70.3 |
YOLOv5s[41] | 6.7 | 15.8 | 178.6 | 77.0 | 40.7 | 74.4 | 70.7 |
YOLOv3-tiny[17] | 8.28 | 12.9 | 434.0 | 74.4 | 36.7 | 73.8 | 67.3 |
YOLOv7-tiny[19] | 5.74 | 13.1 | 78.6 | 73.2 | 36.4 | 73.1 | 66.5 |
Faster-RCNN[9] | 41.37 | 23.1 | 21.1 | 74.5 | 39.1 | - | - |
YOLOX-s[42] | 8.94 | 3.28 | 69.3 | 67.8 | 34.1 | - | - |
YOLOv8-FCS | 4.08 | 11.2 | 93.0 | 77.3 | 44.5 | 71.7 | 72.7 |
Table 7
The results of the comparative experiments on the GC10 dataset
Methods | Params (M) | FLOPS (G) | FPS | mAP50 (%) | mAP50-95(%) | P(%) | R(%) |
YOLOv6n[40] | 4.04 | 11.8 | 116.0 | 62.3 | 31.1 | 68.0 | 59.2 |
YOLOv6s[40] | 15.54 | 44.0 | 90.6 | 65.0 | 32.5 | 72.3 | 60.7 |
YOLOv5n[41] | 1.69 | 4.2 | 108.7 | 64.8 | 32.8 | 65.3 | 62.6 |
YOLOv5s[41] | 6.7 | 15.8 | 87.9 | 64.0 | 32.5 | 65.0 | 63.1 |
YOLOv3-tiny[17] | 8.29 | 12.9 | 115.4 | 57.7 | 27.4 | 56.2 | 59.0 |
YOLOv7-tiny[19] | 5.75 | 13.1 | 55.9 | 62.3 | 31.0 | 59.1 | 63.7 |
Faster-RCNN[9] | 41.39 | 90.9 | 20.9 | 65.5 | 32.2 | - | - |
YOLOX-s[42] | 8.94 | 26.78 | 55.9 | 57.0 | 27.7 | - | - |
YOLOv8-FCS | 4.08 | 11.2 | 57.0 | 65.5 | 33.6 | 66.9 | 62.9 |
Table 8
The results of the comparative experiments on the MTD dataset
Methods | Params (M) | FLOPS (G) | FPS | mAP50 (%) | mAP50-95(%) | P(%) | R(%) |
YOLOv6n[40] | 4.04 | 11.8 | 121.1 | 70.2 | 49.2 | 82.0 | 63.5 |
YOLOv6s[40] | 15.54 | 44.0 | 85.8 | 71.5 | 50.5 | 74.6 | 72.5 |
YOLOv5n[41] | 1.68 | 4.1 | 168.4 | 66.4 | 42.1 | 72.1 | 64.0 |
YOLOv5s[41] | 6.7 | 15.8 | 149.0 | 71.6 | 47.2 | 72.0 | 72.5 |
YOLOv3-tiny[17] | 8.27 | 12.9 | 356.7 | 70.4 | 46.4 | 78.1 | 64.6 |
YOLOv7-tiny[19] | 5.74 | 13.1 | 57.8 | 69.1 | 45.6 | 78.2 | 66.9 |
Faster-RCNN[9] | 41.37 | 90.9 | 37.9 | 66.3 | 43.1 | - | - |
YOLOX-s[42] | 8.94 | 26.77 | 19.0 | 68.8 | 44.8 | - | - |
YOLOv8-FCS | 4.08 | 11.2 | 70.0 | 73.8 | 50.3 | 84.8 | 69.9 |
The YOLOv8-FCS model significantly enhances detection accuracy while maintaining efficiency, thanks to its lower parameters, computational complexity, and outstanding FPS, mAP, precision, and recall performance. These results demonstrate that YOLOv8-FCS is a powerful visual detection model that excels in various tasks and environments. Furthermore, the success of YOLOv8-FCS further validates the potential and prospects of one-stage detection models in deep learning. Meanwhile, other YOLO series models, Faster RCNN, YOLOX, and others, also demonstrate their robust functionality and application potential in their respective domains. However, in direct comparison with YOLOv8-FCS, there is still room for improvement in specific vital metrics. Overall, the performance of these models not only reflects the latest advancements in object detection technology and provides valuable insights for future research and applications.
Figure 9 showcases a comparative analysis of partial detection results between the YOLOv8-FCS and YOLOv8 models across three distinct datasets (NEU-DET, GC10, and MTD dataset). Through this comparison, we can visually observe the performance improvement brought by the model enhancements, particularly in reducing missed detections and false positives.
Figure 9 provides a visual depiction that allows us to discern that the YOLOv8-FCS model attains a remarkable decrease in false negative rate on three datasets after incorporating multiple attention mechanisms. This progress can be attributed to the ability of attention mechanisms to help the model focus more on the target regions, thereby enhancing the detection capability of small objects or objects in complex backgrounds. Notably, when confronted with the MTD dataset, which frequently encompass minor or subtle defects, the enhanced model showcases a heightened proficiency in recognizing these targets, indicating a strengthened capacity for handling challenging scenarios.
4.5 Qualitative Results
Figure 10 depicts the qualitative results of the YOLOv8-FCS algorithm on three datasets. The figure provides compelling evidence that the YOLOv8-FCS model achieves accurate recognition and precise localization of defects in steel surface images.