Enhancing Real-time Target Detection in Smart Cities: YOLOv8-DSAF Insights

doi:10.21203/rs.3.rs-3869120/v1

Download PDF

Article

Enhancing Real-time Target Detection in Smart Cities: YOLOv8-DSAF Insights

https://doi.org/10.21203/rs.3.rs-3869120/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

With the global rise of smart city construction, target detection technology plays a crucial role in optimizing urban functions and improving the quality of life. However, existing target detection technologies still have shortcomings in terms of accuracy, real-time performance, and adaptability. To address this challenge, this study proposes an innovative target detection model. Our model adopts the structure of YOLOv8-DSAF. The model comprises three key modules: Depthwise Separable Convolution (DSConv), Dual-Path Attention Gate module (DPAG), and Feature Enhancement Module (FEM). Firstly, DSConv technology optimizes computational complexity, enabling real-time target detection within limited hardware resources. Secondly, the DPAG module introduces a dual-channel attention mechanism, allowing the model to selectively focus on crucial areas, thereby improving detection accuracy in high-dynamic traffic scenarios. Finally, the FEM module highlights crucial features to prevent their loss, further enhancing detection accuracy. Experimental results on the KITTI V and Cityscapes datasets indicate that our model outperforms the YOLOv8 model. This suggests that in complex urban traffic scenarios, our model exhibits superior performance with higher detection accuracy and adaptability. We believe that this innovative model will significantly propel the development of smart cities and advance target detection technology.

Smart city construction

Target detection technology

YOLOv8

DSConv

DPAG

FEM

In today’s era, the construction of smart cities has become a significant global development trend. Smart cities leverage advanced information technology and data analytics to optimize urban functions, enhance energy efficiency, and improve city services and residents’ quality of life. In this process, target detection technology plays a crucial role(Q. Liu et al., 2023; Soylu & Soylu, 2023). It is widely applied in various domains such as traffic monitoring, public safety, and urban planning. By accurately identifying and analyzing objects in the urban environment, target detection greatly enhances the intelligence and automation level of city management. However, despite notable progress in target detection technology, there are still challenges and shortcomings(X. Wang et al., 2023; Zhang et al., 2023). The accuracy and efficiency of these technologies need further improvement, especially in real-time detection and high-precision recognition in dynamic environments. Additionally, with the expansion of urban scale and environmental diversity, target detection systems face challenges related to large-scale data processing and high requirements for adaptability and scalability. Therefore, researching and developing more efficient and accurate target detection methods hold significant importance for the construction of smart cities(Bui et al., 2023; Sharma et al., 2023).

In recent years, the field of target detection has witnessed a series of innovative algorithms, providing robust support for the development of smart cities. Firstly, algorithms based on Transformers, such as DETR (Detection Transformer), have achieved significant success in target detection by introducing attention mechanisms, thereby improving detection accuracy(Xia et al., 2023). Secondly, the CenterNet algorithm is renowned for its efficient performance, simplifying the detection process by directly predicting the center points and bounding boxes of targets, showcasing excellent performance in real-time scenarios. Additionally, the CornerNet-Lite algorithm adopts a lightweight design, detecting targets at their corners, achieving a balance between outstanding speed and accuracy(Phuong et al., 2022). In the realm of intelligent city traffic detection, the RetinaNet algorithm stands out with its powerful performance for small targets. It utilizes a loss function called Focal Loss, effectively handling a large number of background targets and providing more accurate results for urban traffic monitoring(Liu et al., 2022). Furthermore, the YOLOv5 (You Only Look Once) algorithm significantly improves target detection accuracy in complex scenes by introducing a deeper network structure and a sophisticated feature fusion mechanism, offering robust support for real-time and high-precision detection in smart city traffic scenarios(A. Li et al., 2023). These latest target detection algorithms bring new possibilities for intelligent city traffic detection, accelerating the process of intelligent urban management(Huo et al., 2023).

However, despite the remarkable progress brought by these latest target detection algorithms, there are still some shortcomings that need to be addressed(Hu et al., 2022). Firstly, some algorithms still face challenges when dealing with complex dynamic scenes, particularly in situations where targets rapidly change within high-speed traffic flows; improvement is needed in terms of detection accuracy and stability(Qian et al., 2023). Secondly, certain algorithms may not perform well on embedded systems with limited hardware resources, which could be a constraint in smart city devices. Additionally, the issue of detecting small targets remains a pressing challenge. In urban environments, some crucial targets, such as pedestrians and bicycles, may be relatively small, demanding solutions for accurate detection(Qiu et al., 2022).

Based on the identified shortcomings, this paper proposes YOLOv8-DSAF (YOLOv8 with DSConv, DPAG, and FEM), aiming to further enhance the performance of target detection, especially in smart city traffic scenarios. Firstly, we introduce the Depthwise Separable Convolution (DSConv) technique into the backbone network of YOLOv8 to optimize computational complexity, enabling the model to efficiently generate a large number of feature maps. This innovation effectively reduces the computational burden while maintaining model accuracy. Secondly, to strengthen the model’s ability to differentiate targets in complex urban environments, we incorporate the Dual-Path Attention Gate module (DPAG). DPAG utilizes a dual-channel attention mechanism, enabling the model to focus more selectively on critical areas, thereby improving detection accuracy in high-dynamic traffic scenarios. The application of this module is expected to better adapt the model to challenges in urban traffic monitoring, including high-speed moving targets and dense traffic flows. Thirdly, we introduce the Feature Enhancement Module (FEM) to emphasize target details, prevent the loss of essential features, and further improve detection accuracy. The FEM module enhances the model’s capability to capture target details by highlighting crucial features, contributing to more accurate identification and localization of targets in complex urban scenarios.

The proposed YOLOv8-DSAF introduces Depthwise Separable Convolution (DSConv), Dual-Path Attention Gate module (DPAG), and Feature Enhancement Module (FEM). Through the integration of these innovative technologies and modules, the performance of target detection is successfully enhanced. In complex smart city traffic scenarios, the model demonstrates higher detection accuracy and adaptability to dynamic environments, providing a more reliable solution for practical applications.
The introduction of Depthwise Separable Convolution (DSConv) technology in YOLOv8-DSAF achieves a reduction in computational complexity while maintaining high model performance. This is crucial for practical applications like smart cities because, within limited hardware resources, the model can operate more efficiently, enabling real-time target detection and enhancing the overall efficiency of the system.
The introduction of the Dual-Path Attention Gate module (DPAG) and Feature Enhancement Module (FEM) enables YOLOv8-DSAF to excel in target detection tasks in complex urban environments. The model enhances robustness in high-dynamic traffic, varying lighting conditions, and complex backgrounds through more targeted attention allocation to critical regions and precise capturing of target details. This provides YOLOv8-DSAF with enhanced capabilities for security monitoring in smart city environments.

In the forthcoming article, we will structure the content as follows: Chap. 2 offers an in-depth review of related work, while Chap. 3 delves into the key details of our proposed model. Chapter 4 extensively covers our experimental design and results. The research concludes with Chap. 5, which serves as the final discussion and conclusion.

2.1 Transformer-based object detection algorithms

In the field of object detection, Transformer-based algorithms have made significant progress in recent years, offering new perspectives for advancing object detection technology. DETR (Detection Transformer) stands out as an end-to-end object detection algorithm that successfully eliminates the constraints of traditional two-stage methods by introducing attention mechanisms, enabling simultaneous predictions for the entire set of targets(Chen et al., 2023; Dai et al., 2021). However, DETR faces challenges in dealing with small and densely packed targets, and its performance may be constrained in scenarios requiring more precise target localization. On the other hand, the introduction of Yolact combines instance segmentation with object detection, utilizing a Transformer encoder to extract features and directly outputting instance segmentation masks, thereby incorporating more semantic information into object detection tasks(Zhu et al., 2020). However, Yolact may encounter performance bottlenecks when confronted with a large number of small targets, necessitating better adaptation to the demands of multi-scale object detection. Sparse R-CNN successfully reduces computational complexity and enhances computational efficiency by introducing sparse attention mechanisms, offering an effective approach to address computational burdens in object detection. Despite achieving performance improvements in some dense scenes, there is still a need to overcome performance bottlenecks in more complex scenarios. In addition, T2T-ViT adopts a Transformer-to-Transformer (T2T) structure, achieving multi-level abstractions of targets through nested layers of Transformers. However, its performance still requires further improvement when dealing with challenging scenarios(Carion et al., 2020; Wang et al., 2021).

These Transformer-based object detection algorithms have provided new insights for research in the field, yet challenges such as handling small targets, improving computational efficiency, and adapting to complex scenarios persist. In this context, the proposed YOLOv8-DSAF method seeks to address these challenges effectively by introducing Depthwise Separable Convolution (DSConv), Dual-Path Attention Gate module (DPAG), and Feature Enhancement Module (FEM), aiming to enhance object detection performance, particularly in smart city traffic scenarios.

2.2 Object Detection Algorithms Based on YOLO

In the field of object detection, algorithms based on the You Only Look Once (YOLO) framework have undergone multiple versions, progressively improving detection performance, and their relevance to smart urban traffic has become increasingly apparent.Firstly, YOLOv1, as the inaugural version in this series, achieved end-to-end detection by transforming the object detection task into a regression problem(Luo et al., 2023). It demonstrated fast processing speeds suitable for real-time scenarios. However, it exhibited performance deficiencies in detecting small and dense objects, leading to inaccuracies in localization. Subsequent versions addressed these drawbacks through corresponding adjustments and optimizations.Following YOLOv1, YOLOv2 (YOLO9000) introduced the Anchor Boxes mechanism, enhancing adaptability to different object sizes and extending support for a broader range of categories, including the capability for multi-label classification(Talaat & ZainEldin, 2023). While this version significantly improved overall performance, challenges persisted in detecting small objects. To further address these challenges, YOLOv3 implemented improvements in network architecture, incorporating mechanisms such as multi-scale prediction and cross-layer connections. This significantly enhanced the detection of small and dense objects(Al Mudawi et al., 2023; Q. Wang et al., 2023). However, the increase in network depth led to higher computational complexity, posing challenges for real-time applications with stringent latency requirements.YOLOv4 continued the optimization of network structure, introducing the CIOU loss function and activation functions, achieving higher detection accuracy and stronger generalization capabilities(Liu et al., 2024). Nevertheless, the increased model complexity necessitates careful consideration, particularly in resource-constrained urban traffic applications.The subsequent YOLOv5 adopted a design philosophy of lightweight efficiency, achieved through streamlined model structures and optimized training strategies(Jia et al., 2023). This version struck a favorable balance between performance and speed, providing a robust solution for real-time object detection in smart urban traffic applications.

To further optimize the performance of object detection, this paper introduces YOLOv8-DSAF, in-corporating a series of innovative technical modules into the basic YOLOv8 framework, as illustrated in Fig. 1. The introduction of Depthwise Separable Convolution (DSConv) not only reduces computational complexity but also effectively enhances the model’s operational efficiency, making it more adaptable to resource-constrained real-world applications such as intelligent traffic cities. This is particularly crucial for real-time object detection tasks. On the other hand, the integration of the Dual-Path Attention Gate Module (DPAG) enables the model to focus more specifically on critical areas in traffic environments, improving the accuracy of perception and localization of targets in complex scenes. This is significant for scenarios with high dynamic traffic and varying lighting conditions.

The introduction of the Feature Enhancement Module (FEM) further enhances the model’s accuracy. By highlighting the critical features of targets and preventing the loss of valuable information, the FEM module excels in fine-grained object detection tasks in scenarios like intelligent traffic cities, distinguishing vehicle types, and capturing pedestrian movements. The integration of these three innovative technical modules in YOLOv8-DSAF not only significantly improves the accuracy of object detection but also achieves important progress in computational efficiency. This provides a more reliable and efficient solution for addressing complex scenarios and enhancing the monitoring and management capabilities of intelligent traffic cities. The overall performance advantage of this method holds great significance in practical applications, providing robust support for the future development of intelligent city technologies. The network’s overall architecture is illustrated in Fig. 2.

3.1 Depthwise Separable Convolution

Depthwise Separable Convolution (DSConv) is a convolution operation within Convolutional Neural Networks (CNN), and its core idea involves decomposing the traditional convolution into two steps: depthwise convolution and pointwise convolution(Yang et al., 2023). This decomposition process aids in reducing computational and parameter requirements, thereby lowering the model’s complexity and improving computational efficiency. The overall network architecture is illustrated in Fig. 3.

In the specific implementation of Depthwise Separable Convolution, the process begins with depthwise convolution, where independent spatial convolutions are applied to each channel of the input(Bai et al., 2018). This enables the network to learn features separately for each channel, rather than simultaneously learning features across all channels as in traditional convolution. Subsequently, pointwise convolution, accomplished using 1x1 convolutional kernels, is applied to linearly combine the output channels from depthwise convolution, yielding the final output.

3.2 Integrated Dual-Path Gated Attention Module

The Integrated Dual-Path Gated Attention Module (DPAG) is a key innovation in our proposed YOLOv8- DSAF model. This module is designed to enhance the model’s ability to discern critical regions in complex scenes, particularly in the context of intelligent traffic environments(Huang et al., 2021). The main idea behind DPAG involves the integration of dual-path attention mechanisms, providing the model with a more focused and context-aware perception.

In the specific implementation of DPAG, two attention paths are employed to capture different aspects of the input features. One path emphasizes local details, allowing the model to concentrate on fine-grained information relevant to precise object localization. The other path explores global context, enabling the model to grasp the broader scene context and understand the relationships between objects. The integrated DPAG module is seamlessly incorporated into the YOLOv8-DSAF architecture, as depicted in Fig. 4.

The role of DPAG in our model is pivotal for improving detection accuracy, especially in challenging environments where conventional models may struggle. By providing the YOLOv8-DSAF with a more sophisticated attention mechanism, DPAG contributes significantly to the model’s overall capability to perceive and accurately locate objects in intelligent traffic cityscapes(Y. Liu et al., 2023). The main idea of DPAG is illustrated by the following formulas.

$$\varphi =\text{Conv}3\left(x\right)+\text{RepConv}5\left(x\right)$$

$$\delta =\text{Softmax}\left(C\left(\text{reshape}\left(\varphi \right)\times \varphi ;\text{GAP}\left(\varphi \right)\right)\right)$$

$$\wedge =\delta \times \text{C}\text{o}\text{n}\text{v}\left(x\right)+\delta \times \text{R}\text{e}\text{p}\text{Conv}\left(x\right)$$

$${\Omega }=\text{A}\text{v}\text{g}\text{p}\text{o}\text{o}\text{l}(\wedge )+\text{M}\text{a}\text{x}\text{p}\text{o}\text{o}\text{l}(\wedge )$$

$$\text{s Sigmoid (Conv ( }{\Omega }\text{ ))}$$

The variable ϕ represents the combination of feature maps obtained through standard convolution (Conv3) and repeated convolution (RepConv5). The variable δ is computed through Softmax applied to the reshaped and globally averaged feature maps, providing learned weights for attention. The variable Ω is the result of combining the weighted outputs of standard convolution (Conv) and repeated convolution (RepConv) after applying average pooling (Avgpool) and maximum pooling (Maxpool).

3.3 Built Feature Enhancement Module

Feature Enhancement Module (FEM) is a crucial component in the YOLOv8-DSAF model, designed to improve the accuracy of object detection, especially in challenging scenarios such as intelligent urban traffic(Qian et al., 2023). The module is designed to highlight key features of the targets, prevent the loss of important information, and enhance the model's accuracy in recognizing and locating fine-grained objects.

The implementation of FEM involves introducing additional attention mechanisms and feature enhancement at specific layers of the network. Specifically, FEM enhances the representation capability of the network at certain levels, directing more attention to the crucial features of the targets(Zhang & Shen, 2022). This helps improve the model's ability to identify targets in complex urban environments, such as distinguishing different types of vehicles and capturing pedestrian movements. This integration enables YOLOv8-DSAF to perform exceptionally well in practical applications, such as intelligent urban traffic management, playing a crucial role in these scenarios.

The FEM’s computation process is presented below:

$${Y}_{1}={f}_{dconv1}^{3\times 3}\left[{f}_{conv}^{1\times 1}\left(X\right)\right]$$

$${Y}_{2}={f}_{dconv3}^{3\times 3}\left\{{f}_{conv}^{3\times 3}\left[{f}_{conv}^{1\times 1}\left(X\right)\right]\right\}$$

$${Y}_{3}={f}_{dconv5}^{3\times 3}\left\{{f}_{conv}^{3\times 3}\left[{f}_{conv}^{1\times 1}\left(X\right)\right]\right\}$$

$$Z=C\text{oncat}({Y}_{1},{Y}_{2},{Y}_{3})\oplus X$$

4.1 Datasets

In our study, we utilized two pivotal datasets, namely the Cityscapes dataset and the KITTI Vision Benchmark Suite, to validate the performance of our proposed object detection algorithm in intelligent urban traffic scenarios. Both datasets are renowned for their distinct characteristics and wide-ranging applications.

The Cityscapes dataset provided us with high-resolution urban scene images, distinguished by its detailed and finely annotated information(Cordts et al., 2015). We leveraged the rich data from Cityscapes to validate our algorithm’s capability for accurate detection and precise localization of objects in complex urban traffic environments. This included analyses of varying traffic flows, road structures, and pedestrian activities to assess the algorithm’s robustness and generalizability.

On the other hand, the KITTI Vision Benchmark Suite offered images from onboard sensors and LiDAR point cloud data, simulating diverse weather and traffic scenarios(Behley et al., 2019). Utilizing the multimodal information from the KITTI dataset, we validated our algorithm’s performance under different lighting conditions, dynamic traffic flows, and diverse scenes. This aids in evaluating the algorithm’s adaptability to real-world driving and intelligent urban monitoring.

By combining validations from these two datasets, our research aims to comprehensively understand the feasibility and effectiveness of the proposed object detection algorithm in real-world urban traffic environments. Such comprehensive validation contributes to ensuring the algorithm’s reliability in widespread applications and provides robust support for further research in the field of intelligent urban traffic management.

4.2 Experimental Environment

In our experiments, we utilized a personal computer (PC) as our computing platform. For hardware specifications, our CPU is the Intel Core i9-9900k, running at a clock speed of 3.60GHz. In terms of GPU, we employed two NVIDIA RTX3090 graphics cards, with a total of multiple CUDA cores. The system is equipped with 32GB of RAM and 11GB GDDR6 video memory. In the software environment, we operated on the Windows 10 operating system, Python version 3.8, and relied on libraries such as matplotlib 3.3.4 and opencv 4.5.5. The CUDA version employed was 10.0. This comprehensive hardware and software setup ensured the stability and efficiency of our experiments. For specific details of the experimental environment, please refer to Table 1.

4.3 Evaluation Metrics

In this section, we present the evaluation metrics employed to assess the performance of our proposed model. A comprehensive set of metrics, including precision, recall, F1 score, mean average precision (mAP), mean average precision at different Intersection over Union (IoU) thresholds (mAP@[IoU]), and frame rate, is utilized to provide a thorough analysis of the model’s effectiveness.

Precision, Recall, and F1 Score: Precision, recall, and the F1 score are fundamental metrics for binary classification tasks. Precision represents the ratio of correctly predicted positive instances to the total predicted positives, recall measures the ratio of correctly predicted positive instances to the actual positives, and the F1 score is the harmonic mean of precision and recall. These metrics collectively offer insights into the model’s ability to accurately identify positive instances while minimizing false positives and false negatives.

$$Precision=\frac{TruePositives}{TruePositives+FalsePositives}$$

$$Recall=\frac{TruePositives}{TruePositives+FalseNegatives}$$

$$F1Score=2\times \frac{Precision\times Recall}{Precision+Recall}$$

Mean Average Precision (mAP): Mean Average Precision (mAP) is a widely used metric for evaluating the performance of object detection models. It considers the precision-recall curve for different confidence thresholds and calculates the area under the curve (AUC). A higher mAP value indicates superior detection performance.

$$mAP=\frac{1}{N}\sum _{i=1}^{N} A{P}_{i}$$

Where N is the total number of categories, and $A{P}_{i}$ is the average precision of the $i-th$ category.

Frame Rate: Frame rate measures the number of frames processed per unit of time, indicating the model’s efficiency in real-time applications. A higher frame rate is desirable for applications requiring rapid and continuous processing of video frames.

These evaluation metrics collectively offer a comprehensive assessment of our model’s precision, recall, detection accuracy, and efficiency, providing valuable insights into its overall performance in the context of intelligent city traffic monitoring.

4.4 Experimental Details

Step1: Data preprocessing

In the data preprocessing stage of this study, we meticulously processed two crucial datasets, namely the Cityscapes dataset and the KITTI Vision Benchmark Suite dataset, to ensure the reliability of the experiments and the effectiveness of the models. Here is an overview of our data preprocessing approach: Firstly, for the Cityscapes dataset, approximately 50,000 samples were carefully selected. We partitioned the dataset into training, validation, and testing sets using an 80%/10%/10% ratio. Specifically, the training set comprises 40,000 samples, the validation set comprises 5,000 samples, and the testing set comprises 5,000 samples. This partitioning strategy aims to fully leverage the diversity of the Cityscapes dataset, ensuring that the model encounters a sufficient number of samples during training while being challenged with different scenarios during validation and testing. Secondly, for the KITTI Vision Benchmark Suite dataset, approximately 30,000 samples were chosen. Similarly, an 80%/10%/10% ratio was utilized to divide the dataset into training, validation, and testing sets. Specifically, the training set consists of 24,000 samples, the validation set consists of 3,000 samples, and the testing set consists of 3,000 samples. This partitioning strategy not only helps maintain the balance of the dataset but also aids in verifying the model’s generalization performance across different datasets.

Step 2: Model training:

During the network parameter configuration phase, we meticulously adjusted key parameters to ensure the successful training of the model. The learning rate was set to 0.001 through iterative adjustments based on empirical experiments to ensure robust convergence during the training process. We opted for the Adam optimizer, leveraging its adaptive learning rate feature to facilitate quicker and more efficient convergence of the model. The batch size was carefully set to 32, striking a balance between computational efficiency and model stability. Additionally, we introduced weight decay, set at 0.0001, to prevent model overfitting. The training epoch was defined as 300, with the model possessing a parameter count of 11,136,374 and encompassing 225 layers. Through these meticulously configured network parameters, we ensured stability and effectiveness throughout the training and inference processes for the model in the context of intelligent urban traffic monitoring tasks.The parameter settings are as shown in Table 2.

Table 2: Model parameter settings

Parameter	Value
Learning Rate (lr )	0.001
Batch Size (batch size)	16
Weight Decay (weight size)	0.0005
Epochs (epoch)	300
Layers	225
Parameters	11,136,374

In order to validate the convergence performance of the optimized model in this study on the dataset, we conducted a detailed comparison of various key performance indicators during the training process of the YOLOv8-DSAF model, as shown in Fig. 5. This analysis includes examining the loss curves for bounding box loss, confidence loss, and category loss. Simultaneously, we compared the convergence curves of four performance indicators, including accuracy, recall, [email protected], and mAP@[0.5:0.95]. Through a comprehensive comparison of these convergence performance indicators, we can gain a more thorough understanding of the training performance of the optimized YOLOv8-DSAF model on the dataset, providing robust experimental support for model improvement and practical applications.

4.5 Experimental Results and Analysis

As shown in Table 3, we conducted a comparative analysis on the Cityscapes dataset, evaluating the performance of various models in terms of Precision, F1 Score, [email protected], and processing speed (FPS). The models involved in the comparison include SSD, Faster-RCNN, FCOS, YOLOv5n, YOLOv7-tiny, YOLOv8n, and our proposed YOLOv8-DSAF. The experimental results demonstrate the outstanding performance of the YOLOv8-DSAF model. It surpasses other models in Precision (97.58%), F1 Score (95.48%), and [email protected] (96.18%). Furthermore, YOLOv8-DSAF exhibits remarkable processing speed, reaching 192 FPS, surpassing other models. It is noteworthy that our model achieves excellence in both accuracy and efficiency, highlighting its exceptional performance in smart city traffic scenarios.

Furthermore, we validated our model on the KITTI V dataset, as shown in Table 4. The experimental results similarly demonstrate that our proposed model excels in both accuracy and FPS, surpassing other models. This further corroborates the exceptional performance of our model in handling complex urban traffic scenarios.

Table 3

Performance comparison of different models on the Cityscapes dataset
Model	Precision (%)	F₁ Score (%)	[email protected] (%)	Speed (FPS)
SSD (You et al., 2020)	79.76	75.48	77.68	11
Faster-RCNN (Han et al., 2019)	85.35	73.43	70.95	33
FCOS (Zhang & Zeng, 2020)	92.44	73.48	80.68	42
YOLOv5n (A. Li et al., 2023)	87.02	85.45	81.75	63
YOLOv7-tiny (S. Li et al., 2023)	88.78	83.48	86.22	83
YOLOv8n (Du et al., 2023)	91.68	87.49	87.21	178
YOLOv8-DSAF	97.58	95.48	96.18	192

Table 4

Performance comparison of different models on the KITTI datasets.
Model	Precision (%)	F1 Score (%)	[email protected] (%)	Speed (FPS)
SSD (You et al., 2020)	77.58	73.25	75.45	9
Faster-RCNN (Han et al., 2019)	82.76	71.28	68.74	31
FCOS (Zhang & Zeng, 2020)	89.72	70.84	78.49	40
YOLOv5n (A. Li et al., 2023)	84.88	83.06	79.53	60
YOLOv7-tiny (S. Li et al., 2023)	86.55	80.77	84.02	81
YOLOv8n (Du et al., 2023)	89.44	85.22	85.08	176
YOLOv8-DSAF	95.38	92.68	93.98	190

4.6 Ablation experiments

We conducted ablation experiments on the Cityscapes and KITTI datasets, using YOLOv8n as the baseline model. The results of the ablation study on the Cityscapes dataset are presented in Table 5, where YOLOv8n achieved a performance of 91.68% accuracy, 87.28% [email protected], and a processing speed of 178 FPS. Subsequently, Variant1 introduced DSConv technology, improving accuracy and [email protected] but reducing processing speed to 141 FPS. Variant2, building upon DSConv, further introduced DPAG, resulting in a slight improvement in accuracy and [email protected], while maintaining a processing speed of 179 FPS. Finally, Variant3, incorporating FEM on top of the first two variants, further increased accuracy and [email protected], while sustaining a high processing speed of 192 FPS. The ablation study results on the KITTI dataset are shown in Table 6. Variant1 increased accuracy and [email protected] with DSConv but decreased processing speed to 139 FPS. Variant2, with DSConv and additional DPAG, slightly improved accuracy and [email protected] while maintaining a high speed of 177 FPS. Lastly, Variant3, with DSConv, DPAG, and FEM, further increased accuracy and [email protected], achieving a processing speed of 189 FPS.

These experimental results, as shown in Fig. 6, demonstrate that the incremental introduction of DSConv, DPAG, and FEM positively influences model performance. The balance between accuracy and processing speed showcases the feasibility and efficiency of our model in practical applications. This provides strong support for the application of our model in the field of vehicle detection.

Table 5

Ablation experiments on the Cityscapes datasets.
Model	DSConv	DPAG	FEM	Precision (%)	[email protected] (%)	Speed (FPS)
YOLOv8n	-	-	-	91.68	87.28	178
Variant1	✓	-	-	94.08	92.28	141
Variant2	✓	✓	-	94.28	92.28	179
Variant3	✓	✓	✓	97.58	96.18	192

Table 6

Ablation experiments on the KITTI datasets.
Model	DSConv	DPAG	FEM	Precision (%)	[email protected] (%)	Speed (FPS)
YOLOv8n	-	-	-	90.0	85.6	176
Variant1	✓	-	-	92.4	90.6	139
Variant2	✓	✓	-	92.6	90.6	177
Variant3	✓	✓	✓	94.9	93.5	189

4.6 Presentation of Results

The array of examples showcased in Fig. 7 vividly illustrates the remarkable advantages of our model in terms of reasoning abilities, particularly in comprehending intricate scenes and contextual intricacies. Through a thorough analysis of these scenarios, our model exhibits outstanding information processing and reasoning capabilities, enabling it to accurately and swiftly grasp multi-layered, multidimensional contexts. In these instances, our model transcends mere object recognition; it engages in comprehensive reasoning within complex environments. It demonstrates the capacity to maintain high accuracy when confronted with diverse factors such as varying lighting conditions, traffic situations, and crowd movements. This proficiency holds pivotal practical value for real-time and precise information processing, particularly in the realm of urban management.

In this study, we introduced the YOLOv8-DSAF (You Only Look Once with Depthwise Separable Convolution, Dual-Path Attention Gate, and Feature Enhancement Module) model, aiming to further enhance the performance of target detection in smart city traffic scenarios. By integrating Depthwise Separable Convolution (DSConv) technology, the Dual-Path Attention Gate module (DPAG), and the Feature Enhancement Module (FEM), our model demonstrated higher detection accuracy and adaptability to dynamic scenes in complex smart city traffic environments, providing a more reliable solution for practical applications.

However, despite the significant achievements of YOLOv8-DSAF, two limitations need further research and improvement. Firstly, the model still faces challenges in terms of detection accuracy and stability in complex dynamic scenes, particularly in situations where targets rapidly change within high-speed traffic flows. Secondly, certain algorithms may not perform well on embedded systems with limited hardware resources, posing a constraint in smart city devices.

Looking ahead to future work, we will focus on two main directions. First, we aim to optimize the algorithm to enhance the model's performance in complex dynamic scenes, with a specific emphasis on improving detection accuracy and stability in high-speed traffic flows. Second, we plan to further optimize the model to better adapt to embedded systems with limited hardware resources, ensuring efficient operation in smart city devices. In summary, while YOLOv8-DSAF has achieved remarkable results in smart city traffic scenarios, we acknowledge the identified limitations. Through ongoing research and improvements, we are confident in further enhancing the model's performance and making more significant contributions to the future development of smart cities and traffic management.

Author Contributions: Yihong Li was responsible for manuscript editing, data collection, and statistical analysis. Yanrong Huang verified the entire article and participated in experimental design and data analysis. Qi Tao was responsible for writing the literature review and discussion sections and made contributions to the execution of the experiments. All authors jointly reviewed and approved the final version of the article.

Funding: Research on Crowdsourcing Participant Reputation Assessment Based on Machine Learning (No. 72101235)

Acknowledgments: This work was supported by the National Natural Science Foundation of China under Grant 72101235.

Data Availability Statement:The datasets used in this study include the Cityscapes dataset and the KITTI Vision dataset. The Cityscapes dataset can be obtained from the official website: https://www.cityscapes-dataset.com, while the KITTI Vision dataset can be downloaded from its official website: https://www.cvlibs.net/datasets/kitti.

Al Mudawi, N., Qureshi, A. M., Abdelhaq, M., Alshahrani, A., Alazeb, A., Alonazi, M., & Algarni, A. (2023). Vehicle Detection and Classification via YOLOv8 and Deep Belief Network over Aerial Image Sequences. Sustainability, 15(19), 14597.
Bai, L., Zhao, Y., & Huang, X. (2018). A CNN accelerator on FPGA using depthwise separable convolution. IEEE Transactions on Circuits and Systems II: Express Briefs, 65(10), 1415-1419.
Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., & Gall, J. (2019). Semantickitti: A dataset for semantic scene understanding of lidar sequences. Proceedings of the IEEE/CVF international conference on computer vision,
Bui, P. H. D., Nguyen, T. T., Nguyen, T. M., & Nguyen, H. T. (2023). An Approach for Traffic Sign Recognition with Versions of YOLO. International Conference on Intelligent Systems and Data Science,
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. European conference on computer vision,
Chen, Q., Chen, X., Wang, J., Zhang, S., Yao, K., Feng, H., Han, J., Ding, E., Zeng, G., & Wang, J. (2023). Group detr: Fast detr training with group-wise one-to-many assignment. Proceedings of the IEEE/CVF International Conference on Computer Vision,
Cordts, M., Omran, M., Ramos, S., Scharwächter, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2015). The cityscapes dataset. CVPR Workshop on the Future of Datasets in Vision,
Dai, Z., Cai, B., Lin, Y., & Chen, J. (2021). Up-detr: Unsupervised pre-training for object detection with transformers. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
Du, Y., Liu, X., Yi, Y., & Wei, K. (2023). Optimizing road safety: advancements in lightweight YOLOv8 models and GhostC2f design for real-time distracted driving detection. Sensors, 23(21), 8844.
Han, C., Gao, G., & Zhang, Y. (2019). Real-time small traffic sign detection with revised faster-RCNN. Multimedia Tools and Applications, 78, 13263-13278.
Hu, J., Wang, Z., Chang, M., Xie, L., Xu, W., & Chen, N. (2022). PSG-Yolov5: A Paradigm for Traffic Sign Detection and Recognition Algorithm Based on Deep Learning. Symmetry, 14(11), 2262.
Huo, G., Lin, D., Liu, Y., Zhu, X., & Yuan, M. (2023). Small-Sample Iris Image Segmentation Based on Lightweight Convolutional Neural Networks. Journal of Jilin University: Science Edition, 61, 583-591. https://doi.org/10.13413/j.cnki.jdxblxb.2022078
Huang, Z., Li, W., Li, J., & Zhou, D. (2021). Dual-path attention network for single image super-resolution. Expert Systems with Applications, 169, 114450.
Jia, F., Tan, J., Lu, X., & Qian, J. (2023). Radar Timing Range–Doppler Spectral Target Detection Based on Attention ConvLSTM in Traffic Scenes. Remote Sensing, 15(17), 4150.
Li, A., Sun, S., Zhang, Z., Feng, M., Wu, C., & Li, W. (2023). A Multi-Scale Traffic Object Detection Algorithm for Road Scenes Based on Improved YOLOv5. Electronics, 12(4), 878.
Li, S., Wang, S., & Wang, P. (2023). A small object detection algorithm for traffic signs based on improved YOLOv7. Sensors, 23(16), 7145.
Liu, Q., Liu, Y., & Lin, D. (2023). Revolutionizing Target Detection in Intelligent Traffic Systems: YOLOv8-SnakeVision. Electronics, 12(24), 4970.
Liu, Q., Ye, H., Wang, S., & Xu, Z. (2024). YOLOv8-CB: Dense Pedestrian Detection Algorithm Based on In-Vehicle Camera. Electronics, 13(1), 236.
Liu, S., Cai, T., Tang, X., Zhang, Y., & Wang, C. (2022). Visual recognition of traffic signs in natural scenes based on improved RetinaNet. Entropy, 24(1), 112.
Liu, Y., Yue, M., Yan, H., & Zhu, L. (2023). Single‐image super‐resolution using lightweight transformer‐convolutional neural network hybrid model. IET Image Processing.
Luo, S., Wu, C., & Li, L. (2023). Detection and Recognition of Obscured Traffic Signs During Vehicle Movement. IEEE Access, 11, 122516-122525.
Phuong, V. L. Q., Dong, N. V., Thu, T. N. M., & Khang, P. N. (2022). Combine Clasification Algorithm and Centernet Model to Predict Trafic Density. International Conference on Future Data and Security Engineering,
Qian, H., Wang, H., Feng, S., & Yan, S. (2023). FESSD: SSD target detection based on feature fusion and feature enhancement. Journal of Real-Time Image Processing, 20(1), 2.
Qian, X., Duan, J., Liu, J., Chen, G., Liu, G., & Liang, L. (2023). Image Dehazing Algorithm Based on Attention Feature Fusion. Journal of Jilin University: Science Edition, 61(3), 567-576.
Qiu, M., Huang, L., & Tang, B.-H. (2022). ASFF-YOLOv5: Multielement detection method for road traffic in UAV images based on multiscale feature fusion. Remote Sensing, 14(14), 3498.
Sharma, N., Baral, S., Paing, M. P., & Chawuthai, R. (2023). Parking Time Violation Tracking Using YOLOv8 and Tracking Algorithms. Sensors, 23(13), 5843.
Soylu, E., & Soylu, T. (2023). A performance comparison of YOLOv8 models for traffic sign detection in the Robotaxi-full scale autonomous vehicle competition. Multimedia Tools and Applications, 1-31.
Talaat, F. M., & ZainEldin, H. (2023). An improved fire detection approach based on YOLO-v8 for smart cities. Neural Computing and Applications, 35(28), 20939-20954.
Wang, Q., Li, X., & Lu, M. (2023). An Improved Traffic Sign Detection and Recognition Deep Model Based on YOLOv5. IEEE Access.
Wang, X., Gao, H., Jia, Z., & Li, Z. (2023). BL-YOLOv8: An Improved Road Defect Detection Model Based on YOLOv8. Sensors, 23(20), 8361.
Wang, Y., Zhang, X., Yang, T., & Sun, J. (2021). Anchor DETR: Query design for transformer-based object detection. arXiv preprint arXiv:2109.07107, 3(6).
Xia, J., Li, M., Liu, W., & Chen, X. (2023). DSRA-DETR: An Improved DETR for Multiscale Traffic Sign Detection. Sustainability, 15(14), 10862.
Yang, G., Wang, J., Nie, Z., Yang, H., & Yu, S. (2023). A lightweight YOLOv8 tomato detection algorithm combining feature enhancement and attention. Agronomy, 13(7), 1824.
You, S., Bi, Q., Ji, Y., Liu, S., Feng, Y., & Wu, F. (2020). Traffic sign detection method based on improved SSD. Information, 11(10), 475.
Zhang, F., & Zeng, Y. (2020). D-FCOS: traffic signs detection and recognition based on semantic segmentation. 2020 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS),
Zhang, K., & Shen, H. (2022). Multi-stage feature enhancement pyramid network for detecting objects in optical remote sensing images. Remote Sensing, 14(3), 579.
Zhang, L. j., Fang, J. j., Liu, Y. x., Le, H. f., Rao, Z. q., & Zhao, J. x. (2023). CR-YOLOv8: Multiscale object detection in traffic sign images. IEEE Access(99), 1-1.
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.

No competing interests reported.

Download PDF

Editorial decision: Revision requested
18 Apr, 2024
Reviews received at journal
31 Mar, 2024
Reviews received at journal
11 Mar, 2024
Reviewers agreed at journal
10 Mar, 2024
Reviews received at journal
05 Mar, 2024
Reviewers agreed at journal
01 Mar, 2024
Reviewers agreed at journal
28 Feb, 2024
Reviewers invited by journal
28 Feb, 2024
Editor assigned by journal
27 Feb, 2024
Editor invited by journal
25 Jan, 2024
Submission checks completed at journal
25 Jan, 2024
First submitted to journal
16 Jan, 2024

You are reading this latest preprint version

Enhancing Real-time Target Detection in Smart Cities: YOLOv8-DSAF Insights

Status:

Version 1

Abstract

Figures

1 INTRODUCTION

2 RELATED WORK

2.1 Transformer-based object detection algorithms

2.2 Object Detection Algorithms Based on YOLO

3 METHOD

3.1 Depthwise Separable Convolution

3.2 Integrated Dual-Path Gated Attention Module

3.3 Built Feature Enhancement Module

4 EXPERIMENT

4.1 Datasets

4.2 Experimental Environment

4.3 Evaluation Metrics

4.4 Experimental Details

4.5 Experimental Results and Analysis

4.6 Ablation experiments

4.6 Presentation of Results

5. CONCLUSION AND DISCUSSION

Declarations

References

Additional Declarations

Status:

Version 1