4.1 Datasets
In our study, we utilized two pivotal datasets, namely the Cityscapes dataset and the KITTI Vision Benchmark Suite, to validate the performance of our proposed object detection algorithm in intelligent urban traffic scenarios. Both datasets are renowned for their distinct characteristics and wide-ranging applications.
The Cityscapes dataset provided us with high-resolution urban scene images, distinguished by its detailed and finely annotated information(Cordts et al., 2015). We leveraged the rich data from Cityscapes to validate our algorithm’s capability for accurate detection and precise localization of objects in complex urban traffic environments. This included analyses of varying traffic flows, road structures, and pedestrian activities to assess the algorithm’s robustness and generalizability.
On the other hand, the KITTI Vision Benchmark Suite offered images from onboard sensors and LiDAR point cloud data, simulating diverse weather and traffic scenarios(Behley et al., 2019). Utilizing the multimodal information from the KITTI dataset, we validated our algorithm’s performance under different lighting conditions, dynamic traffic flows, and diverse scenes. This aids in evaluating the algorithm’s adaptability to real-world driving and intelligent urban monitoring.
By combining validations from these two datasets, our research aims to comprehensively understand the feasibility and effectiveness of the proposed object detection algorithm in real-world urban traffic environments. Such comprehensive validation contributes to ensuring the algorithm’s reliability in widespread applications and provides robust support for further research in the field of intelligent urban traffic management.
4.2 Experimental Environment
In our experiments, we utilized a personal computer (PC) as our computing platform. For hardware specifications, our CPU is the Intel Core i9-9900k, running at a clock speed of 3.60GHz. In terms of GPU, we employed two NVIDIA RTX3090 graphics cards, with a total of multiple CUDA cores. The system is equipped with 32GB of RAM and 11GB GDDR6 video memory. In the software environment, we operated on the Windows 10 operating system, Python version 3.8, and relied on libraries such as matplotlib 3.3.4 and opencv 4.5.5. The CUDA version employed was 10.0. This comprehensive hardware and software setup ensured the stability and efficiency of our experiments. For specific details of the experimental environment, please refer to Table 1.
![](https://myfiles.space/user_files/122228_c8a1650c59388082/122228_custom_files/img1706529622.png)
4.3 Evaluation Metrics
In this section, we present the evaluation metrics employed to assess the performance of our proposed model. A comprehensive set of metrics, including precision, recall, F1 score, mean average precision (mAP), mean average precision at different Intersection over Union (IoU) thresholds (mAP@[IoU]), and frame rate, is utilized to provide a thorough analysis of the model’s effectiveness.
Precision, Recall, and F1 Score: Precision, recall, and the F1 score are fundamental metrics for binary classification tasks. Precision represents the ratio of correctly predicted positive instances to the total predicted positives, recall measures the ratio of correctly predicted positive instances to the actual positives, and the F1 score is the harmonic mean of precision and recall. These metrics collectively offer insights into the model’s ability to accurately identify positive instances while minimizing false positives and false negatives.
$$Precision=\frac{TruePositives}{TruePositives+FalsePositives}$$
$$Recall=\frac{TruePositives}{TruePositives+FalseNegatives}$$
$$F1Score=2\times \frac{Precision\times Recall}{Precision+Recall}$$
Mean Average Precision (mAP): Mean Average Precision (mAP) is a widely used metric for evaluating the performance of object detection models. It considers the precision-recall curve for different confidence thresholds and calculates the area under the curve (AUC). A higher mAP value indicates superior detection performance.
$$mAP=\frac{1}{N}\sum _{i=1}^{N} A{P}_{i}$$
Where N is the total number of categories, and \(A{P}_{i}\) is the average precision of the \(i-th\) category.
Frame Rate: Frame rate measures the number of frames processed per unit of time, indicating the model’s efficiency in real-time applications. A higher frame rate is desirable for applications requiring rapid and continuous processing of video frames.
These evaluation metrics collectively offer a comprehensive assessment of our model’s precision, recall, detection accuracy, and efficiency, providing valuable insights into its overall performance in the context of intelligent city traffic monitoring.
4.4 Experimental Details
Step1: Data preprocessing
In the data preprocessing stage of this study, we meticulously processed two crucial datasets, namely the Cityscapes dataset and the KITTI Vision Benchmark Suite dataset, to ensure the reliability of the experiments and the effectiveness of the models. Here is an overview of our data preprocessing approach: Firstly, for the Cityscapes dataset, approximately 50,000 samples were carefully selected. We partitioned the dataset into training, validation, and testing sets using an 80%/10%/10% ratio. Specifically, the training set comprises 40,000 samples, the validation set comprises 5,000 samples, and the testing set comprises 5,000 samples. This partitioning strategy aims to fully leverage the diversity of the Cityscapes dataset, ensuring that the model encounters a sufficient number of samples during training while being challenged with different scenarios during validation and testing. Secondly, for the KITTI Vision Benchmark Suite dataset, approximately 30,000 samples were chosen. Similarly, an 80%/10%/10% ratio was utilized to divide the dataset into training, validation, and testing sets. Specifically, the training set consists of 24,000 samples, the validation set consists of 3,000 samples, and the testing set consists of 3,000 samples. This partitioning strategy not only helps maintain the balance of the dataset but also aids in verifying the model’s generalization performance across different datasets.
Step 2: Model training:
During the network parameter configuration phase, we meticulously adjusted key parameters to ensure the successful training of the model. The learning rate was set to 0.001 through iterative adjustments based on empirical experiments to ensure robust convergence during the training process. We opted for the Adam optimizer, leveraging its adaptive learning rate feature to facilitate quicker and more efficient convergence of the model. The batch size was carefully set to 32, striking a balance between computational efficiency and model stability. Additionally, we introduced weight decay, set at 0.0001, to prevent model overfitting. The training epoch was defined as 300, with the model possessing a parameter count of 11,136,374 and encompassing 225 layers. Through these meticulously configured network parameters, we ensured stability and effectiveness throughout the training and inference processes for the model in the context of intelligent urban traffic monitoring tasks.The parameter settings are as shown in Table 2.
Table 2: Model parameter settings
Parameter
|
Value
|
Learning Rate (lr )
|
0.001
|
Batch Size (batch size)
|
16
|
Weight Decay (weight size)
|
0.0005
|
Epochs (epoch)
|
300
|
Layers
|
225
|
Parameters
|
11,136,374
|
In order to validate the convergence performance of the optimized model in this study on the dataset, we conducted a detailed comparison of various key performance indicators during the training process of the YOLOv8-DSAF model, as shown in Fig. 5. This analysis includes examining the loss curves for bounding box loss, confidence loss, and category loss. Simultaneously, we compared the convergence curves of four performance indicators, including accuracy, recall, [email protected], and mAP@[0.5:0.95]. Through a comprehensive comparison of these convergence performance indicators, we can gain a more thorough understanding of the training performance of the optimized YOLOv8-DSAF model on the dataset, providing robust experimental support for model improvement and practical applications.
4.5 Experimental Results and Analysis
As shown in Table 3, we conducted a comparative analysis on the Cityscapes dataset, evaluating the performance of various models in terms of Precision, F1 Score, [email protected], and processing speed (FPS). The models involved in the comparison include SSD, Faster-RCNN, FCOS, YOLOv5n, YOLOv7-tiny, YOLOv8n, and our proposed YOLOv8-DSAF. The experimental results demonstrate the outstanding performance of the YOLOv8-DSAF model. It surpasses other models in Precision (97.58%), F1 Score (95.48%), and [email protected] (96.18%). Furthermore, YOLOv8-DSAF exhibits remarkable processing speed, reaching 192 FPS, surpassing other models. It is noteworthy that our model achieves excellence in both accuracy and efficiency, highlighting its exceptional performance in smart city traffic scenarios.
Furthermore, we validated our model on the KITTI V dataset, as shown in Table 4. The experimental results similarly demonstrate that our proposed model excels in both accuracy and FPS, surpassing other models. This further corroborates the exceptional performance of our model in handling complex urban traffic scenarios.
Table 3
Performance comparison of different models on the Cityscapes dataset
Model
|
Precision (%)
|
F1 Score (%)
|
[email protected] (%)
|
Speed (FPS)
|
SSD (You et al., 2020)
|
79.76
|
75.48
|
77.68
|
11
|
Faster-RCNN (Han et al., 2019)
|
85.35
|
73.43
|
70.95
|
33
|
FCOS (Zhang & Zeng, 2020)
|
92.44
|
73.48
|
80.68
|
42
|
YOLOv5n (A. Li et al., 2023)
|
87.02
|
85.45
|
81.75
|
63
|
YOLOv7-tiny (S. Li et al., 2023)
|
88.78
|
83.48
|
86.22
|
83
|
YOLOv8n (Du et al., 2023)
|
91.68
|
87.49
|
87.21
|
178
|
YOLOv8-DSAF
|
97.58
|
95.48
|
96.18
|
192
|
Table 4
Performance comparison of different models on the KITTI datasets.
Model
|
Precision (%)
|
F1 Score (%)
|
[email protected] (%)
|
Speed (FPS)
|
SSD (You et al., 2020)
|
77.58
|
73.25
|
75.45
|
9
|
Faster-RCNN (Han et al., 2019)
|
82.76
|
71.28
|
68.74
|
31
|
FCOS (Zhang & Zeng, 2020)
|
89.72
|
70.84
|
78.49
|
40
|
YOLOv5n (A. Li et al., 2023)
|
84.88
|
83.06
|
79.53
|
60
|
YOLOv7-tiny (S. Li et al., 2023)
|
86.55
|
80.77
|
84.02
|
81
|
YOLOv8n (Du et al., 2023)
|
89.44
|
85.22
|
85.08
|
176
|
YOLOv8-DSAF
|
95.38
|
92.68
|
93.98
|
190
|
4.6 Ablation experiments
We conducted ablation experiments on the Cityscapes and KITTI datasets, using YOLOv8n as the baseline model. The results of the ablation study on the Cityscapes dataset are presented in Table 5, where YOLOv8n achieved a performance of 91.68% accuracy, 87.28% [email protected], and a processing speed of 178 FPS. Subsequently, Variant1 introduced DSConv technology, improving accuracy and [email protected] but reducing processing speed to 141 FPS. Variant2, building upon DSConv, further introduced DPAG, resulting in a slight improvement in accuracy and [email protected], while maintaining a processing speed of 179 FPS. Finally, Variant3, incorporating FEM on top of the first two variants, further increased accuracy and [email protected], while sustaining a high processing speed of 192 FPS. The ablation study results on the KITTI dataset are shown in Table 6. Variant1 increased accuracy and [email protected] with DSConv but decreased processing speed to 139 FPS. Variant2, with DSConv and additional DPAG, slightly improved accuracy and [email protected] while maintaining a high speed of 177 FPS. Lastly, Variant3, with DSConv, DPAG, and FEM, further increased accuracy and [email protected], achieving a processing speed of 189 FPS.
These experimental results, as shown in Fig. 6, demonstrate that the incremental introduction of DSConv, DPAG, and FEM positively influences model performance. The balance between accuracy and processing speed showcases the feasibility and efficiency of our model in practical applications. This provides strong support for the application of our model in the field of vehicle detection.
Table 5
Ablation experiments on the Cityscapes datasets.
Model
|
DSConv
|
DPAG
|
FEM
|
Precision (%)
|
[email protected] (%)
|
Speed (FPS)
|
YOLOv8n
|
-
|
-
|
-
|
91.68
|
87.28
|
178
|
Variant1
|
✓
|
-
|
-
|
94.08
|
92.28
|
141
|
Variant2
|
✓
|
✓
|
-
|
94.28
|
92.28
|
179
|
Variant3
|
✓
|
✓
|
✓
|
97.58
|
96.18
|
192
|
Table 6
Ablation experiments on the KITTI datasets.
Model
|
DSConv
|
DPAG
|
FEM
|
Precision (%)
|
[email protected] (%)
|
Speed (FPS)
|
YOLOv8n
|
-
|
-
|
-
|
90.0
|
85.6
|
176
|
Variant1
|
✓
|
-
|
-
|
92.4
|
90.6
|
139
|
Variant2
|
✓
|
✓
|
-
|
92.6
|
90.6
|
177
|
Variant3
|
✓
|
✓
|
✓
|
94.9
|
93.5
|
189
|
4.6 Presentation of Results
The array of examples showcased in Fig. 7 vividly illustrates the remarkable advantages of our model in terms of reasoning abilities, particularly in comprehending intricate scenes and contextual intricacies. Through a thorough analysis of these scenarios, our model exhibits outstanding information processing and reasoning capabilities, enabling it to accurately and swiftly grasp multi-layered, multidimensional contexts. In these instances, our model transcends mere object recognition; it engages in comprehensive reasoning within complex environments. It demonstrates the capacity to maintain high accuracy when confronted with diverse factors such as varying lighting conditions, traffic situations, and crowd movements. This proficiency holds pivotal practical value for real-time and precise information processing, particularly in the realm of urban management.