YOLOv9s, YOLOv9m and YOLOv9c were used for preliminary analysis, and YOLOv9c was selected for study. The size, mAPval, speed (CPU ONNX), speed (A100 TensorRT (ms), params (millions), and FLOPS (billions) parameters are given in Table 1. The brief parameters used are given in Table 2.
Table 1
Model features of YOLOv9c
Model
|
Size (px)
|
mAPval
50-95
|
Speed
CPU ONNX (ms)
|
Speed
A100 TensorRT
(ms)
|
Params
(M)
|
FLOPs
(G)
|
YOLOv9c
|
640
|
53.0
|
239.1
|
2.05
|
25.3
|
102.1
|
px: pixels, ms: milliseconds, M: millions, B: billions |
Table 1 summarizes the model features of YOLOv9c, presenting key metrics such as size in pixels (px), mean Average Precision on validation set (mAPval), speed in milliseconds (ms) for both CPU inference using ONNX and A100 TensorRT, parameters in millions (M), and Floating-Point Operations (FLOPs) in billions (G). YOLOv9c operates at a resolution of 640x640 pixels with a mAPval of 53.0. Its inference speed is 239.1 ms on CPU using ONNX and 2.05 ms on A100 TensorRT. With 25.3 million parameters and 102.1 billion FLOPs, YOLOv9c offers a balance between accuracy and computational efficiency for object detection tasks. Table 2 shows brief characteristics of YOLOv9c and compared to other models such as SSD MobileNet V1, SSD MobileNet V2, and Faster-RCNN ResNet-50 V1 which are previously present for stenosis detection [38].
Table 2
Brief characteristics of the models
Model
|
Inference time, ms
|
mAP @[0.5:0.95]
|
Weights,
M
|
Model size, Mb
|
YOLOv9c (Proposed Model)
|
18
|
52.3
|
25.3
|
49
|
SSD MobileNet V1
|
56
|
32
|
4.2
|
44
|
SSD MobileNet V2
|
31
|
22
|
6.1
|
19
|
SSD ResNet-50 V1
|
76
|
35
|
25.6
|
127
|
Faster-RCNN ResNet-50 V1
|
89
|
30
|
25.6
|
114
|
RFCN ResNet-101 V2
|
92
|
30
|
44.7
|
199
|
Faster-RCNN ResNet-101 V2
|
106
|
32
|
44.7
|
190
|
Faster-RCNN Inception ResNet V2
|
620
|
37
|
55.9
|
241
|
Faster-RCNN NASNet
|
540
|
-
|
88.9
|
416
|
ms: milliseconds, M: millions, Mb: megabytes |
This table provides a concise overview of various object detection models' characteristics, focusing on inference time, mean Average Precision (mAP) in the range of IoU thresholds [0.5:0.95], weights in millions (M), and model size in megabytes (Mb). The proposed YOLOv9c model stands out with an inference time of 18 milliseconds and a competitive mAP of 52.3, weighing 25.3 million parameters with a model size of 49 Mb. Compared to other models such as SSD MobileNet V1, SSD MobileNet V2, and Faster-RCNN ResNet-50 V1, YOLOv9c offers superior performance in terms of both speed and accuracy, making it an efficient choice for real-time object detection applications.
In this study, the proposed YOLOv9c model is utilized as an object detection algorithm aimed at achieving high accuracy and real-time performance in detecting objects within images [42]. YOLOv9c represents a specific variant of the YOLO model that is tailored to meet the study's objectives [43]. Additionally, the study incorporates other well-established object detection models, such as SSD MobileNet V1, SSD MobileNet V2, SSD ResNet-50 V1, Faster-RCNN ResNet-50 V1, RFCN ResNet-101 V2, Faster-RCNN ResNet-101 V2, Faster-RCNN Inception ResNet V2, and Faster-RCNN NASNet, each of which offers distinct architectural features and performance characteristics [42, 44].
The utilization of SSD MobileNet V1, a lightweight convolutional neural network architecture, as the backbone of SSD reflects the study's consideration of computational efficiency without compromising accuracy in object detection tasks [45]. Furthermore, the upgraded version, SSD MobileNet V2, introduces new network building blocks and optimization techniques, enhancing the performance and efficiency of object detection [46]. Similarly, SSD ResNet-50 V1 and Faster-RCNN ResNet-50 V1 leverage the ResNet-50 architecture, known for its effectiveness in training deep models with numerous layers, to facilitate robust feature extraction and object detection [47, 48].
Moreover, the study incorporates Faster-RCNN ResNet-101 V2, which employs the deeper architecture of ResNet-101 V2 to enhance feature representations, potentially improving detection performance [49]. Additionally, the combination of the Faster-RCNN framework with the Inception-ResNet V2 architecture in Faster-RCNN Inception-ResNet V2 aims to leverage the exceptional accuracy and efficiency of Inception-ResNet V2 in image classification tasks for object detection purposes [50]. Furthermore, the integration of Faster-RCNN with the NASNet architecture in Faster-RCNN NASNet reflects the study's focus on optimizing network architecture using automatically designed neural networks [51].
These models were compared with the proposed model (YOLOv9c) in the present study, aiming to evaluate and compare their respective performances in object detection tasks. The detailed results are given in Table 3.
Table 3
Comparative study of the selected models.
Model
|
Weights,
M
|
Training time,
hours
|
Inference time,
ms
|
F1-score
|
[email protected]
|
Abs.
|
Rel.
|
Abs.
|
Rel.
|
Abs.
|
Rel.
|
Abs.
|
Rel.
|
Abs.
|
Rel.
|
Proposed Model (YOLOv9c)
|
25.3
|
6.1x
|
11
|
0.6x
|
18
|
0.4x
|
0.98
|
1.36x
|
0.98
|
1.42x
|
SSD MobileNet V1
|
4.2
|
1.0x
|
16
|
1.0x
|
43
|
1.0x
|
0.72
|
1.0x
|
0.69
|
1.00x
|
SSD MobileNet V2
|
6.1
|
1.4x
|
20
|
1.3x
|
26
|
0.6x
|
0.80
|
1.10x
|
0.83
|
1.20x
|
SSD ResNet-50 V1
|
25.6
|
6.0x
|
47
|
3.0x
|
61
|
1.4x
|
0.73
|
1.01x
|
0.76
|
1.09x
|
Faster-RCNN ResNet-50 V1
|
25.6
|
6.0x
|
28
|
1.8x
|
98
|
2.3x
|
0.88
|
1.21x
|
0.92
|
1.33x
|
RFCN ResNet-101 V2
|
44.7
|
10.5x
|
55
|
3.6x
|
99
|
2.3x
|
0.96
|
1.32x
|
0.94
|
1.36x
|
Faster-RCNN ResNet-101 V2
|
44.7
|
10.5x
|
55
|
3.5x
|
118
|
2.7x
|
0.96
|
1.32x
|
0.94
|
1.35x
|
Faster-RCNN Inception ResNet V2
|
55.9
|
13.2x
|
93
|
6.0x
|
363
|
8.4x
|
0.94
|
1.30x
|
0.95
|
1.38x
|
Faster-RCNN NASNet
|
88.9
|
21.0x
|
147
|
9.5x
|
880
|
20.4x
|
0.82
|
1.13x
|
0.84
|
1.22x
|
YOLOv8
|
-
|
-
|
-
|
-
|
-
|
-
|
0,68
|
0.94x
|
0.65
|
0.94x
|
"Model": Names of different object detection models; "Weights (M)": Number of model weights in millions; "Training Time (hours)": Time taken to train the model in hours; "Inference Time (ms)": Time taken for inference or prediction per image in milliseconds; "F1-score": |
The F1-score is a metric used to assess the overall performance of a model by considering both precision and recall simultaneously.; "[email protected]": Mean Average Precision (mAP) at a threshold of 0.5 is a metric commonly used to evaluate object detection performance.
Table 3 presents a comparative study of selected object detection models, including their absolute and relative weights in millions (M), training time in hours, inference time per image in milliseconds (ms), F1-score, and mean average precision at a threshold of 0.5 ([email protected]). The proposed YOLOv9c model demonstrates superior performance in various aspects, including inference speed, F1-score, and [email protected], compared to other models such as SSD MobileNet V1, SSD MobileNet V2, and Faster-RCNN ResNet-50 V1. Additionally, the table includes YOLOv8 for reference, highlighting the improvements achieved with YOLOv9c.
Table 4
Variable
|
Values
|
Variable
|
Values
|
Variable
|
Values
|
epochs
|
100
|
lrf
|
0.01
|
fl_gamma
|
0.0
|
batch_size
|
8
|
momentum
|
0.937
|
hsv_h
|
0.015
|
imgsz
|
640
|
weight_decay
|
0.0005
|
hsv_s
|
0.7
|
optimizer
|
SGD
|
warmup_epochs
|
3.0
|
hsv_v
|
0.4
|
workers
|
8
|
warmup_momentum
|
0.8
|
degrees
|
0.0
|
label_smoothing
|
0.0
|
warmup_bias_lr
|
0.1
|
translate
|
0.1
|
patience
|
100
|
box
|
7.5
|
scale
|
0.9
|
freeze
|
[0]
|
cls
|
0.5
|
shear
|
0.0
|
save_period
|
-1
|
cls_pw
|
1.0
|
perspective
|
0.0
|
local_rank
|
-1
|
obj
|
0.7
|
flipud
|
0.0
|
min_items
|
0
|
obj_pw
|
1.0
|
fliplr
|
0.5
|
close_mosaic
|
15
|
dfl
|
1.5
|
mosaic
|
1.0
|
bbox_interval
|
-1
|
iou_t
|
0.2
|
mixup
|
0.15
|
lr0
|
0.01
|
anchor_t
|
5.0
|
copy_paste
|
0.3
|
Table 4 presents the training parameters used for the model, including values such as epochs, learning rate factor (lrf), focal loss gamma (fl_gamma), batch size, momentum, image size (imgsz), optimizer, and various augmentation parameters like hsv (hue, saturation, value), degrees, translation, scaling, shearing, perspective transformation, flipping, and mixup. These parameters are essential for configuring and fine-tuning the training process to achieve optimal performance for object detection tasks.
Table 4 provides a comparison of various object detection models based on several metrics. The models used include the proposed model (PM), SSD MobileNet V1 [38], SSD MobileNet V2 [38], SSD ResNet-50 V1 [38], Faster-RCNN ResNet-50 V1 [38], RFCN ResNet-101 V2 [38], Faster-RCNN ResNet-101 V2 [38], Faster-RCNN Inception ResNet V2 [38], Faster-RCNN NASNet [38] and YOLOv8 [52]. The metrics analyzed included the number of model weights (in millions), training time (in hours), inference time (in milliseconds), F1 score, and [email protected] (Mean Average Precision at a threshold of 0.5).
Comparing the proposed model (PM) with the other models, we observe several noteworthy differences that highlight the advantages of the PM in terms of object detection performance.
In terms of model weights, the PM model yields a greater weight (25.3 million) than do the SSD MobileNet V1 (4.2 million) and SSD MobileNet V2 (6.1 million) models. This increase in weights suggests that the PM potentially incorporates more complex and expressive features, enabling it to capture finer details and improve detection accuracy.
Regarding training time, the PM demonstrated a training time 0.6 times shorter than that of the SSD MobileNet V1, indicating less resource consumption in the learning process.
The inference time, which reflects the speed of object detection during runtime, shows that the PM outperforms the SSD MobileNet V2, with an inference time of 18 milliseconds compared to 26 milliseconds. This improved inference efficiency suggests that the PM achieves real-time performance, making it suitable for time-sensitive applications.
Moving on to the evaluation metrics, the F1-score and [email protected] provide insights into the detection accuracy of the models. The PM exhibited an F1-score of 0.98, surpassing the other models, including the SSD MobileNet V1 (0.72) and SSD MobileNet V2 (0.80). This higher F1-score implies that the PM achieves a better balance between precision and recall, resulting in more accurate object localization.
Similarly, the PM shows a superior [email protected] of 0.98, outperforming the SSD MobileNet V1 (0.69) and SSD MobileNet V2 (0.83). A higher [email protected] indicates that the PM excels at detecting objects with higher precision and recall rates at the specified threshold, making it well suited for applications demanding high detection performance.
Training was performed with the features explained in Table 4 by using 6660 images in the dataset (8325 total), and testing was performed with 832 images in the dataset. The trained images are labeled with red rectangles, as shown in Fig. 3. After training and testing were complete, validation was performed. Figure 4 shows the predicted stenosis and labels.
One hundred epochs were completed in 3.467 hours for fine-tuning. All of them were successfully completed. For validation, the YOLO 2024-3-18 Python-3.12.2, torch-2.2.1 + cu118, and CUDA:0 (NVIDIA RTX A4000, 16376MiB) was used, and the model summary is given in Table 5.
Table 5
Summary of the validation model
Class
|
Images
|
Instances
|
Box
|
P
|
R
|
mAP@50
|
mAP@50–95
|
All
|
832
|
832
|
0.969
|
0.965
|
0.978
|
0.518
|
Table 5 summarizes the validation results of the model, including metrics such as precision (P), recall (R), mean Average Precision at IoU threshold 0.5 (mAP@50), and mean Average Precision over the range of IoU thresholds from 0.5 to 0.95 (mAP@50–95). The evaluation is performed across all classes, comprising a total of 832 images and instances. The model demonstrates strong performance, with high precision, recall, and mAP scores, indicating its effectiveness in object detection tasks.
As shown in Fig. 5, the speed results for validation for preprocessing, inference, loss, and postprocessing were 0.2 ms, 3.9 ms, 0.0 ms, and 1.3 ms per image, respectively.
The confusion matrix illustrated in Fig. 6 reveals interesting insights into the object detection performance, specifically for the class "stenosis" and the background class. The number of instances classified as "stenosis" is 820, while the background class consists of 10 instances. However, it is worth noting that the model predicted 12 instances as background instead of correctly identifying them as "stenosis".
This observation highlights an important aspect of the object detection system's performance. The misclassification of 12 instances as background indicates a potential limitation in the model's ability to accurately differentiate between the "Stenosis" class and the background. This misclassification can have implications in medical imaging or other applications where correctly identifying "stenosis" is crucial.