Improvement of YOLOv8 algorithm through integration of Pyramid Vision Transformer architecture

doi:10.21203/rs.3.rs-4987159/v1

Download PDF

Article

Improvement of YOLOv8 algorithm through integration of Pyramid Vision Transformer architecture

https://doi.org/10.21203/rs.3.rs-4987159/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Addressing the issue of poor target detection accuracy in complex backgrounds with the YOLOv8s model, this chapter proposes an improved YOLOv8s model that incorporates the Pyramid Vision Transformer (PVT). Specifically, to enhance the feature extraction capabilities of the base module, this paper proposes using PVT in the Backbone stage of YOLOv8s to replace the previous basic convolutional feature extraction blocks. This structure allows the model to process images at different resolution levels, thereby more effectively capturing details and contextual information.

Physical sciences/Mathematics and computing/Computer science

Physical sciences/Mathematics and computing/Information technology

With the rapid development of deep learning technology, the field of computer vision has undergone unprecedented changes. As one of the core tasks of computer vision, the improvement of object detection performance is directly related to the level of intelligence in many application scenarios. The YOLO (You Only Look Once) series of algorithms, with their efficient and accurate characteristics^[1]^, have received widespread attention and applications since their inception. However, as application requirements become increasingly complex, enhancing the accuracy and efficiency of YOLO^[2] algorithms has become a research hotspot. YOLOv8, as the latest member of the YOLO series, inherits many advantages of YOLOv5^[3]^[4]and achieves significant performance improvements. Nevertheless, when faced with more complex and diverse scenarios and multi-scale object detection tasks, existing Convolutional Neural Network (CNN) structures still have certain limitations. To overcome this challenge, this paper proposes an improved YOLOv8 algorithm that incorporates the Pyramid Vision Transformer (PVT) architecture^[5](Figure 1).

Pyramid Vision Transformer is a deep learning model that combines the Transformer architecture with a pyramid structure^[6]. Compared to the original Vision Transformer (ViT)^[7]^[8]^[9], PVT utilizes feature maps of different scales across multiple stages, thereby forming a pyramid structure. This design enables PVT to capture feature information at different scales, enhancing the model's ability to process objects of varying sizes within images. By introducing PVT into the backbone network of YOLOv8, it aims to leverage its robust feature extraction capabilities and multi-scale feature processing abilities to further improve the object detection accuracy and efficiency of YOLOv8^[10].

Currently, both single-stage and two-stage object detection algorithms based on deep learning largely adopt feature pyramid structures to enhance intermediate feature map information. In Vision Transformer (ViT) structure-based methods^[11]^, the input image is first converted into a series of patches, which are then fed into the Transformer Encoder to extract object features and obtain feature maps. However, since the feature map size remains the same^[12]at each stage, this approach is challenging to apply to downstream vision tasks. To bridge the gap between ViT and feature pyramid techniques, Wang et al.^[22] proposed PVT, which can be trained on high-resolution images without significantly increasing the model's computational complexity. PVT employs a progressive pyramid paradigm with diminishing sizes to produce multi-scale output feature maps. Additionally, it introduces Spatial Reduction Attention (SRA)^[13]^[14] to reduce resource consumption and time complexity during attention computations. Compared to CNNs and ViT, PVT not only inherits the global receptive field of ViT but also incorporates the pyramid structure of CNNs, facilitating the acquisition of multi-scale feature maps and seamless migration to advanced computer vision tasks such as object detection and instance segmentation.

Figure 2 illustrates the overall architecture of PVT, which comprises four stage modules (Stage1, Stage2, Stage3, and Stage4). Each stage module contains a Patch Embedding and n Transformer Encoder Layers^[5], outputting feature maps of different sizes (four-fold downsampled, eight-fold downsampled, sixteen-fold downsampled, and thirty-two-fold downsampled)^[6]. In the first stage module, given an input image of size H×W×3, it is first divided into patches of size 4×4×3, totaling H×W/16 patches. These embedded vectors, along with positional embeddings, are then fed into the Transformer Encoder Layer, and the output feature map is reshaped to obtain an H/4×W/4×C1 feature map f1. The subsequent three stages follow a similar process, yielding H/8×W/8×C2 feature map f2, H/16×W/16×C3 feature map f3, and H/32×W/32×C4 feature map f4, respectively.

To reduce the computational load of the model, PVT proposes SRA to replace the Multi-Head Attention (MHA) module in Transformers^[5]. Similar to MHA, SRA takes a query, key, and value as inputs and outputs a modified feature vector^[15]. Unlike MHA, SRA reduces the spatial dimensions (i.e., width and height) of the key and value before performing self-attention calculations, significantly decreasing the attention computation burden and system memory usage^[14]^[16]. The structures of MHA and SRA modules are depicted in Figure 3.

PVT encompasses four variant models: PVT-Tiny, PVT-Small, PVT-Medium, and PVT-Large. Considering computational cost and complexity, this paper adopts the YOLOv8s model as the baseline and utilizes PVT-Small as the feature extraction network, replacing the stacked convolutional block structure in YOLOv8s^[8]^[17].

Addressing the issue of poor object detection accuracy in complex backgrounds for the YOLOv8s model^[18]^[19], this paper aims to enhance the feature extraction capability of the YOLOv8s model's feature extraction network. We propose substituting the stacked convolutional block structure in YOLOv8s with PVT. This approach leverages PVT's pre-trained weights to better initialize model parameters and benefits from the Transformer model's robust feature modeling capabilities, enabling multi-dimensional feature extraction and fusion, thereby strengthening the model's image feature extraction ability. The overall structure of the proposed model is shown in Figure 3 below. We replace the original convolutional block structure with PVT in the Backbone stage, while maintaining consistency with YOLOv8s in the Head and Prediction stages.

MS COCO (Microsoft Common Objects in Context) is a large-scale, publicly available benchmark dataset for computer vision tasks, including object detection, instance segmentation, image captioning, and more. It was released by Microsoft Research to advance research and development in computer vision tasks. MS COCO has three versions released in 2014, 2015, and 2017, with the 2017 version being a significant update that further expanded its scale and annotation quality. The MS COCO 2017 dataset consists of hundreds of thousands of high-resolution images and covers 80 common object categories, including people, animals, furniture, electronic devices, vehicles, and more. The dataset is divided into training, validation, and test sets, with 118,287 images in the training set, 5,000 images in the validation set, and 5,000 images in the test set, respectively used for model training, performance evaluation, and image testing.

The self-built dataset for fall detection of elderly individuals in this paper, which includes images captured in the Smart Elderly Care Laboratory of Guiyang Health Vocational University and other locations, totals 6490 images. Specifically, the training set contains 4401 images, the validation set contains 1089 images, and the test set contains 1000 images. We used the LabelImg tool for annotation, and some annotated images from the training set are shown in the Figure 4 below.

Experimental equipment and environment

This paper uses Ubuntu 20.04 as the experimental platform, with an Intel Xeon(R) Gold 6330 CPU and an NVIDIA GeForce RTX 3090 GPU, and a total system memory of 64GB. In terms of software configuration, the GPU is equipped with driver version 525.60.11, and the versions of CUDA (Compute Unified Device Architecture) and CUDNN (NVIDIA CUDA® Deep Neural Network library) are 11.8 and 8.7.0, respectively. The Python interpreter version is 3.8, and the version of the deep learning framework PyTorch is 2.0.1. This paper uses the Ultralytics algorithm library to reproduce and improve the YOLOv8s model, with a specific version number of 8.1.18. Additionally, this paper also utilizes some extra dependent libraries, such as OpenCV version 4.7.0.72, Numpy version 1.21.6, Pillow version 9.5.0, Matplotlib version 3.8.2, and tqdm version 4.65.0.

Experimental hyperparameter setting

In the experiments of this paper, we conducted detailed settings for the experimental parameter configuration. Firstly, the batch size for all experiments was uniformly set to 64, and all experiments were trained for 100 epochs. The optimization algorithm chosen was Stochastic Gradient Descent (SGD), with an initial learning rate set to 0.01 and a final learning rate gradually decaying to 0.0001. The total decay factor for the learning rate was set to 0.937, and the weight decay factor for the optimizer was set to 0.0005. In terms of data augmentation during the training phase, the probabilities for color hue, saturation, and brightness adjustments were set to 0.015, 0.7, and 0.4, respectively. Additionally, the probabilities for left-right flipping and scale variation of images were both set to 0.5 to increase the diversity of the training data. We also employed the mosaic method to augment the dataset. For the loss function, the Complete Intersection over Union (CIoU) Loss was used for the object bounding box loss, with a weight coefficient set to 7.5. The weight coefficient for the Distribution Focal Loss (DFL Loss) was set to 1.5, and the weight coefficient for the object classification loss was set to 0.5.

Figure 5 shows the metric graph obtained by reproducing the YOLOv8s model on the MS COCO dataset during the experimental process of this paper. It can be seen from the figure that the loss during the training process of the algorithm consistently decreases and converges to a small value. In object detection tasks, we mainly focus on the metrics of Precision, Recall, mAP50, and mAP50-95. The mAP50 metric refers to the average precision across all categories with an Intersection over Union (IoU) threshold of 0.5 for the object bounding boxes. The mAP50-95 metric refers to the average precision across all categories with IoU thresholds at 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, and 0.95 for the object bounding boxes. Higher mAP50 and mAP50-95 metrics indicate stronger detection performance. YOLOv8s achieves 67.3%, 54.8%, 59.9%, and 43.9% respectively in Precision, Recall, mAP50, and mAP50-95 metrics. Compared to other methods such as YOLOv5, YOLOv6, and YOLOv7, YOLOv8 demonstrates significant performance improvements with only 100 epochs of training, while maintaining a relatively fast detection speed, compared to these methods with the same number of parameters. Figure 6 shows the metric results of our method on the MS COCO 2017 dataset. It can be seen from the figure that our method achieves 69.4%, 56.0%, 61.7%, and 45.2% respectively in Precision, Recall, mAP50, and mAP50-95 metrics, which are 2.1%, 1.2%, 1.8%, and 1.3% higher compared to the YOLOv8s model. The main reason for this performance improvement is that our method uses a stronger PVT model in the feature extraction stage, which can more effectively learn and represent object features.

Figure 7 shows the result metric graph of the YOLOv8s model on the self-built elderly fall detection dataset in this paper. It can be seen from the figure that the object bounding box losses, CIoU Loss and DFL Loss, of the YOLOv8s model converge to 1.33 and 1.206 after 100 iterations on the validation set, and the object classification loss converges to 0.488. In specific metrics, YOLOv8s achieves 94.2%, 94.6%, 94.8%, and 64.9% respectively in Precision, Recall, mAP50, and mAP50-95 metrics. Figure 8 shows the experimental result metric graph of our method on the self-built fall detection dataset. It can be seen from the figure that after 100 iterations of optimization in the training phase, the object bounding box loss CIoU Loss, DFL Loss, and object classification loss exhibit a stable and rapid downward trend, ultimately converging to 0.76, 0.989, and 0.356, respectively. In the validation phase, these three losses decrease to 1.214, 1.138, and 0.429, respectively. In specific metrics, our method achieves 97.1%, 96.9%, 97.3%, and 66.8% respectively in Precision, Recall, mAP50, and mAP50-95 metrics. Compared to the YOLOv8s model, there are improvements of 2.9%, 2.3%, 2.5%, and 1.9%, respectively.

Figure 9 shows the detection results of our method on the self-built elderly fall detection test set. It can be seen from the figure that our method can accurately predict fall targets in different scenarios while achieving a very high detection confidence score.

The Pyramid Vision Transformer (PVT), as a deep learning model that combines the Transformer architecture with a pyramid structure, excels in feature extraction and multi-scale processing. This paper explores strategies for incorporating the PVT architecture into the YOLOv8 algorithm and analyzes its potential advantages and challenges^[20][21]. The original backbone network of YOLOv8 typically employs Convolutional Neural Networks (CNNs), such as CSPDarknet^[22]. To introduce the multi-scale feature extraction capabilities of PVT, we can replace the backbone network of YOLOv8 with the PVT architecture. Through its pyramid structure design, PVT can utilize feature maps of different scales at multiple stages, thereby capturing feature information of targets of different sizes in the image. This design gives PVT a significant advantage in handling multi-scale targets. After adopting PVT as the backbone network, we need to optimize the feature fusion network to ensure effective transmission and integration of multi-scale features. YOLOv8 commonly uses structures such as Feature Pyramid Network (FPN)^[6] or Path Aggregation Network (PAN) ^[23]for feature fusion. We can adjust the parameters and configurations of these structures based on the characteristics of the feature maps output by PVT to achieve more efficient feature fusion^[24]. The design of YOLOv8's detection head is usually closely related to the backbone network^[3]. After introducing PVT as the backbone network, we need to adapt the detection head accordingly^[25]. This includes adjusting the input channel number of the detection head, anchor box settings, and loss functions to ensure that the detection head can fully utilize the multi-scale features extracted by PVT and achieve more accurate object detection.

The pyramid structure design of PVT enables it to capture feature information at different scales in images, which is crucial for handling multi-scale object detection tasks. Incorporating PVT into YOLOv8 can significantly enhance the algorithm's ability to detect multi-scale objects. The Transformer architecture excels in feature extraction, but its computational complexity is usually high. PVT effectively reduces computational costs through designs such as progressive feature pyramids and spatial-reduction attention (SRA) layers^[26][27][28]. Using PVT as the backbone network of YOLOv8 can maintain high feature extraction efficiency while reducing the model's computational complexity^[29]. Both PVT and YOLOv8 are deep learning-based models that can learn rich feature representations during training. Combining the two can fully leverage their respective strengths to form stronger feature extraction and detection capabilities. This combination helps improve the model's generalization ability, allowing it to perform well in different scenarios^[30].

Experimental results on benchmark datasets and self-built datasets verify the effectiveness of incorporating the PVT architecture. Compared to the original YOLOv8, the improved algorithm achieves significant improvements in mean average precision (mAP)^[31], especially in the detection performance of small objects and complex backgrounds^[32]. This result indicates that the introduction of the PVT architecture indeed enhances the feature extraction and expression capabilities of YOLOv8^[33]. Although the PVT structure is relatively complex, we have successfully maintained the real-time performance of the improved algorithm^[34] by optimizing model parameters and inference strategies. This characteristic is crucial for real-time object detection systems in practical applications.

Despite its impressive performance in multiple aspects, YOLOv8 with the PVT architecture also has some limitations^[35]. Firstly, due to the complexity of the PVT structure, model training requires more computational resources and time. Secondly, there is still room for improvement in detecting extremely small objects and objects under heavy occlusion^[36]. Future research can further explore how to optimize the PVT structure to enhance performance in these aspects^[37]. To address the current limitations of the PVT structure, future work can explore more efficient Transformer variants or optimization strategies to further reduce computational complexity^[38] and improve feature extraction capabilities^[39]. With the rise of multimodal learning, future attempts can be made to fuse PVT with other types of sensor data (such as LiDAR, radar, etc.)^[40] to further enhance the robustness and accuracy^[41] of object detection systems. For resource-constrained application scenarios such as mobile and embedded devices, future research can investigate how to design a lightweight version of YOLOv8 with PVT to meet the real-time and accuracy requirements of these scenarios^[42].

The proposed improved YOLOv8 algorithm fused with PVT effectively enhances the accuracy and robustness of object detection, especially when dealing with small and dense objects. This improvement provides a new research direction in the field of real-time object detection and offers powerful technical support for complex scene detection in practical applications.

Data availability

The datasets used or analysed during the current study are available from the corresponding author on reason

able request.

Authors’ contributions

ZD and SY contributed equally to this work. ZD and SYcontributed to the conceptualization, Methodology, software, data curation, writing - original draft preparation, writing- reviewing and editing. SY contributed to the writing - reviewing and editing, supervision, project administration, funding acquisition. YX contributed to the investigation, data curation, resources.

Funding

This research was supported by the doctoral research launch project of Guiyang Healthcare Vocational University

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethics declarations

Ethics approval and consent to participate

All procedures performed in the study involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. All individuals and/or their parents provided informed consent to participate in this study and approval was provided by Research Ethics Committee of Guiyang Healthcare Vocational University.

Statement

All subjects and/or their legal guardian(s) agree to publish of identifying information/images in an online open-access publication.

Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 779–788). (2016).
Redmon, J. & Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7263–7271). (2017).
Bochkovskiy, A., Wang, C. Y. & Liao, H. Y. M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv preprint arXiv:2004.10934. (2020).
Jocher, G., Chaurasia, A., Qiu, J. & Stoken, A. (2020). YOLOv5. GitHub repository.
Wang, W. et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In International Conference on Computer Vision (ICCV). (2021).
Lin, T. Y. et al. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2117–2125). (2017).
Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR). (2021).
Liu, Z. et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022). (2021).
Carion, N. et al. End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV) (pp. 213–229). (2020).
Ge, Z., Liu, S., Wang, F., Li, Z. & Sun, J. YOLOX: Exceeding YOLO Series in 2021. (2021). arXiv preprint arXiv:2107.08430.
Huang, Z., Wang, X. & Li, L. J. CrossViT: Cross-Attention Vision Transformer for Image Classification. arXiv preprint arXiv:2007.00666. (2020).
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). (2016).
Chu, X., Wu, Y. & Liu, X. TokenLearner: What Can 8.4 Billion Tokens Do for Visual Recognition? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 5117–5127). (2021).
Lin, J., Li, J., Wang, Z., Xu, M. & Zhang, Z. Simplified Self-Attention Mechanisms in Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5698–5708). (2022).
Vaswani, A. et al. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (pp. 5998–6008). (2017).
Cheng, B. & Liu, X. Adaptive Attention: A New Mechanism for Transformer Models. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 8620–8627). (2020).
Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Switchable Atrous Convolution for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6878–6887). (2020).
Zhang, Z., Li, M. & Qi, X. Replacing Convolutional Neural Networks with Transformer Networks for Image Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1450–1459). (2020).
Chen, J., Yu, K., Xie, L. & Zhang, X. Efficient and Robust Object Detection with Attention Mechanisms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1245–1255). (2021).
Redmon, J. & Farhadi, A. YOLOv3: An Incremental Improvement. arXiv preprint arXiv:1804.02767. (2018).
Zhu, X., Lu, L., Li, B., Dai, J. & Wang, X. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations (ICLR). (2021).
Wang, C. & Xu, Z. CSPDarknet: A New Backbone Network for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5294–5303). (2020).
Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 8759–8768). (2020).
Zhou, X., Wang, D. & Zhu, J. Objects as Points. arXiv preprint arXiv:2006.05987. (2020).
Zhang, Y., Li, M. & Qi, X. A Survey on Backbone Networks for Object Detection. J. Comput. Vis. Res. 36 (2), 109–125 (2021).
Xie, E. et al. Multiscale Vision Transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2860–2869). (2021).
Cao, Y. & Yang, J. A Survey on Vision Transformers. arXiv preprint arXiv:2108.10654. (2021).
Chen, L. & Wu, J. Efficient Attention Mechanism in Transformers for Vision Tasks. IEEE Trans. Neural Networks Learn. Syst. 32 (5), 1804–1816 (2021).
Li, X., Xie, E., Wang, C., Zhang, Z. & Fan, D. Vision Transformers: A Survey of Methods and Applications. arXiv preprint arXiv:2205.12476. (2022).
Zhang, H., Li, H. & Lin, H. Enhancing YOLO with Transformers for Improved Object Detection Performance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1479–1488). (2022).
Wu, T. & Dong, Y. YOLO-SE: Improved YOLOv8 for remote sensing object detection and recognition. Appl. Sci. 13 (24), 12977 (2023).
Liu, Y., Sun, P., Wergeles, N. & Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 172, 114602 (2021).
Ma, N., Su, Y., Yang, L., Li, Z. & Yan, H. Wheat Seed Detection and Counting Method Based on Improved YOLOv8 Model. Sensors. 24 (5), 1654 (2024).
Yao, J. et al. A real-time detection algorithm for Kiwifruit defects based on YOLOv5. Electronics. 10 (14), 1711 (2021).
Swathi, Y. & Challa, M. YOLOv8: Advancements and Innovations in Object Detection. In International Conference on Smart Computing and Communication (pp. 1–13). Singapore: Springer Nature Singapore. (2024), January.
Lin, Y., Zhang, J. & Huang, J. Centralised visual processing center for remote sensing target detection. Sci. Rep. 14 (1), 17021 (2024).
Maghrabie, H. M. et al. Building-integrated photovoltaic/thermal (BIPVT) systems: Applications and challenges. Sustain. Energy Technol. Assess. 45, 101151 (2021).
So, D. et al. Searching for efficient transformers for language modeling. Adv. Neural. Inf. Process. Syst. 34, 6010–6022 (2021).
Zebari, R., Abdulazeez, A., Zeebaree, D., Zebari, D. & Saeed, J. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appl. Sci. Technol. Trends. 1 (1), 56–70 (2020).
Hasan, M. et al. LiDAR-based detection, tracking, and property estimation: A contemporary review. Neurocomputing. 506, 393–405 (2022).
Zhang, Y., Hou, J. & Yuan, Y. A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks. Int. J. Comput. Vision. 132 (5), 1592–1624 (2024).
Zamri, F. N. M. et al. (2024). Enhanced Small Drone Detection using Optimized YOLOv8 with Attention Mechanisms. IEEE Access.

No competing interests reported.

Download PDF

Editor invited by journal
09 Sep, 2024
Submission checks completed at journal
06 Sep, 2024
First submitted to journal
27 Aug, 2024

You are reading this latest preprint version

Improvement of YOLOv8 algorithm through integration of Pyramid Vision Transformer architecture

Status:

Version 1

Abstract

Figures

Introduction

Experimental dataset

Results

Discussion

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1