With the rapid development of deep learning technology, the field of computer vision has undergone unprecedented changes. As one of the core tasks of computer vision, the improvement of object detection performance is directly related to the level of intelligence in many application scenarios. The YOLO (You Only Look Once) series of algorithms, with their efficient and accurate characteristics[1], have received widespread attention and applications since their inception. However, as application requirements become increasingly complex, enhancing the accuracy and efficiency of YOLO[2] algorithms has become a research hotspot. YOLOv8, as the latest member of the YOLO series, inherits many advantages of YOLOv5[3][4]and achieves significant performance improvements. Nevertheless, when faced with more complex and diverse scenarios and multi-scale object detection tasks, existing Convolutional Neural Network (CNN) structures still have certain limitations. To overcome this challenge, this paper proposes an improved YOLOv8 algorithm that incorporates the Pyramid Vision Transformer (PVT) architecture[5](Figure 1).
Pyramid Vision Transformer is a deep learning model that combines the Transformer architecture with a pyramid structure[6]. Compared to the original Vision Transformer (ViT)[7][8][9], PVT utilizes feature maps of different scales across multiple stages, thereby forming a pyramid structure. This design enables PVT to capture feature information at different scales, enhancing the model's ability to process objects of varying sizes within images. By introducing PVT into the backbone network of YOLOv8, it aims to leverage its robust feature extraction capabilities and multi-scale feature processing abilities to further improve the object detection accuracy and efficiency of YOLOv8[10].
Currently, both single-stage and two-stage object detection algorithms based on deep learning largely adopt feature pyramid structures to enhance intermediate feature map information. In Vision Transformer (ViT) structure-based methods[11], the input image is first converted into a series of patches, which are then fed into the Transformer Encoder to extract object features and obtain feature maps. However, since the feature map size remains the same[12]at each stage, this approach is challenging to apply to downstream vision tasks. To bridge the gap between ViT and feature pyramid techniques, Wang et al.[22] proposed PVT, which can be trained on high-resolution images without significantly increasing the model's computational complexity. PVT employs a progressive pyramid paradigm with diminishing sizes to produce multi-scale output feature maps. Additionally, it introduces Spatial Reduction Attention (SRA)[13][14] to reduce resource consumption and time complexity during attention computations. Compared to CNNs and ViT, PVT not only inherits the global receptive field of ViT but also incorporates the pyramid structure of CNNs, facilitating the acquisition of multi-scale feature maps and seamless migration to advanced computer vision tasks such as object detection and instance segmentation.
Figure 2 illustrates the overall architecture of PVT, which comprises four stage modules (Stage1, Stage2, Stage3, and Stage4). Each stage module contains a Patch Embedding and n Transformer Encoder Layers[5], outputting feature maps of different sizes (four-fold downsampled, eight-fold downsampled, sixteen-fold downsampled, and thirty-two-fold downsampled)[6]. In the first stage module, given an input image of size H×W×3, it is first divided into patches of size 4×4×3, totaling H×W/16 patches. These embedded vectors, along with positional embeddings, are then fed into the Transformer Encoder Layer, and the output feature map is reshaped to obtain an H/4×W/4×C1 feature map f1. The subsequent three stages follow a similar process, yielding H/8×W/8×C2 feature map f2, H/16×W/16×C3 feature map f3, and H/32×W/32×C4 feature map f4, respectively.
To reduce the computational load of the model, PVT proposes SRA to replace the Multi-Head Attention (MHA) module in Transformers[5]. Similar to MHA, SRA takes a query, key, and value as inputs and outputs a modified feature vector[15]. Unlike MHA, SRA reduces the spatial dimensions (i.e., width and height) of the key and value before performing self-attention calculations, significantly decreasing the attention computation burden and system memory usage[14][16]. The structures of MHA and SRA modules are depicted in Figure 3.
PVT encompasses four variant models: PVT-Tiny, PVT-Small, PVT-Medium, and PVT-Large. Considering computational cost and complexity, this paper adopts the YOLOv8s model as the baseline and utilizes PVT-Small as the feature extraction network, replacing the stacked convolutional block structure in YOLOv8s[8][17].
Addressing the issue of poor object detection accuracy in complex backgrounds for the YOLOv8s model[18][19], this paper aims to enhance the feature extraction capability of the YOLOv8s model's feature extraction network. We propose substituting the stacked convolutional block structure in YOLOv8s with PVT. This approach leverages PVT's pre-trained weights to better initialize model parameters and benefits from the Transformer model's robust feature modeling capabilities, enabling multi-dimensional feature extraction and fusion, thereby strengthening the model's image feature extraction ability. The overall structure of the proposed model is shown in Figure 3 below. We replace the original convolutional block structure with PVT in the Backbone stage, while maintaining consistency with YOLOv8s in the Head and Prediction stages.