To address the challenges in the DOTA v1.0 dataset, this paper proposes a target detection algorithm based on YOLOX (Ge et al., 2021), named YOLOX-SHE. To achieve a lightweight effect, we modified YOLOX to its tiny version. The network architecture is shown in Figure 4. Firstly, we propose the ST-CSP backbone network. We add the Swin Transformer structure to the original CSPDarknet53 backbone network (Ge et al., 2021) to capture global information from multiple scales, enhancing feature extraction. Secondly, we introduce the Explicit Visual Center (EVC) structure to capture global long-range dependencies, improving the accuracy and efficiency of the algorithm. Finally, we add an additional detection head specifically designed for tiny objects to the original three YOLO heads for more effective multi-scale object detection.
4.1 ST-CSP
The Swin Transformer is a deep learning model based on the transformer architecture, utilizing a multi-stage attention mechanism and cross-layer connections to improve model efficiency and accuracy. Swin Transformer offers better computational efficiency and a smaller model size, making it more practical for real-world applications. Integrating the Swin Transformer structure into the DOTA v1.0 dataset enhances inference speed and achieves better detection results with a smaller model size. Adding the Swin Transformer structure to the YOLO model improves performance and mean average precision (mAP) due to several factors:
Better Feature Representation: The Swin Transformer, being transformer-based, can better capture semantic information in input images. Through multiple layers of attention mechanisms and position encoding, it enables global and cross-level feature interactions, resulting in stronger feature representation.
Larger Receptive Field: The Swin Transformer establishes long-range dependencies through self-attention mechanisms, allowing the model to perceive features over a larger spatial range, leading to a better understanding of objects in the image. This larger receptive field helps capture contextual information and improve object detection accuracy.
Improved Position Encoding: The Swin Transformer uses learnable position encoding to represent spatial information of input images accurately. This precise position encoding enhances the model's understanding and localization of objects, improving position regression capability and detection accuracy..
Better Feature Pyramid Network: The Swin Transformer structure can be integrated into a feature pyramid network to capture features at different scales. This is crucial for object detection tasks as it allows the model to capture features of objects at various scales while providing more contextual information, improving detection capability.
The relationship between targets and the surrounding environment in remote sensing images is usually complex. Capturing global information is essential for better target detection. Therefore, the multi-stage attention mechanism in the Swin Transformer structure weights different feature levels to capture global information. Cross-layer connections fuse features at different levels to capture background information, improving target detection accuracy. This paper proposes the ST-CSP backbone network structure, which integrates the Swin Transformer structure in dark5 (Liu et al., 2021) to obtain more global information, addressing the abundant target numbers, different orientations, and environmental influences in the dataset.
4.2 EVC
The Explicit Visual Center (EVC) is a novel network structure that identifies the direction and size of targets based on the position information of their center points (Quan et al., 2022). The internal structure of EVC uses a learnable center point as the main basis for target detection. EVC detects targets by calculating the error between the center point and the bounding box of each target. This method offers higher detection accuracy and faster detection speed in the DOTA v1.0 remote sensing dataset. By utilizing the position information of the center point and anchor-based candidate box generation, EVC better captures the location and direction information of targets in remote sensing images, enhancing target detection accuracy. Additionally, EVC provides faster detection speeds, completing network detection tasks in a shorter time, which is important for lightweight network models to achieve higher efficiency.
EVC consists of two parallel connected blocks. The lightweight MLP captures global information of top-level features, while a learnable visual center mechanism aggregates local features in the same region within the layer to preserve local information. The lightweight MLP mainly comprises deep convolution-based modules and channel MLP-based blocks. The network structure is shown in Figure 6.
4.3 YOLO Head
Adding a YOLO head to the YOLO model improves performance and mAP due to several factors:
Enhanced Feature Representation: The YOLO head performs detection on top of the features extracted by the main network. With additional convolutional and fully connected layers, it better learns and represents object features, making the model more sensitive to object features and accurately localize and classify them.
Increased Receptive Field: The convolutional layers in the YOLO head enhance the ability to perceive features at different scales. By employing multiple convolutional layers, the model captures semantic information at different levels, leading to better understanding and analysis of objects in the image.
Finer Position Regression: The YOLO head predicts the bounding box positions of objects for localization. Adding a YOLO head provides more regression capabilities, enabling more accurate localization. By refining position regression, the model better captures object boundary details, improving detection accuracy.
Multi-scale Predictions: YOLO models typically predict objects of different scales on different feature layers. By adding a YOLO head, independent predictions can be made on each feature layer, enhancing the ability to detect objects at varying scales and reducing the impact of object scale.
To improve the detection of small targets in the dataset, this paper refers to TPH-YOLOv5 (Zhu et al., 2021) and adds a YOLO head to generate a new detection head for tiny targets based on the three original YOLO heads of YOLOX-Tiny. Although adding a prediction head increases the parameter and computational complexity, it effectively detects multi-scale objects. Figure 7 shows the structure of the YOLO head.