YOLOX-SEH: Efficient Small Target Detection in Remote Sensing Images

doi:10.21203/rs.3.rs-4642473/v1

Object detection is a crucial and challenging problem in computer vision. In remote sensing image object detection, objects can appear in any orientation, complicating accurate detection. Additionally, the presence of small objects and environmental factors pose significant challenges to existing deep learning-based object detection algorithms. To address these issues, this paper proposes a YOLOX-based object detection algorithm named YOLOX-SEH. To enhance the detection accuracy of objects appearing in any orientation, a Swin Transformer structure is incorporated into the YOLO framework along with other optimizations. Specifically, the Swin Transformer employs sliding windows and hierarchical structures for more efficient and flexible computation, capturing global information from multiple scales with improved feature extraction capability. Additionally, an Explicit Visual Center (EVC) structure is introduced to capture global long-range dependencies through a lightweight MLP network, enhancing the accuracy and efficiency of object detection algorithms. To improve the detection of small objects, an additional prediction head is added to detect tiny objects, complementing YOLOX’s original three prediction heads for effective multi-scale object detection. Extensive experiments on the DOTA v1.0 and DIOR datasets demonstrate that our YOLOX-SEH performs well in remote sensing image detection with a small model size of only 8MB.

remote sensing image

object detection

YOLOX

Swin Transformer

EVC

Remote sensing technology has been widely applied in various fields, including location-based services and constructing comprehensive and clear models of the Earth (Xia et al., 2018). It is also crucial for weather forecasting, traffic monitoring, and other important services. In large-scale remote sensing images, objects such as planes and vehicles often appear in high density, small sizes, and complex environments, making it challenging for detection models to distinguish them from irrelevant ground objects.

Convolutional Neural Networks (CNNs) are a type of feedforward neural network widely used in image recognition and processing (Krizhevsky, Sutskever, and Hinton, 2012). Existing CNN-based object detection methods can be divided into two categories: single-stage and two-stage detection methods. Single-stage methods, such as YOLO (You Only Look Once) (Redmon et al., 2016) and SSD (Single Shot MultiBox Detector) (Liu et al., 2016), use a single CNN model to directly predict the position and category of objects. Two-stage methods, including R-CNN (Region-based CNN) (Girshick et al., 2014), Fast R-CNN (Girshick, 2015), Faster R-CNN (Ren et al., 2015), and Mask R-CNN (He et al., 2017), first generate candidate object regions using a CNN model, followed by classification and regression of these regions. While the YOLO series has been effective for single-stage detection problems, there is room for improvement, especially in detecting small targets.

This paper proposes a YOLOX-based object detection model named YOLOX-SEH. To enhance the detection capability of objects appearing in any orientation, we integrate the Swin Transformer structure into the YOLOX model. The Swin Transformer (Liu et al., 2021) utilizes sliding windows and hierarchical structures to handle images more efficiently and flexibly, extracting global information from multiple scales to improve feature extraction capabilities. Additionally, we introduce a lightweight MLP network structure, the Explicit Visual Center (EVC) (Quan et al., 2022), to capture global long-range dependencies, further improving detection accuracy and efficiency. Moreover, an additional prediction head specifically designed for small objects is added to the original three detection heads, aiding in multi-scale object detection. These enhancements significantly boost the performance of object detection algorithms, particularly for small targets.

We evaluate these methods on the DOTA (Dataset for Object Detection in Aerial Images) dataset (Xia et al., 2018), a remote sensing dataset used for aerial image object detection. The evaluation results confirm the efficiency of our model.

2.1. YOLOX

YOLO (You Only Look Once) is a widely recognized object detection algorithm known for its high speed and accuracy (Redmon et al., 2016). The YOLO algorithm performs object detection in a single forward pass, making it fast enough to meet real-time detection requirements while maintaining a high accuracy level comparable to state-of-the-art object detection algorithms. YOLOX is an enhanced version of the YOLO series, offering improved detection accuracy and speed (Ge et al., 2021). YOLOX incorporates several technical innovations, including SPP-DEX, BFP, and CSP, to boost detection performance. YOLOX comes in six versions: YOLOX-Tiny, YOLOX-S, YOLOX-M, YOLOX-L, YOLOX-X, and YOLOX-Nano. These versions differ in network model width, depth, and input image resolution. YOLOX-Tiny, the smallest version, is suitable for devices with limited resources, while YOLOX-X, the largest version, is ideal for high-precision detection scenarios.

2.2. Transformer

The Transformer model, introduced by Vaswani et al. (2017), is a deep learning model based on the self-attention mechanism. It addresses issues like gradient vanishing and low computational efficiency in processing long sequences with recurrent neural networks. Transformers use self-attention and multi-head attention mechanisms to capture long-term dependencies in input sequences and map them into fixed-length vector representations, which are then transformed into target sequences via a decoder. This model has inspired several improved and extended versions, including BERT (Bidirectional Encoder Representations from Transformers) by Google (Devlin et al., 2018), GPT (Generative Pre-trained Transformer) by OpenAI (2018), VIT (Vision Transformer) by Google (Dosovitskiy et al., 2020), DeiT (Data-efficient Image Transformers) by Facebook (Touvron, Cord, and Jégou, 2021), and PVT (Pyramid Vision Transformer) by Huawei (Wang et al., 2021).

This paper utilizes the DOTA v1.0 dataset, a well-known dataset for remote sensing images, released in 2018. The dataset features targets of various sizes and shapes, appearing in multiple orientations, and includes many small objects in complex environments. It contains 15 categories of targets across diverse scenes such as cities, oceans, and farmlands, with numerous types and quantities of targets. The varying orientations and poses of targets, influenced by factors like the aerial image shooting angle and target shapes, present detection challenges. Some targets exhibit rotation or tilt, as shown in Fig. 1. Given the inconsistent sizes and large number of targets, selecting appropriate input sizes is crucial. To balance model size and achieve a lightweight effect for quick deployment on edge devices, this paper uses an input size of 640 x 640 in experiments.

3.1. Multiple Small Targets in the Dataset

The DOTA v1.0 dataset contains a relatively high number of small targets. In remote sensing images, small targets are usually harder to recognize and locate than larger ones, often appearing at different scales. As shown in Fig. 2, aerial images contain numerous small targets, such as small and large vehicles, helicopters, and swimming pools. This makes object detection in the DOTA v1.0 dataset particularly challenging

3.2. Environmental Factors Influencing Targets

The DOTA v1.0 dataset is affected by environmental factors, including weather, fog, and clouds. Weather significantly impacts target visibility, as shown in Fig. 3. Poor weather conditions, such as rain or fog, greatly restrict visibility, making target detection more difficult.

To address the challenges in the DOTA v1.0 dataset, this paper proposes a target detection algorithm based on YOLOX (Ge et al., 2021), named YOLOX-SHE. To achieve a lightweight effect, we modified YOLOX to its tiny version. The network architecture is shown in Figure 4. Firstly, we propose the ST-CSP backbone network. We add the Swin Transformer structure to the original CSPDarknet53 backbone network (Ge et al., 2021) to capture global information from multiple scales, enhancing feature extraction. Secondly, we introduce the Explicit Visual Center (EVC) structure to capture global long-range dependencies, improving the accuracy and efficiency of the algorithm. Finally, we add an additional detection head specifically designed for tiny objects to the original three YOLO heads for more effective multi-scale object detection.

4.1 ST-CSP

The Swin Transformer is a deep learning model based on the transformer architecture, utilizing a multi-stage attention mechanism and cross-layer connections to improve model efficiency and accuracy. Swin Transformer offers better computational efficiency and a smaller model size, making it more practical for real-world applications. Integrating the Swin Transformer structure into the DOTA v1.0 dataset enhances inference speed and achieves better detection results with a smaller model size. Adding the Swin Transformer structure to the YOLO model improves performance and mean average precision (mAP) due to several factors:

Better Feature Representation: The Swin Transformer, being transformer-based, can better capture semantic information in input images. Through multiple layers of attention mechanisms and position encoding, it enables global and cross-level feature interactions, resulting in stronger feature representation.

Larger Receptive Field: The Swin Transformer establishes long-range dependencies through self-attention mechanisms, allowing the model to perceive features over a larger spatial range, leading to a better understanding of objects in the image. This larger receptive field helps capture contextual information and improve object detection accuracy.

Improved Position Encoding: The Swin Transformer uses learnable position encoding to represent spatial information of input images accurately. This precise position encoding enhances the model's understanding and localization of objects, improving position regression capability and detection accuracy..

Better Feature Pyramid Network: The Swin Transformer structure can be integrated into a feature pyramid network to capture features at different scales. This is crucial for object detection tasks as it allows the model to capture features of objects at various scales while providing more contextual information, improving detection capability.

The relationship between targets and the surrounding environment in remote sensing images is usually complex. Capturing global information is essential for better target detection. Therefore, the multi-stage attention mechanism in the Swin Transformer structure weights different feature levels to capture global information. Cross-layer connections fuse features at different levels to capture background information, improving target detection accuracy. This paper proposes the ST-CSP backbone network structure, which integrates the Swin Transformer structure in dark5 (Liu et al., 2021) to obtain more global information, addressing the abundant target numbers, different orientations, and environmental influences in the dataset.

4.2 EVC

The Explicit Visual Center (EVC) is a novel network structure that identifies the direction and size of targets based on the position information of their center points (Quan et al., 2022). The internal structure of EVC uses a learnable center point as the main basis for target detection. EVC detects targets by calculating the error between the center point and the bounding box of each target. This method offers higher detection accuracy and faster detection speed in the DOTA v1.0 remote sensing dataset. By utilizing the position information of the center point and anchor-based candidate box generation, EVC better captures the location and direction information of targets in remote sensing images, enhancing target detection accuracy. Additionally, EVC provides faster detection speeds, completing network detection tasks in a shorter time, which is important for lightweight network models to achieve higher efficiency.

EVC consists of two parallel connected blocks. The lightweight MLP captures global information of top-level features, while a learnable visual center mechanism aggregates local features in the same region within the layer to preserve local information. The lightweight MLP mainly comprises deep convolution-based modules and channel MLP-based blocks. The network structure is shown in Figure 6.

4.3 YOLO Head

Adding a YOLO head to the YOLO model improves performance and mAP due to several factors:

Enhanced Feature Representation: The YOLO head performs detection on top of the features extracted by the main network. With additional convolutional and fully connected layers, it better learns and represents object features, making the model more sensitive to object features and accurately localize and classify them.

Increased Receptive Field: The convolutional layers in the YOLO head enhance the ability to perceive features at different scales. By employing multiple convolutional layers, the model captures semantic information at different levels, leading to better understanding and analysis of objects in the image.

Finer Position Regression: The YOLO head predicts the bounding box positions of objects for localization. Adding a YOLO head provides more regression capabilities, enabling more accurate localization. By refining position regression, the model better captures object boundary details, improving detection accuracy.

Multi-scale Predictions: YOLO models typically predict objects of different scales on different feature layers. By adding a YOLO head, independent predictions can be made on each feature layer, enhancing the ability to detect objects at varying scales and reducing the impact of object scale.

To improve the detection of small targets in the dataset, this paper refers to TPH-YOLOv5 (Zhu et al., 2021) and adds a YOLO head to generate a new detection head for tiny targets based on the three original YOLO heads of YOLOX-Tiny. Although adding a prediction head increases the parameter and computational complexity, it effectively detects multi-scale objects. Figure 7 shows the structure of the YOLO head.

5.1. Experimental Details

The experiments were conducted on an Ubuntu 18.04.5 LTS operating system. The hardware setup included an Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz and an NVIDIA Tesla V100 GPU. The model was trained using the PyTorch 1.8.0 framework, with CUDA version 10.2. The training images were sized at 640 x 640 pixels and in PNG format. To accelerate the training process, the batch size was set to 8, and the learning rate was initialized at 0.01. Mosaic data augmentation and mixup data augmentation techniques were used. The learning rate was adjusted using the cosine annealing algorithm.

5.2. Evaluation Indicators

The experiments utilized the official partition of the DOTA v1.0 dataset. Predictions on the official test set were formatted into DOTA Task2 format and submitted to the DOTA Task2 evaluation server for assessment.

5.3. Experimental Result

The effect of adding the Swin Transformer structure is shown in Table 1. Compared with the Baseline, the mAP has increased by 3.29%, which indicates a significant improvement in performance. At the sametime, there is also a significant improvement in AP. For example, the Ship has increased from 46.53% to 52.10%, which is a 5.57% improvement. The Small vehicle has increased from 34.26% to 41.64%, which is a 7.385% improvement. After adding the EVC structure, the mAP has improved by 0.6%, which is a 3.89% improvement compared to the Baseline.

Table 2 compares the inference speed, model parameter quantity, and mAP of YOLOX-SHE with other commonly used models. Compared with YOLOv2 Redmon and Farhadi (2017) and YOLOv3 Redmon and Farhadi (2018), there are improvements in speed, model size, and mAP. Among them, YOLOv4tiny Bochkovskiy, Wang, and Liao (2020) and YOLOv5s were compared under the same training and testing envi-ronment. The evaluation results of the other models were referenced from the relevant paper Gu et al. (2020) . Through comparison, we can found that the CenterFPANet and RetinaNet models use larger image sizes for training, which results in higher mAP but also larger model sizes. Although the difference in mAP is only a few percentage points, our model has smaller model size, only one fourteenth and forty-fifth of the others while it achieves faster speed. Although YOLOv4tiny and YOLOv5s slightly increase in model size, their speed is relatively slow and their mAP has increased by 24.05% and 18.14% respectively.

Table 1. Accuracy performance of different models

Method	AP(%)	score threhold	F1	Recall(%)	Precision(%)
Expressway-Service-area	66.06	0.5	0.75	66.11	85.68
Expre-toll-station	57.11	0.5	0.63	50.94	83.08
airplane	91.64	0.5	0.93	91.38	94.74
airport	52.63	0.5	0.53	48.21	69.42
baseballfield	92.08	0.5	0.92	91.66	92.06
basketballcourt	84.28	0.5	0.86	80.81	91.05
bridge	38.45	0.5	0.40	26.25	81.59
chimney	87.86	0.5	0.90	84.04	96.63
dam	46.88	0.5	0.50	37.31	74.62
golffield	57.79	0.5	0.63	53.28	76.84
groundtrackfield	70.65	0.5	0.70	62.48	80.39
harbor	54.60	0.5	0.59	48.40	73.99
overpass	62.87	0.5	0.67	58.17	77.80
ship	81.46	0.5	0.87	80.77	93.51
stadium	89.44	0.5	0.91	84.11	99.22
storagetank	82.68	0.5	0.83	76.72	91.02
tenniscourt	90.88	0.5	0.93	88.33	98.53
trainstation	44.61	0.5	0.50	44.71	57.00
vehicle	77.61	0.5	0.77	67.43	88.48
windmill	83.26	0.5	0.87	81.05	93.20

Table 2. Time consumption of different models.

Method	Speed(fps)	FLOPs(G)	mAP	Inference Device
CSFF	9.9		68.0	NVIDIA Titan X
FENet	8.69		68.3	NVIDIA Titan XP
SSD(VGG)	19.2	220.15	60.9	3070 laptop 8G
YOLOv3	22.5	121.4	59.9	3070 laptop 8G
YOLOv4	23.3	110.9	63.12	3070 laptop 8G
ours	28.02	32.49	72.16	NVIDIA 4090

To evaluate the effectiveness of the experimental model, the models were tested on the DIOR dataset Ye et al. (2020). The dataset emphasizes diverse object instances with complex backgrounds to mimic real-world scenarios. It includes instances from different categories with bounding boxes and labels provided. The images are captured from various viewpoints, scales, rotations, and backgrounds, and include challenges such as object occlusion, deformations, and lighting variations. The DIOR dataset serves as a valuable tool for testing and comparing object recognition algorithms and improving their accuracy and robustness.Researchers and developers can leverage the DIOR dataset to compare and improve the performance, accuracy, and robustness of various object recognition algorithms.

The DIOR dataset utilizes an input size of 800x800 pixels.The experimental results demonstrate that the proposed model performs well on the DIOR dataset, with a high mean average precision (MAP). Although the computational complexity is not very high, the model achieves the highest frames per second (FPS) among all compared models. Due to the compact size of our model is only 8MB, the proposed model shows promising results in edge and embedded environments.

Table 3. Accuracy performance of different models

Method	Speed(fps)	Size(MB)	mAP
YOLOv2(1024*1024)	20.2	240	42.72
YOLOv3(1024*1024)	9.0	262	37.42
CenterFPANet(1024*1024)	22.2	113	64.00
RetinaNet(1024*1024)	12.2	367	67.45
Mask RCNN(1024*1024)	8.3	391	71.61
Faster RCNN(1024*1024)	13.8	380	70.76
YOLOv4tiny	94.54	5.91	22.43
YOLOv5s	46.12	7.1	28.34
YOLOx	40.34	5.04	37.15
ours	28.02	8.0	46.48

Table 4. Performance data of each model

Method	Baseline(%)	Baseline+swin(%)	Baseline+swin+evc(%)	Baseline+swin+evc+head(%)
Plane	63.00	62.99	63.12	72.10
BD	35.78	36.36	36.16	44.24
Bridge	9.09	9.09	9.09	9.09
GTF	26.54	34.14	34.90	42.80
SV	34.26	41.64	41.79	49.66
LV	60.83	61.48	61.12	61.63
Ship	43.53	52.10	52.40	61.39
TC	81.58	81.66	81.63	81.71
BC	53.09	53.83	53.41	54.09
ST	35.28	35.12	35.35	43.37
SBF	22.99	24.07	31.09	31.18
RA	9.09	16.31	15.74	16.91
Harbor	42.88	50.51	50.61	58.31
SP	32.50	40.88	33.15	48.61
Helicopter	6.81	6.43	15.90	22.06
mAP	37.15	40.44	41.04	46.48

With With the widespread application of remote sensing technology across various fields, an increasing number of remote sensing image datasets are emerging. In this context, remote sensing image object detection has become a highly relevant and pressing issue. This paper analyzes the DOTA v1.0 and DIOR remote sensing datasets, focusing on three main challenges: a large number of objects appearing in various orientations, a significant presence of small objects, and the impact of environmental factors on object detection. To address these issues, we propose the YOLOX-SHE network structure, which enhances the YOLOX-tiny model by incorporating Swin Transformer, EVC structure, and an additional YOLO head specifically designed for small objects. Experimental results demonstrate that the proposed model offers superior detection performance compared to existing models, while maintaining a lightweight architecture.

Author Contribution

Song Qingzeng: As the first author, Song Qingzeng made significant contributions to the research design, algorithm development, data processing, and experimental analysis. Zhang Honggeng: As the second author, Zhang Honggeng assisted with data collection and preprocessing and provided essential technical support in algorithm optimization and model validation. He also contributed to the analysis of experimental results and the writing of the discussion section.Corresponding Author: Xue Yongjiang: As the corresponding author, Xue Yongjiang provided guidance and support for the overall direction and framework of the research. He conducted detailed reviews and revisions of the manuscript to ensure scientific accuracy and completeness.

Acknowledgments

National Natural Science Foundation of China (Grant nos. 61802279, 6180021345, 61702281, and 61702366), Natural Science Foundation of Tianjin (Grant nos. 18JC- QNJC70300, 19JCTPJC49200, 19PTZWHZ00020, and 19JCYBJC15800), Fundamental Research Funds for the Tianjin Universities (Grant no. 2019KJ019), and the Tianjin Science and Technology Program (Grant no. 19PTZWHZ00020) and in part by the State Key Laboratory of ASIC and System (Grant no. 2021KF014) and Tianjin Ed- ucational Commission Scientific Research Program Project (Grant nos. 2020KJ112 and 2018KJ215) and the fund of Beijing Polytechnic (Grant No. 2022X017-KXZ, 2023X005-KXD).

Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. “Yolov4: Optimal speed and accuracy of object detection.” arXiv preprint arXiv:2004.10934.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “Bert: Pre- training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805.
Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2020. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929.
Ge, Zheng, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. “Yolox: Exceeding yolo series in 2021.” arXiv preprint arXiv:2107.08430. Girshick, Ross. 2015. “Fast r-cnn.” In Proceedings of the IEEE international conference on computer vision, 1440–1448.
Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. “Rich feature hierar- chies for accurate object detection and semantic segmentation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, 580–587.
Gu, Xi, Lingbin Kong, Zhicheng Wang, Jie Li, Zhaohui Yu, and Gang Wei. 2020. “A light- weight object detection framework with FPA module for optical remote sensing imagery.” In Proceedings of the 2020 4th High Performance Computing and Cluster Technologies Confer- ence & 2020 3rd International Conference on Big Data and Artificial Intelligence, 109–115.
He, Kaiming, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. 2017. “Mask r-cnn.” In Pro- ceedings of the IEEE international conference on computer vision, 2961–2969.
Krizhevsky, Alex, I. Sutskever, and G. Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in neural information processing systems 25 (2).
Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. “Ssd: Single shot multibox detector.” In Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 21–37. Springer.
Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. “Swin transformer: Hierarchical vision transformer using shifted windows.” In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
Quan, Yu, Dong Zhang, Liyan Zhang, and Jinhui Tang. 2022. “Centralized Feature Pyramid for Object Detection.” arXiv preprint arXiv:2210.02093.
Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. “You only look once: Unified, real-time object detection.” In Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788.
Redmon, Joseph, and Ali Farhadi. 2017. “YOLO9000: better, faster, stronger.” In Proceedings of the IEEE conference on computer vision and pattern recognition, 7263–7271.
Redmon, Joseph, and Ali Farhadi. 2018. “Yolov3: An incremental improvement.” arXiv preprint arXiv:1804.02767.
Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. 2015. “Faster r-cnn: Towards real-time object detection with region proposal networks.” Advances in neural information pro- cessing systems 28.
Touvron, Hugo, Matthieu Cord, and Herv´e J´egou. 2022. “Deit iii: Revenge of the vit.” In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, 516–533. Springer.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, L–ukasz Kaiser, and Illia Polosukhin. 2017. “Attention is all you need.” Advances in neural information processing systems 30.
Wang, Wenhai, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.” In Proceedings of the IEEE/CVF international conference on computer vision, 568–578.
Xia, Gui-Song, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Mar- cello Pelillo, and Liangpei Zhang. 2018. “DOTA: A large-scale dataset for object detection in aerial images.” In Proceedings of the IEEE conference on computer vision and pattern recognition, 3974–3983.
Ye, Xinhai, Fengchao Xiong, Jianfeng Lu, Haifeng Zhao, and Jun Zhou. 2020. “M 2 -Net: A Multi-scale Multi-level Feature Enhanced Network for Object Detection in Optical Remote Sensing Images.” In 2020 Digital Image Computing: Techniques and Applications (DICTA),
Zhu, Xingkui, Shuchang Lyu, Xu Wang, and Qi Zhao. 2021. “TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios.” In Proceedings of the IEEE/CVF international conference on computer vision, 2778–2788.

No competing interests reported.

YOLOX-SEH: Efficient Small Target Detection in Remote Sensing Images

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Work

2.1. YOLOX

2.2. Transformer

3. Dataset Analysis for Aerial Photography

3.1. Multiple Small Targets in the Dataset

3.2. Environmental Factors Influencing Targets

4. Methods

5. Experiment

6. Conclusion

Declarations

Author Contribution

Acknowledgments

References

Additional Declarations

Status:

Version 1