A Dual-Correlation Feature Enhancement Network Model Based on Transformer for Occluded Pedestrian Re-identification

doi:10.21203/rs.3.rs-4623525/v1

Download PDF

Research Article

A Dual-Correlation Feature Enhancement Network Model Based on Transformer for Occluded Pedestrian Re-identification

https://doi.org/10.21203/rs.3.rs-4623525/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

In pedestrian re-identification, retrieving occluded pedestrians remains a challenging problem. The current methods primarily utilize additional networks to provide body cues for distinguishing the visible parts of the body. However, the inevitable domain gap between the auxiliary models and the Re-ID datasets significantly increases the difficulty in obtaining effective and efficient models. To eliminate the need for additional pre-trained networks, a Transformer-based dual correlation feature enhancement network model is proposed. Specifically, this method designs a relation-based feature enhancement module that effectively compensates for the absence or inaccuracy of local features by modeling the relational information within pedestrian images. Additionally, a dual correlation fusion module is designed to adaptively generate feature weights, fusing global and local features with weighted summation. Finally, extensive experiments were conducted on both occluded and holistic datasets to demonstrate that the proposed model outperforms state-of-the-art methods. The proposed model achieved a Rank-1 accuracy of 72.2% on the Occluded-Duke dataset and 88.0% on the Partial-REID dataset. This proves the effectiveness of the proposed approach.

Pedestrian Re-identification

Occlusion

Dual-Correlation Feature Enhancement

Feature Fusion

Pedestrian Re-identification (Re-ID) is the task of retrieving images of a specified pedestrian from a large database given a query object. Pedestrian Re-ID is a challenging task in computer vision with wide applications in security authentication, human tracking, and robotics[1][2][3]. Due to the continuous advancement of deep learning techniques and the accessibility of datasets, significant progress has been made in the field of Re-ID. Many methods [4][5][6][7] can be employed to tackle complex issues such as viewpoint changes and variable lighting conditions in the field of pedestrian Re-ID. However, these methods largely rely on extensive datasets, utilizing straightforward, and unoccluded pedestrian images for training deep neural networks. Therefore, when facing occluded scenarios, their effectiveness may decrease, where pedestrian images, are occluded by objects such as vehicles, walls, or other pedestrians as shown in Fig. 1. These situations present significant challenges for precise individual identification. Therefore, occluded Re-ID has emerged as a crucial area worthy of in-depth exploration and research.

The occluded Re-ID task faces two main challenges. Firstly, the conventional global supervision in image-based pedestrian Re-ID not only involves the information of the target person but also may include interference from occlusions. Different types of occlusions, such as variations in color, position, and size, increase the difficulty of obtaining robust features of the target person. Secondly, occluded body parts sometimes exhibit more discriminative information, while unoccluded body parts might appear similar, leading to mismatching issues. The mainstream strategy to address these problems typically involves leveraging local features from different body parts. Typically, these strategies rely on external cues provided by semantic parsing [8][9][10] or pose estimation[11][12][13][14]. In these approaches, pre-trained pose or semantic detectors are used to identify landmarks or regions of interest in the image. These detectors accurately locate unoccluded regions and facilitate the alignment of local features during the learning process. However, such solutions have certain limitations. For example, when dealing with significant cross-domain differences between training and testing data, or in cluttered environments, the accuracy of off-the-shelf external detectors may be compromised. In scenarios involving multiple pedestrian occlusions, the pose estimation results may misalign with other pedestrians, leading to inaccurate information extraction. Additionally, manual parsing models may not always recognize items carried by individuals, such as backpacks, hats, and umbrellas, which can lead to the omission of crucial information used for Re-ID purposes. Furthermore, the external detectors may introduce additional computational costs, which might be disadvantageous in real-time surveillance applications.

To address the aforementioned issues, this paper proposes a Transformer-based dual correlation feature enhancement network model for occluded pedestrian Re-ID. First, the input images undergo three stages of data augmentation: basic augmentation, random erasing, and random cropping. The augmented data is then used for feature extraction. Subsequently, a Relation-based Feature Enhancement (RFE) module is employed for feature enhancement. Finally, the Dual-Association Fusion (DAF) module fuses the obtained global and local features, effectively capturing contextual information and improving pedestrian Re-ID capabilities.

The main contributions of the paper are as follows:

1.We propose a novel end-to-end learning model, a Transformer-based dual-correlation feature enhancement network, for occluded pedestrian Re-ID.

2.We design a RFE module that enhances the extracted features, enabling the model to autonomously extract non-occluded body information without relying on any external tools. This allows the model to focus more on features that are more useful for the current task, improving its ability to understand and model features.

3.We design a DAF module, which merges the obtained local features into global features, enriching the overall pedestrian features with crucial details. This facilitates better capture of contextual information, representing an effective feature fusion technique.

4.Experimental results on multiple datasets demonstrate that the proposed method achieves excellent performance in terms of Rank-1 accuracy and mean Average Precision (mAP).

2.1 Standard pedestrian re-identification

Pedestrian Re-ID is a critical task in the field of computer vision, aimed at identifying and matching pedestrians across different camera viewpoints. Early methods primarily relied on handcrafted features and metric learning[15][16][17][18], while modern approaches extensively leverage deep learning techniques such as Convolutional Neural Networks[19][20][21], attention mechanisms[22][23][24][25], and Generative Adversarial Networks[26][27], to enhance the discriminative power and robustness of feature representations.

Recent research suggests that utilizing locally-based feature extraction methods performs well in the overall pedestrian Re-ID task. Sun et al. [21]proposed the PCB method, which enhances feature learning by simply horizontally dividing pedestrian images and computing a loss for each local feature within each horizontal block. Song et al. [28]introduced a novel deep learning model called DIMN, specifically designed for learning domain-agnostic pedestrian Re-ID models. DIMN follows a meta-learning approach and employs a memory bank to maintain the scalability and discriminative power of the model, enabling fast adaptation to new domains without requiring model updates. However, these methods have limited pedestrian retrieval accuracy in occluded pedestrian Re-ID, making them less applicable in crowded and complex environments.

2.2 Occluded pedestrian re-identification

The task of occluded pedestrian Re-ID[29] involves matching identities in images of individuals that are partially obstructed, aiming to find pedestrians with the same identity across different camera views despite occlusions or partial visibility. Due to incomplete image information and spatial misalignment, among other factors, occluded pedestrian Re-ID is highly challenging.

Zhuo et al. [30]were the first to comprehensively define the concept of occluded pedestrian Re-ID and proposed a pedestrian body attention mechanism framework. By designing a simulated occlusion module, they applied occlusion operations to standard pedestrian image data to direct the network's attention towards discriminating occluded regions. Miao et al.[11]utilized pose information to guide the network in extracting local features of visible body parts, suppressing background and occlusion features, thereby enabling local feature matching and global feature matching. Gao et al. [31] also utilized pose information to segment the human body into different parts, then used graph matching algorithms to predict the visibility of each part, assigning higher weights to body parts with higher visibility. Wang et al. [32] proposed a PADE method for occluded pedestrian Re-ID tasks. This method effectively alleviates the issue of data imbalance through parallel enhancement mechanisms and a dual enhancement strategy involving both global and local features, making full use of training data. Wang et al. [33] introduced the Feature Elimination and Diffusion Network (FED), which effectively addresses non-pedestrian occlusions and non-target pedestrian interferences in pedestrian Re-ID. By employing precise occlusion simulation and feature diffusion strategies, this method significantly enhances the model's perception of target pedestrians. While the aforementioned methods can partially address occlusion issues, most of them heavily rely on pre-existing human parsing models or pose estimators. Unlike other methods, our proposed approach enhances feature representation by learning relational information within pedestrian images, rather than relying on external manual parsing models or pose estimators.

2.3 Transformer-based pedestrian re-identification

Transformer models [34] have made significant breakthroughs in the field of natural language processing, becoming the standard model for tasks such as machine translation. As its excellent performance and simple model structure become increasingly recognized, more and more researchers are beginning to explore the applications and possibilities of Transformer models in computer vision tasks.

Dosovitskiy et al. proposed the Vision Transformer (ViT) model [35], a network architecture entirely based on self-attention mechanisms. They demonstrated that on large-scale datasets, pure Transformers, without relying on CNNs, can achieve excellent performance on classification tasks. ViT was the first to apply the Transformer model from the NLP domain to image classification tasks. This demonstrated that Transformer models not only perform well in natural language processing but also possess high transferability and versatility, offering new perspectives and methodologies for the field of computer vision. Since then, a large number of vision tasks have adopted methods based on pure Transformers as well as hybrid approaches combining CNNs and Transformers. Correspondingly, there have been research advancements in the field of pedestrian Re-ID. He et al. [36] proposed Transformer-based Object Re-ID (TransRelD), which uses the ViT model as the backbone for feature extraction, establishing a robust Transformer-based Re-ID benchmark model. It is the first method to employ a pure Transformer model for Re-ID tasks, achieving results comparable to CNN-based frameworks on several Re-ID benchmarks.

In this section, the proposed Dual-Association Feature Enhancement (DAFE) Network Model based on Transformer is detailed, as illustrated in Fig. 2. Here, we briefly introduce the entire process. First, data augmentation is applied to the input raw images. Then, the augmented images are processed through a backbone network with shared parameters, with ViT and ResNet50 for feature extraction. Next, a RFE module is used to refine and enhance the key information in the features extracted by ViT. Finally, a DAF module is employed to fully utilize both global and local features.

3.1 Data Augmentation

In the task of occluded pedestrian Re-ID, common challenges include effectively handling and recognizing partially occluded person images. To enhance the effectiveness of models in dealing with occlusion, traditional random and sequential data augmentation methods may be insufficient to address the data imbalance issue caused by occlusion. We apply three distinct data augmentation processes to the input images to generate multiple augmented images, aiming to enhance data diversity during training and improve the model's robustness.

Specifically, for the input original image$I$, we conducted basic augmentation, erasure augmentation, and cropping augmentation, resulting in ${I}_{base}$, ${I}_{erased}$, and ${I}_{cropped}$. Among them, basic augmentation includes fundamental image processing operations such as color adjustments and flipping, thereby increasing the model's adaptability to common image variations. Erasure augmentation refers to randomly erasing certain parts of the image, simulating the occlusion of pedestrian body parts, which helps the model learn how to identify individuals from partially visible human features. Cropping augmentation involves irregularly cropping the image, further simulating occlusion situations, especially when occlusion occurs at the image's edges or specific areas.

$${I}_{base}=BA\left(I\right),{I}_{erased}=EA\left(I\right),{I}_{cropped}=CA\left(I\right)$$

Where$BA\left(I\right)$,$EA\left(I\right)$,and$CA\left(I\right)$respectively represent basic augmentation, erasure augmentation, and cropping augmentation applied to the original image I.

By employing this data augmentation technique, the model can simultaneously learn to handle various types of image transformations during the training phase, particularly those stemming from occlusion-induced complex scenes. This not only aids the model in better adapting to various occlusion conditions but also effectively enhances its ability to recognize partially occluded pedestrians in practical applications. Additionally, it ensures the diversity and richness of the training data, thereby enhancing the model's generalization ability across different occlusion scenarios.

3.2 Feature extraction network

The obtained ${I}_{base}$, ${I}_{erased}$, and ${I}_{cropped}$ will then be inputted into a parameter-shared multi-branch network for further processing. When using ViT and ResNet50 for feature extraction, the same weights and structure are used for both, which helps the network learn more efficiently and reduces the risk of overfitting.

The convolutional layers in ResNet50 excel at capturing local information but are less effective at establishing global information connections. In contrast, the primary advantage of Transformers in visual tasks is their capability for global relationship modeling and their proficiency in learning long-range dependencies. During training, the features obtained from both networks are weighted and combined. This approach leverages the strengths of Transformers in modeling feature relationships while retaining the intuitive advantages of convolutional networks in feature extraction.

$$\left\{\begin{array}{c}{f}_{g1}^{1}={F}_{1}\left({I}_{base}\right),{f}_{g1}^{2}={F}_{1}\left({I}_{earsed}\right),{f}_{g1}^{3}={F}_{1}\left({I}_{cropped}\right)\\ {f}_{g2}^{1}={F}_{2}\left({I}_{base}\right),{f}_{g2}^{2}={F}_{2}\left({I}_{earsed}\right),{f}_{g2}^{3}={F}_{2}\left({I}_{cropped}\right)\end{array}\right.$$

Where ${F}_{1}(.)$is the feature extractor with ViT as the backbone network, and ${F}_{2}(.)$is the feature extractor with ResNet50 as the backbone network.

Combining a multi-branch structure with powerful feature extractors not only enhances the model's adaptability and robustness to various occlusion scenarios but also ensures effective learning and integration of information from each enhanced image.

3.3 Relation-based Feature Enhancement Module (RFE)

To enable the model to better adapt to different data distributions and feature distributions, and to capture key information more accurately, we propose an RFE module. Figure 3 illustrates the schematic diagram of the proposed RFE module. The module adopts a dual-branch structure. First, the input feature maps are globally averaged pooled in the spatial dimension through a global average pooling layer to obtain global features for each channel. Then, the first fully connected layer reduces the dimensionality of the features to reduce computational and parameter complexity while effectively capturing important features. Subsequently, the ReLU activation function introduces non-linearity, enabling the model to learn more complex features. The second fully connected layer restores the dimensionality-reduced features to the original number of channels, generating weight coefficients for each channel. Finally, the Sigmoid function normalizes the feature values to the range of 0 to 1, allowing the weight coefficients to be multiplied with the original features, enhancing the network's focus on important features. In contrast to the upper branch, the lower branch in NL_2 utilizes bias, enabling more precise adjustment of the weights for each channel. The outputs of the two branches are merged through element-wise multiplication, and then aggregated to combine these weighted features with the original input features. This process enhances the model's representation capability of the input data, facilitating more effective extraction and utilization of information in subsequent layers. Through the RFE module, the model can better adapt to different data distributions and feature distributions, enhance its ability to capture key information, and thereby improve overall performance.

$${f}_{g1}^{{i}^{{\prime }{\prime }}}=S(NL\_1(R(NL\_1(Avg\left(re\right({f}_{g1}^{{i}^{{\prime }}}\left)\right)){\bullet }{f}_{g1}^{{i}^{{\prime }}}$$

$$+S(NL\_2(R(NL\_2(Avg\left(re\right({f}_{g1}^{i{\prime }}\left)\right)){\bullet }{f}_{g1}^{i{\prime }}$$

where $\text{i}\text{=1},\text{2},\text{3 }$, $re$denotes the transformation of the input feature vector into a feature tensor form, $Avg$represents adaptive average pooling, $R$represents the sigmoid activation function, $S$represents the ReLU activation function.

3.4 Dual-Association Fusion Module (DAF)

To enhance the performance of pedestrian Re-ID, current methods tend to simultaneously extract global and local features and perform joint optimization. Global features capture the overall information of a person, while local features focus on more detailed parts. However, these methods often overlook the contextual relationships between features, leading to some detailed information becoming inaccurate or irrelevant from a global perspective. To address this issue, a DAF is proposed, as shown in Fig. 4, aiming to facilitate interaction and information exchange between global and local features. This module not only considers the features themselves but also takes into account the potential relationships between different features. This allows for more precise capture and description of local details while preserving global information. The Dual-Association Fusion Module enhances both global and local features in a targeted manner through an interactive approach, enabling both aspects to be more effectively utilized.

The global features ${f}_{g}^{1{\prime }}$and local features ${f}_{l}^{1}$,${f}_{l}^{2}$,${f}_{l}^{3}$,and${f}_{l}^{4}$, processed by the baseline enhancement and RFE modules, undergo incremental feature refinement with the DAF module, ultimately merging into an enhanced global feature. The specific procedure involves associating each local feature with the global feature through LGAF for enhancement, then combining it with the local feature to obtain${f}_{l}^{i{\prime }}$.

$${f}_{l}^{i{\prime }}=LGAF({f}_{g}^{1{\prime }},{f}_{l}^{i})+{f}_{l}^{i}$$

On this basis, the global feature undergoes further enhancement.

$$\left\{\begin{array}{c}{f}_{g}^{1{\prime }{\prime }}=LGAF({f}_{g}^{1},{f}_{l}^{i{\prime }})\\ {f}_{g}^{1{\prime }{\prime }}=LGAF({f}_{g}^{1{\prime }{\prime }},{f}_{l}^{i{\prime }})\end{array}\right.$$

This DAF enhances the correlation between global and local features, enriching the final output of global features with more comprehensive contextual information and detailed features. This, in turn, improves the model's accuracy in identifying pedestrian identities.

The structure of the Local-Global Association Fusion Component (LGAF) is shown in Fig. 5. The purpose of this component is to combine local and global features to enhance the final feature representation. The specific procedure involves transforming the input features through three different fully connected layers (Fc_query, Fc_part, Fc_value) and performing dot product operations to generate interactive feature representations. Then, with the Sigmoid activation function, this interaction is converted into attention scores. The attention scores are utilized to weight the value features, emphasizing specific features relevant to the query. Finally, by fusing with the original query features, the enhanced value features and query features are merged to produce the final enhanced features.

$$\left\{\begin{array}{c}{f}_{lq}=Fq\left({f}_{l}\right);{f}_{lp}^{{\prime }}=Fq\left({f}_{l}\right);{f}_{gv}=Fv\left({f}_{g}\right)\\ {f}_{g}^{{\prime }}=S({f}_{lp}\otimes {f}_{gv})\otimes {f}_{lq}\oplus {f}_{lq}\oplus {f}_{g} \end{array}\right.$$

3.5 Loss Function

The cross-entropy loss function is utilized to measure the difference between predicted results and true labels, while the triplet loss function is utilized to optimize the feature space, which can ensure that features from the same class are closer and features from different classes are farther apart. Therefore, during training our model, we choose widely used cross-entropy loss function ${L}_{id}$and triplet loss function${ L}_{tri}$.

All global features and local features are optimized under the constraints of ${L}_{id}$ and ${ L}_{tri}$. Specifically, global features are derived from feature extraction across the entire image, while local features are extracted from local regions of the image. This ensures that the model captures useful information at different scales. The final loss function can be represented as:

$$L={\sum }_{i=1}^{3}{L}_{id}({p}_{g}^{i},y)+{\sum }_{i=1}^{3}{L}_{tri}\left({f}_{g}^{i}\right)+{\sum }_{j=1}^{4}{L}_{id}({p}_{l}^{j},y)+{\sum }_{j=1}^{4}{L}_{id}\left({f}_{l}^{j}\right)$$

Where $\text{p}$ represents the predicted result, and y represents the true label.

4.1 Datasets and Evaluation Metrics

To validate the effectiveness of the method, experiments were conducted on four datasets: Occluded-Duke, Partial-REID, Market-1501, and DukeMTMC.

The evaluation of model performance utilized two metrics from the Cumulative Matching Characteristic (CMC) curve: Rank-n and Mean Average Precision (mAP). These metrics were employed to assess the effectiveness of the model. Rank-n refers to the probability of selecting the top n most similar individuals from the gallery set, who are also the same individuals as the query, when a person is selected from the query set. mAP calculates the Average Precision (AP) values and computes the weighted average of these AP values. It serves as a metric to assess the overall performance of the retrieval system.

4.2 Implementation Details

The model presented in this paper was trained and tested on a single Tesla V100 16GB. The initial weights of the model were pre-trained on ImageNet. The training and testing images were resized to 256 × 128. The initial learning rate was set to 0.008 and was reduced at the 40th and 70th epochs. The batch size was set to 16. The entire network was fine-tuned using the stochastic gradient descent (SGD) method with a weight decay of 0.0004.

4.3 Comparison With the State-of-the-Art Methods

4.3.1 Comparison of the occluded dataset

The experiments are validated on the Occluded-Duke and Partial-REID datasets. We compare the proposed DAFE with PCB[21], HOReID[12], OAMN[37], PRE-NeT[38], ViT Baseline[35], FED[33], TransReID[36], PFD[14], DPEFormer[39], CAAO[40], DPM[41], SAP[42], and PADE[32]. On the most challenging occluded dataset, Occluded-Duke, the proposed method achieved a mAP of 63.2% and a Rank-1 accuracy of 72.2%. Additionally, on the Partial-REID dataset, the proposed method achieved a mAP of 84.5% and a Rank-1 accuracy of 88.0%. Compared to other state-of-the-art methods, the proposed approach shows varying degrees of improvement. Table 1 presents the performance comparison results on the Occluded-Duke and Partial-REID datasets.

Table 1

Performance comparison with state-of-the-art methods on the Occluded-Duke and Partial-REID datasets (%).An asterisk (*) indicates that these results were obtained from the original papers, while the others were reproduced with the available open-source code. The best results are highlighted in bold font to emphasize.
Method	Auxiliary Clues	Occluded-Duke		Partial-ReID
Method	Auxiliary Clues	mAP	Rank-1	mAP	Rank-1
PCB(ECCV 18)	×	33.7	42.6	63.8	66.3
HOReID(CVPR 20) ^*	√	43.8	55.1	-	85.3
OAMN(ICCV 21) ^*	×	46.1	62.6	77.4	86
PRE-Net(TCSVT 23) ^*	×	55.2	68.3	-	86
ViT Baseline(ICCV 21)	×	53.1	60.5	74.0	73.3
FED(CVPR 22)	×	56.3	67.9	80.8	83.4
TransReID(ICCV 21)	×	57.4	66.4	78.6	82.7
PFD(AAAI 22)	√	58.9	66.4	67.8	71.0
DPEFormer(CVPR 24) ^*	×	58.9	69.9	-	-
CAAO(IEEE 23) ^*	×	59.5	68.5	-	-
DPM(ACM 22)	×	61.1	70.7	72.6	76.3
SAP(AAAI 23) ^*	×	62.2	70.0	-	-
PADE(ICASSP 24)	×	62.8	72.0	82.8	87.3
DAFE(ours)	×	63.2	72.2	84.5	88.0

4.3.2 Comparison of the complete dataset

We also evaluate our model on two comprehensive pedestrian Re-ID datasets. Table 2 presents the performance comparison results on the Market-1501 and DukeMTMC datasets. Although our method aims to address occlusion challenges in pedestrian Re-ID, it exhibits outstanding performance in terms of accuracy on the overall dataset. Specifically, on the Market-1501 dataset, the proposed method achieved a mAP of 89.0% and a Rank-1 accuracy of 95.6%. Simultaneously, on the DukeMTMC dataset, the proposed method achieved a mAP of 80% and a Rank-1 accuracy of 90.6%. This comprehensive performance demonstrates that our method not only effectively utilizes robust feature representations in occlusion challenges but also exhibits good robustness in overall challenges.

Table 2

Performance comparison with state-of-the-art methods on the Market-1501 and DukeMTMC datasets (%). An asterisk (*) indicates that these results were obtained from the original papers, while the others were reproduced with the available open-source code. The best results are highlighted in bold font to emphasize.
Method	Auxiliary Clues	Market-1501		DukeMTMC
Method	Auxiliary Clues	mAP	Rank-1	mAP	Rank-1
PCB(ECCV 18)	×	78.7	92.2	70.5	84.6
OAMN(ICCV 21) ^*	×	79.8	93.2	72.6	86.3
HOReID(CVPR 20) ^*	√	84.9	94.2	75.6	86.9
CBDB-Net(TCSVT 21)	×	85.0	94.4	74.3	87.7
TransReID(ICCV 21)	×	85.4	93.7	79.4	89.4
FED(CVPR 22)	×	86.3	95.0	78.0	89.4
PRE-Net(TCSVT 23) ^*	×	86.5	95.3	77.8	89.3
CAAO(IEEE 23) ^*	×	88.0	95.3	80.9	89.8
DPEFormer(CVPR 24) ^*	×	88.1	95.4	80.3	90.0
PADE(ICASSP 24)	×	88.6	95.6	79.6	90.0
DAFE(ours)	×	89.0	95.6	80.0	90.6

4.4 Ablation Study

To demonstrate the effectiveness of our proposed RFE and DAF, we conducted a series of experiments on the Occluded-Duke dataset. The experimental results are shown in Table 3, where 'index 1' corresponds to the baseline model, which contrasts different images of the same person for Siamese learning. Subsequently, we gradually incorporated the RFE module and DAF module, represented by indices 2 to 4, indicating baseline + RFE, baseline + DAF, and baseline + RFE + DAF. Comparing index 1 and index 2, it can be observed that incorporating the RFE module resulted in an increase in Rank-1 accuracy and mAP accuracy by 2% and 2.3%. Comparing index 1 and index 3, it is evident that the RFE module indeed enhances the model's ability to represent input data. Incorporating the DAF module resulted in an increase in Rank-1 accuracy and mAP accuracy by 3.6% and 3.4%, respectively. The DAF module effectively facilitates interaction and information exchange between global and local features. Finally, index 6 achieved the highest accuracy, demonstrating the effectiveness of each module, whether individually or in combination.

Table 3

Ablation experiment results on the Occluded-Duke dataset(%).
Index	RFE	DAF	mAP	Rank-1
1	×	×	58.1	68.2
2	√	×	60.4	70.2
3	×	√	61.5	71.8
4	√	√	63.2	72.3

Furthermore, to evaluate the adaptability of the proposed RFE, we integrated RFE into three methods: TransReID, PFD, and DPM. Table 4 presents the experimental results. It can be observed from the results that by introducing RFE, all these methods demonstrated improvements. This indicates that RFE has universality and adaptability across various methods.

Table 4

Verification of the Role of the RFE Module in Other Models(%).
Index	Method	Occluded-Duke		Market-1501
Index	Method	Map	Rank-1	Map	Rank-1
1	TransReID	57.4	66.4	85.4	93.7
2	TransReID + RFE	58.2	68.3	86.3	94.5
3	PFD	58.9	66.4	88.2	94.9
4	PFD + RFE	61.2	67.9	88.9	95.3
5	DPM	61.1	70.7	88.6	94.7
6	DPM + RFE	61.9	71.0	89.5	95.4

4.5 Visualization results analysis

To visually demonstrate the advancement of the proposed model, four randomly selected pedestrian images with various occlusions were chosen from the query set. Figure 6 illustrates the pedestrian retrieval results, displaying the top ten matching images generated by both the baseline model and our proposed DAFE model. It is observed that compared to the baseline model, our approach consistently achieves superior performance in pedestrian image retrieval, thereby enhancing the accuracy of pedestrian re-identification models.

In this paper, we introduce a simple yet effective DAFE, an end-to-end network architecture specifically designed for occluded pedestrian Re-ID. First, we propose a relation-based feature enhancement module that enables the model to focus more on features useful for the current task, improving the model's feature understanding and modeling capabilities. Additionally, this paper introduces a dual-association fusion module that integrates local features with global features, enriching the overall pedestrian features with critical detail information and better capturing contextual information. This feature fusion mechanism effectively enhances the model's expressive power. Finally, experiments were conducted on two occluded datasets and two holistic datasets, including comparisons with other advanced algorithms, ablation studies, and visualization experiments, demonstrating the superior performance of the proposed DAFE model. In future work, we will explore the use of lightweight models for further research.

Author contributions Zoufei Zhao wrote the main manuscript text and performed the related experiments. Lihong Li gave guidance. All authors reviewed and revised the manuscript.

Data availability The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Competing interests The authors did not receive support from any organization for the submitted work. The authors have no relevant financial or non-financial interests to disclose.

Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., Hoi, S.C.H.: Deep learning for person re-identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 2872-2893 (2021).
Leng, Q., Ye, M., Tian, Q.: A survey of open-world person re-identification. IEEE Transactions on Circuits and Systems for Video Technology 30(4), 1092-1108 (2019).
Yadav, A., Vishwakarma, D.K.: Deep learning algorithms for person re-identification: State-of-the-art and research challenges. Multimedia Tools and Applications 83(8), 22005-22054 (2024).
Perwaiz, N., Shahzad, M., Moazam Fraz, M.: TransPose Re-ID: Transformers for pose invariant person re-identification. Journal of Experimental & Theoretical Artificial Intelligence 2023, 1-14.
Dai, J., Zhang, P., Wang, D., et al.: Video person re-identification by temporal residual learning. IEEE Transactions on Image Processing 28(3), 1366-1377 (2018).
Jiang, K., Zhang, T., Liu, X., et al.: Cross-modality transformer for visible-infrared person re-identification. In: European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022, pp. 480-496.
Dai, Y., Li, X., Liu, J., et al.: Generalizable person re-identification with relevance-aware mixture of experts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16145-16154.
Huang, H., Chen, X., Huang, K.: Human parsing based alignment with multi-task learning for occluded person re-identification. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2020, pp. 1-6.
Dou, S., Jiang, X., Tu, Y., et al.: DROP: Decouple Re-Identification and Human Parsing with Task-specific Features for Occluded Person Re-identification. arXiv preprint arXiv:2401.18032, 2024.
Somers, V., De Vleeschouwer, C., Alahi, A.: Body part-based representation learning for occluded person re-identification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 1613-1623.
Miao, J., Wu, Y., Liu, P., et al. Pose-guided feature alignment for occluded person re-identification. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 542-551.
Wang, G., Yang, S., Liu, H., et al.: High-order information matters: Learning relation and topology for occluded person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6449-6458.
Liu, Z., Wang, Q., Wang, M., et al.: Occluded person re-identification with pose estimation correction and feature reconstruction. IEEE Access 11, 14906-14914 (2023).
Wang, T., Liu, H., Song, P., et al.: Pose-guided feature disentangling for occluded person re-identification based on transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3), pp. 2540-2549.
Yang, Y., Yang, J., Yan, J., et al.: Salient color names for person re-identification. In: Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I. Springer International Publishing, 2014, pp. 536-551.
Liao, S., Hu, Y., Zhu, X., et al.: Person re-identification by local maximal occurrence representation and metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2197-2206.
Koestinger, M., Hirzer, M., Wohlhart, P., et al.: Large scale metric learning from equivalence constraints. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2288-2295.
Liao, S., Li, S.Z.: Efficient PSD constrained asymmetric metric learning for person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3685-3693.
Qu, W., Xu, Z., Luo, B., et al.: Pedestrian re-identification monitoring system based on deep convolutional neural network. IEEE Access 8, 86162-86170 (2020).
Khan, S.U., Hussain, T., Ullah, A., et al.: Deep-ReID: Deep features and autoencoder assisted image patching strategy for person re-identification in smart cities surveillance. Multimedia Tools and Applications 83(5), 15079-15100 (2024).
Sun, Y., Zheng, L., Yang, Y., et al.: Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 480-496.
Liu, Z., Wan, P.: Pedestrian re-identification feature extraction method based on attention mechanism. Journal of Computer Applications 40(3), 672 (2020).
Xu, R., Zheng, Y., Wang, X., et al.: Person re-identification based on improved attention mechanism and global pooling method. Journal of Visual Communication and Image Representation 94, 103849 (2023).
Luo, Q., Shao, J., Dang, W., et al.: An efficient multi-scale channel attention network for person re-identification. The Visual Computer 40(5), 3515-3527 (2024).
Zhang, Z., Zhang, H., Liu, S.: Person re-identification using heterogeneous local graph attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12136-12145.
Liu, D., Wu, L., Hong, R., et al.: Generative metric learning for adversarially robust open-world person re-identification. ACM Transactions on Multimedia Computing, Communications and Applications 19(1), 1-19 (2023).
Jianheng, L.: Improved Method for Pedestrian Recognition Based on Generative Adversarial Networks. Journal of Artificial Intelligence Practice 6(2), 23-30 (2023).
Song, J., Yang, Y., Song, Y.Z., et al.: Generalizable person re-identification by domain-invariant mapping network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 719-728.
Ning, E., Wang, C., Zhang, H., et al.: Occluded person re-identification with deep learning: a survey and perspectives. Expert Systems with Applications 2023, 122419.
Zhuo, J., Chen, Z., Lai, J., et al.: Occluded person re-identification. In: 2018 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2018, pp. 1-6.
Gao, S., Wang, J., Lu, H., et al.: Pose-guided visible part matching for occluded person re-id. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11744-11752.
Wang, Z., Huang, H., Zheng, A., et al.: Parallel augmentation and dual enhancement for occluded person re-identification. In: ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 3590-3594.
Wang, Z., Zhu, F., Tang, S., et al.: Feature erasing and diffusion network for occluded person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4754-4763.
Lundberg, S.M., Lee, S.I., Guyon, I., et al.: Advances in neural information processing systems. In: Guyon, UV Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnetteds, Eds. (Curran Associates Inc., 2017), 2017, 30.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
He, S., Luo, H., Wang, P., et al.: Transreid: Transformer-based object re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15013-15022.
Chen, P., Liu, W., Dai, P., et al.: Occlude them all: Occlusion-aware attention network for occluded person re-id. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11833-11842.
Yan, G., Wang, Z., Geng, S., et al.: Part-based representation enhancement for occluded person re-identification. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
Zhang, X., Fu, K., Zhao, Q.: Dynamic Patch-aware Enrichment Transformer for Occluded Person Re-Identification. arXiv preprint arXiv:2402.10435, 2024.
Zhao, C., Qu, Z., Jiang, X., et al.: Content-adaptive auto-occlusion network for occluded person re-identification. IEEE Transactions on Image Processing, 2023.
Tan, L., Dai, P., Ji, R., et al.: Dynamic prototype mask for occluded person re-identification. In: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 531-540.
Jia, M., Sun, Y., Zhai, Y., et al.: Semi-attention partition for occluded person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(1), pp. 998-1006.

No competing interests reported.

Download PDF

Reviewers invited by journal
10 Jul, 2024
Editor assigned by journal
24 Jun, 2024
Submission checks completed at journal
24 Jun, 2024
First submitted to journal
22 Jun, 2024

You are reading this latest preprint version

A Dual-Correlation Feature Enhancement Network Model Based on Transformer for Occluded Pedestrian Re-identification

Status:

Version 1

Abstract

Figures

Introduction

Related work

2.1 Standard pedestrian re-identification

2.2 Occluded pedestrian re-identification

2.3 Transformer-based pedestrian re-identification

Proposed Method

3.1 Data Augmentation

3.2 Feature extraction network

3.3 Relation-based Feature Enhancement Module (RFE)

3.4 Dual-Association Fusion Module (DAF)

3.5 Loss Function

Experiments

4.1 Datasets and Evaluation Metrics

4.2 Implementation Details

4.3 Comparison With the State-of-the-Art Methods

4.3.1 Comparison of the occluded dataset

4.3.2 Comparison of the complete dataset

4.4 Ablation Study

4.5 Visualization results analysis

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1