In this section, the proposed Dual-Association Feature Enhancement (DAFE) Network Model based on Transformer is detailed, as illustrated in Fig. 2. Here, we briefly introduce the entire process. First, data augmentation is applied to the input raw images. Then, the augmented images are processed through a backbone network with shared parameters, with ViT and ResNet50 for feature extraction. Next, a RFE module is used to refine and enhance the key information in the features extracted by ViT. Finally, a DAF module is employed to fully utilize both global and local features.
3.1 Data Augmentation
In the task of occluded pedestrian Re-ID, common challenges include effectively handling and recognizing partially occluded person images. To enhance the effectiveness of models in dealing with occlusion, traditional random and sequential data augmentation methods may be insufficient to address the data imbalance issue caused by occlusion. We apply three distinct data augmentation processes to the input images to generate multiple augmented images, aiming to enhance data diversity during training and improve the model's robustness.
Specifically, for the input original image\(I\), we conducted basic augmentation, erasure augmentation, and cropping augmentation, resulting in \({I}_{base}\), \({I}_{erased}\), and \({I}_{cropped}\). Among them, basic augmentation includes fundamental image processing operations such as color adjustments and flipping, thereby increasing the model's adaptability to common image variations. Erasure augmentation refers to randomly erasing certain parts of the image, simulating the occlusion of pedestrian body parts, which helps the model learn how to identify individuals from partially visible human features. Cropping augmentation involves irregularly cropping the image, further simulating occlusion situations, especially when occlusion occurs at the image's edges or specific areas.
$${I}_{base}=BA\left(I\right),{I}_{erased}=EA\left(I\right),{I}_{cropped}=CA\left(I\right)$$
1
Where\(BA\left(I\right)\),\(EA\left(I\right)\),and\(CA\left(I\right)\)respectively represent basic augmentation, erasure augmentation, and cropping augmentation applied to the original image I.
By employing this data augmentation technique, the model can simultaneously learn to handle various types of image transformations during the training phase, particularly those stemming from occlusion-induced complex scenes. This not only aids the model in better adapting to various occlusion conditions but also effectively enhances its ability to recognize partially occluded pedestrians in practical applications. Additionally, it ensures the diversity and richness of the training data, thereby enhancing the model's generalization ability across different occlusion scenarios.
3.2 Feature extraction network
The obtained \({I}_{base}\), \({I}_{erased}\), and \({I}_{cropped}\) will then be inputted into a parameter-shared multi-branch network for further processing. When using ViT and ResNet50 for feature extraction, the same weights and structure are used for both, which helps the network learn more efficiently and reduces the risk of overfitting.
The convolutional layers in ResNet50 excel at capturing local information but are less effective at establishing global information connections. In contrast, the primary advantage of Transformers in visual tasks is their capability for global relationship modeling and their proficiency in learning long-range dependencies. During training, the features obtained from both networks are weighted and combined. This approach leverages the strengths of Transformers in modeling feature relationships while retaining the intuitive advantages of convolutional networks in feature extraction.
$$\left\{\begin{array}{c}{f}_{g1}^{1}={F}_{1}\left({I}_{base}\right),{f}_{g1}^{2}={F}_{1}\left({I}_{earsed}\right),{f}_{g1}^{3}={F}_{1}\left({I}_{cropped}\right)\\ {f}_{g2}^{1}={F}_{2}\left({I}_{base}\right),{f}_{g2}^{2}={F}_{2}\left({I}_{earsed}\right),{f}_{g2}^{3}={F}_{2}\left({I}_{cropped}\right)\end{array}\right.$$
2
Where \({F}_{1}(.)\)is the feature extractor with ViT as the backbone network, and \({F}_{2}(.)\)is the feature extractor with ResNet50 as the backbone network.
Combining a multi-branch structure with powerful feature extractors not only enhances the model's adaptability and robustness to various occlusion scenarios but also ensures effective learning and integration of information from each enhanced image.
3.3 Relation-based Feature Enhancement Module (RFE)
To enable the model to better adapt to different data distributions and feature distributions, and to capture key information more accurately, we propose an RFE module. Figure 3 illustrates the schematic diagram of the proposed RFE module. The module adopts a dual-branch structure. First, the input feature maps are globally averaged pooled in the spatial dimension through a global average pooling layer to obtain global features for each channel. Then, the first fully connected layer reduces the dimensionality of the features to reduce computational and parameter complexity while effectively capturing important features. Subsequently, the ReLU activation function introduces non-linearity, enabling the model to learn more complex features. The second fully connected layer restores the dimensionality-reduced features to the original number of channels, generating weight coefficients for each channel. Finally, the Sigmoid function normalizes the feature values to the range of 0 to 1, allowing the weight coefficients to be multiplied with the original features, enhancing the network's focus on important features. In contrast to the upper branch, the lower branch in NL_2 utilizes bias, enabling more precise adjustment of the weights for each channel. The outputs of the two branches are merged through element-wise multiplication, and then aggregated to combine these weighted features with the original input features. This process enhances the model's representation capability of the input data, facilitating more effective extraction and utilization of information in subsequent layers. Through the RFE module, the model can better adapt to different data distributions and feature distributions, enhance its ability to capture key information, and thereby improve overall performance.
$${f}_{g1}^{{i}^{{\prime }{\prime }}}=S(NL\_1(R(NL\_1(Avg\left(re\right({f}_{g1}^{{i}^{{\prime }}}\left)\right)){\bullet }{f}_{g1}^{{i}^{{\prime }}}$$
$$+S(NL\_2(R(NL\_2(Avg\left(re\right({f}_{g1}^{i{\prime }}\left)\right)){\bullet }{f}_{g1}^{i{\prime }}$$
3
where \(\text{i}\text{=1},\text{2},\text{3 }\), \(re\)denotes the transformation of the input feature vector into a feature tensor form, \(Avg\)represents adaptive average pooling, \(R\)represents the sigmoid activation function, \(S\)represents the ReLU activation function.
3.4 Dual-Association Fusion Module (DAF)
To enhance the performance of pedestrian Re-ID, current methods tend to simultaneously extract global and local features and perform joint optimization. Global features capture the overall information of a person, while local features focus on more detailed parts. However, these methods often overlook the contextual relationships between features, leading to some detailed information becoming inaccurate or irrelevant from a global perspective. To address this issue, a DAF is proposed, as shown in Fig. 4, aiming to facilitate interaction and information exchange between global and local features. This module not only considers the features themselves but also takes into account the potential relationships between different features. This allows for more precise capture and description of local details while preserving global information. The Dual-Association Fusion Module enhances both global and local features in a targeted manner through an interactive approach, enabling both aspects to be more effectively utilized.
The global features \({f}_{g}^{1{\prime }}\)and local features \({f}_{l}^{1}\),\({f}_{l}^{2}\),\({f}_{l}^{3}\),and\({f}_{l}^{4}\), processed by the baseline enhancement and RFE modules, undergo incremental feature refinement with the DAF module, ultimately merging into an enhanced global feature. The specific procedure involves associating each local feature with the global feature through LGAF for enhancement, then combining it with the local feature to obtain\({f}_{l}^{i{\prime }}\).
$${f}_{l}^{i{\prime }}=LGAF({f}_{g}^{1{\prime }},{f}_{l}^{i})+{f}_{l}^{i}$$
4
On this basis, the global feature undergoes further enhancement.
$$\left\{\begin{array}{c}{f}_{g}^{1{\prime }{\prime }}=LGAF({f}_{g}^{1},{f}_{l}^{i{\prime }})\\ {f}_{g}^{1{\prime }{\prime }}=LGAF({f}_{g}^{1{\prime }{\prime }},{f}_{l}^{i{\prime }})\end{array}\right.$$
5
This DAF enhances the correlation between global and local features, enriching the final output of global features with more comprehensive contextual information and detailed features. This, in turn, improves the model's accuracy in identifying pedestrian identities.
The structure of the Local-Global Association Fusion Component (LGAF) is shown in Fig. 5. The purpose of this component is to combine local and global features to enhance the final feature representation. The specific procedure involves transforming the input features through three different fully connected layers (Fc_query, Fc_part, Fc_value) and performing dot product operations to generate interactive feature representations. Then, with the Sigmoid activation function, this interaction is converted into attention scores. The attention scores are utilized to weight the value features, emphasizing specific features relevant to the query. Finally, by fusing with the original query features, the enhanced value features and query features are merged to produce the final enhanced features.
$$\left\{\begin{array}{c}{f}_{lq}=Fq\left({f}_{l}\right);{f}_{lp}^{{\prime }}=Fq\left({f}_{l}\right);{f}_{gv}=Fv\left({f}_{g}\right)\\ {f}_{g}^{{\prime }}=S({f}_{lp}\otimes {f}_{gv})\otimes {f}_{lq}\oplus {f}_{lq}\oplus {f}_{g} \end{array}\right.$$
6
3.5 Loss Function
The cross-entropy loss function is utilized to measure the difference between predicted results and true labels, while the triplet loss function is utilized to optimize the feature space, which can ensure that features from the same class are closer and features from different classes are farther apart. Therefore, during training our model, we choose widely used cross-entropy loss function \({L}_{id}\)and triplet loss function\({ L}_{tri}\).
All global features and local features are optimized under the constraints of \({L}_{id}\) and \({ L}_{tri}\). Specifically, global features are derived from feature extraction across the entire image, while local features are extracted from local regions of the image. This ensures that the model captures useful information at different scales. The final loss function can be represented as:
$$L={\sum }_{i=1}^{3}{L}_{id}({p}_{g}^{i},y)+{\sum }_{i=1}^{3}{L}_{tri}\left({f}_{g}^{i}\right)+{\sum }_{j=1}^{4}{L}_{id}({p}_{l}^{j},y)+{\sum }_{j=1}^{4}{L}_{id}\left({f}_{l}^{j}\right)$$
7
Where \(\text{p}\) represents the predicted result, and y represents the true label.