The images of this dataset are already cropped around the face, so we don’t need a face detection stage to localize the face from each image. However, we need to correct the rotation of the face so that we can remove the masked region efficiently. To do so, we detect 68 facial landmarks using Dlib-ml open source library introduced in [8]. According to the eyes location, we apply a 2D rotation to make them horizontal as presented in Figure 1.
The next step is to apply a cropping filter in order to extract only the non- masked region. To do so, we firstly normalize all face images into 240 x 240 pixels. Next, we use the partition into blocks. The principle of this technique is to divide the image into 100 fixed-size square blocks (24 x 24 pixels in our case). Then we extract only the blocks including the non-masked region (blocks from number 1 to 50). Finally, we eliminate the rest of the numbers of the blocks as presented in Figure 2.
4.2 Feature extraction layer
We extract deep features using VGG–16 face CNN descriptor [20] from the 2D images. It is trained on ImageNet dataset which has over 14 million images and 1000 classes. Its name VGG–16 comes from the fact that it has 16 layers. Its layers consists of convolutional layers, Max Pooling layers, Activation layers, Fully connected layers. There are 13 convolutional layers, 5 Max Pooling layers and 3 Dense layers which sums up to 21 layers but only 16 weight layers. Figure 4 presents VGG–16 architecture. In this work, we only consider the feature maps (FMs) at the last convolutional layer, also called channels. These features will be used in the following in the quantization stage.
From the ith image, we extract feature maps using the feature extraction layer described above. In order to measure the similarity between the extracted feature vectors and the codewords also called term vector, we applied the RBF kernel as similarity metric as proposed in [17]. Thus, the first sublayer will be composed of RBF neurons, each neuron is referred to a codeword.
As presented in Figure 3, the size of the extracted feature map defines the number of the feature vectors that will be used in the BoF layer. Here we refer by Vi to the number of feature vectors extracted from the ith image. For example, if we have 10 x 10 feature maps from the last convolutional layer of VGG–16 model, we will have 100 feature vectors to feed the quantization step using BoF paradigm. To build the codebook, the initialization of the RBF neurons can be carried out manually or automatically using all the extracted feature vectors overall the dataset. The most used automatic algorithm is of course k-means. Let F the set of all the feature vectors, defined by: F =
{Vij, i = 1
... V, j = 1
... Vi} and
Vk is the number of the RBF neurons centers
referred by ck. Note that these RBF centers are learned afterward to get the final codewords.
The quantization is then applied to extract the histogram with a predefined number of bins, each bin is referred to a codeword. RBF layer is then used as a similarity measure, it contains 2 sublayers:
(I) RBF layer: measures the similarity of the input features of the probe faces to the RBF centers.
Formally: the jth RBF neuron φ(Xj) is defined by:
φ(Xj) = exp(l/x − cjl/2/σj), (1)
Where x is a feature vector and cj is the center of the jth RBF neuron.
(II) Quantization layer: the output of all the RBF neurons is collected in this layer that contains the histogram of the global quantized feature vector that will be used for the classification process. The final histogram is defined by:
Where φ(V ) is the output vector of the RBF layer over the ck bins.
Once the global histogram is computed, we pass to the classification stage to assign each test image to its identity. To do so, we apply Multilayer perceptron classifier (MLP) where each face is represented by a term vector. Deep BoF network can be trained using back-propagation and gradient descent. Note that 10 cross validation strategy is applied in our experiments on RMFRD dataset. We note V = [v1,..., vk] the term vector of each face, where each vi refers to the occurrence of the term i in the given face. t is the number of attributes, and m is the number of classes (face identities). Test faces are defined by their codeword V MLP uses a set of term occurrences as input values (vi) and associated weights (wi) and a sigmoid function (g) that sums the weights and maps the results to an output (y).. Note that the number of hidden layers used in our experiments is given by: m+t /2.