MFC-NAS comprises three main stages: defining the search space, searching for optimal cells, and stacking cells, followed by testing (as depicted in Fig. 1). In the stage of defining the search space, four search space structures are designed: transfer cell, normal cell, pooling cell, and dropout cell. Among these, the transfer cell employs shallow blocks from multiple pre-trained models as searchable weight-sharing layers, enhancing the model's ability to express feature generalization and improve its generalization capability by automatically searching for suitable shallow blocks to adapt to different network structures. The normal cell is a multi-branch structure consisting of multiple block units, each including input-output nodes, candidate operations, and connection methods, with their outputs concatenated through concatenation operations. The pooling cell is a single-branch structure that incorporates various pooling operations as search candidates, formed by a single block unit containing input-output nodes and candidate operations. On the other hand, the dropout cell considers multiple dropout rates as search candidates, aiming to automatically determine suitable dropout rates to enhance the model's generalization capability.
During the optimal cell search stage, the initial step involves initializing search parameters such as search strategy, search period, and controller parameters. Subsequently, a recurrent neural network (RNN)[25] is employed as the controller to generate sampling probabilities for candidate options of each cell, while simultaneously training the controller using policy gradient reinforcement learning to update its weights. Through probabilistic sampling, suitable candidate options are selected from the search space. Periodic training of the controller yields the optimal structures for transfer cell, normal cell, pooling cell, and dropout cell.
During the cell stacking and testing stage, the optimal transfer cell, normal cell, pooling cell, and dropout cell, derived from the search space, are combined to form the MFC-NAS model. Finally, the model undergoes training and testing, which includes comparison and ablation experiments, to validate this approach.
3.1 Transfer cell
Inspired by the observation that the shallow layers of CNNs are more adept at representing general features[17], we incorporate different shallow layers from various pre-trained models into the weight transfer module, named transfer cell (T-cell), within the search space. This inclusion allows for the automatic selection of shallow layers from different levels of various models, facilitating the sharing of pre-trained general feature information and accelerating model convergence while enhancing generalization capabilities.
We selected five pre-trained candidate models: MobileNet V3-Large, EfficientNet, ResNet-50, Xception, and Inception-v3. As shown in Table 1, each candidate model consists of three searchable layers. For instance, the operation code 2,1 denotes the candidate model as ResNet-50, with its searchable layer being layer 2. To thoroughly explore the different shallow layers of candidate models, the searchable layers are positioned at the end of different shared layers (Si) within the candidate models. As illustrated in Fig. 2, these shared layers comprise various shallow layers, with each shared layer (Si) containing multiple searchable layers (Lj).
Table 1
Candidates for the T-cell search space
Candidate model | OP code |
Searchable layer 1 | Searchable layer 2 | Searchable layer 3 |
MobileNet V3-Large | 0,0 | 0,1 | 0,2 |
EfficientNet | 1,0 | 1,1 | 1,2 |
ResNet-50 | 2,0 | 2,1 | 2,2 |
Xception | 3,0 | 3,1 | 3,2 |
Inception-v3 | 4,0 | 4,1 | 4,2 |
The shared layers of the candidate model MobileNet V3-Large[18] are outlined in Table 2, where "conv2d" represents a convolutional layer with a 3×3 kernel size and a stride of 2, and "bneck" denotes a depthwise separable convolution (DSC) layer. This model comprises three shared layers, denoted as Si, \(i \in \left\{ {1,2,3} \right\}\). For instance, S1-L1 refers to the first searchable layer of the first shared layer, specifically the first "bneck_3×3" layer within the "3×bneck_3×3" searchable layer, and so forth. The shared layer "MBConv1" in EfficientNet[19] is composed of multiple "bneck" modules from MobileNet V3, as illustrated in Table 3. In ResNet-50[20], the shared layer "conv1" is a convolutional layer with a 7×7 kernel size and a stride of 2, while "conv2_x" consists of three sets of convolution blocks, each containing 1×1 convolution, 3×3 convolution, and another 1×1 convolution, as detailed in Table 4. The shared layer in Xception[21] (refer to Table 5) includes two 3×3 convolutions and fifteen DSC_3×3 layers. As for Inception-v3[22], its shared layer (see Table 6) comprises six 3×3 convolutions and three Inception blocks, where an Inception block represents a convolution module for multiscale feature fusion.
Table 2
MobileNet V3-Large searchable weight sharing layers
Candidate model | Shared layer | Range of shared layer | Name | Searchable layers |
MobileNet V3-Large | S1 | conv2d,3×bneck_3×3 | Li | 3×bneck_3×3 |
S2 | S1, 3×bneck_5×5 | Li | 3×bneck_5×5 |
S3 | S1, S2, 3×bneck_3×3 | Li | 3×bneck_3×3 |
Table 3
EfficientNet searchable weight sharing layers
Candidate model | Shared layer | Range of shared layer | Name | Searchable layers |
EfficientNet | S1 | Conv_3×3,MBConv1(3×3), 2×bneck_3×3,1×bneck_5×5 | Li | 2×bneck_3×3, 1×bneck_5×5 |
S2 | S1, 2×bneck_5×5,1×bneck_3×3 | Li | 2×bneck_5×5, 1×bneck_3×3 |
S3 | S1, S2, 3×bneck_3×3 | Li | 3×bneck_3×3 |
Table 4
ResNe-50 searchable weight sharing layers
Candidate model | Shared layer | Range of shared layer | Name | Searchable layers |
ResNet-50 | S1 | conv1, conv2_x, 1×\(\left[ \begin{gathered} {\text{1}} \times {\text{1}} \hfill \\ {\text{3}} \times {\text{3}} \hfill \\ {\text{1}} \times {\text{1}} \hfill \\ \end{gathered} \right]\) | Li | \(\left[ \begin{gathered} {\text{1}} \times {\text{1}} \hfill \\ {\text{3}} \times {\text{3}} \hfill \\ {\text{1}} \times {\text{1}} \hfill \\ \end{gathered} \right]\) |
S2 | S1, 3×\(\left[ \begin{gathered} {\text{1}} \times {\text{1}} \hfill \\ {\text{3}} \times {\text{3}} \hfill \\ {\text{1}} \times {\text{1}} \hfill \\ \end{gathered} \right]\) | Li | 3×\(\left[ \begin{gathered} {\text{1}} \times {\text{1}} \hfill \\ {\text{3}} \times {\text{3}} \hfill \\ {\text{1}} \times {\text{1}} \hfill \\ \end{gathered} \right]\) |
S3 | S1, S2, 1×\(\left[ \begin{gathered} {\text{1}} \times {\text{1}} \hfill \\ {\text{3}} \times {\text{3}} \hfill \\ {\text{1}} \times {\text{1}} \hfill \\ \end{gathered} \right]\) | Li | \(\left[ \begin{gathered} {\text{1}} \times {\text{1}} \hfill \\ {\text{3}} \times {\text{3}} \hfill \\ {\text{1}} \times {\text{1}} \hfill \\ \end{gathered} \right]\) |
Table 5
Xception searchable weight sharing layers
Candidate model | Shared layer | Range of shared layer | Name | Searchable layers |
Xception | S1 | 2×Conv_3×3,9×DSC_3×3 | Li | 3×DSC_3×3 |
S2 | S1, 3×DSC_3×3 | Li | 3×DSC_3×3 |
S3 | S1, S2, 3×DSC_3×3 | Li | 3×DSC_3×3 |
Table 6
Inception-v3 searchable weight sharing layers
Candidate model | Shared layer | Range of shared layer | Name | Searchable layers |
Inception-v3 | S1 | 3×conv_3×3 | Li | 3×conv_3×3 |
S2 | S1, 3×conv_3×3 | Li | 3×conv_3×3 |
S3 | S1, S2, 3×Inception | Li | 3×Inception |
3.2 Improved Normal cell
The structure of the normal cell (N-cell) is essentially similar to NASNet[12]. As illustrated in Fig. 3, the N-cell is a multi-branch structure composed of N blocks, each containing 2 candidate operations (OP) followed by a Combine operation. The combine operation encompasses two searchable connection candidates, using Add or Concat. Here, H[i-1] and H[i] represent inputs from the (i-1)th and ith cells, respectively, while H[i + 1] denotes the output of the current (i + 1)th cell. Unlike the NASNet search space, OP primarily includes depthwise separable convolution (DSC) and attention-based DSC as the main candidate operations. As shown in Table 7, DSC reduces parameter count while incorporating attention mechanisms aids in extracting more critical features. Table 8 outlines the candidate operations for Combine.
Table 7
Candidate operations in the N-cell
Candidate OP | OP code | Candidate OP | OP code |
Identity | 0 | DSC 3×3 CBAM | 6 |
Conv 1×1 | 1 | DSC 5×5 CBAM | 7 |
DSC 3×3 | 2 | DSC 3×3 CA | 8 |
DSC 5×5 | 3 | DSC 5×5 CA | 9 |
DSC 3×3 SE | 4 | Max pooling 3×3 | 10 |
DSC 5×5 SE | 5 | Avg pooling 3×3 | 11 |
Table 8
Combine's candidate operations
Candidate OP | OP code |
Add | 0 |
Concat | 1 |
The diagram in Fig. 4 outlines the structure of the depthwise separable convolution (DSC) module based on attention mechanisms. This module encompasses DSC 3×3 and DSC 5×5 as candidate operations within the search space, while SE[26], CBAM[27], and CA[28] are considered as candidate attention modules. The process starts with the input of the feature map into a 1×1 Convolution layer, followed by batch norm (BN) and ReLU activation (BN + ReLU). Next, the output undergoes another round of processing through a DSC 3×3 or DSC 5×5 layer, followed again by BN + ReLU. Subsequently, it enters the candidate attention module to obtain attention weights, which are then multiplied with the input to yield the weighted feature map. Finally, the module outputs the result through a convolution layer (Conv 1×1) and BN.
3.3 Pooling cell
A pooling cell (P-cell) has been introduced to dynamically reduce number of parameters. Similar to the N-cell, the P-cell is structured with block, albeit limited to only two operations: max pooling and avg pooling (as detailed in Table 9). Notably, it consists of two pooling operations followed by a Combine operation, as depicted in Fig. 5.
Table 9
Candidate operations in the P-cell
Candidate OP | OP code |
Max pooling 3×3 | 0 |
Max pooling 5×5 | 1 |
Max pooling 7×7 | 2 |
Avg pooling 3×3 | 3 |
Avg pooling 5×5 | 4 |
Avg pooling 7×7 | 5 |
3.4 Dropout cell
In addition, a dropout cell (D-cell) has been devised for adaptive dropout rate selection to mitigate model overfitting. Illustrated in Fig. 6, the D-cell introduces multiple Dropouts with distinct dropout rates into the search space (refer to Table 10). This mechanism empowers the controller to dynamically choose an appropriate Dropout, culminating in a voting mechanism that selects the best prediction result, consequently enhancing the model's generalization capabilities.
Table 10
Candidate dropout rates in the D-cell
Dropout rate | OP code | Dropout rate | OP code |
0.1 | 0 | 0.6 | 5 |
0.2 | 1 | 0.7 | 6 |
0.3 | 2 | 0.8 | 7 |
0.4 | 3 | 0.9 | 8 |
0.5 | 4 | | |
3.5 Search strategy
This study employs reinforcement learning techniques to identify the optimal cell architecture. As illustrated in Fig. 1, the process initiates with the controller estimating the sampling probabilities for each cell and conducting sampling to derive the operation codes for each cell. Subsequently, utilizing these operation codes, the MFC-NAS candidate models are constructed. Following this, training occurs using randomly selected subsets of the training data, and the controller parameters are iteratively updated via policy gradient descent based on validation accuracy, initiating subsequent search iterations until completion. The T-cell is specialized in exploring various shallow layers across different pre-trained models. The N-cell focuses on investigating candidate operations for feature extraction. The P-cell is dedicated to exploring candidate operations for downsampling. Lastly, the D-cell is designed to explore different dropout rates.
3.6 Candidate neural network architecture
The proposed MFC-NAS model is established by utilizing operation codes. As illustrated in Fig. 7, once the controller generates a comprehensive list of operation codes for all cells, cells are then constructed based on the corresponding operation codes, with each operation code list corresponding to a unique cell. Initially, the construction begins with the T-cell, whose operation code list corresponds to the candidate items in the search space of T-cell, followed by the sequential establishment of N-cell, P-cell, and D-cell. Within the candidate model, N-cell and P-cell appear in pairs, stacked L times, with the final P-cell outputting to the D-cell.