A. Resnet-50
The ResNet50 convolutional neural network (CNN), is a highly deep network which has 50 layers. Although network depth is important for neural networks, deeper networks are more challenging to train. The structure of ResNet50 facilitates the training of networks and allows them to be much deeper, which leads to increased performance in different tasks. In addition to being substantially deeper than its "simple" counterparts, ResNet50 has a much reduced number of parameters (weights).
B. Inception V3
The InceptionV3's fundamental design is based on GoogleNet [21]. The utilisation of Lin's "Network in Network" technique [22], which increased the representational power of neural networks, is one of the essential components of the Inception framework. As a result, the dimension was reduced to 11 convolutions, lowering the computation cost. The Inception architecture was created to lower the computational expense of deep learning-based image classification. There are typically three possible convolution sizes and one maximum pooling in the Inception module. The following inception modules make up InceptionV3's core architecture: A dropout layer, a fully connected layer with 1024 neuron units and ReLU, and an average pooling layer with a filter size of 5 x 5 and stride 3 are all used for dimension reduction. The channel is aggregated following the convolutional operation, and the fusion operator is then applied to the output of the preceding layer. As a result, it aids in lowering overfitting and enhancing the network's adaptability.
C. VGG16
VGG16 is one of the most widely applied pre-trained CNN models. There are 16 layers in all, 13 of which are convolutional, two of which are fully connected, and one of which is a Softmax activation layer [23]. ReLU (rectified linear unit) activation function is utilised to increase the model's nonlinearity, while Softmax is employed for classification. There are 138 million parameters in total. According to the VGG16 network setups, an input consists of a 224 x 224 picture with the colours red, green, and blue. The only pre-processing done is to remove the average value from each pixel in order to normalise the RGB values for each pixel. The picture is transmitted through the first stack of two convolution layers with a size of 3 x 3 after ReLU activations. There are 64 filters on each layer. To retain spatial resolution after convolution, the convolution stride is set to 1 pixel. The buried layers of the VGG network employ ReLU. The VGG16 network's hidden layers all employ the ReLU activation function.