Convolutional Neural Networks are deep learning models widely utilized for image classification and computer vision tasks. Their hierarchical structure of interconnected layers enables automatic learning and extraction of relevant features from input images. Convolutional layers employ learnable filters or kernels to perform convolution operations on the input image, extracting local features like edges, textures, and shapes. Pooling layers down sample the input volume, reducing computational complexity in subsequent layers through operations like max pooling, which selects the maximum value within a defined window. To introduce non-linearity and learn complex patterns, activation functions like ReLU (Rectified Linear Unit) are commonly used in CNNs. At the end of the CNN, fully connected layers are responsible for classifying the learned features, connecting every neuron in the previous layer to every neuron in the current one. A variant of CNN called B-R-CNN combines the advantages of R-CNN with AdaBoost, a boosting method that enhances the precision of weaker classifiers. B-R-CNN trains multiple weak classifiers using R-CNN on various subsets of the training data and then employs AdaBoost to create a stronger classifier. During testing, this stronger classifier predicts the class of items in the image. B-R-CNN has been demonstrated to outperform the original R-CNN in terms of speed and accuracy. However, due to training several weak classifiers, it requires more computer resources during the training phase. Semantic segmentation is a critical computer vision task that involves classifying each pixel in an image into one of multiple predefined classes. Fully Convolutional Networks (FCNs) have emerged as a powerful architecture for semantic segmentation due to their ability to preserve spatial information while efficiently capturing features. Initialize the FCN model with pre-trained weights or random weights. The FCN architecture typically comprises an encoder part that captures hierarchical features and a decoder part that recovers spatial resolution. Forward pass the original image I through the FCN model to obtain the output logits or feature maps. The encoder processes the image through convolutional and pooling layers, resulting in a set of high-level feature maps capturing various levels of abstraction. Apply a softmax activation function to the logits to obtain class probabilities for each pixel. The softmax function normalizes the logits, providing pixel-wise class probabilities. This step transforms the network's output into a form interpretable as probabilities of each pixel belonging to different classes. Obtain the segmentation mask by assigning each pixel to the class with the highest probability. For each pixel, the class with the highest probability is selected, creating a segmentation mask. The purpose of the mask is to emphasize specific regions that correspond to distinct objects or classes present in the input image. When training a Fully Convolutional Network (FCN) for the task of semantic segmentation, the selection of an appropriate loss function is crucial. Typically, the cross-entropy loss is adopted, quantifying the disparity between predicted class probabilities and the true ground truth labels associated with each pixel. Through the process of backpropagation and gradient descent, the FCN's parameters, also known as weights, undergo iterative optimization. This involves adjusting the network's parameters in a manner that systematically reduces the chosen loss function. This optimization process aims to fine-tune the FCN's performance, ultimately leading to more accurate and effective semantic segmentation results.
3.1 Data Augmentation
Augmenting the training data with transformations like flipping, rotation, and scaling helps improve model generalization and robustness. Post-processing: Additional post-processing techniques, such as conditional random fields, can refine segmentation results by considering spatial relationships between pixels. Transfer Learning - Pre-training on large datasets, such as ImageNet, followed by fine-tuning on segmentation-specific data, can help accelerate convergence and improve performance. Fully Convolutional Networks (FCNs) have revolutionized semantic segmentation in computer vision. By capitalizing on their ability to capture spatial information and hierarchical features, FCNs have become a cornerstone in numerous applications, including autonomous driving, medical imaging, and remote sensing. The fusion of convolutional operations, pooling, upsampling, and softmax activation, coupled with optimization through backpropagation, enables FCNs to achieve accurate and robust semantic segmentation. As the field of computer vision advances, FCNs remain a fundamental tool for segmenting objects and regions within images, enhancing our understanding of visual data and facilitating groundbreaking research and applications.
Convolution operation: Z_i = Sum(W_k * X_i) + b_i ---------------(1)
Pooling-operation: Y_i = Max(X_i) --------------------------------(2)
Up-sampling operation: Y_i_upsampled = Upsample(Y_i) ------(3)
SoftMax activation_function: Soft-Max(Z i ) = exp(Z i ) / \(ϵ\)(exp(Zj)) ----------(4)
Data augmentation is a powerful technique utilized in machine learning and computer vision to enhance the diversity and quality of training data. By applying various transformations to original images, data augmentation helps improve the robustness and generalization of machine learning models. A key aspect of data augmentation is its ability to handle noise, which can often be present in real-world scenarios. To effectively address noise, we can introduce random noise into images during the data augmentation process. The following algorithm outlines the steps for data augmentation with noise where I(x, y) represents the pixel value at coordinates (x, y) in the original image, and N(0, sigma^2) denotes Gaussian noise with a mean of 0 and variance of sigma^2 in Eq. 1. Dropout is a regularization technique that prevents over-reliance on specific neurons during training. It involves randomly deactivating neurons with a certain probability, which encourages the network to learn more robust and diverse features in Eq. 2. Here, Input(x) denotes the input to the dropout layer, Output(x) is the output after applying dropout, and Mask(x) is a binary mask with elements drawn from a Bernoulli distribution based on the probability of retaining a neuron. Weight decay, also known as L2 regularization, aids in controlling the magnitude of weights in a neural network. This technique adds a penalty term to the loss function, discouraging excessively large weight values in Eq. (3). In this formula, L_regularized (W) represents the regularized loss function, L(W) is the original loss function (e.g., cross-entropy loss), lambda is the regularization parameter, and ||W_i|| signifies the L2 norm of weight W_i represented in Eq. 1, 2 &3. By incorporating data augmentation with noise and applying regularization techniques like dropout and weight decay, CNN models can achieve greater resilience to noise within training data. This enhanced robustness leads to improved model performance and generalization, even in the presence of noisy input. This comprehensive approach equips machine learning models to handle real-world challenges and deliver reliable results across various applications.
I_augmented(x, y) = I(x, y) + N(0, sigma^2) ----------------(5)
Output(x) = Input(x) * Mask(x) —----------------------------(6)
L_regularized(W) = L(W) + (lambda / 2) * sum(||W_i||^2) –----(7)
Table 1
Different Features of Train Data
|
id
|
class
|
segmentation
|
Count
|
115488
|
115488
|
33913
|
unique
|
38496
|
3
|
33899
|
top
|
case_123day20_slice_0001
|
Large bowel
|
12629, 10 12894, 12 13158, 15
|
frequency
|
3
|
38496
|
2
|
3.2. Optimization in Blur Techniques
Gaussian blur is a fundamental image processing technique used to reduce image noise and details, resulting in a smoother appearance. It's achieved by convolving the original image with a Gaussian kernel, which is a two-dimensional distribution centered around the origin in Eq. (4). This kernel emphasizes nearby pixels while gradually diminishing the influence of distant pixels. The size of the kernel, determined by parameter k, controls the extent of blurring, while the standard deviation sigma regulates the spread of the Gaussian distribution. This technique is particularly useful for preparing images before performing tasks like edge detection or object recognition. Motion blur is a phenomenon that occurs when capturing images of objects in motion or due to camera movement during exposure. This effect can be simulated or corrected using motion blur techniques. By convolving the original image with a motion blur kernel, the appearance of motion blur is introduced, replicating the effect of objects moving across the field of view. The kernel's size and orientation, controlled by parameters k and theta, influence the strength and direction of the blur. This technique is essential for various applications, including simulating realistic motion effects and restoring images affected by motion blur. Apply Wiener deconvolution to obtain the deblurred image in the frequency in Eq. (6) shows Inverse Fourier transform the deblurred image to obtain the deblurred image in the spatial domain. Deconvolution is a sophisticated technique used to recover the original image from a blurred or distorted version. In the context of image processing, it is particularly useful for restoring images that have been affected by various types of blurring, such as motion blur or Gaussian blur. Deconvolution algorithms aim to estimate the original image by considering the effects of the blur kernel and noise. The regularization parameter lambda controls the trade-off between preserving image details and reducing noise amplification during deconvolution. This process involves working in the frequency domain, utilizing the Fourier transforms of the blurred image and the blur kernel, and then applying inverse Fourier transform to obtain the deblurred image. Effective deconvolution techniques are crucial for enhancing image quality and aiding tasks like astronomical imaging, medical imaging, and forensic analysis. The techniques mentioned above, including Gaussian blur, motion blur, and deconvolution, can significantly impact image processing outcomes. An essential aspect of these techniques is their optimization to find the optimal parameters and regularization values. This optimization process involves experimenting with different kernel sizes, standard deviations, motion angles, and regularization parameters to achieve the best balance between removing unwanted artifacts and preserving image features. For CNN models, integrating these techniques can lead to improved performance in handling blurred images. By training CNNs on datasets containing both blurred and sharp images, models can learn to effectively distinguish and enhance relevant features while suppressing noise and blur effects. The models can then be fine-tuned using techniques such as transfer learning to adapt to specific tasks, such as image restoration or object detection, with better accuracy and robustness.
GK = (1 / (2 * pi * sigma^2)) * exp(-((x^2 + y^2) / (2 * sigma^2))) —(8)
MBK= (1 / k) * [cos(theta) cos(theta) ... cos(theta);
sin(theta) sin(theta) ... sin(theta);
cos(theta) cos(theta) ... cos(theta)] —--------------(9)
Deblurred Image (Frequency Domain): (1 / (FFT(K) + lambda)) * FFT(I_blurred) —-(10)
Table 2
Parameters of Keras Layers with Optimizer
Keras Layers
|
Input neurons
|
Activation
|
Conv2D
|
32(3,3)
|
Relu
|
Max Pooling
|
2,2
|
Relu
|
Conv2D
|
64, (3, 3)
|
Relu
|
Dense
|
10
|
Softmax
|
Optimizer
|
64(3,3)
|
Adam
|
Loss function
|
32(3,3)
|
Cross entropy
|
Trained images
|
70%
|
Relu, Softmax
|
Epochs
|
500
|
-
|
3.2 RCNN MODEL
The R-CNN model with boost Set up a collection of T weak R-CNN models with random initialization, each of which accepts an image as input and produces a list of object suggestions together with related class probabilities. Set each weak R-CNN model's weights to 1/N, where N represents the total number of weak classifiers. The weights wi of the weak R-CNN models should be normalized to total to 1 for each iteration t in the set T. Using the current weights wi and the base R-CNN model architecture, train each weak R-CNN model on a random portion of the training data.
For any unreliable R-CNN model:
i: Calculate the loss function Li to assess the system's performance on the training dataset.
-
Subtract the sum of Li weighted by wi to arrive at the error rate Ei.
-
As the best weak classifier, pick the weak R-CNN model with the lowest error rate.
-
t = 1/2 * ln((1 - Et) / Et)
-
Change all weak R-CNN models' weights wi shown in Eq. 1.
where xi is an image, yi is its label, hit(xi) is the prediction made by the best weak classifier ht on image xi, and yi * hit(xi) is + 1 if the prediction is accurate and − 1 otherwise. The weights wi should be normalized to equal 1. To create the strong R-CNN classifier H, combine all of the weak R-CNN models with their respective weights as shown in Eq. 2. Calculate the average precision (AP) and use it to assess how well the powerful R-CNN classifier H performed on the test dataset. By training weak R-CNN models on various subsets of the training data and modifying their weights based on their individual performance, the approach improves the performance of weak R-CNN models iteratively. The weak models' weights are combined to create a strong R-CNN classifier, which can handle challenging object detection tasks better. The AdaBoost technique is used to increase the weight of the better performing models while decreasing the weight of the weaker models.
wi = wi * exp(-t * yi * hit(xi)) ------------(11)
H(x) = sign(sum(wi * hi(x)) -------------- (12)
Table 3
Image Segmentation Values on GI Tract Images
segmentation
|
29.364956
|
size
|
33.333333
|
id
|
100.000000
|
class
|
100.000000
|
cas
|
100.000000
|
jour
|
100.000000
|
slice
|
100.000000
|
Dtype
|
float64
|
3.3 VGG model
The VGG architecture was presented by Karen Simonyan and Andrew Zisserman in 2013 as a contribution from Oxford's Visual Geometry Group (VGG). This model brought about a significant transformation in the field of computer vision. and was initially developed for the 2014 ImageNet Challenge. It differed from previous successful models, like AlexNet, in several ways. While AlexNet used a larger receptive field of 11x11 with a 4-pixel stride, VGG employed smaller 3x3 receptive fields with a 1-pixel stride, achieving a larger effective receptive field by combining these smaller filters. The VGG network is known for its simplicity, utilizing small convolutional filters throughout the architecture. VGG16, for instance, consists of 13 convolutional layers and 3 fully linked layers make up this. making it a 16-layer deep neural network. It has a substantial number of parameters, totaling 138 million, making it relatively large compared to contemporary standards. Despite its size, the key advantage of VGGNet16 lies in its simplicity, encompassing the fundamental characteristics of convolutional neural networks. It has become a foundational model in computer vision tasks due to its effective architecture and it was thoroughly used and adapted in different Application. The VGG networks apply small 3x3 receptive fields, which are the smallest possible. The VGG19 model shares the same basic idea as VGG16 but 19 levels are supported, showing the Weight-Layers (convolutional layers) present in the model. As a result, VGG19 has three additional convolutional layers compared to VGG16. The VGG architecture employs very small convolutional filters. 13 convolutional layers and three fully linked layers make up the VGG16. VGGNet accepts images with a 224x224 resolution. To maintain input size consistency during the ImageNet competition developers eliminated each image's Centre 224x224 patch. VGG networks consist of small convolution filters. Specifically, there are three fully linked layers and 13 convolutional layers in the VGG16. For a ImageNet competition, VGGNet takes a 224x224 image as input. To maintain a consistent image input size, the model's developers removed a 224x224 square from the center of each submitted image. The VGG convolutional layers play a key role in the network's overall architecture and success.
3.3 Conditional Invertible Neural Networks
A conditional invertible neural network is an advanced type of neural network architecture that builds upon the concept of invertible neural networks. An invertible neural network is designed in such a way that both the forward and backward transformations are invertible. This means that you can transform data from its original form to a learned representation and then back again without any loss of information. In conditional invertible neural networks, an additional conditioning input is incorporated into the network's architecture. This conditioning input serves as extra information that guides the transformation process. For example, in image generation tasks, the conditioning input could specify the attributes of the image to be generated, such as the pose of an object or the style of a painting. The network can then generate images with these specified attributes while maintaining invertibility, allowing the generated images to be accurately transformed back to their original attributes. The applications of conditional invertible neural networks are broad, ranging from generative modeling and data augmentation to style transfer and controlled transformation tasks. Intestinal gas, also known as flatulence, refers to the presence of gas in the digestive system. This gas can be produced by various factors, including swallowing air while eating, the breakdown of certain foods by gut bacteria, and fermentation processes in the intestines. While the presence of gas in the intestines is a normal part of digestion, excessive gas can lead to discomfort, bloating, and sometimes flatulence. It's important to note that there doesn't appear to be a direct connection between "conditional invertible neural networks" and "intestinal gas." They are concepts from very different domains: one is related to advanced neural network architectures, and the other pertains to digestive physiology. Algorithm for Computing Conditioned Reconstruction Statistics using Invertible Neural Networks Given a noisy measurement yδ, an invertible neural network F, and a conditioning network C, the algorithm aims to calculate the mean and variance of the conditioned reconstructions based on random samples. Start by obtaining the forward-backward projection (FBP) using TFBP(yδ), and denote it as c0. Apply the conditioning network C with parameters Θ to c0, resulting in conditioned outputs c. For each iteration k from 1 to K: a. Generate a random sample z[k] from a normal distribution N(0, I). b. Use the inverse of the invertible neural network F to compute the reconstruction xˆ[k] using z[k] and conditioned output c. After completing the loop, calculate the mean reconstruction by averaging all xˆ[k] values that is to Calculate the variance of the reconstructions by averaging the squared differences between each xˆ[k] and the mean xˆ represented in Eq. 3.
xˆ = (1 / K) * ∑ xˆ[k]
σˆ = (1 / K) * ∑ (xˆ[k] - xˆ)^2 ------------- (13)
Table 4
Segmentation Analysis according to Class Label
|
id
|
class
|
segmentation
|
0
|
Slice_0001
|
large bowel
|
NaN
|
1
|
Slice_0001
|
small bowel
|
29.36
|
2
|
Slice_0001
|
stomach
|
33.33
|
3
|
Slice_0002
|
large_bowel
|
32.4
|
4
|
Slice_0002
|
small_bowel
|
29.36
|
5
|
Slice_0143
|
small_bowel
|
NaN
|
6
|
Slice_0143
|
stomach
|
32.65
|
7
|
Slice_0144
|
large_bowel
|
31.35
|
8
|
Slice_0144
|
small_bowel
|
25.67
|
9
|
Slice_0144
|
stomach
|
33.45
|