In this section, the proposed methodology for the segmentation tasks have been explained in detail. The entire process before the training starts including the pre-processing techniques as well as the hyperparameters involved have been described. The architecture of the U-Net model has also been described in brief [43]. The methodology for lung and ROI (region of infection segmentation) in CT scan images of COVID-19 patients has been described in detail. The dataset was downloaded from Zenodo. It contained CT scans of 20 patients, along with infection mask, lung and infection mask and lung mask images. These images have been verified by an experienced radiologist. The masks give the region of infection and the region of lung in the actual CT scan. The U-net architecture has been applied. This architecture has given state of the art results in medical image segmentation. For infection segmentation, three-fold, four-fold and seven-fold cross validation methods have been used. It is used for validating any overlap or similarity between two images. Fig. 1. illustrates samples of images from the dataset containing the original CT scan and its infection mask. Fig. 2. illustrates samples of images from the dataset containing the original CT scan and the lung mask.
In Fig. 1 and Fig. 2, the highlighted portion on the images on the right-hand side show the masked regions. These images are already masked and verified by an expert radiologist and will be used to train a deep neural architecture which can also accurately determine the region of infection and the lung by segmentation.
U-Net Architecture: U-Net is a CNN architecture developed mainly for biomedical image segmentation. Its architecture was modified to work with a smaller dataset and observe segmentations with higher accuracy. The architecture consists of a contracting path and an expansive path with gives the U-type architecture. During the contracting path, the images are passed through a series of convolution, relu and pooling operations where the spatial dimensionality is reduced, increasing the size of feature maps. The expansive path combines the spatial and feature maps using a series of concatenations which help to increase the resolution of the output.
3.1 Pre-processing
The images are first resized to 512x512 dimensions. The CLAHE (Contrast Limited Adaptive Histogram Equalization) technique is applied to enhance the contrast of the images. Medical images usually have a lot of problems in contrast, especially CT scan images. The parameters involved in applying CLAHE are the clip limit and grid size. Correct amount of clip limit prevents the over amplification of noise in the image. This is one of the main advantages of CLAHE over AHE (Adaptive Histogram Equalization). This value is generally kept between 2 and 4. The clip limit value has been tuned at 3. The processor in this method has to go through a lot of black areas, which are not necessary for the segmentation process. The images can be cropped so that only the region of interest is utilized. This can be done by slicing the images using trial and error, but that would be generalized for the dataset in use only. A better approach would be to draw contours over the image and crop out the rectangle with biggest contour and area. This would point to the contour covering both the lungs. Maximum region of intersection can be obtained by taking the next two largest contours by area for the two lungs and combine them. While cropping a CT scan, the corresponding segmentation map should also be cropped by same limits to avoid wrong labelling of a region. In global thresholding an arbitrary chosen value can be used as a threshold in contrast to Otsu’s method, which automatically determines the value [44]. A total of 3520 slices were obtained. About twenty percent from the front and last of each file in general did not have any infection mask and some did not have lungs too, which is why they were discarded as noise. A total of around 500 slices had a complete black mask, which meant there was no infection in these regions. They were kept out of the segmentation model. Finally, about 1600 samples were obtained which were later split into train and test. Since the images and masks were cropped, all of them were not of the same size. This is why they were reduced to 224x224 dimensions. The same pre-processing steps were followed for all K-fold infection segmentations as well as for lung segmentation. Fig. 3. illustrates a comparison between the images enhanced using the CLAHE algorithm with their original CT scan. In the enhanced image, the infection can be clearly distinguished. Fig. 3. also provides the histogram comparison of original image and enhanced image. Fig. 4. illustrates the pre-processed (Enhanced using CLAHE algorithm and Cropped) image along with the infection mask. Fig. 5. illustrates the enhanced CT scan of lung along with the mask of the lung.
3.2 Proposed Architecture
The architecture proposed in this paper is based on the concepts employed in the U-Net architecture. The architecture for the segmentation process is illustrated in Fig. 6. It is a modified version of the U-Net architecture. It has the familiar U-shape with various changes made to the individual blocks. The contraction path can be explained as a combination of feature extraction blocks. The feature map has to converted to a vector and an image has to be reconstructed from this vector. The main idea behind the process is using the same feature maps which were obtained during the contraction path. These feature vectors are then used in the expansion path to form a new segmented image which contains only the boundary or region of interest of the original image, preserving integrity of the image.
In Fig. 6, the input images of 224x224 dimensions are given to the input layer. This layer is connected to the first block of the contraction path. There are four contraction path blocks altogether. The number of filters in the convolutional layer of the contraction path are 32, 64, 128 and 256 respectively for each subsequent block. In each block a 3x3 size filter and the relu activation function has been applied. Two convolutional layers are followed by a batch normalization layer which is followed by a max pooling layer in two dimensions with a 2x2 pool size. This performs the down sampling operation to reduce dimensionality of the feature map. The job of each contraction block is to extract fine features of the image. This can be thought of as a feature extraction task. A routine dropout regularization layer has been added at the end of each contraction block.
After the feature extraction is completed by four contraction blocks, two convolutional blocks are added with 512 filters. These are followed by the expansion block. The expansion block consists of a 2D Convolutional Transpose Layer, which is basically the inverse of a pooling operation. This is used to perform the up-sampling operation and has the ability to interpret raw data to fill in the feature matrix. This layer is a combination of convolutional and up-sampling layer in two dimensions. This layer is followed by the concatenate layer, which joins its previous layer with the nearest batch normalization layer in order to combine the information on location of the features from the contraction paths with the contextual information obtained in the expansion path. This layer is followed by two convolutional layers. The number of filters in the convolutional layer and the convolutional transpose layer are 256,128,64 and 32 respectively for each subsequent block in the expansion path. The size of filters used is 3x3 with the relu activation function. The final convolutional layer is connected to the output layer, which is also a convolutional layer with a single neuron to give the single segmented output image, with a sigmoid activation function.
The modifications done in this model which improve the performance of the model are the additions of a batch normalization layer and the Convolutional Transpose layer. Usage of the transpose layer instead of only the up-sampling layer which is done in the conventional U-Net model gives us interpret the coarse input data in a better way. Batch normalization is an important part in the contraction or classification process [45]. These layers provide an advantage of the proposed model over the original U-Net architecture.
3.2.3 Model Training and Hyperparameter Tuning
The lung and infection segmentation tasks were both trained using the same proposed model. The infection segmentation task was done with three, four and seven consecutive folds. Each fold is used as validation during which the remaining folds act as the training set.
The hyperparameters used in this paper have been tuned to their value after repeating trial and error training runs for a fixed number of epochs. All the segmentation tasks have been carried out with the same hyperparameters. For model optimization, the Adam optimizer function has been used with a learning rate of 0.0005. The metric to be measured is the dice coefficient. The loss is calculated using the following equations.