The proposed approach is based on three fundamental pillars. First, a data preparation step is performed consisting of a pre-processing stage followed by a data augmentation phase to enrich the database, followed by a MS lesion segmentation step using a DL model. Figure 1 illustrates the general workflow of the proposed segmentation method.
A. Dataset
In this work, two public datasets ISBI 2015 and MICCAI 2016 were used.
The ISBI 2015 dataset [38] consists of 19 subjects divided into 5 training sets and 14 test cases. All data were segmented by two experienced evaluators with over 2 years of experience. Only the masks from the training set are publicly accessible, while those from the test set are masked for the challenge. Each subject in the training set has an average of 4.4 time points, and the test set also has an average of 4.4 time points per subject. The clinical characteristics of the subjects, as well as the imaging parameters for each dataset, are detailed in Table 1.
Table 1
Summary of clinical and imaging characteristics of the ISBI 2015 dataset.
Dataset
|
Number of subjects
|
Clinical characteristics
|
Genre
|
Age
(Mean ± Std)
|
Time-points (Mean ± Std)
|
Modality
|
Voxel size (mm3)
|
Training set
|
5
|
1 Male, 4 Females
|
43.5 (\(\:\pm\:\)10.3)
|
4.4 (\(\:\pm\:\)0.55)
|
MPRAGE
PD-w
T2-w
FLAIR
|
0.82 x 0.82 x 1.17
0.82 x 0.82 x 2.2
|
Testing set
|
14
|
3 Males, 11 Females
|
39.3 (\(\:\pm\:\)8.9)
|
4.4 (\(\:\pm\:\)0.63)
|
The longitudinal datasets include MRI scans with T1, T2, PD, and FLAIR modalities, captured at 3 to 5 time points using a 3T MRI scanner. Each volume contains 18 2 slices, some of which are entirely black. The dimensions of each slice are 181 x 217 pixels, with a spatial resolution of 0.82 x 0.82 mm and slice thickness ranging from 1.17 mm3 to 2.2 mm3. Figure 2 illustrates an example of an MRI slice from T1-w, T2-w, PD, and FLAIR modalities, along with Ground Truth (GT) given by expert 1 for patient 1 from the ISBI 2015 training set.
The MICCAI 2016 dataset [39, 40] was generated by participating neurologists as part of the “Observatoire Français de la Sclérose En Plaques” (OFSEP) and comes from four different MRI scanners from various manufacturers (Siemens, Philips, and GE). The MRI scanners included three Tesla magnets and 1.5 Tesla magnets.
Table 2
Summary of clinical and imaging characteristics of the MICCAI 2016 dataset.
|
Patient age
(Mean ± Std)
|
Patient genre
|
Patient number of lesions (Mean ± Std)
|
Patient lesions load
(Mean ± Std)
|
Training dataset
|
46.92 (± 10.16)
|
8 Males, 30 Females
|
40.71 (± 39.48)
|
12350.68 (± 15028.476)
|
Testing dataset
|
41.6 (± 9.84)
|
7 Males, 8 Females
|
37.13 (± 27.06)
|
20756.42
|
All dataset
|
45.41 (± 10.26)
|
15 Males, 38 Females
|
39.69 (± 36.18)
|
14729.66 (± 17006.0636)
|
A total of 53 patients were recorded in this dataset, divided into 15 training cases and 38 test cases. Each patient received T1-w, T2-w, FLAIR, PD, and T1-Gd images, with dimensions ranging from 240 x 320 to 512 x 512 pixels. The spatial resolution of these images ranges from 0.43 x 0.43 mm to 0.70 x 0.74 mm, with slice thicknesses ranging from 0.7 to 3 mm3. Table 2 provides more details about the clinical characteristics of both training and testing datasets. For the training set, the slices were manually segmented by seven qualified junior radiologists. Figure 3 presents an example of slices from the MICCAI 2016 dataset.
B. Preprocessing
In this work, we utilized the preprocessed images of the ISBI 2015 and MICCAI 2016 datasets. To enhance the quality and consistency of the data, we conducted additional preprocessing steps. This included resampling the images to a standardized resolution of 192 x 192 x 192 voxels to ensure uniformity across the dataset. We have also normalized the intensities of the images to resolve problems of variability that could complicate the training of the DL model. Furthermore, we meticulously removed any undesirable black images which could potentially affect the accuracy of our analysis.
C. Data Augmentation
To optimize the performance and robustness of the NN model and to increase both the amount and diversity of available data, we applied various data augmentation techniques. Using the image data augmenter generator, we applied a set of geometric transformations to each training image, including a random rotation of 90°, a vertical flip, an offset, a scaling, and a grid distortion. These manipulations are intended to enrich the training dataset and to adapt the model to a wider variety of situations. Figure 4 shows an example of slices of the ISBI 2015 dataset after the application of data augmentation techniques.
D. Proposed CNN Architecture
The proposed CNN architecture for MS lesion segmentation is inspired by the U-Net architecture introduced by Ronneberger et al. for medical image segmentation [41]. It incorporates a ResNet-34-based encoder coupled with a U-Net decoder. This approach leverages the robustness of ResNet's residual blocks to extract deep features while ensuring the preservation of memory flow throughout the network [42]. The considered model consists of two main paths: a contraction path or the Encoder (E) with residual blocks and an expansion path or named Decoder (D) as presented in Fig. 5.
Table 3 provides a comprehensive overview of the CNN architecture used in this study, detailing the specific configurations at each stage of the model and highlighting the hyperparameters chosen to optimize network performance for accurate lesion segmentation. For the encoder, intermediate and decoder blocks, the number of residual sub-blocks, the types of layers used, the number and the size of filters at each level are detailed.
Table 3
|
Block
|
Number of residual subs-block
|
Type
|
Filter Size/Number
|
Pad size
|
Stride
|
Encoder
|
Input layer
|
-
|
1 Conv2D
|
7 x 7 x 64
|
1
|
2
|
1 Max Pooling
|
2 x 2
|
-
|
Encoder Block 1
|
3
|
[2 Conv2D] x 3
|
3 x 3 x 64
|
1
|
1
|
Encoder Block 2
|
4
|
[2 Conv2D] x 4
|
3 x 3 x 128
|
Encoder Block 3
|
6
|
[2 Conv2D] x 6
|
3 x 3 x 256
|
Encoder Block 4
|
3
|
[2 Conv2D] x 3
|
3 x 3 x 512
|
Intermediate Block
|
-
|
-
|
2 Conv2D
|
3 x 3 x 1024
|
Decoder
|
Decoder Block 1
|
-
|
Conv2D Transpose
2 Conv2D
|
3 x 3
2 x 2
|
x 512
|
1
|
1
|
Decoder Block 2
|
-
|
x 256
|
Decoder Block 3
|
-
|
x 128
|
Decoder Block 4
|
-
|
x 64
|
Final layer
|
-
|
1 Conv2D
|
1 x 1
|
x 2
|
a. Encoder: ResNet-34 backbone: In our model, the encoder is a modified version of ResNet-34, specifically adapted to extract relevant features from brain MRI images of MS patients. It consists of four residual blocks. In the first stage, a 7x7 convolution with a stride of 2 and padding of 1 is firstly applied followed by ReLU activation, Batch Normalization (BN), and 2x2 MaxPooling. In Fig. 5, \(\:i\:\)represents the index of encoder and decoder blocks, while \(\:i=\{\text{1,2},\text{3,4}\}\). Each Encoder block (Ei) is composed of \(\:j\:\)Residual sub-blocks (Rj), where \(\:j\) takes the values {3,4,6,3} respectively for considered encoder block i; \(\:i=\left\{\text{1,2},\text{3,4}\right\}\). Each Rj includes two successive 3x3 convolutional layers with k filters, \(\:k=\left\{\text{64,128,256,512}\right\}\) fixed according to the decomposition level \(\:i=\left\{\text{1,2},\text{3,4}\right\}\) respectively. Each layer is followed by a ReLU activation function and a BN layer. An identity mapping is finally added to enhance the learning of features at different layers.
b. Intermediate Block: The intermediate block bridges the encoder and decoder blocks. It consists of two successive 3x3 convolutional layers, each followed by ReLU activation function and BN. This block prepares the features for reconstruction in the decoder.
c. U-Net Decoder: The U-Net decoder recovers spatial details lost during the encoding phase and reconstructs a segmentation map with the same resolution as the input image. The decoding process consists of four blocks. Each Decoder block (Di) begins with an Up-sampling operation, typically performed by a 3x3 Conv2DTranspose layer, which doubles the spatial dimensions of the reduced feature map. At each decoding level, skip connections are employed to concatenate the corresponding encoder features (E output), merging global contextual information with fine details. This is followed by two successive 3x3 convolutional layers, each one is paired with ReLU activation and BN as illustrated by the first Decoder block in Fig. 5. This process is repeated across all four levels. Finally, a 1x1 convolution adjusts the depth of the feature map to match the number of classes to be segmented, producing the final segmentation map.
E. Loss Function
In this study. we adopted Losscomb as the loss function, an innovative approach resulting from the judicious combination of Dice loss with modified CE [43]. The choice of this combination function stems from its clear advantages in solving the problem inherent in the imbalance between input classes, particularly when segmenting small lesions against a large context or background. This approach offers the ability to simultaneously control the trade-off between FP and FN, while facilitating smooth training through the use of CE. The mathematical formulation of Losscomb is defined by the following Equation:
Loss comb = 𝛼. 𝐷𝑖𝑐𝑒 𝐿𝑜𝑠𝑠 + 𝛽. 𝐶𝐸 (1)
Where α and β are respectively the weights assigned to Dice loss and CE. In Eq. 1, α controls the amount of Dice term contribution in the loss function. β ∈ [0. 1] controls the level of model penalization for FP/FN. In this work, α and β were fixed, empirically, equal to 0.5 to keep the model parameters out of bad local minima via the global spatial information provided by Dice [43].