2.1 Data and material
MIRIAD (Minimal Interval Resonance Imaging in Alzheimer's Disease) is a series of longitudinal volumetric T1-MRI scans of mild-moderate Alzheimer's subjects and controls [27]. An overview of the MIRIAD demographics and publications is published in Malone et [27]. The dataset consists of scans with the same scanner with accompanying information on gender, age, and Mini-Mental State Examination (MMSE) scores [27]. The data used in this study classifies subjects as AD if they have an MMSE score of 26 or under at baseline while a healthy control (HC) has an MMSE of 27 or above [27]. This is also the cutoff point to describe the class label for each feature vector. Each patient has multiple MRI scans from different time points. Many scans were collected of each participant at intervals from two weeks to two years, the study was designed to investigate the feasibility of using MRI as an outcome measure for clinical trials of Alzheimer's treatments [27]. Table 1 shows the demographics of the included patients.
Table 1
MIRIAD demographic information
|
Alzheimer's Disease
(N = 46, Total MRI-scans = 465)
|
Healthy Controls
(N = 23, Total MRI-scans: 243)
|
Age at study entry
|
69.4 ± 7.1
|
69.7 ± 7.2
|
Men
|
41%
|
52%
|
Mean (SD) baseline MMSE
|
19.2 ± 4
|
29.4 ± 0.8
|
Each scan is provided in NIfTI-format (Neuroimaging Informatics Technology Initiative) [28]. It is an open file format for volumetric images with a size of 256 x 256 x 124. Figure 1 shows a sample of the MRI dataset. An axial, sagittal, and coronal view is displayed. The raw dataset still contains bone structures. The bone structures are not relevant for the diagnosis of Alzheimer's and are getting removed in the pre-processing.
2.2 Feature engineering and pre-processing
Pre-processing is an important step to prepare the dataset for the following training of the classification algorithm. The MIRIAD dataset is pre-processed by applying spatial normalization, bias correction, and grey matter segmentation. Spatial normalization is the process of mapping images from different scans onto a single template. There are two steps to this: linear transformation (e.g. translation, rotation, shear) and non-linear transformation (e.g. warping). This results in all images referencing the same coordinate space [29] and should adjust, for example, for different subject positioning when the MRI was recorded.
The ratio of MRI scans of AD subjects to healthy controls is approximately 2:1. To mitigate this imbalance, data augmentation is performed by creating copies and flipping them. This results in almost the same number of instances labeled for AD and non-AD subjects. This can also be considered as a specific type of oversampling in medical imaging.
Finally, grey matter segmentation is performed and grey matter is extracted from the raw data. This excludes features that are unlikely to be discriminative in the classification task e.g. skull tissue (skull-stripping). The Python ‘Nipype’ library interface is used, allowing all processing to be done in Python [30]. An axial MRI scan of the central part of the brain for each patient was used as an input for the following classification algorithm.
2.3 Convolutional Neural Network
Convolutional Neural networks are a specialized kind of neural network for processing data that has a grid-like topology [31]. A CNN consists of several layers: convolutional, pooling, and fully connected layers. Each convolutional layer consists of a certain number of trainable parametric filters. Each convolutional layer is typically followed by a pooling layer which reduces the feature space. Finally, the data is passed to one or more fully connected layers and the predicted output is produced. A further description of the basic ingredients of a convolutional neural network can be derived from a textbook in deep learning [31] and are not further explained.
The applied CNN to distinguish between Alzheimer's and non-Alzheimer patients is used as a classification algorithm. Classification is to learn a mapping from inputs x to output y, where y ∈ {1,.., C} with C being the number of classes [32]. If C = 2, this is called binary classification [32]. In our study, a binary classification task is performed to distinguish between patients with Alzheimer's and patients who do not show signs of Alzheimer's.
Loss function and optimization
As a loss function for the convolutional neural network, the binary cross-entropy was chosen [33][34]. Every training epoch of the CNN has the aim to reduce the loss function (binary cross-entropy). RMSprop is a gradient-based optimization technique used in training neural networks. It has also been applied in deep learning for MR-images by Medina et al [35].
Convolutional filter size and Max-pooling.
For a two-dimensional image I as our input (from an MRI scan), a two-dimensional kernel K can be used. In this study, the convolutional filter size was set to (3,3). In convolutional network terminology, the output is referred to as a feature map [31]. The convolutional operation can be described as follows [31]:
$$S\left(i,j\right) = \left(I*K\right)\left(i,j\right)=\sum _{m}\sum _{n}I\left(m,n\right)K\left(i-m,j-n\right)$$
1
A pooling function replaces the output with a summary statistic. For example, the max-pooling operation reports the maximum output within an area [31]. The Max-pooling filter size of the final configuration after hyperparameter tuning was set to (2,2).
Dropout layer
Dropout provides a computationally inexpensive method for regularizing a model and to prevent overfitting [31][36]. During training, units get randomly get removed [36]. The randomly selected unit is removed from the network, along with all its incoming and outgoing connections [36]. It prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures efficiently [36]. Dropout introduces an extra hyperparameter—the probability of retaining a unit [36]. A value of p = 1 implies no dropout, and low values of p mean more dropout [36]. The dropout rate was set to 0.4 in our configuration to avoid overfitting.
Activation function
Neurons in the activation map pass through a non-linear function [37]. There are different activation functions. For example, the sigmoid function, the rectified linear unit (ReLU), and the leaky rectified linear unit (leaky ReLU). The logistic sigmoid function can be defined as following [20]:
$${f}_{sigmoid}\left(x\right) = \frac{1}{1+\text{exp}\left(-x\right)}$$
2
Another activation-function is the ReLu-function [20]:
$${f}_{ReLu}\left(x\right)=\text{max}\left(0,x\right)=\left\{\begin{array}{c}0, x<0\\ x, x\ge 0\end{array}\right.$$
3
Whenever the activation values are zero, the ReLu-function cannot learn in a gradient-based learning method [20]. Therefore, a leaky ReLu-function can be used.
$${f}_{Leaky Relu}\left(x\right)=\left\{\begin{array}{c}x, x\ge 0\\ \alpha x, x<0\end{array}\right.$$
4
In our study, the parameter alpha was set to 0.1 and a leaky rectified linear unit was used.
Regularization
To prevent overfitting a regularization method can be used to train the neural network. L1-Regularization is also known as Lasso-Regularization [38]. L2-Regularization, also known as Ridge Regularization [38]. L1 + L2 Regularization is also known as Elastic Net Regularization [38]. A small value for the regularization parameters for L1 = 0.001 and L2 = 0.002 has been added to prevent overfitting.
Table 2 contains the final settings of the CNN model. The number of layers and convolutional filters per layer were varied. The hyperparameter tuning is used either with a 3-layer or 4-layer setting.
Table 2
Configuration of the applied CNN
Setting/Parameter
|
Values in Keras
|
Loss Function
|
binary_crossentropy
|
Optimiser Function
|
RMSprop(lr = 0.001)
|
Convolutional filter (kernel) size
|
(3, 3)
|
Max-pooling filter size
|
(2, 2)
|
Activation function for all layers
|
Leaky ReLU (alpha = 0.1)
|
Weight regularisation added to all models to mitigate overfitting
|
L1 = 0.001
L2 = 0.002
|
Dropout layer added to all models to mitigate overfitting
|
0.4
|
Batch size
|
100
|
2.4 Performance metrics
The evaluation of model performance is an essential step in understanding and developing a machine learning algorithm. Definitions of conventional performance metrics such as accuracy, precision, specificity, recall, and F1-score are not further described. The definition can be obtained from textbooks in machine learning such as Goodfellow et al [31], Murphy [32], and Hastie et al [38]. This study used as an additional metric Matthew's Correlation Coefficient (MCC) [39]. The following abbreviations have been used TP = True Positives, TN = True Negatives, FP = False Positive, FN = False Negative.
The MCC is defined according to [39] as:
$$MCC = \frac{TP \times TN-FP\times FN}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)}}$$
5
The MCC metric is more balanced than metrics like accuracy and F1-score because its score is high only if the classifier is good on both positive and negative predictions [40]. The MCC is calibrated so that it ranges from − 1 to + 1. A value of 0 indicates a result close to chance, the closer to + 1 the score is, the better the result [40]. Receiver Operating Characteristic (ROC) curves have also been plotted for the best outcome.
The data are split into training, validation, and test-dataset. Approximately, 20% of each category is randomly allocated to the validation dataset and 10% to the validation to the validation dataset. The best configuration of the CNN was determined with the highest MCC on unseen medical images of a set of AD and non-AD patients. The training of the CNN used 20 epochs per instance. Table 3 shows the split of the data for the binary classification with a CNN.
Table 3
data augmentation, training, validation, test
Class Label
|
Number of MRI-scans
|
Total slices after data augmentation
|
Slices in training split
|
Slices in validation split
|
Slices in test split
|
AD
|
465
|
465
|
326
|
39
|
100
|
non-AD
|
243
|
486
|
342
|
42
|
102
|