Alzheimer’s Disease MRI Identification, Recognition, &amp; Evaluation - Deep Learning

doi:10.21203/rs.3.rs-3124095/v1

Download PDF

Research Article

Alzheimer’s Disease MRI Identification, Recognition, & Evaluation - Deep Learning

https://doi.org/10.21203/rs.3.rs-3124095/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

In this study, we investigate the impact of various Convolutional Neural Network (CNN) architectures on the accuracy of diagnosing Alzheimer’s Disease (AD) using patient MRI scans. Alzheimer’s Disease is a debilitating neurodegenerative disorder that affects millions worldwide. Early, accurate, and non-invasive diagnostic methods are required for providing optimal care and symptom management. Deep learning techniques, particularly CNNs, have shown great promise in enhancing this diagnostic process. We aim to contribute to the ongoing research in this field by comparing the effectiveness of different CNN architectures and providing insights for future studies.

Our methodology involved preprocessing MRI data, implementing multiple CNN architectures, and evaluating the performance of each model. We employed intensity normalization, linear registration, and skull stripping for our preprocessing. The selected architectures included VGG, ResNet, and DenseNet models, all implemented using the Keras library. We employed transfer learning and trained models from scratch to compare their effectiveness.

Our findings demonstrated significant differences in performance among the tested architectures, with DenseNet201 achieving the highest accuracy of 86.4%. Transfer learning proved to be helpful in improving model performance. We also identified potential areas for future research, such as experimenting with other architectures, optimizing hyperparameters, and employing fine-tuning strategies.

By providing a comprehensive analysis of the selected CNN architectures, we offer a solid foundation for future research in Alzheimer’s Disease diagnosis using deep learning techniques. Our study highlights the potential of CNNs as a valuable diagnostic tool and emphasizes the importance of ongoing research to develop more accurate and effective models.

Alzheimer's Disease

Deep Learning

Convolutional Neural Networks

Medical Imaging

MRI

1.1. Project Motivation

AD is a neurodegenerative disorder that causes the degradation of memory, language, and other key cognitive abilities. Patients lose the ability to understand the world around them, and in the final stages, require constant care. It is the most common form of dementia, and the disease is ultimately fatal. An estimated 44 million people worldwide have Alzheimer’s, a number expected to grow as populations age [1]. The disease comes with a large cost to society, both in terms of the economic cost of treatment and care, along with the personal cost and loss suffered by the families of the afflicted [2].

Alzheimer’s Disease diagnosis is difficult, due to a variety of risk factors. Most individuals with the illness are elderly, with other age-related medical conditions complicating diagnosis. Non-invasive clinical diagnostic methods (such as neurological tests) are limited in their accuracy and reliability [3]. Meanwhile, definitive diagnosis requires an autopsy, which hinders treatment during the patient’s lifetime [2]. More accurate non-invasive methods and tools are required to enable earlier intervention, with the goal of improving patient outcomes and quality of life.

In this context, this study will explore the capabilities of Convolutional Neural Networks (CNNs) in diagnosing Alzheimer’s Disease using patient Magnetic Resonance Image (MRI) scans. CNNs are a class of deep learning algorithms which are commonly used in computer vision tasks, such as image classification. Numerous studies have employed CNNs to classify AD patients using MRI scans, while experimenting with various configurations [4]. These methods are desirable due to their non-invasive, low-labor, and high accuracy compared to traditional diagnostic methods [4, 5]. In this study, we hope to leverage the capability of CNNs to develop a robust and reliable diagnostic model for Alzheimer’s Disease, whilst also examining a variety of design decisions that could improve results.

1.2. Goals & Desired Contributions

When creating any sort of neural network, numerous design choices must be made that can drastically impact the results achieved by a model. These decisions include aspects such as the number of layers, the arrangement of layers, node activation functions, and loss functions, amongst others. Depending on the choices made, model performance can vary drastically. As such, this project aims to examine various design choices, specifically looking at CNN architectures, along with comparing models trained from scratch with those utilizing transfer learning techniques. As the primary goal of this project, we hope to contribute valuable knowledge on the impact of CNN architectures and different training strategies for this task.

Furthermore, another key objective of this project is to produce a model that can accurately dif- ferentiate AD patients from control subjects, achieving an accuracy of at least 77% [5]. This level of performance not only matches the average clinical diagnostic accuracy achieved by experts, but also offers a more time-efficient alternative to current diagnostic processes than current diagnostic pro- cesses. Moreover, 25% of patients who received a clinical diagnosis during their lifetime did not have evidence of the disease at autopsy [3]. We will also be looking at the evaluation metrics of precision, recall, and F1-score which will be formally defined in 2.4. This project seeks to contribute to the field of AD diagnosis with the public provision of a highly accurate AD classification model, along with the tools to use it.

1.3. Project Scope & Success Criteria

This project will encompass the development, testing, and evaluation of CNN models for the diag- nosis of Alzheimer’s Disease using patient MRI scans. This project will address the preparation of MRI scans for training the neural networks, creating and curating a dataset tailored for this task. This involves running various MRI-specific preprocessing tasks to standardize the data, which will facil- itate the development of CNN models. Experiments will focus on the performance of six different CNN architectures: VGG16 and VGG19 [6], ResNet50 and ResNet152 [7], as well as DenseNet121 and DenseNet201 [8]. Each of these will be tested in two different configurations, training them from scratch, and transfer learning models using weights pre-trained on the ImageNet database [9]. Be- yond these key variables, all other parameters will be kept as similar as possible between runs to ensure results are consistent, while making some optimizations to enable fair comparisons between

models. Although we plan on creating a robust model for the diagnosis of Alzheimer’s Disease, our goal is not necessarily to create the best model possible. Instead, we aim to explore the different architectures while comparing transfer learning and training from scratch approaches to determine which ones provide the most promising results.

Under this context, several success criteria can be defined. Firstly, this project will be considered successful if we’re able to not only compare the performance of different architectures, but also reach helpful conclusions about what makes one architecture more effective than another. Furthermore, as previously stated, we aim to create a model with over 77% accuracy when tested, as both a proof- of-concept for a real-world application and as a demonstration of the effectiveness of our approach. Finally, this project will be considered successful if the results align with other studies in the field, whilst also shedding light on the impacts of model depth, parameter count, and transfer learning for this task.

1.4. Structure

This study will start out by outlining the final implementation design, while justifying the decisions we’ve made by referencing existing studies. We will discuss the software frameworks that we’re employing for this study, including the programming language, the libraries employed, and the dataset. We will then describe our data preprocessing protocols, the models that we implement, and any modifications made to them, and the performance metrics used to evaluate them. This will cover essential concepts and challenges associated with training neural networks using MRI scans as well as steps taken to mitigate them.

After we train our models, we will examine the results achieved by the different architectures, training from scratch and transfer learning, while analyzing the impact of different parameters on our results. This will be followed by a detailed discussion, where we will critically assess the strengths and limitations of our methods along with the wider implications and impact of this study. Finally, we will conclude by summarizing the study and its findings, addressing the challenges we faced, and suggesting future experiments.

1.5. Research Questions

The primary research question guiding our study is:

Can Convolutional Neural Networks effectively diagnose Alzheimer’s Disease using patient MRI scans, and if so, which architectural choices contribute to the best performance?

This will help guide our research and will serve as the basis for a lot of our decisions. To address this question, we will also explore the following sub-questions:

How do performance metrics (Such as accuracy, precision, and recall) of the selected CNN architectures differ in this application?
Which architectural components (i.e., number of parameters, model depth) have the most impact on model performance for this task?
In what ways do our models and techniques advance the current literature on Alzheimer’s Dis- ease diagnosis using MRI scans?

To summarize, this study aims to evaluate the effectiveness of using Convolutional Neural Networks to diagnose Alzheimer’s Disease through patient MRI scans, while finding the architectural factors that lead to the best performance. By approaching these questions, we seek to contribute to the development of accurate, non-invasive diagnostic tools for Alzheimer’s Disease.

2.1. Software Frameworks

The first major design decision that had to be made for this project was the programming language and the deep learning tools that would be used for this task. In terms of programming language, Python was chosen due to its ease of use, its vast ecosystem of libraries, and its open-source nature. Moreover, Python has become the most popular language in the field of AD classification using MRI scans, becoming the de-facto standard for developing neural networks for this task [4].

Expanding on that, Keras was chosen as the main deep-learning library for creating and training the CNNs. Keras is a high-level deep learning library that uses TensorFlow to create and run networks. It provides an easy-to-use API, along with a variety of pre-trained models and architectures, making it ideal for this task. Beyond that, it is highly flexible and extensible, integrating seamlessly with other important libraries for this task, such as matplotlib, cv2, and scikit-learn [10].

Preprocessing MRI scans for use in deep learning applications is a complex task that requires a standardized library. FMRIB Software Library (FSL) was chosen for this job, due to its wide range of tools, its flexibility, and its extensive Python libraries [11]. FSL has also been employed in a wide range of studies in this field [4], providing a robust and reliable framework for preprocessing brain MRI scans.

2.2. Datasets & Preprocessing

2.2.1. Dataset

This project uses the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset [12], specifically the ADNI1:Screening collection, where we used 188 subjects with early AD and 229 CN subjects. ADNI1 is a longitudinal study that has been extensively employed in Alzheimer’s Disease research, both in traditional statistical analyses and deep learning approaches. The dataset provides comprehensive subject data, including not only patient MRI scans, but also genetic information, cognitive test scores, and demographic data [12]. It has become the most popular dataset for the diagnosis of Alzheimer’s Disease using MRI scans and has been leveraged to create many diagnostic models [4].

In addition to its popularity and widespread use in the literature, ADNI was selected due to its rigorous data collection and quality control protocols. The dataset is often revised and updated, considering new information, changes in diagnosis, and new patient scans [12]. Given the inaccuracy of typical Alzheimer’s diagnostic methods, having a rigorous and highly standardized process is crucial when training a neural network. Ensuring dataset quality is vital to minimizing the impact of mislabeled subjects and variance in the data, ultimately leading to more accurate and robust diagnostic models.

2.2.2. Preprocessing

As previously stated, using brain MRIs to train accurate CNN models requires rigorous standardization and preprocessing. Beyond simply cropping and orienting the images correctly, we employed intensity normalization, linear registration, and skull-stripping. Using FSL’s Python libraries, we implemented a script to run these steps in order on the ADNI dataset. An example of a brain before and after these steps can be seen in Fig. 1

Intensity Normalization: Intensity normalization is a preprocessing step that adjusts the voxel intensity values in an MRI scan to a standard scale. MRIs can have some variation in image

acquisition, with slightly different scanners or parameters being used on many different subjects, which can cause large intensity variations. By normalizing the voxel values, the CNNs can focus on structural information without being influenced by fluctuations in voxel intensities [13]. Prepping our dataset with intensity normalization will enable better generalization and improved performance [4]. This step is implemented in our program using the first few steps of FSL’s fsl_anat tool, which executes a sequence of brain MRI prep scripts [11].

Linear Registration: Linear registration involves aligning MRI scans to a reference space, typically an anatomical template or atlas, by moving the MRI voxels to be within a certain range in (x,y,z) coordinates, aligning the MRI to the template, and scaling the MRI to be a certain resolution. This process employs various linear transformations, such as translation, rotation, scaling, and shearing, to achieve this alignment of the MRI scans to the reference space [14]. By aligning images in this way, corresponding brain structures between different subjects are located in a similar spatial location, leading to more accurate and interpretable results when used in training CNNs [4]. There is also a process known as nonlinear registration, which performs a similar process but also uses nonlinear transformations. This wasn’t employed as the total computation time for high resolution nonlinear registration is significantly higher, and linear registration gave good results. For our study, we employed linear registration to the MNI152 atlas, with a 1mmx1mmx1mm spatial resolution, described in [15]. This is done using FLIRT (FMRIB’s Linear Image Registration Tool), which performs linear registration on brain MRI scans [11].
Skull-Stripping: Skull-stripping is the process of removing non-brain tissue from the MRI. This entails masking the brain out, and getting rid of skull, scalp, and fleshy tissue from the scan. These tissues can introduce noise and artefacts that could affect model performance if kept in. By performing skull-stripping, the model can focus on relevant brain structures to extract meaningful information, ultimately improving accuracy in the diagnosis of Alzheimer’s Disease [4]. This step is done with FSL’s BET tool.

The models we implemented require 2D images as input, so to train the CNNs, we had to extract image slices from the brain. We chose the axial view (from top down) and extracted 10 slices from each subject. Following existing studies [16, 17], we extracted slices within the range 80–110 (out of 181 slices) as these contained the information most relevant for AD. Inspired by [16] and [18], to increase the amount of information for each slice, we also decided to group adjacent layers in threes, mapping each layer into different RGB channels, effectively utilizing multiple slices in a single image. This allows the CNN to take advantage of the spatial relationship between slices and generate more well-rounded predictions. An example of this process can be seen in Fig. 2. When extracting slices, we also skipped a slice in between each image we generated, to ensure that the neural network would learn to recognize slices from AD related deterioration as opposed to recognizing them from their similarity to other slices from the same subject.

Finally, we applied data augmentation techniques. Data augmentation is the act of randomly ap- plying transformations to data during the training phase to increase the overall size and controlled

diversity of the dataset. This allows the CNN to be trained on more data, improving its overall per- formance and ability to generalize [19]. We applied the augmentation techniques of zoom, horizontal and vertical shift, and flips. Up to ± 5% zoom was applied randomly during training, horizontal and vertical shifts were applied up to ± 5%, and vertical flips were randomly added as the axial view images are vertically symmetrical.

A selection of images generated for a CN subject is shown in Fig. 3a, while images for an AD subject can be seen in Fig. 3b.

Figure 3

CN and AD sample subject slices after preprocessing

2.3. CNN Architectures & Design Decisions

2.3.1. Model architectures

One of the major choices made when designing this study was whether to focus on 2D or 3D CNNs. The main difference between the two is that 2D CNNs take single image slices as inputs, while 3D CNNs take 3D voxel data, meaning the networks are fed entire MRIs at once. We chose to focus on 2D CNNs (using image slices) due to limitations in the dataset size, and time constraints. Beyond that, 3D CNNs are a less developed technology with few, if any, pretrained model weights available [4], making it out of scope for this project. In this context, it’s important to remember that all the models evaluated in this study are 2D CNNs using the same axial dataset mentioned in 2.2.2.

We’ve decided to focus on 6 different CNN architectures, based on 3 different approaches: VGG [6], ResNets [7], and DenseNets [8]. A table containing the different architecture’s attributes can be found in Table I and a brief explanation is as follows:

VGG: The VGG networks are a type of CNN designed by Visual Geometry Group at the Uni- versity of Oxford [6]. The variations used in this study are VGG16 and VGG19. VGG16 has 13 convolutional (conv) layers with 3 Fully-Connected (FC) layers, and 3x3 filters. It was chosen as it is a relatively simple architecture with a standard connectivity pattern (Pictured in Fig- ure 4a), serving as a good starting point of comparison. VGG19 is similar, except it has 16
TABLE I: Different CNN architectures & their attributes. Data sourced from Keras Docs [10]

Model	Size (MB)	Parameters	Depth (layers)
VGG16	528	138.4M	16
VGG19	549	143.7M	19
ResNet50	98	25.6M	107
ResNet152	232	60.4M	311
DenseNet121	33	8.1M	242
DenseNet201	57	20.2M	338

Figure 4: Connectivity Patterns. Arrows signify the movement of outputs.

conv layers followed by 3 FC layers. This architecture was chosen to investigate the impact of deeper networks in the performance of AD diagnosis. Beyond that, another important aspect of choosing VGG networks is observing the impact of having more FC layers, and therefore, trainable parameters when transfer learning.

ResNet: The ResNet models, or residual network architectures are a type of network developed by researchers at Microsoft [7]. These address the vanishing gradient problem by using skip connections, which connect each layer not only to the previous layer, but to the one before that as well. These types of connections are called residual connections (Pictured in Fig. 4b), and they enable the training of deeper networks while indirectly serving as a regularization technique. ResNets leverage conv layers interlaced with batch normalization (regularization) layers to improve accuracy. ResNet50 was chosen to measure the impact of a more complex, deeper network with fewer parameters than the VGG models. ResNet152 is a deeper version of the ResNet models, adding much more depth whilst maintaining significantly fewer parameters compared to either VGG model.
DenseNet: DenseNet models were created by researchers at Cornell University to improve information flow in ResNets [8]. These benefit from a dense connectivity pattern (Pictured in Fig. 4c), where each layer receives the concatenated outputs of all previous layers, leading to much greater feature reuse and optimization. This means DenseNets can acheieve a much greater depth whilst also having a significantly lower number of paramaters than ResNets. DenseNet121 was chosen to test this sort of network, and to experiment with a significantly lower number of parameters compared to the other models. DenseNet201 is a deeper ver- sion of this architecture and was chosen to experiment with the deepest (commonly used) network available. DenseNets also showed promising results in [20] and [21], outperforming some ResNet architectures in the latter.

2.3.2. Model Changes

A couple of modifications had to be made to ensure fair comparisons between the different architectures. As part of this project, we wanted to compare the performance of models trained from scratch on our datasets, compared to transfer learning on models that were pretrained on the ImageNet database [9]. Transfer learning is the act of taking a model pretrained on a general-use dataset and retraining the model for a specific task with certain weights being ”frozen”, i.e. they won’t be updated when training. This is done with the goal of reusing the existing weights for the new task, which enables better performance when datasets are limited. Meanwhile, training from scratch is

the act of training entire models with random weight initialization, which can lead to overall better models, however, can be harder to train when there is limited data [22], such as in our case.

For models being trained from scratch, they were simply initialized with random weights and trained the models on our dataset. For transfer learning tasks, we initialized models with pretrained ImageNet weights but no FC layers, froze all parameters, then re-added trainable FC layers. For each model, we replicated the architecture’s original FC layer configuration, such as the VGG networks having a group of 3 FC layers. This configuration ensured that the models were as similar as possible to their original configurations to maintain fair comparisons.

Table II contains the final attributes of different models explored in this study. When examining the influence of depth and parameter counts on performance, we will focus on these values. Due to modifications made to some models, such as adjusting the number of outputs to 2 (either AD or CN), or freezing weights, these values differ slightly from those in Table I. Note how the Transfer Learned (TL) models have significantly fewer trainable parameters than the Trained from Scratch (SC) models.

TABLE II: Final attributes of models in this study.

TL = Transfer learned models, SC = Models trained from scratch.

Architecture	Trainable Parameters (TL)	Trainable Parameters (SC)	Depth
VGG16	79,704,065	94,418,753	0.8000
VGG19	79,704,065	99,728,449	0.8000
ResNet50	86,017	23,620,609	0.8824
ResNet152	86,017	58,305,537	0.9286
DenseNet121	30,721	6,984,577	0.8750
DenseNet201	57,601	18,150,529	0.8889

Another important decision we had to make was choosing a loss function and an optimizer. All our models used binary cross-entropy as the loss function, given its prevalent use in binary classification tasks [23]. In contrast to this, there was a large amount of variance in performance based on what optimizer we used, specifically between the Stochastic Gradient Descent (SGD) and Adam optimiz- ers when transfer learning. SGD, or stochastic gradient descent, is a straightforward algorithm that runs typical gradient descent algorithms on batches of data as opposed to entire datasets [23]. The Adam optimizer, meanwhile, is a much more complicated algorithm, that uses adaptive learning rates for different parameters [23]. When running some preliminary experiments, some TL models would perform much better on one optimizer than another, with no optimizer being clearly better overall. As a solution to this, we trained each transfer learned model using both Adam and SGD optimizers and narrowed down our analysis to the more performant model. As this was a problem only faced when transfer learning, models trained from scratch only used the SGD optimizers. The parameters used are a learning rate of 0.001 for all Adam models, and a learning rate of 0.0003 with momentum of 0.9 for SGD models, chosen from preliminary tests and inspired by [16].

As previously stated, this project will focus on 2D CNNs as the approach for AD diagnosis and leverage multi-channel images to store more information for classification. Although these networks will be trained like conventional CNNs, they will be evaluated in a different manner to traditional net- works, adopting a method inspired by [16]. This entails training the 2D CNN with individual slices and evaluating the CNN by grouping all of a subject’s slices together. While a single MRI slice can provide some diagnostic information on a subject, it is not enough to reach a diagnosis. When evaluating, the model will make individual predictions on all of a subject’s slices, and employing majority voting on these predictions, decide on a final classification for the subject. Evaluating the models using this approach will enable a more realistic assessment of the real-world performance achieved by the models trained, along with using more of the MRI data available for each subject. Beyond that, this opens the door to practical applications using our methods.

2.4. Performance Metrics

Evaluating the performance of our models requires a multifaceted approach which allows us to un- derstand each model’s strengths and weaknesses. We will be using a combination of metrics to understand the performance of the models we train.

Accuracy is the most straightforward metric; simply the number of correctly classified subjects divided by the total number of subjects. A high accuracy means it accurately classifies subjects more often. This metric is useful, but it can be unclear with imbalanced datasets [23]. As most studies in the field of AD diagnosis using patient MRI scans use this as their main metric, this will serve as a good point of comparison.

Accuracy =\(\frac{CorrectPredictions}{SubjectCount}\)

Precision is the proportion of true positive predictions out of all total positive predictions. This measures the ability of a model to correctly identify positive instances, in this case, AD patients. A high precision indicates that the model is reliable in its positive predictions [23].

Precision =\(\frac{TruePositives}{TruePositives + FalsePositives}\)

Recall measures the proportion of true positive predictions to actual positive instances. This eval- uates the model’s ability to identify all positive instances. This is slightly different to precision, as instead of measuring the accuracy of positive predictions, we’re looking at the number of positive predictions that were correctly identified as positive [23].

Recall =\(\frac{TruePositives}{TruePositives + FalseNegatives}\)

F1-Score is the harmonic mean of precision and recall, serving as a single performance metric that balances the tradeoffs of precision and recall. This is a more rounded performance metric that allows us to distinguish which model is overall “better,” as it accounts for false negatives and false positives [23].

F1-Score =\(\frac{2*Precision*Recall}{Precision + Recall}\)

By taking these metrics into account, we will have a better understanding of the models that we train, and the benefits of certain approaches compared to others.

3.1. Implementation Parameters

For the training and evaluation of the CNNs we used the ADNI dataset described in 2.2.1, with each subject represented by 10 axial images following the preprocessing described in 2.2.2. The dataset we used comprised 188 AD subjects, 229 CN subjects, split with a train/test/validation ratio of 8:1:1. The ratio was applied in a grouped manner, ensuring that all MRI slices belonging to a single subject were placed together in the same set.

The distribution of subjects in each set is as follows:

Train set: 150 AD subjects and 183 CN subjects (1500 and 1830 image slices, respectively)
Test set: 20 AD subjects and 24 CN subjects (200 and 240 image slices, respectively)
Validation set: 18 AD subjects and 22 CN subjects (180 and 220 image slices, respectively)

During training, data augmentation was randomly applied as explained in 2.2.2. The models were trained with a batch size of 32, with up to 100 epochs of training allowed per model. Early stopping was employed so training would stop if there was no reduction of validation loss for 15 consecutive epochs. We also applied learning rate reduction callbacks, so learning rate would be reduced by a factor of 0.1 if validation loss didn’t improve for five consecutive epochs, with a minimum learning rate of 0.5e^− 6.

Training was done on an Amazon Web Services (AWS) g5.xlarge GPU instance. This instance an NVIDIA A10G Tensor Core GPU with 24GiB of GPU memory, 250GB of storage, 4 vCPUs, and 16GiB of RAM. Evaluation was done using the majority voting strategy explained in 2.3.2.

3.2. Model Performances

The accuracy, precision, recall, and F1-scores of the final models for each architecture can be seen in Table III. Models trained using transfer learning on pretrained ImageNet weights are denoted using TL, while models trained from scratch are denoted using the SC label. [24]

3.3. Detailed Results Analysis

There is a lot of information that can be extracted from our results table, relating to different architec- tural attributes and the training strategies we used. In 3.3.1 we will look at the different performance metrics, identifying different model’s performance within them and what those performances mean. After that, we will look at the impact of transfer learning compared to training from scratch in 3.3.2, fol- lowed up by the impact of model depth and parameter count on our results in 3.3.3 and 3.3.4. Finally, we will compare our results to existing methods in 3.4.

TABLE III: Performances of different architectures & training strategies.

TL = Transfer learned models, SC = Models trained from scratch.

	Accuracy		Precision		Recall		F1-Score
Architecture	TL	SC	TL	SC	TL	SC	TL	SC
VGG16	0.8182	0.7727	0.8000	0.7500	0.8000	0.7500	0.8000	0.7500
VGG19	0.7273	0.7955	0.8000	0.7619	0.6000	0.8000	0.6667	0.7805
ResNet50	0.8409	0.7045	0.8824	0.7059	0.7500	0.6000	0.8108	0.6486
ResNet152	0.8182	0.6591	0.9286	0.6316	0.6500	0.6000	0.7647	0.6154
DenseNet121	0.8182	0.8409	0.8750	0.7826	0.7000	0.9000	0.7778	0.8372
DenseNet201	0.8636	0.7955	0.8889	0.7619	0.8000	0.8000	0.8421	0.7805

Figure 5

Training history for two different models

3.3.1. Model Performances

In terms of accuracy and F1-score, the TL DenseNet201 model outperforms all other models with an accuracy of 0.8636 and F1-score of 0.8421. This is closely followed by the SC DenseNet121 model, with an accuracy of 0.8409 and F1 of 0.8372. These high scores indicate that the models are effectively balancing precision and recall, making them the most robust models trained for this task. In contrast to this, the worst performing model under these metrics is the SC ResNet152 model, with an accuracy of 0.6591 and an F1-score of 0.6154. The plot of this model’s training history seen in Fig. 5a reveals that the model is underfitting, that is, the model is not achieving high accuracy in either the training or the validation stages (Especially compared to the TL DenseNet201 model). This could be related to several reasons. One reason for this is that being a SC model, it has significantly more trainable parameters that make convergence difficult when datasets are small. Beyond that, it is likely that the ResNet suffered from overregularization, meaning that the ResNet’s batch normalization layers and skip connections regularized the network’s weights too much, and led to the model underfitting.

Despite this, the model with the highest precision was the TL ResNet152, with a precision of 0.9286. Although this is a less robust model than the previously mentioned models (F1-score of 0.7647), it has a very high true positive rate: 92.3% of the brains classified as having Alzheimer’s Disease had it. This is helpful information, as although it is likely to miss positive samples due to its low recall (0.65), its positive predictions have a very high degree of certainty. However, the previously mentioned TL DenseNet201 has only a slightly lower precision (0.8889) with a much higher F1-score, meaning that it is still overall better for this task. The model with the worst overall precision was the SC ResNet152 model, with a precision of 0.6316, meaning that it had a very high false positive rate.

In terms of recall, the best performing model is the SC DenseNet121 model, with a score of 0.9.

This means this model correctly identified the largest proportion of positive samples out of the models trained. This is followed up by a 4-way tie between TL VGG16, SC VGG19, TL DenseNet201, and TL DenseNet201, which all had a recall score of 0.8, meaning they correctly identified 80% of AD subjects. The worst recall was achieved by TL VGG19, SC ResNet50 and SC ResNet152, all having a recall of 60%

These results allow us to understand which of our models performed the best, however, in isolation, this doesn’t reveal much about how different design decisions affect each model’s performance. To do this, we must look at performances compared to each other on a macro scale.

3.3.2. Training Strategies

Training strategies are a big part of model performance when using CNNs. Oftentimes, transfer learning existing models can be used to achieve better accuracy with smaller datasets, however, this is dependent on the task. This section will focus on comparing TL models with SC models.

In general, TL models performed better on this task than their SC counterparts. Looking at the

TABLE IV: Average performance of training strategies, and their difference.

TL = Transfer learned models, SC = Models trained from scratch.

Training Strategy	Accuracy	Precision	Recall	F1-Scores
TL	0.8144	0.8625	0.7167	0.7770
SC	0.7614	0.7323	0.7417	0.7354
Difference	0.0530	0.1302	-0.0250	0.0416

average performance between these models, we can see a noticeable difference. As seen in Table IV, on average, TL models had an accuracy 5.3% higher than their SC equivalents. This is similarly reflected in the average F1-scores for transfer learning and training from scratch, with TL models on average having an F1-score of 0.0416 higher. This means that at least out of the models tested, transfer learning is a better approach on average. Models that employ transfer learning have higher accuracy and are more robust than those trained from scratch. This is likely due to the fact that the dataset used for this task was relatively small, making it convergence more difficult when having to train entire networks as opposed to just the FC layers.

The difference between TL and SC models is most noticeable when looking at the precision. This metric is, on average, 0.1302 higher when models are transfer learned. This means that on aver- age, when transfer learning, the positive AD predictions that models make are 13% more likely to be accurate compared to training from scratch. The only statistic that did not improve, and instead got worse on average when transfer learning was recall. Models trained from scratch had 0.025 higher recall than transfer learned models. This is a smaller difference in performance than the other metrics, however it is important to consider, as this means that transfer learned models on average missed slightly more positive samples when tested.

The main outliers in this comparison are the VGG19 and DenseNet121 models. VGG19 had a TL accuracy of 0.7273 and a SC accuracy of 0.7955, and a TL F1-score of 0.6667 compared to a SC F1-score of 0.7805. This means that training from scratch ultimately resulted in higher accuracy and created a more robust model, despite the limited amount of data. Similarly, DenseNet121 had a TL accuracy of 0.8182, a SC accuracy of 0.8409, along with a TL F1-score of 0.7778, and a SC F1- score of 0.8372. This could be due to a variety of factors, most likely to model depth and parameter counts, which will be elaborated on in 3.3.3 and 3.3.4. DenseNet121 likely performed better in SC configurations than TL due to the architecture’s significantly lower number of parameters than other architectures (Shown in Table I), making it easier to train from scratch on smaller datasets. In terms of VGG19, it is the most complex model by far in terms of parameters, meaning that, in this case, training from scratch, and taking advantage of the greater number of parameters could prevent overfitting. When looking at the training history for VGG19, we find that on their final epoch, the SC model had training accuracy of 0.7162 and validation accuracy of 0.68, whilst the TL model had 0.8354 and 0.6675, respectively (The TL and SC train histories can be found in Appendix C and Appendix B, respectively). The greater difference between training and validation accuracy for our TL model indicates a degree of overfitting, meaning that the model is overly tuned for the training dataset, leading to poor generalization. In the future, to improve results on a model like this, regularization techniques such as dropout and pooling should be employed [22].

Despite this, these results clearly demonstrate that transfer learning is the more effective approach for this task. All other architectures performed better when employing transfer learning compared to training from scratch. This is likely due to the relatively small dataset used for this project, meaning that there was not enough data to train entire models without transfer learning. If a larger dataset was used, training from scratch could result in better results, however, due to the lower number and smaller size of datasets for this task, it is likely that transfer learning will remain the dominant training strategy when classifying AD patients using CNNs.

TABLE V: Pearson correlations measuring the impact of model depth and parameter counts on different metrics. Value in each cell corresponds to the correlation between the attributes and the metrics. TL = Transfer learned models, SC = Models trained from scratch.

Accuracy

Precision

Recall

F1-Score

Attribute

Depth

Parameters

0.6224

-0.6957

-0.0764

-0.1231

0.8226

-0.9337

-0.2076

-0.0885

0.1745

-0.1582

0.0394

-0.1176

0.5046

-0.5602

-0.0397

-0.0950

3.3.3. Model Depth

One of the major differences between the different architectures we employed was depth. This refers to the total number of layers in a network. Deeper networks tend to be able to extract more abstract information from data, and for many tasks, perform better [22]. We quantified this relationship by calculating Pearson correlations between model depth and different model metrics. The correlation coefficients, included in Table V, describe the strength and direction of the correlation between vari- ables, with values ranging from − 1 (strong negative correlation) to 1 (strong positive correlation) [25]. In this context, Pearson correlations help to understand the relationship between depth and perfor- mance metrics for both TL and SC models. The Pearson correlations were made between the metrics columns in Table III, and the architecture attribute values in Table II.

For TL models, there was a moderate positive correlation between depth and accuracy/F1-score, with coefficients of 0.6224 and 0.5046, respectively. This means that, as depth in the architectures increased, the accuracy and f1-score of the models trained also increased, implying that when trans- fer learning, deeper models are desirable. Precision seemed to increase significantly with deeper models, with a coefficient of correlation of 0.8226. Meanwhile, depth and recall had a very weak, positive correlation of 0.1745, signifying that this metric is not necessarily attributed to depth. The correlations found between these attributes signify that when transfer learning, model perfor- mance is moderately positively correlated to the depth of the network employed - as depth increases, performance will generally trend upwards.

This same correlation was not found between depth and SC model metrics, with their coefficients all being less than 0.5 in either direction. This means that although depth had a notable positive correlation with model performance when transfer learning was employed, it had almost no effect when training from scratch. This is likely due to limited data struggling to adequately train models from scratch. As explained in 3.3.2, SC models on average performed worse across the board, with little relation to the depth of the models. As a result, it is not possible to deduce whether model depth had a positive or negative impact on the performance of models trained from scratch for this task.

In our experiments, depth was correlated with better performance in TL models and had no mea- surable impact on SC models. While these correlations are helpful for understanding the impact of depth on our model’s performances, they don’t necessarily reveal a causal relationship; results de- pend on implementation and the hyperparameters chosen. Experimenting with these model variations is useful, as it helps guide future design choices, particularly in the case of TL models. However, with SC models, our results were inconclusive. More experimentation is required to understand the im- portance of model depth on models trained from scratch. It is important to note that depth is only one factor influencing the performance of an architecture; examining the number of parameters is just as important.

3.3.4. Model Parameter Count

A crucial factor in selecting a CNN architecture is the parameter count. The parameter count refers to the number of trainable weights in an architecture - this includes the conv layer filter weights, the node connection weights, and the fully-connected layer weights. More parameters can directly improve an architecture’s ability to learn, however a more complex (higher parameter count) model can be harder

to train and be prone to overfitting, especially with small datasets [22]. Table V contains the Pearson correlation coefficients between architecture parameter counts and their resulting model metrics.

For TL models, there was a notable negative correlation between increasing parameter counts and model accuracies and F1-scores, with coefficients of -0.6957 and − 0.5602 respectively. This means that there is a correlation between higher parameter counts resulting in lower accuracies and F1-scores for TL models; increased complexity has led to these being less accurate and less robust. This trend is highly noticeable in the model’s precision, with the coefficient being − 0.9337. This is likely due to the higher parameter count architectures either overfitting, or not having enough data to train them. Overall, lower parameter counts are correlated with better performance, which is an important to consider when designing an architecture. More complex models aren’t necessarily better, especially in this scenario where training is more difficult and datasets are limited. Much like with model depth, there was no observed correlation between the architectures’ parameter counts and performance metrics for SC models.

When picking a CNN architecture for the diagnosis of Alzheimer’s Disease, one must take param- eter count into consideration. As seen in our results, when transfer learning, lower parameter counts are associated with better performance, which emphasizes the importance of picking the right bal- ance of complexity. In the future, implementing a fine-tuning approach (explored in 4.1.5), or using different hyperparameters (explored in 4.1.4) could reveal more helpful insights.

3.4. Comparison to Existing Methods

It’s important to look at our study in the context of other similar studies in the literature and consider other methods to improve performance. In this section, we focus on the accuracy metrics of the models to keep comparisons consistent as not all studies used the same metrics we employed. [16] used a very similar training and evaluation setup to our experiment, using single slices to train a CNN and evaluating using all of a brain’s MRI scans. Using 2D CNNs on the axial view, their best result was a TL ResNet18, with an accuracy of 87.50%, slightly higher than our best result with TL DenseNet201 at 86.36%. They reached a TL accuracy of 84.38% on both VGG16 and VGG19 (Same exact accuracy due to only having 16 test subjects), compared to our accuracies of 81.82% and 72.73% for TL VGG16 and VGG19, respectively. Their ResNet50 model also had similar results to ours with 78.12% SC accuracy and 81.25% TL accuracy, compared to our 70.45% on SC models and 84.09% accuracy on TL. However, their best result using a custom transfer learning approach with a 3D ResNet model achieved a remarkable accuracy of 96.88%.

In [21], the authors found similar results to ours regarding the impact of model depth and complex- ity, noting that hyperparameter tuning could improve the results of more shallow and less complex models. Their TL ResNet50 and ResNet152 models both achieved accuracies of 82.68%, similar to our TL ResNet50 with 84.09% and our TL ResNet152 with 81.82%. Their TL DenseNet201 had worse results than ours, however, with an accuracy of 83.8%, compared to our accuracy of 86.36%. This could be due to a large variety of causes, most likely associated with different preprocessing setups or different hyperparameters. Their best result was a custom DenseNet121-inspired method, achieving an accuracy of 94.97%.

[26] demonstrated a different method when extracting 2D MRI slices, using image entropy to pick the most informative slices. This aided in achieving an accuracy of 92.3% when transfer learning a VGG16 network, notably higher than our TL VGG16 result of 81.82%, highlighting the importance of experimental approaches to preprocessing setups when dealing with MRI images and CNNs. Finally,

[20] implemented a custom 2D slice-based DenseNet model, achieving an accuracy of 92.4%, higher than our best performing model and highlighting the potential of custom architectures inspired by existing methods.

For many of our architectures, we achieved similar accuracies when compared to other studies that implemented similar architectures. Furthermore, our TL DenseNet201 model performed competitively when compared to many of the results achieved in the studies mentioned. However, although our results are strong, they are not as impressive as the results other studies achieved when implementing custom architectures or more advanced 3D techniques. Regardless, our results are valuable as they inform us about the impacts of different design decisions when creating custom CNNs for this task.

4.1. Critical Reflection

In this study, we aimed to compare different CNN architectures and training strategies for the diagnosis of Alzheimer’s Disease using MRI scans. Our analysis examined the performance of six architectures, employing transfer learning and training from scratch. Although our results provided valuable insights on the impact of architecture and training strategies for this task, it is essential to critically reflect on these findings, understand some of the limitations we faced, and suggest future experiments to improve our results.

4.1.1. Dataset Size

One of the primary challenges faced when employing ML techniques to classify AD patients is the size of datasets available for this task. Despite our use of data augmentation techniques to address this issue, some of our models still struggled to learn the features that distinguish AD brains from control subjects. Beyond that, as our test set only contained 44 subjects (20 AD and 24 CN), the limited sample size might not be sufficient to provide a robust evaluation of the different models’ performance.

The small dataset is likely the cause for models such as the TL VGG19 overfitting on the training data, and models like the SC ResNet152 underfitting due to a lack of data. To overcome this limitation, future studies could benefit from leveraging larger datasets. As an example ADNI3 provides a larger overall cohort to ADNI1, however many of these subjects don’t fit cleanly into AD/CN categories, and are instead grouped in various CI (cognitive impairment) stages [12]. Other experiments could also take advantage of non-MRI data from the ADNI dataset, such as cerebrospinal fluid biomarkers or genetic profiles [12]. By addressing these dataset limitations, future research will be better able to create more robust models and improve results.

4.1.2. Data Preparation

For this study, we implemented a specific data preprocessing pipeline to create standardized datasets, using techniques such as normalization and skull-stripping. Despite our preprocessing steps being similar to many others in the field [4], there could be differences in the specific implementations that could significantly impact the performance of models for this task. As an example, using different resolutions or different atlases during the registration step could lead to very different images being created. For instance, [27] had preprocessed scans to a resolution of (256 256 256), while [16] had scans preprocessed to a (79 95 79) resolution. [28] used the same registration atlas as us but with a different tool, resulting in a final resolution of (193 229 193) whilst our images had a resolution of (182 218 182) showing how different programs can result in different scans even when using the same atlas.

An inherent challenge when comparing different approaches for training CNNs on medical im- ages arises from the stringent preprocessing requirements, which make direct comparisons between methods more difficult. While the design of the CNNs implemented for this task is essential, the data preprocessing and preparation is just as important. The performance of models is entirely dependent on the quality of the data used to train them, and substandard preprocessing techniques could lead to worse models. Although our preprocessing methods aligned with common practices in the field [4], it is important to consider that the models trained in this study are not only connected to data preprocessing, but reliant on it. In this context, it is important to look at this investigation not only for the merits of its approach to model design and training, but also to keep our dataset preprocessing and handling in mind. This is not to say that these processes should be standardized; rather, it highlights the importance of being aware of the potential impacts of different preprocessing techniques on the performance of CNN models.

4.1.3. Data Coverage

When dealing with MRI data using 2D CNNs, some tradeoffs are made to keep training straightfor- ward. Considering that, MRI data, being 3D, and there being a spatial relationship between points in the scan, 2D CNNs aren’t able to fully capture the brain structures associated with Alzheimer’s dis- ease. As a result, even though our approach used 10 images per subject with 3 slices encoded into

each image using the multi-channel method, the training was not able to take advantage of the rela- tionships between these slices in their classification. Beyond that, our approach only uses a subset of the image slices available for each scan and doesn’t take advantage of all MRI views.

As a way to address this issue, there are a few approaches that could be taken. 3D CNNs, despite their complexity, are able to understand the volumetric data more effectively than 2D CNNs, and have been proven to be plausible to implement [4]. Meanwhile, [16] implements a similar solution to us, however, for each architecture, they trained 3 CNNs, one for each of the axial, sagittal, and coronal views. Afterwards, they make class predictions for the subject using each CNN and the slices from their corresponding views, using a majority voting strategy to reach a final classification on the subject, taking advantage of more MRI data for each patient and learning structural attributes from other views. However, this didn’t always result in better performance, and suffered from high training time [16].

4.1.4. Hyperparameter Tuning

Considering the importance of hyperparameters on network performance, it is likely that some of the models that we trained in this study were not optimally configured. As part of our experiment, we decided to keep all hyperparameters identical between different models, to ensure an even field for comparison. However, despite this goal, some models performed worse than they could have. Appendix A contains results of models trained when choosing the hyperparameters for this study, and depending on what level of pooling was used before the fully-connected layers, some models (Such as the SC ResNet152) performed very differently.

A more systematic approach to picking the hyperparameters and configurations used in this project could lead to better results. Employing algorithms such as grid search or Bayesian optimization [29] on a per-model basis could improve the performance of some of the architectures in this exploration and provide a more realistic comparison of the differences between the architectures chosen. It is therefore important to keep in mind that although this study serves as a general point of comparison between different architectures, it does not provide the ”best-case” for them, with some benefitting from the chosen hyperparameters and others underperforming.

4.1.5. Transfer Learning & Fine-tuning

Although our approach to transfer learning reflects the methods taken by existing research in the field [16, 21, 26, 27], it is not the only way to reuse existing pretrained model weights. Freezing all layers and training just the fully-connected layers is one approach. An alternative method is commonly referred to as fine-tuning. This involves taking an architecture already trained on an existing dataset and retraining the entire network for the task with a significantly lower learning rate [29]. The idea is that the pretrained weights serve as the starting point, and they’re slowly tweaked to fit for the new task. This would hypothetically allow for entire networks to be tuned for the specific final task, whilst not requiring nearly as much data as training from scratch. Some studies, such as [30] used a hybrid approach, taking a pretrained model where no layers were frozen, however, the fully-connected layers had a learning rate 10x higher than the other layers.

When training CNNs, starting out with random weight initialization will usually yield worse results than when using a pretrained network. As a result, when these pretrained weights are available, it makes sense to take advantage of them whenever possible. Our approach used transfer learning and training from scratch with random weight initialization. Contrasting these configurations, the models trained from scratch yielded inconclusive and unsatisfactory results. Exploring fine-tuning as an in- termediate method for our existing configurations would have enabled us to tune a whole model for this task instead of just the FC layers, resulting in, potentially, better performance than SC models with the advantages of our TL approaches. In the future, when using architectures that have existing weights available, they should be leveraged as much as possible.

4.2. Approach Strengths & Limitations

When designing our solution for the diagnosis of AD using CNNs and MRI scans, we had to make a multitude of decisions when it came to preprocessing, training, and evaluation, that inevitably had tradeoffs. While our approach revealed valuable insights and promising results, it is essential to acknowledge its strengths and limitations.

4.2.1. Strengths

Ease of Training & Evaluation: Our approach was easy to implement, train, and evaluate, allowing us to iterate quickly and compare different configurations. By focusing on 2D CNNs with preexisting weights, we avoided the complexity of 3D networks, which necessitates spe- cialized techniques like replicating layers across the third dimension as demonstrated in [16]. Using traditional training methods and a custom yet straightforward evaluation approach sim- plified implementation and allowed us to train many models with little effort. Beyond that, our approach to training was fast, taking as little as 6:01 minutes for the fastest training task, and 22:45 minutes for the slowest task. The full training times are featured in Fig. 6.
Good Results: Despite the simplicity of our implementation and the limitations of our approach, we managed to achieve impressive results with some of the models we trained. Nine out of twelve models we trained surpassed the 77% accuracy baseline we defined at the beginning of this study. Beyond that, multiple different approaches performed well, and some even better than expected, even when working within constraints.
Modularity: The experimental design of our study and program is highly modular, enabling us to modify functionality quickly and easily as needed. By separating the tool into independent modules such as preprocessing, model training, and evaluation, we can quickly experiment and update specific components without affecting others. This modularity extends to individual functions as well, such as the creation of image slices from different views or the loading of models based on various input architecture names and training methods. This flexibility allowed for the comparison of various architectures and approaches, while measuring their performance on the task at hand. Beyond that, in its current state, the tools we developed for this program allow the creation of models with a multitude of different options (such as enabling pooling, dropout, L2 regularization, etc.), which could be explored in further research.
Extensibility: Our approach is highly extensible, providing a solid foundation for future im- provements and adaptations. By leveraging our program’s modular design, researchers can easily extend the current setup to incorporate more complex approaches (Such as the multi- view majority voting approach from [16]), additional data, or alternative training strategies. Be- yond that, the process of supporting more CNN architectures is very straightforward, requiring almost no manual configuration. This level of adaptability ensures that our method can be improved further, remaining relevant and valuable as research in the field continues to evolve.

4.2.2. Limitations

Limited Use of Data: Our approach did not fully take advantage of all the available data for this task, focusing on a subset of the views (only using the axial view) and a subset of the MRI slices. Beyond that, we didn’t use any of the non-imaging data, which could have further

Figure 6: Architecture training times.

improved results. Although our approach was successful in its goals, an implementation that takes advantage of as much of the data available as possible would likely be more accurate and robust.

Suboptimal Configurations: As previously explained, our chosen hyperparameters negatively impacted some of the architectures’ performance. Experimenting with different hyperparame- ters would be beneficial, especially in the case of SC models. Although more complex config- urations for this issue were out of scope for this study, when expanding on our implementation this should be a key feature.
Comparatively Worse Performance: While our results were satisfactory given the goals of this study and the simplicity of our approach, it is essential to acknowledge that the performance of our models were behind the state-of-the art for this task. More complex and specialized approaches to model design were able to reach very high accuracies for this task, implying that there is room for improvement in our methodology. Although our goal was never to make the best model possible, it is important to consider that our approach should be refined to achieve better results in the future.

5.1. Project Summary & Outcomes

In this study, we aimed to explore a collection of different Convolutional Neural Network architectures for diagnosing Alzheimer’s Disease using patient MRI scans. We investigated the performance of six different architectures, employing both transfer learning and training from scratch, to identify the factors that led to the best model for this approach. In this process, we created a series of tools, which enable MRI preprocessing, the creation of custom datasets, training models on these datasets, and evaluating the results. Throughout this study, we explained our specific approaches for each step, and the reasons for the choices made. We aimed to find the impact of factors such as model depth, trainable parameter counts, along with the impact of training from scratch and transfer learning. We were ultimately able to train a model that performed better than the 77% average clinical diagnostic accuracy [5], along with identifying some of the factors that were associated with better performance in this task.

In the design section of our study, we provided a detailed explanation of the methods we created to solve this task. Beyond that, we explained the rationales behind certain decisions and tradeoffs we had to make when creating our program. In terms of preprocessing, we employed intensity normaliza- tion, linear registration, and skull-stripping to our brain MRIs. Each subject’s MRI was divided into ten slices, and split into training, testing, and validation sets ensuring that all slices from a single subject were kept together. Training was done like a traditional neural network, using individual images with individual predictions, adjusting weights according to the loss function. Meanwhile, evaluation was done on a per subject basis, making individual predictions on each slice and using majority voting to make a final classification. After training VGG16, VGG19, ResNet50, ResNet152, DenseNet121, and DenseNet201 models using both transfer learning and training from scratch, we evaluated them and analyzed our results.

When exploring training strategies, we found that, on average, the models we trained had better performance when employing transfer learning compared to training from scratch. This is generally due to the higher amount of data required to train entire networks from scratch, leading to some models, such as the SC ResNet152, underperforming. In comparison, all the TL models trained performed relatively well, with all models other than VGG19 achieving an accuracy of higher than 77% when employing transfer learning. The limited amount of data when training networks for this task means that training entire networks will continue to be difficult unless larger datasets are created. In the meantime, it’s likely that reusing pretrained model weights (Using transfer learning or fine-tuning) will continue to produce favorable results for this task.

In terms of architectural factors, we explored the impact of both model depth and parameter counts on model performance in this task. For transfer-learned models, there was a moderate positive corre- lation between model depth and performance metrics, and a moderate negative correlation between

parameter count and the performance metrics. This means that deeper models were correlated with better overall performance when the networks were transfer-learned on, signifying that model depth is a desirable attribute for this task. Beyond that, models with more trainable parameters were cor- related with worse performance when using transfer learning, meaning that keeping the parameter count low is desirable when designing models. This is likely due to the constrained dataset in this task. When training from scratch, there was no notable correlations found for the impact of model depth or parameter counts on performance, due to the generally poor performance of the SC models.

We were ultimately successful in the goals we set out at the beginning of this study. In our process of comparing different CNN architectures and training strategies, we were able to come to important conclusions on design decisions that lead to the best overall outcomes considering the challenges in this project. The results provided valuable insights on the impact of architectural factors and training from scratch compared to transfer learning, highlighting the strengths and weaknesses of each ap- proach. Our critical reflection allowed us to recognize some of the limitations of we faced during this study and potential ways to mitigate them. We found that, not only could CNNs serve as a valuable tool in the diagnosis of Alzheimer’s Disease using patient MRI scans, but that there is many ways that results could be improved. Despite the room for improvement in our approach, our findings already contribute to the ongoing efforts to develop more robust diagnosis tools for this debilitating disease.

5.2. Open Science & Implications

When writing this study and creating our program, we decided to take a number of steps to improve access to research in this field. By making our methods, tools, and models publicly available, we aim to foster collaboration, promote the spread of knowledge, and lower the barriers to entry for future research. We hope that subsequent studies can leverage the tools and methods we created to explore aspects of this task beyond what we covered in this study. In this subsection, we emphasize the importance of open science in the field and highlight the areas where our study and tools could support future research.

Methods & Documentation: To improve the accessibility of the field, this paper was written to very specifically document all the methods and tools used to create the models in our ex- ploration. Some of the studies we looked at (Such as [16, 21, 27]) explained their approaches, however, without enough detail to accurately recreate their findings. We chose to very specif- ically explain all aspects of our approach, from the preprocessing, to the training setup, and the hyperparameters. By providing this level of transparency and detail when describing our methods, our results become easier to replicate whilst also making them easier to interpret.
Open-Source Access: As part of this exploration, we chose to make the tools we created open- source. In our goal to contribute to the field, we felt it was important to make access to the tools to create and iterate on models readily available. It is important that future researchers have access to existing approaches, as it will help inform their research along with guiding them through what has worked and has not. We made the ADMIRE-DL tool publicly available on GitHub (Link found in Appendix D), which includes our preprocessing tool, our training tool, our evaluation tool, along with a collection of supplemental files and configurations. Beyond that, we have made all the models we’ve explored in this study public, including all TL and SC models (Links can be found in Appendix E). Other models (such as the models using different pooling layers or dropout) are available upon request.

Making sure our study is transparent and open will enable the development of new ideas. Future studies can benefit from these sorts of decisions, speeding up the rate of innovation. We believe that an open approach to research is beneficial for the field, and could ultimately lead to more effective diagnostic tools, treatment strategies, and a deeper understanding of Alzheimer’s Disease and its underlying mechanisms.

5.3. Future Experiments

In the process of conducting our study, and understanding the benefits and downsides of our ap- proach, we came up with several future experiments that could be conducted as an expansion

on the foundation we have created. These include testing more models, experimenting with different parameters, and extensions to the program that could improve the overall accuracy and robustness of the models trained. Although the goal of our study was to compare architectures, it would be beneficial to dedicate efforts to other aspects of this task.

Different Models: Although we compared a number of different architectures in this study, we only selected a small subsection of the models available to study. Similar to the architectures we’ve explored, Keras [10] has implementations of DenseNet169 [8] and ResNet101 [7] avail- able, which would serve as helpful intermediate datapoints in the comparison of the DenseNets and ResNets. Beyond that, some studies have investigated other types of CNN architectures, such as Inception networks in [31] and GoogleNets in [32]. There is a massive number of existing architectures that could be explored that aren’t necessarily included in Keras applications, that could be explored using the tools that we have created. Designing custom models for this specific task also has potential, however they won’t necessarily benefit from existing pretrained weights.
Optimized Parameters: Future explorations should aim to improve the parameter choices, either by experimenting with a larger number of setups (i.e., testing all models with different hyperparameters), or by running hyperparameter tuning algorithms. Ideally, a follow-up study should build off of this one by not only implementing a collection of different architectures, but also setting them up so they can perform as well as possible individually.
Fine-tuning: Our training strategy in this study involved either training models from scratch or employing transfer learning. As explained in 4.1.5, this is not the only way to reuse pretrained model weights. Future studies should employ fine-tuning approaches, to see how they compare to models trained from scratch and models employing transfer-learning. This would allow for a better understanding of the training strategies that can be employed for this task, and also the impact that they have on model performance.
Tweaked Preprocessing: Our preprocessing pipeline employed intensity normalization, linear registration, and skull-stripping, however there is many other preprocessing techniques that could potentially improve model performance. As an example, some studies employed nonlin- ear registration [4], which performs a similar process to linear registration but using nonlinear transformations as well [14]. This could hypothetically be used to better standardize the dataset, however wasn’t employed in this study as the computational cost of nonlinear reg- istration greatly increases when using 1mm voxel size MRI scans. Beyond that, different slice ranges could be extracted, different views used, or even different image resolutions used in the preprocessing step. FSL includes a large suite of tools for handling MRI preprocessing, even including a number of brain MRI segmentation tools [11], which could be used to mask areas of the brain affected by the disease to focus the CNN on those areas. These sorts of experiments can be very time consuming, however, they could provide valuable insights on the impact of different preprocessing decisions - an essential aspect of training neural networks using MRI scans.
Multiple Views: In this study, we focused on training single CNNs on the axial view of MRI scans, however, implementing multiple views could result in more robust models. As explained in 4.1.3, [16] trained three separate CNNs for each architecture, one from the axial view, one from the sagittal view, and one from the coronal view. Final classifications are then done with the majority vote for each subject from all three views. Although this didn’t always result in an improvement in model performance, it would be a worthwhile experiment as our setup is not identical and could lead to different results. The hope is that this approach could lead to more robust models for some architectures, leveraging the complementary information of having multiple views.

This study and the tools we created for it have laid a solid foundation for future experiments that can build upon our findings and further research in the field. By experimenting with additional setups

and extensions, we could continue improving our understanding of the best models and approaches for the diagnosis of AD using CNNs. Our goal is not only to compare different architectures, but also to contribute to the development of the most effective models possible. Future work can leverage the insights gained from this research along with building off the tools we’ve created to further understanding in the field.

5.4. Conclusion

Throughout our study, we have shed light on and gained valuable insights into the impact of archi- tectural choices when using Convolutional Neural Networks in the diagnosis of Alzheimer’s Disease with patient MRI scans. Our findings both highlight the potential of CNNs as a valuable diagnostic tool, whilst also emphasizing the importance of ongoing research and development of better models for this task. Ultimately, these sorts of developments could lead to better tools for diagnosis, and hopefully, better patient outcomes as a result.

Alzheimer’s Disease is a debilitating condition that affects millions worldwide. Fast, early, and accurate diagnosis is vital to provide the best possible care and is an important step on the path to developing effective methods for intervention and symptom management. Our study contributes to the corpus of research in this field, aiming to aid future development and hoping to improve the quality of life for patients and their families.

As the field of deep learning continues to evolve, our research serves as a foundation for future studies to build upon. By publicizing the methods, models, and tools used in this study, we aim to open collaboration and inspire more ideas for future research. We hope that future works can leverage our findings and methods in their research. Open collaboration can hugely reduce barriers to entry in research in this field, which should hopefully lead to even better tools being developed. We hope that through these efforts we can change the way Alzheimer’s Disease is diagnosed and managed, offering hope of a better future for both patients and their loved ones.

MRI: Magnetic Resonance Image
AD: Alzheimer’s Disease
CN: Cognitively Normal
CNN: Convolutional Neural Network
ADNI: Alzheimer’s Disease Neuroimaging Initiative
FSL: FMRIB Software Library
FC: Fully-Connected
SC: Trained from Scratch
TL: Transfer Learned
SGD: Stochastic Gradient Descent

Acknowledgements

The dataset used in this project was funded by the Alzheimer’s Disease Neuroimaging Initiative [12] (ADNI) (NIH Grant U01 AG024904) and DOD ADNI (DoD award W81XWH-12-2-0012), with support from various institutes, companies, and organizations. The Canadian Institutes of Health Research funds Canadian ADNI sites, and private sector contributions are facilitated by the FNIH. The grantee organization is the Northern California Institute for Research and Education, with study coordination by the Alzheimer’s Therapeutic Research Institute and data dissemination by the Laboratory for Neuro Imaging at USC [12].

Data Availability Statement

The data used in this research were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database [12]. Access to the ADNI dataset is managed by the ADNI Data and Publications Committee (DPC). Researchers interested in obtaining the data can submit a request following the guidelines and data use terms specified on the ADNI website. For more information, please visit the ADNI website (https://adni.loni.usc.edu/) or contact the ADNI Data and Publications Committee directly.

The authors have no competing interests or potential conflicts of interest associated with this research or its publication.

C. A. Lane, J. Hardy, and J. M. Schott, “Alzheimer’s disease,” European Journal of Neurology, vol. 29, no. 1, pp. 59–70, oct 2017.
A. P. Association, Diagnostic and statistical manual of mental disorders, fifth edition, text revision (DSM- 5-TR (TM)), 5th ed. Arlington, TX: American Psychiatric Association Publishing, 2022.
T. G. Beach, S. E. Monsell, L. E. Phillips, and W. Kukull, “Accuracy of the Clinical Diagnosis of Alzheimer Disease at National Institute on Aging Alzheimer Disease Centers, 2005–2010,” Journal of Neuropathology & Experimental Neurology, vol. 71, no. 4, pp. 266–273, 04 2012. [Online]. Available: https://doi.org/10.1097/NEN.0b013e31824b211b
M. A. Ebrahimighahnavieh, S. Luo, and R. Chiong, “Deep learning to detect alzheimer’s disease from neuroimaging: A systematic literature review,” Computer Methods and Programs in Biomedicine, vol. 187, p. 105242, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0169260719310946
M. N. Sabbagh, L.-F. Lue, D. Fayard, and J. Shi, “Increasing precision of clinical diagnosis of alzheimer’s disease using a combined algorithm incorporating clinical and novel biomarker data,” Neurology and Therapy, vol. 6, no. 1, pp. 83–95, Jul 2017. [Online]. Available: https://doi.org/10.1007/s40120-017-0069-5
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available: https://arxiv.org/abs/1409.1556
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261–2269.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248– 255.
F. Chollet et al., “Keras,” https://keras.io, 2015.
M. Jenkinson, C. F. Beckmann, T. E. J. Behrens, M. W. Woolrich, and S. M. Smith, “FSL,” Neuroimage, vol. 62, no. 2, pp. 782–790, Sep. 2011.
“The alzheimer’s disease neuroimaging initiative (ADNI): MRI methods,” J Magn Reson Imaging, vol. 27, no. 4, pp. 685–691, Apr. 2008.
X. Sun, L. Shi, Y. Luo, W. Yang, H. Li, P. Liang, K. Li, V. C. T. Mok, W. C. W. Chu, and D. Wang, “Histogram-based normalization technique on human brain magnetic resonance images from different acquisitions,” BioMedical Engineering OnLine, vol. 14, no. 1, p. 73, Jul 2015. [Online]. Available: https://doi.org/10.1186/s12938-015-0064-y
Z. Akkus, A. Galimzianova, A. Hoogi, D. L. Rubin, and B. J. Erickson, “Deep learning for brain mri segmentation: State of the art and future directions,” Journal of digital imaging, vol. 30, no. 4, p. 449–459, 2017. [Online]. Available: http://dx.doi.org/10.1007/s10278-017-9983-4
V. Fonov, A. C. Evans, K. Botteron, C. R. Almli, R. C. McKinstry, and D. L. Collins, “Unbiased average age-appropriate atlases for pediatric studies,” NeuroImage, vol. 54, no. 1, pp. 313–327, 2011. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1053811910010062
A. Ebrahimi, S. Luo, and Alzheimer’s Disease Neuroimaging Initiative, “Convolutional neural networks for alzheimer’s disease detection on MRI images,” J Med Imaging (Bellingham), vol. 8, no. 2, p. 024503, Apr. 2021.
C. D. Billones, O. J. L. D. Demetria, D. E. D. Hostallero, and P. C. Naval, “Demnet: A convolutional neural network for the detection of alzheimer’s disease and mild cognitive impairment,” in 2016 IEEE Region 10 Conference (TENCON), 2016, pp. 3724–3727.
M. Lai, “Deep learning for medical image segmentation,” CoRR, vol. abs/1505.02000, 2015. [Online]. Available: http://arxiv.org/abs/1505.02000
C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of Big Data, vol. 6, no. 1, p. 60, Jul 2019. [Online]. Available: https://doi.org/10.1186/s40537-019-0197-0
F. Li and M. Liu, “Alzheimer’s disease diagnosis based on multiple cluster dense convolutional networks,” Computerized Medical Imaging and Graphics, vol. 70, pp. 101–110, Dec 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S089561111830199X
J. Islam and Y. Zhang, “Deep convolutional neural networks for automated diagnosis of alzheimer’s dis- ease and mild cognitive impairment using 3d brain mri,” in Brain Informatics, S. Wang, V. Yamamoto, J. Su, Y. Yang, E. Jones, L. Iasemidis, and T. Mitchell, Eds. Cham: Springer International Publishing, 2018, pp. 359–369.
L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J. Santamaría, M. A. Fadhel, M. Al-Amidie, and L. Farhan, “Review of deep learning: concepts, cnn architectures, challenges, applications, future directions,” Journal of Big Data, vol. 8, no. 1, p. 53, Mar 2021. [Online]. Available: https://doi.org/10.1186/s40537-021-00444-8
R. Szeliski, Computer Vision: Algorithms and Applications. Cham, Switzerland: Springer Nature, 2022.
As explained in 2.3.2, we trained all TL models using both the SGD and Adam optimizers, and chose the final TL model for each architecture based on performance. Ultimately, both VGG19 and ResNet152 performed better using SGD, wheras the rest used the Adam configuration.
D. Freedman, R. Pisani, and R. Purves, “Statistics (international student edition),” Pisani, R. Purves, 4th edn. WW Norton & Company, New York, 2007.
M. Hon and N. M. Khan, “Towards alzheimer’s disease classification through transfer learning,” in 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2017, pp. 1166–1169.
R. Jain, N. Jain, A. Aggarwal, and D. J. Hemanth, “Convolutional neural network based alzheimer’s disease classification from magnetic resonance brain images,” Cognitive Systems Research, vol. 57, pp. 147–159, Oct 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1389041718309562
G. Folego, M. Weiler, R. F. Casseb, R. Pires, and A. Rocha, “Alzheimer’s disease detection through Whole-Brain 3D-CNN MRI,” Front Bioeng Biotechnol, vol. 8, p. 534592, Oct. 2020.
G. Dreyfus, Neural Networks: Methodology and Applications. Berlin, Germany: Springer, 2005.
A. Ebrahimi, S. Luo, and R. Chiong, “Introducing transfer learning to 3d resnet-18 for alzheimer’s disease detection on mri images,” in 2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ), 2020, pp. 1–6.
A. Khvostikov, K. Aderghal, A. Krylov, G. Catheline, and J. Benois-Pineau, “3d inception-based cnn with smri and md-dti data fusion for alzheimer’s disease diagnostics,” 2018. [Online]. Available: http://dx.doi.org/10.48550/ARXIV.1809.03972
A. Farooq, S. Anwar, M. Awais, and M. Alnowami, “Artificial intelligence based smart diagnosis of alzheimer’s disease and mild cognitive impairment,” in 2017 International Smart Cities Conference (ISC2), 2017, pp. 1–4.

No competing interests reported.

RawData.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Alzheimer’s Disease MRI Identification, Recognition, & Evaluation - Deep Learning

Status:

Version 1

Abstract

Figures

1. Introduction

1.1. Project Motivation

1.2. Goals & Desired Contributions

1.3. Project Scope & Success Criteria

1.4. Structure

1.5. Research Questions

2. Final Design

2.1. Software Frameworks

2.2. Datasets & Preprocessing

2.2.1. Dataset

2.2.2. Preprocessing

2.3. CNN Architectures & Design Decisions

2.3.1. Model architectures

2.3.2. Model Changes

2.4. Performance Metrics

3. Training & Results

3.1. Implementation Parameters

3.2. Model Performances

3.3. Detailed Results Analysis

3.3.1. Model Performances

3.3.2. Training Strategies

3.3.3. Model Depth

3.3.4. Model Parameter Count

3.4. Comparison to Existing Methods

4. Discussion

4.1. Critical Reflection

4.1.1. Dataset Size

4.1.2. Data Preparation

4.1.3. Data Coverage

4.1.4. Hyperparameter Tuning

4.1.5. Transfer Learning & Fine-tuning

4.2. Approach Strengths & Limitations

4.2.1. Strengths

4.2.2. Limitations

5. Conclusion & Future Experiments

5.1. Project Summary & Outcomes

5.2. Open Science & Implications

5.3. Future Experiments

5.4. Conclusion

Abbreviations

Declarations

Acknowledgements

Data Availability Statement

References

Additional Declarations

Supplementary Files

Status:

Version 1