3.1. Implementation Parameters
For the training and evaluation of the CNNs we used the ADNI dataset described in 2.2.1, with each subject represented by 10 axial images following the preprocessing described in 2.2.2. The dataset we used comprised 188 AD subjects, 229 CN subjects, split with a train/test/validation ratio of 8:1:1. The ratio was applied in a grouped manner, ensuring that all MRI slices belonging to a single subject were placed together in the same set.
The distribution of subjects in each set is as follows:
-
Train set: 150 AD subjects and 183 CN subjects (1500 and 1830 image slices, respectively)
-
Test set: 20 AD subjects and 24 CN subjects (200 and 240 image slices, respectively)
-
Validation set: 18 AD subjects and 22 CN subjects (180 and 220 image slices, respectively)
During training, data augmentation was randomly applied as explained in 2.2.2. The models were trained with a batch size of 32, with up to 100 epochs of training allowed per model. Early stopping was employed so training would stop if there was no reduction of validation loss for 15 consecutive epochs. We also applied learning rate reduction callbacks, so learning rate would be reduced by a factor of 0.1 if validation loss didn’t improve for five consecutive epochs, with a minimum learning rate of 0.5e− 6.
Training was done on an Amazon Web Services (AWS) g5.xlarge GPU instance. This instance an NVIDIA A10G Tensor Core GPU with 24GiB of GPU memory, 250GB of storage, 4 vCPUs, and 16GiB of RAM. Evaluation was done using the majority voting strategy explained in 2.3.2.
3.3. Detailed Results Analysis
There is a lot of information that can be extracted from our results table, relating to different architec- tural attributes and the training strategies we used. In 3.3.1 we will look at the different performance metrics, identifying different model’s performance within them and what those performances mean. After that, we will look at the impact of transfer learning compared to training from scratch in 3.3.2, fol- lowed up by the impact of model depth and parameter count on our results in 3.3.3 and 3.3.4. Finally, we will compare our results to existing methods in 3.4.
TABLE III: Performances of different architectures & training strategies.
TL = Transfer learned models, SC = Models trained from scratch.
| Accuracy | Precision | Recall | F1-Score |
Architecture | TL | SC | TL | SC | TL | SC | TL | SC |
VGG16 | 0.8182 | 0.7727 | 0.8000 | 0.7500 | 0.8000 | 0.7500 | 0.8000 | 0.7500 |
VGG19 | 0.7273 | 0.7955 | 0.8000 | 0.7619 | 0.6000 | 0.8000 | 0.6667 | 0.7805 |
ResNet50 | 0.8409 | 0.7045 | 0.8824 | 0.7059 | 0.7500 | 0.6000 | 0.8108 | 0.6486 |
ResNet152 | 0.8182 | 0.6591 | 0.9286 | 0.6316 | 0.6500 | 0.6000 | 0.7647 | 0.6154 |
DenseNet121 | 0.8182 | 0.8409 | 0.8750 | 0.7826 | 0.7000 | 0.9000 | 0.7778 | 0.8372 |
DenseNet201 | 0.8636 | 0.7955 | 0.8889 | 0.7619 | 0.8000 | 0.8000 | 0.8421 | 0.7805 |
Figure 5
Training history for two different models
3.3.1. Model Performances
In terms of accuracy and F1-score, the TL DenseNet201 model outperforms all other models with an accuracy of 0.8636 and F1-score of 0.8421. This is closely followed by the SC DenseNet121 model, with an accuracy of 0.8409 and F1 of 0.8372. These high scores indicate that the models are effectively balancing precision and recall, making them the most robust models trained for this task. In contrast to this, the worst performing model under these metrics is the SC ResNet152 model, with an accuracy of 0.6591 and an F1-score of 0.6154. The plot of this model’s training history seen in Fig. 5a reveals that the model is underfitting, that is, the model is not achieving high accuracy in either the training or the validation stages (Especially compared to the TL DenseNet201 model). This could be related to several reasons. One reason for this is that being a SC model, it has significantly more trainable parameters that make convergence difficult when datasets are small. Beyond that, it is likely that the ResNet suffered from overregularization, meaning that the ResNet’s batch normalization layers and skip connections regularized the network’s weights too much, and led to the model underfitting.
Despite this, the model with the highest precision was the TL ResNet152, with a precision of 0.9286. Although this is a less robust model than the previously mentioned models (F1-score of 0.7647), it has a very high true positive rate: 92.3% of the brains classified as having Alzheimer’s Disease had it. This is helpful information, as although it is likely to miss positive samples due to its low recall (0.65), its positive predictions have a very high degree of certainty. However, the previously mentioned TL DenseNet201 has only a slightly lower precision (0.8889) with a much higher F1-score, meaning that it is still overall better for this task. The model with the worst overall precision was the SC ResNet152 model, with a precision of 0.6316, meaning that it had a very high false positive rate.
In terms of recall, the best performing model is the SC DenseNet121 model, with a score of 0.9.
This means this model correctly identified the largest proportion of positive samples out of the models trained. This is followed up by a 4-way tie between TL VGG16, SC VGG19, TL DenseNet201, and TL DenseNet201, which all had a recall score of 0.8, meaning they correctly identified 80% of AD subjects. The worst recall was achieved by TL VGG19, SC ResNet50 and SC ResNet152, all having a recall of 60%
These results allow us to understand which of our models performed the best, however, in isolation, this doesn’t reveal much about how different design decisions affect each model’s performance. To do this, we must look at performances compared to each other on a macro scale.
3.3.2. Training Strategies
Training strategies are a big part of model performance when using CNNs. Oftentimes, transfer learning existing models can be used to achieve better accuracy with smaller datasets, however, this is dependent on the task. This section will focus on comparing TL models with SC models.
In general, TL models performed better on this task than their SC counterparts. Looking at the
TABLE IV: Average performance of training strategies, and their difference.
TL = Transfer learned models, SC = Models trained from scratch.
Training Strategy | Accuracy | Precision | Recall | F1-Scores |
TL | 0.8144 | 0.8625 | 0.7167 | 0.7770 |
SC | 0.7614 | 0.7323 | 0.7417 | 0.7354 |
Difference | 0.0530 | 0.1302 | -0.0250 | 0.0416 |
average performance between these models, we can see a noticeable difference. As seen in Table IV, on average, TL models had an accuracy 5.3% higher than their SC equivalents. This is similarly reflected in the average F1-scores for transfer learning and training from scratch, with TL models on average having an F1-score of 0.0416 higher. This means that at least out of the models tested, transfer learning is a better approach on average. Models that employ transfer learning have higher accuracy and are more robust than those trained from scratch. This is likely due to the fact that the dataset used for this task was relatively small, making it convergence more difficult when having to train entire networks as opposed to just the FC layers.
The difference between TL and SC models is most noticeable when looking at the precision. This metric is, on average, 0.1302 higher when models are transfer learned. This means that on aver- age, when transfer learning, the positive AD predictions that models make are 13% more likely to be accurate compared to training from scratch. The only statistic that did not improve, and instead got worse on average when transfer learning was recall. Models trained from scratch had 0.025 higher recall than transfer learned models. This is a smaller difference in performance than the other metrics, however it is important to consider, as this means that transfer learned models on average missed slightly more positive samples when tested.
The main outliers in this comparison are the VGG19 and DenseNet121 models. VGG19 had a TL accuracy of 0.7273 and a SC accuracy of 0.7955, and a TL F1-score of 0.6667 compared to a SC F1-score of 0.7805. This means that training from scratch ultimately resulted in higher accuracy and created a more robust model, despite the limited amount of data. Similarly, DenseNet121 had a TL accuracy of 0.8182, a SC accuracy of 0.8409, along with a TL F1-score of 0.7778, and a SC F1- score of 0.8372. This could be due to a variety of factors, most likely to model depth and parameter counts, which will be elaborated on in 3.3.3 and 3.3.4. DenseNet121 likely performed better in SC configurations than TL due to the architecture’s significantly lower number of parameters than other architectures (Shown in Table I), making it easier to train from scratch on smaller datasets. In terms of VGG19, it is the most complex model by far in terms of parameters, meaning that, in this case, training from scratch, and taking advantage of the greater number of parameters could prevent overfitting. When looking at the training history for VGG19, we find that on their final epoch, the SC model had training accuracy of 0.7162 and validation accuracy of 0.68, whilst the TL model had 0.8354 and 0.6675, respectively (The TL and SC train histories can be found in Appendix C and Appendix B, respectively). The greater difference between training and validation accuracy for our TL model indicates a degree of overfitting, meaning that the model is overly tuned for the training dataset, leading to poor generalization. In the future, to improve results on a model like this, regularization techniques such as dropout and pooling should be employed [22].
Despite this, these results clearly demonstrate that transfer learning is the more effective approach for this task. All other architectures performed better when employing transfer learning compared to training from scratch. This is likely due to the relatively small dataset used for this project, meaning that there was not enough data to train entire models without transfer learning. If a larger dataset was used, training from scratch could result in better results, however, due to the lower number and smaller size of datasets for this task, it is likely that transfer learning will remain the dominant training strategy when classifying AD patients using CNNs.
TABLE V: Pearson correlations measuring the impact of model depth and parameter counts on different metrics. Value in each cell corresponds to the correlation between the attributes and the metrics. TL = Transfer learned models, SC = Models trained from scratch.
| Accuracy | Precision | Recall | F1-Score |
Attribute | TL | SC | TL | SC | TL | SC | TL | SC |
Depth Parameters | 0.6224 -0.6957 | -0.0764 -0.1231 | 0.8226 -0.9337 | -0.2076 -0.0885 | 0.1745 -0.1582 | 0.0394 -0.1176 | 0.5046 -0.5602 | -0.0397 -0.0950 |
3.3.3. Model Depth
One of the major differences between the different architectures we employed was depth. This refers to the total number of layers in a network. Deeper networks tend to be able to extract more abstract information from data, and for many tasks, perform better [22]. We quantified this relationship by calculating Pearson correlations between model depth and different model metrics. The correlation coefficients, included in Table V, describe the strength and direction of the correlation between vari- ables, with values ranging from − 1 (strong negative correlation) to 1 (strong positive correlation) [25]. In this context, Pearson correlations help to understand the relationship between depth and perfor- mance metrics for both TL and SC models. The Pearson correlations were made between the metrics columns in Table III, and the architecture attribute values in Table II.
For TL models, there was a moderate positive correlation between depth and accuracy/F1-score, with coefficients of 0.6224 and 0.5046, respectively. This means that, as depth in the architectures increased, the accuracy and f1-score of the models trained also increased, implying that when trans- fer learning, deeper models are desirable. Precision seemed to increase significantly with deeper models, with a coefficient of correlation of 0.8226. Meanwhile, depth and recall had a very weak, positive correlation of 0.1745, signifying that this metric is not necessarily attributed to depth. The correlations found between these attributes signify that when transfer learning, model perfor- mance is moderately positively correlated to the depth of the network employed - as depth increases, performance will generally trend upwards.
This same correlation was not found between depth and SC model metrics, with their coefficients all being less than 0.5 in either direction. This means that although depth had a notable positive correlation with model performance when transfer learning was employed, it had almost no effect when training from scratch. This is likely due to limited data struggling to adequately train models from scratch. As explained in 3.3.2, SC models on average performed worse across the board, with little relation to the depth of the models. As a result, it is not possible to deduce whether model depth had a positive or negative impact on the performance of models trained from scratch for this task.
In our experiments, depth was correlated with better performance in TL models and had no mea- surable impact on SC models. While these correlations are helpful for understanding the impact of depth on our model’s performances, they don’t necessarily reveal a causal relationship; results de- pend on implementation and the hyperparameters chosen. Experimenting with these model variations is useful, as it helps guide future design choices, particularly in the case of TL models. However, with SC models, our results were inconclusive. More experimentation is required to understand the im- portance of model depth on models trained from scratch. It is important to note that depth is only one factor influencing the performance of an architecture; examining the number of parameters is just as important.
3.3.4. Model Parameter Count
A crucial factor in selecting a CNN architecture is the parameter count. The parameter count refers to the number of trainable weights in an architecture - this includes the conv layer filter weights, the node connection weights, and the fully-connected layer weights. More parameters can directly improve an architecture’s ability to learn, however a more complex (higher parameter count) model can be harder
to train and be prone to overfitting, especially with small datasets [22]. Table V contains the Pearson correlation coefficients between architecture parameter counts and their resulting model metrics.
For TL models, there was a notable negative correlation between increasing parameter counts and model accuracies and F1-scores, with coefficients of -0.6957 and − 0.5602 respectively. This means that there is a correlation between higher parameter counts resulting in lower accuracies and F1-scores for TL models; increased complexity has led to these being less accurate and less robust. This trend is highly noticeable in the model’s precision, with the coefficient being − 0.9337. This is likely due to the higher parameter count architectures either overfitting, or not having enough data to train them. Overall, lower parameter counts are correlated with better performance, which is an important to consider when designing an architecture. More complex models aren’t necessarily better, especially in this scenario where training is more difficult and datasets are limited. Much like with model depth, there was no observed correlation between the architectures’ parameter counts and performance metrics for SC models.
When picking a CNN architecture for the diagnosis of Alzheimer’s Disease, one must take param- eter count into consideration. As seen in our results, when transfer learning, lower parameter counts are associated with better performance, which emphasizes the importance of picking the right bal- ance of complexity. In the future, implementing a fine-tuning approach (explored in 4.1.5), or using different hyperparameters (explored in 4.1.4) could reveal more helpful insights.
3.4. Comparison to Existing Methods
It’s important to look at our study in the context of other similar studies in the literature and consider other methods to improve performance. In this section, we focus on the accuracy metrics of the models to keep comparisons consistent as not all studies used the same metrics we employed. [16] used a very similar training and evaluation setup to our experiment, using single slices to train a CNN and evaluating using all of a brain’s MRI scans. Using 2D CNNs on the axial view, their best result was a TL ResNet18, with an accuracy of 87.50%, slightly higher than our best result with TL DenseNet201 at 86.36%. They reached a TL accuracy of 84.38% on both VGG16 and VGG19 (Same exact accuracy due to only having 16 test subjects), compared to our accuracies of 81.82% and 72.73% for TL VGG16 and VGG19, respectively. Their ResNet50 model also had similar results to ours with 78.12% SC accuracy and 81.25% TL accuracy, compared to our 70.45% on SC models and 84.09% accuracy on TL. However, their best result using a custom transfer learning approach with a 3D ResNet model achieved a remarkable accuracy of 96.88%.
In [21], the authors found similar results to ours regarding the impact of model depth and complex- ity, noting that hyperparameter tuning could improve the results of more shallow and less complex models. Their TL ResNet50 and ResNet152 models both achieved accuracies of 82.68%, similar to our TL ResNet50 with 84.09% and our TL ResNet152 with 81.82%. Their TL DenseNet201 had worse results than ours, however, with an accuracy of 83.8%, compared to our accuracy of 86.36%. This could be due to a large variety of causes, most likely associated with different preprocessing setups or different hyperparameters. Their best result was a custom DenseNet121-inspired method, achieving an accuracy of 94.97%.
[26] demonstrated a different method when extracting 2D MRI slices, using image entropy to pick the most informative slices. This aided in achieving an accuracy of 92.3% when transfer learning a VGG16 network, notably higher than our TL VGG16 result of 81.82%, highlighting the importance of experimental approaches to preprocessing setups when dealing with MRI images and CNNs. Finally,
[20] implemented a custom 2D slice-based DenseNet model, achieving an accuracy of 92.4%, higher than our best performing model and highlighting the potential of custom architectures inspired by existing methods.
For many of our architectures, we achieved similar accuracies when compared to other studies that implemented similar architectures. Furthermore, our TL DenseNet201 model performed competitively when compared to many of the results achieved in the studies mentioned. However, although our results are strong, they are not as impressive as the results other studies achieved when implementing custom architectures or more advanced 3D techniques. Regardless, our results are valuable as they inform us about the impacts of different design decisions when creating custom CNNs for this task.