Deep learning has become a popular class of machine learning algorithms in computer vision and has been successfully employed in various tasks, including multimedia analysis (image, video, and audio analysis), natural language processing, and robotics1. In particular, deep convolutional neural networks (CNNs) hierarchically learn high-level and complex features from input data, hence eliminating the need for handcrafting features, as in the case of conventional machine learning schemes2.
The application of these methods in neuroimaging is rapidly growing (see Greenspan et al.3 and Zaharchuk et al.4 for reviews). Several studies employed deep learning methods for image improvement and transformation5–10. Other studies performed lesion detection and segmentation11–13 and image-based diagnosis using different CNNs architectures14,15. Deep learning has also been applied to more complex tasks, including identifying patterns of disease subtypes, determining risk factors, and predicting disease progression (see, e.g., Zaharchuk et al.4 and Davatzikos16 for reviews). Early works applied stacked auto encoders14,17,18 and deep belief networks19 to classify neurological patients from healthy subjects using data collected from different neuroimaging modalities, including magnetic resonance imaging (MRI), positron emission tomography (PET), resting-state functional MRI (rsfMRI), and the combination of these modalities20.
Some authors reported very high accuracies in classifying patients with neurological diseases, such as Alzheimer’s disease (AD) and Parkinson's disease (PD). For a binary classification of AD vs. healthy controls, Hon and Khan21 reported accuracy up to 96.25% using a transfer learning strategy. Sarraf et al.22 classified subjects as AD or healthy controls with a subject-level accuracy of 100% by adopting LeNet-5 and GoogleNet network architectures. In other studies, CNNs have been used for performing multi-class discrimination of subjects. Recently, Wu and colleagues23 adopted a pre-trained CaffeNet and achieved accuracies of 98.71%, 72.04%, and 92.35% for a three-way classification between healthy controls, stable mild cognitive impairment (MCI), and progressive MCI patients, respectively. In another work by Islam and Zhang24, an ensemble system of three homogeneous CNNs has been proposed and average multi-class classification accuracy of 93.18% was found on the Open Access Series of Imaging Studies (OASIS) dataset. For the classification of PD, Esmaeilzadeh et al.25 classified PD patients from healthy controls based on MRI and demographic information (i.e., age and gender). With the proposed 3D model, they achieved 100% accuracy on the test set. In another study by Sivaranjini and Sujatha26, a pre-trained 2D CNN AlexNet architecture was used to classify PD patients vs. healthy controls, resulting in an accuracy of 88.9%.
Although very good performances have been shown by using deep learning for the classification of neurological disorders, there are still many challenges that need to be addressed, including complexity and difficulty in interpreting the results due to highly nonlinear computations, non-reproducibility of the results, and data/information and, especially, data overfitting (see Vieira et al.20 and Davatzikos16 for reviews).
Overly optimistic results may be due to data leakage – a process caused by incorporating information of test data into the learning process. While concluding that data leakage leads to overly optimistic results will surprise few practitioners, we believe that the extent to which this is happening in neuroimaging applications is mostly unknown, especially in small datasets. As we completed this study, we became aware of independent research by Wen et al.27 that corroborates part of our conclusions regarding the problem of data leakage. They successfully suggested a framework for reproducible assessment of AD classification methods. However, the architectures have not been trained and tested on smaller datasets typical of clinical practice and they mainly employed hold-out model validation strategies, rather than cross-validation (CV) – that gives a better indication of how well a model performs on unseen data28,29. Moreover, the authors focused on illustrating the effect of data leakage on the classification of AD patients only.
Unfortunately, the problem of data leakage incurred by incorrect data split is not only limited within the area of AD classification but can also be seen in various neurological disorders. It is more common to observe the data leakage in 2D architectures, yet some forms of data leakage, such as late split, could be present in 3D CNN studies as well. Moreover, although deep complex classifiers are more prone to overfitting, there is no reason to believe that conventional machine learning algorithms could not be affected by data leakage. A summary of these works with clear and potential data leakage is given in Tables 1 and 2, respectively. Other works with insufficient information to assess data leakage are reported in Table 3.
Table 1
Summary of the previous studies performing classification of neurological disorders using MRI and with clear data leakage.
Disorder
|
Reference
|
Groups
(number of subjects)
|
Machine learning model
|
Data split method
|
Type of data leakage
|
Accuracy
(%)
|
|
Gunawardena et al., 201758
|
AD-MCI-HC (36)
|
2D CNN
|
4:1 train/test slice-level split
|
wrong split
|
96.00
|
|
Hon & Khan, 201721
|
AD-HC (200)
|
2D CNN
(VGG16)
|
4:1 train/test slice-level split
|
wrong split
|
96.25
|
|
Jain et al., 201959
|
AD-MCI-HC
(150)
|
2D CNN (VGG16)
|
Data augmentation + 4:1 train/test slice-level split
|
late and wrong split
|
95.00
|
|
Khagi et al., 201960
|
AD-HC
(56)
|
2D CNN
(AlexNet, GoogLeNet,ResNet50, new CNN)
|
6:2:2 train/validation/test slice-level split
|
wrong split
|
98.00
|
AD/MCI
|
Sarraf et al., 201722
|
AD-HC
(43)
|
2D CNN (LeNet-5)
|
3:1:1 train/validation/test slice-level split
|
wrong split
|
96.85
|
|
Wang et al., 201761
|
MCI-HC
(629)
|
3D CNN
|
Data augmentation + 10:3:3 train/validation/test split by MRI slices
|
wrong split
|
90.60
|
|
Puranik et al., 201862
|
AD/EMCI-HC
|
2D CNN
|
17:3 train/test split by MRI slices
|
wrong split
|
98.40
|
|
Basheera et al., 201963
|
AD-HC
|
2D CNN
|
4:1 train/test split by MRI slices
|
wrong split
|
90.47
|
|
Basaia et al., 201964
|
AD-HC
|
2D CNN
|
9:1 slice level split
|
wrong split
|
99.00
|
AD = Alzheimer’s disease; HC = Healthy controls; MCI = Mild cognitive impairment.
Table 2
Summary of the previous studies performing classification of neurological disorders using MRI and suspected to have potential data leakage.
Disorder
|
Reference
|
Groups
(number of subjects)
|
Machine learning model
|
Data split method
|
Type of data leakage
|
Accuracy
(%)
|
|
Farooq et al., 201735
|
AD-MCI-LMCI-HC (355)
|
2D CNN (GoogLeNet and modified ResNet)
|
3:1 train/test (potential)
slice-level split
|
wrong split
|
98.80
|
AD/MCI
|
Ramzan et al., 201965
|
HC-SMC- EMCI-MCI-LMCI-AD
(138)
|
2D CNN
(ResNet-18)
|
7:2:1 train/validation/test (potential)
slice-level split
|
wrong split
|
100
|
|
Raza et al., 201966
|
AD-HC
(432)
|
2D CNN (AlexNet)
|
4:1 train/test (potential) slice-level split
|
wrong split
|
98.74
|
|
Pathak et al., 202067
|
AD-HC
|
2D CNN
|
3:1 (potential) slice level split
|
wrong split
|
91.75
|
ASD
|
Libero et al., 201568
|
ASD-TD
(37)
|
Decision tree
|
unclear
|
entire data set used for feature selection
|
91.90
|
Zhou et al., 201469
|
ASD-TD/HC
(280)
|
Random tree classifier
|
4:1 train/test split
|
entire data set used for feature selection
|
100
|
PD
|
Sivaranjini, et al., 201926
|
PD-HC
(182)
|
2D CNN
|
4:1 train/test split by MRI slices
|
wrong split
|
88.90
|
TBI
|
Lui et al., 201470
|
TBI-HC
(47)
|
Multilayer perceptron
|
10-fold CV
|
entire data set used for feature selection
|
86.00
|
Brain tumor
|
Hasan et al., 201971
|
Tumor-HC
(600)
|
MGLCM + 2D CNN + SVM
|
10-fold CV
|
dataset is divided randomly into 0 folds
|
99.30
|
AD = Alzheimer’s disease; ASD = Autism spectrum disorder; HC = Healthy controls; MCI = Mild cognitive impairment; PD = Parkinson’s disease; SWEDD = scans without evidence of dopaminergic deficit; TBI = Traumatic brain injury; TD = Typically developing.
Table 3
Summary of the previous studies performing classification of neurological disorders using MRI and that provide insufficient information to assess data leakage.
Disorder
|
Reference
|
Groups
(number of subjects)
|
Machine learning model
|
Data split method
|
Accuracy
(%)
|
|
Al-Khuzaie et al., 202172
|
AD-HC
(240)
|
2D CNN
|
(potential)
slice-level split
|
99.30
|
AD/MCI
|
Billones et al., 201673
|
AD-MCI-HC
(900)
|
2D CNN
(modified VGG16)
|
7:3 train/test (potential) slice-level split
|
91.85
|
|
Wu et al., 201823
|
AD-HC
|
2D CNN
|
Data augmentation + 2:1 train/test split by MRI slices
|
97.58
|
AD = Alzheimer’s disease; HC = Healthy controls; MCI = Mild cognitive impairment.
In this study, we addressed the issue of data leakage in one of the most common classes of deep learning models, i.e., 2D CNNs, caused by incorrect dataset split of 3D MRI data. Specifically, we quantified the effect of data leakage on CNN models trained on different datasets of T1-weighted brain MRI of healthy controls and patients with neurological disorders using a nested CV scheme with two different data split strategies: a) subject-level split, avoiding any form of data leakage and b) slice-level split, in which different slices of the same subject are contained both in the training and the test folds (thus data leakage will occur). We focused our attention on both large (about 200 subjects) and small (about 30 subjects) datasets to evaluate a possible increase of performance overestimation when a smaller dataset was used, as is often the case in clinical practice. This paper expands on the preliminary results by Yagis et al.30 offering a broader investigation on the issue. In particular, we performed the classification of AD patients using the following datasets: 1) OASIS-200, consisting of randomly sampled 100 AD patients and 100 healthy controls from the OASIS-1 study31, 2) ADNI, including 100 AD patients and 100 healthy controls randomly sampled from Alzheimer’s Disease Neuroimaging Initiative (ADNI)32, and 3) OASIS-34, composed of 34 subjects (17 AD patients and 17 healthy controls) randomly selected from the OASIS-200 dataset. Given that the performance of a model trained on a small sample dataset could depend on the selected samples, we created ten instances of the OASIS-34 dataset by randomly sampling from the OASIS-200 dataset ten times independently. The subject IDs included in each instance are found in Supplementary Table S1 (Supporting Information). Moreover, we generated a further dataset, called OASIS-random, where, for each subject of the OASIS-200 dataset, a fake random label of either AD patient or healthy control was assigned. In this case, the image data had no relationship with the assigned labels. Besides, we included two T1-weighted images datasets of patients with de-novo PD: PPMI, including 100 de-novo PD patients and 100 healthy controls randomly chosen from the public Parkinson’s Progression Markers Initiative (PPMI) dataset33, and Versilia, a small-sized private clinical dataset of 17 patients with de-novo PD and 17 healthy controls. A detailed description of each dataset has been reported in the “Methods” section.