3.1. Dataset description
The study conducted experiments using Kaggle's latest freely available dataset, which gained attention for its use in an automatic blindness detection competition related to DR. The APTOS-19 dataset consists of 3,662 fundus photographs collected for analysis. The dataset has been fully labeled with five distinct labels (0, 1, 2, 3, and 4) assigned to each data point. The labels assigned to each data point represent the severity grades of DR, ranging from 0 for a normal and healthy condition to 4 for the most severe stage. The grades assigned to the dataset have been labeled as normal (0), mild (1), moderate (2), severe (3), and proliferative (4) for varying degrees of DR. A pool of this dataset has been shown in Fig. 2.
The dataset consists of 1805 images classified as grade 0, 370 as grade 1, 999 as grade 2, 193 as grade 3, and 295 as grade 4. The dataset is partitioned into a training set containing 2930 images and a validation set containing 732 images for training and validation purposes.
3.2. Pre-processing
The dataset includes images classified into five distinct classes based on the severity of DR. Table 1 presents the distribution of sample images among the various classes in the Kaggle dataset. The initial row of Table 1 illustrates the distribution of various classes in the dataset, which is significantly imbalanced. Imbalanced data used for training deep neural networks can result in biased classification. Figure 3 displays the pre-processing steps typically applied to input datasets before feeding them into a machine-learning model. As part of the first pre-processing step illustrated in Fig. 3(a), each input image is resized to 337 × 224 (refer to Fig. 3(b)) while maintaining the aspect ratio, aiming to decrease the training overhead of deep networks.
Up-sampling and down-sampling techniques have also been employed to address the dataset imbalance (Roychowdhury et al., 2014). To address the imbalance, up-sampling has been conducted by augmenting minority classes through random cropping of 224 × 224 patches (refer to Fig. 3(c)), followed by flipping and 90° rotation (refer to Fig. 3(e)), aiming to balance sample distribution, enhance the dataset, and mitigate overfitting. In the down-sampling process, redundant instances from the majority classes are eliminated to match the cardinality of the lowest class. In the resulting distributions (before flipping and rotation), each image is mean normalized (refer to Fig. 3(d)) to mitigate feature bias and accelerate training time. The dataset is partitioned into two subsets, namely training and validation sets, with proportions of 80% and 20%, respectively. The validation set has been employed during the training process to assess and mitigate overfitting. The learning rate is adjusted adaptively from 0.01 to 0.0001 to prevent overfitting based on the observed improvement in validation loss. Image augmentation is implemented using the Keras Image Data Generator, with a re-scale value of 1/225, shear and zoom ranges set to 0.2, and horizontal and vertical flip set to True. The data generator performs automatic on-the-fly data augmentation during runtime.
Table 1
No. of samples of different classes present in the APTOS-19 dataset
Grade | No DR | Mild | Moderate | Severe | Proliferative |
Training images | 1434 | 300 | 808 | 154 | 234 |
Validation images | 371 | 70 | 191 | 39 | 61 |
Total images | 1805 | 370 | 999 | 193 | 295 |
3.3. Ensemble model
An ensemble method is a meta-algorithm consolidating multiple machine-learning techniques into a single predictive model. Ensemble methods can serve various purposes, including reducing variance (Bagging), reducing bias (Boosting), or improving prediction accuracy (Stacking). Stacking is an approach that leverages multiple predictive models to create a new model by aggregating information from each model. The stacked approach often performs better than individual models, owing to its soothing nature. Stacking emphasizes the strengths of each base model where it excels while downplaying its weaknesses in areas where it underperforms. Stacking achieves optimal results when the base models used are notably diverse from each other. To enhance the predictive performance of our model, we employed stacking, which is evident from the observed results.
As part of the proposed approach, ResNet50, Densenet121, Squeezenet1_0, SVM, and Decision tree have been combined as base learners in an ensemble method. Let \({\Omega }\) = (ResNet50, Densenet121, Squeezenet1_0, SVM, and decision tree as base learners) be the pre-trained models used in the experiment. The pre-trained models, including ResNet50, Densenet121, Squeezenet1_0, SVM, and decision tree as base learners, are fine-tuned with a dataset (X; Y) of N fundus images (X) of size 224×224 and their corresponding labels (Y = \(\raisebox{1ex}{$y$}\!\left/ \!\raisebox{-1ex}{$y$}\right.ϵ\) normal (0), mild (1), moderate (2), severe (3), and proliferative (4)). The training set (Xtrain; Ytrain) is divided into mini-batches of size n = 8, denoted as (Xi; Yi) ∈ (Xtrain; Ytrain), i = 1; 2…….\(\frac{N}{n}\), and the CNN model h ∈ \({\Omega }\) is iteratively optimized (fine-tuned) to minimize the empirical loss.
$$L\left(w,{X}_{i}\right)= \frac{1}{n}\sum _{x\in {X}_{i}, \text{y}\in {Y}_{i} }l\left(h\left(x,w\right),y\right) \left(1\right)$$
In the equation, \(h\left(x,w\right)\) refers to the CNN model that predicts class y for input \(x\) given w, and l (⸱) represents the categorical cross-entropy loss penalty function. The training process updates the learning parameters using Nesterov-accelerated Adaptive Moment Estimation (Reyad et al., 2023).
$${w}_{t+1}= {w}_{t}-\frac{\alpha }{\sqrt{\widehat{v}}+ϵ}\left({\beta }_{1}{\widehat{m}}_{t}+\frac{\left(1-{\beta }_{1}\right)\frac{\partial }{\partial {w}_{t}}L\left({w}_{t},{X}_{i}\right)}{1-{\beta }_{1}^{t}}\right) \left(2\right)$$
In the equation, \(\widehat{v}\), \(\alpha\) and \(\widehat{m}\) represent the gradient's second-order moment, learning rate, and first-order moment. The decarov Momentum hy rates, denoted as \({\beta }_{1}\) and \({\beta }_{2}\), are initially assigned a value of 0.9. The Nestelps aims to provide directional guidance for the next step and prevent fluctuations. The initial weights, denoted as \({w}_{t}\), at time t = 0 being set to the learned weights of the model h ∈ \({\Omega }\) through transfer learning (Yosinski et al., 2014). Each model, h ∈ \({\Omega }\), applies SoftMax as its activation function in the output layer to produce probabilities indicating the input's relationship to various classes (normal, mild, moderate, severe, and PDR). The learning rate, denoted by \(\alpha\), is initially 0.01 and is reduced by a factor of 0.1 until it reaches 0.00001. During training, we implement 50 epochs with early stopping to prevent model overfitting.
3.4. Algorithm
The image dataset is represented as I3662 X (224 X 224 X 3), and L3662 X 1 represents the true labels of the corresponding image samples such that each Li € {0,1,2,3,4}. I is further divided into two subsets, IT3112 X (224 X 224 X 3) and IV550 X (224 X 224 X 3), which represent the train set and validation set for the proposed model, respectively. Similarly, LT3112 X 1 and LV550 X 1 represent the true labels of train and validation sets, respectively. ST3112 X 350, a fully structured training set for training Decision tree and SVM, is obtained after using a pre-trained VGG-16 network with IT as its input. This algorithm creates a hypothesis F, which maps each IiV to {0,1,2,3,4}, resulting in a 550 X 1 prediction vector L'. The step-by-step algorithm for the DRDEL model is represented below in Fig. 4.
3.5. Initialization and parameter setting
The CNNs utilized in this context are pre-trained on the ImageNet dataset, while the built-in SVM and decision tree from the sklearn library serve as the underlying base learners. For the DRDEL model, Python 3.6 has been employed as the programming language, and the CNNs have been trained on Google Colab utilizing an NVIDIA TESLA T-80 GPU with a consistent batch size of 16. The Fastai library, compatible with Python 3.6, has been utilized for managing CNNs. All CNN models employed in this context have been pre-trained on the ImageNet dataset. The three CNNs have been initially trained for 20 epochs, and the learning rate versus training loss plots have been generated for each network. After analyzing their learning rate versus training loss graphs, the networks have been reloaded, unfrozen, and trained for an additional 10 epochs with adjusted learning rates determined by the "loss vs. learning rate" graphs. This process continues until all the CNNs have reached 50 epochs. By following the approach as mentioned above, four sub-models have been obtained for each CNN: the first trained for 20 epochs, the second trained for an additional 10 epochs (total 30 epochs), the third trained for an additional 10 epochs (total 40 epochs), and the fourth trained for an additional 10 epochs (total 50 epochs). The training loss has been assessed for each sub-model, and the best-performing sub-model has been chosen from each CNN for subsequent prediction and ensembling tasks. The chosen sub-models of each CNN have been slightly modified to obtain the prediction probabilities for each class on each sample.
Features were extracted from 3112 image samples, each with dimensions of 224 × 224, for training multiclass SVM and multiclass Decision Tree models. After performing dimensionality reduction on the 3112 samples with 21,055 attributes, a set of 350 selected attributes has been utilized to train the two multiclass classifiers. The performance of each classifier has been evaluated for statistical analysis, and prediction probabilities have been calculated in the same manner as for the CNN sub-models. The probabilities of class predictions have been fed into the neural network's input layer to train the model to calculate impact weights. The proposed network for the final ensemble task employs backpropagation to update the impact weights. The network for the final ensemble task has been trained using a separate labeled dataset comprising 3112 samples and 25 attributes. The attributes of the dataset consist of the class-wise probabilities assigned to each sample by the five learners. The accuracy of the model stabilized after 150 epochs, and the DRDEL has been used to evaluate the model's performance on the validation set consisting of 550 samples.