In this section, the selected papers are discussed below:
Sumaiya Pathan et al., [3] suggested a multi-step system for the glaucoma detection automatically. The process involves preprocessing images and then isolate the Region Of Interest (ROI) through the analysis of statistical features. Then, clinical and texture-based features are extracted from the ROI. Finally, an ensemble of classifier models is constructed using dynamic selection techniques. Evaluations were carried out on both public databases and 300 hospital images. The most promising results came from an ensemble of Random Forest (RF) models with the META-DES dynamic ensemble selection technique. In the hospital database, this method achieved an impressive 100% accuracy, specificity, and sensitivity. The average accuracy, specificity, and sensitivity in the RIM-ONE and DRISHTI-GS databases reached 97.86%, 100%, and 93.85%, 97%, 90%, and 100% respectively. These outcomes demonstrate the effectiveness of their system, particularly with the ensemble of RF models employing META-DES.
Ruben Hemelings et al., [21] used labeled fundus images from 13 diverse data sources. These included BMES, GHS, and eleven databases that are publicly available. To minimize data discrepancies, the authors developed a standardized image processing strategy to extract images centered on 30° discs from the original data. The testing model involved a total of 149,455 images. For the BMES and GHS cohorts, the Area Under the Receiver Operating Characteristic (AUROC) curve achieved 0.976 and 0.984 at the participant level, respectively. At a specific specificity of 95%, sensitivities were 87.3% and 90.3%, respectively, surpassing the minimum 85% sensitivity recommended by Prevent Blindness America. 0.854 to 0.988 is the range of values of the AUROC in the eleven databases.
Veronika Kurilová et al., [22] In this paper, an average voting ensemble of multiple Convolutional Neural Networks (CNNs) models trained using the REFUGE database achieved the highest accuracy 98% and the AUROC score, surpassing the individual VGG-16, ResNet-50, and MobileNet models. Among the single CNN models, ResNet-50 demonstrated the best performance. Ensemble methods can significantly improve predictive performance, but including weaker-performing models can negatively impact overall results.
Gavin D’Souza et al., [23] used AlterNet-K, a compact model that merges ResNets and multi-head self-attention, trained on the Rotterdam EyePACS AIROGS database achieved 91.6% accuracy, 0.968 AUROC, and 91.5% F score in glaucoma detection, outperforming various transformer and standard CNN models. The model's success is attributed to its alternating pattern of ResNet blocks and multi-head self-attention, which leverages their complementary strengths for better generalizability. The results suggest that smaller, parameter-efficient CNNs combined with multi-head self-attention can achieve high accuracy in medical image classification tasks, potentially outperforming larger models.
Sajib Saha et al., [24] developed a system, powered by CNN that demonstrates that exceptional accuracy in the detection of glaucoma using color images of the funds. The system first isolates the OD with a custom YOLO network, followed by a glaucomatous vs. non-glaucomatous classification via a MobileNet architecture. Extensive testing with seven state-of-the-art CNNs yielded outstanding results, including 97.4% accuracy and 97.3% F score with sensitivity, specificity, and AUROC 97.5%, 97.2%, and 0.993 respectively.
Latif J. et al., [25] employed the Enhanced Grey Wolf Optimized Support Vector Machine (EGWO-SVM) for image processing. First, they eliminated noise using the Adaptive Median Filter (AMF). Then Speeded-Up Robust Features (SURF), Histogram of Oriented Gradients (HOG), and global features were used for extracting features. Classification utilized the EGWO technique along with SVM. The ORIGA database was used for testing, which produced remarkable performance metrics with a high accuracy of 94%, specificity of 92%, and sensitivity of 92%.
Milon Biswas et al., [26] proposed a lightweight to detect CNN retinal disorders, focusing on binary judgments: distinguishing healthy and non-healthy cases, specifically within non-healthy image screening. They evaluated its performance on two well-defined public databases. To differentiate healthy from non-healthy images, CNN reached an accuracy of 99.67% on databases of Diabetic Retinopathy (DR) and 96.5% on databases of Glaucoma (GL). Furthermore, in the context of non-healthy screening, aiming to differentiate between different retinal disorders, CNN achieved an accuracy of 99.03% when distinguishing between cases of GL and DR.
Ghorui A. et al., [27] proposed a novel CNN architecture called ProspectNet. It outperforms two established pre-trained networks, VGG16 and DenseNet121, by exhibiting higher accuracy with reduced computational time and complexity. They used a combined database from DRISHTI-GS and the glaucoma database (Kaggle), containing ocular color fundus images of normal and glaucomatous eyes. ProspectNet achieves an AUROC of 0.991, a specificity of 98%, and a precision of 98%.
Alice K. et al., [28] aimed to create models to detect glaucoma using ML algorithms and image feature descriptors in a publicly accessible retinal fundus image database. The goal was to classify the images as normal or abnormal. Their classification process occurred in two stages: first, the image features were extracted using specific filters, followed by training a tree-based ensemble classifier. Then, this classifier was tested to achieve optimal accuracy. The experiment was carried out iteratively exploring three effective filters: Edge Histogram (EH), Pyramid Histograms of Orientation Gradients (PHOG), and Fuzzy Color and Texture Histogram (FCTH). They evaluated the combination of filters to determine the most effective one. The employing of the EH filter in conjunction with the FCTH, using an RF classifier, reached the highest accuracy of 80.43% and an AUROC of 0.884.
Ahmed MT. et al., [29] employed DL techniques to identify open-angle glaucoma in fundus images based on three distinct architectures: VGG16, VGG19, and ResNet50. They classified the eyes as positive or negative for glaucoma using the Kaggle database. In particular, data augmentation significantly improved the performance of all three models, with accuracy ranging from 93–97.56%. Among them, VGG19 proved to be the most accurate, with accuracy 97.56%.
Raju M. et al., [30] utilized four different ML classification methods in Electronic Health Records (EHR) from more than 650 medical facilities in the US to predict glaucoma before clinical symptoms manifest, allowing for potential early intervention and preventive treatments. XGBoost, Multilayer Perceptron (MLP) and RF exhibited similar favorable results with an AUROC score of 0.81, while Logistic Regression (LR) achieved a score of 0.73. These models effectively predicted glaucoma one year before onset based on patient EHR data, suggesting the potential of ML to identify patients before glaucoma.
N. J. Shyla and W. S. Emmanuel [31] propose a technique for OD segmentation and classification using DL and Pattern Classification Neural Networks (PCNs). First, they resize the input image and employ level-set segmentation for OD segmentation. AlexNet is then used for classification into normal and glaucoma classes. Additionally, they feed the glaucoma images to the PCN to classify them as initial, moderate, or severe. The Neural Network (NN) is trained using statistical features and the Cup-to-Disk Ratio (CDR). This work, performed in the DRISHTI-GS, LAG, and RIM-ONE databases, achieved an accuracy, sensitivity, and specificity of 98.42%, 97.6%, and 97.5%, respectively.
Venkateswara Rao Naramala et al., [32] used Restricted Boltzmann Machines (RBM) to extract and analyze multiple features from retinal images to classify anomalies and automate the diagnostic process. Additionally, the investigation involved the use of a U-network model to segment the optical images and the application of the Squirrel Search Algorithm (SSA) to fine-tune the hyperparameters of RBM for optimal performance. For evaluation, the RIM-ONE database was used. The proposed model achieves 99.2% accuracy in the database.
Yao Li et al., [33] explored Drop-Coating Deposition Raman Spectroscopy (DCDRS) as a new, non-invasive method to distinguish between patients with glaucoma and healthy individuals using tear samples. Tears from 63 individuals were analyzed for their Raman spectra. High-dimensional Raman data were processed using Principal Component Analysis-Linear Discriminant Analysis (PCA-LDA) to identify key features, and then an SVM classifier based on the PCA-LDA results was used to categorize samples. DCDRS successfully differentiated between patients with glaucoma and healthy individuals with a total accuracy of 93.2%. Differences in protein and lipid content in tears, reflected in Raman spectra, contributed to the classification. With 30% validation of the test database, the classification accuracy remained at 90.9%.
Reshma Verma et al., [34] suggested SVM and K-means clustering compared to determine of the CDR from fundus images. SVM outperformed K-means in accuracy and consistency. Used a convex hull approach for diagnosis and classification and developed a web application for inexpensive and user-friendly screening. SVM achieved better accuracy and consistency in CDR determination. Identification of the severity of early-stage glaucoma was possible. The Web application offers a cost-effective screening tool. The limitations in this paper were that the convex hull algorithm for contour joining might be slow and the study relies on OCT images captured by trained professionals with specialized equipment.
Zefree Lazarus Mayaluri and Satyabrata Lenka [35] presented a modified dichromatic reflection model to separate specular reflections from corrupted fundus images. For this separation task, a modified U-Net CNN is used which is also used to accurately segment relevant ROI from the preprocessed images. The relevant features are extracted from segmented images, likely representing the morphological and structural characteristics related to glaucoma. An SVM classifier, trained with different kernels, is applied to classify the images into glaucomatous or non-glaucomatous categories based on the extracted features. After comparing seven existing methods to obtain diffuse and specular components, they adopted the one that produced the highest quality images and used the output image in subsequent steps of the screening process. The experimental results demonstrated that their model achieved a maximum improvement of 37.97 dB in PSNR and 0.961 in SSIM during the preprocessing step. The model reached an accuracy of 91.83%, a sensitivity of 96.39%, a specificity of 95.37%, and an AUROC of 0.971 for detection.
M. Raveenthini and R. Lavanya [36] presented a new Computer Aided Diagnosis (CAD) system to diagnosing DR and glaucoma in multiple eyes simultaneously. This could be a game changer for large-scale screening programs by significantly reducing manpower and time requirements. They eliminate the need for separate DR and glaucoma systems by using a segmentation-independent approach. This avoids issues with image quality and anatomy that can affect segmentation accuracy. They constructed an ensemble of an RF classifier with CNNs. Utilizes non-linear features like Higher Order Spectra (HOS), fractal, and entropy features, which capture essential image details beyond basic pixel intensities. Ensemble learning combines the strengths of both the RF and DL models using the sum rule for improved accuracy, sensitivity, and specificity.
Senthil kumar Arunachalam et al., [37] introduced Deep Neural Perona–Malik Diffusive Mean Shift Mode Seeking Segmented Image Classification (DNP-MDMSMSIC) for glaucoma detection and Stargardt disease. It uses space-variant Perona–Malik diffusive preprocessing to reduce noise while preserving edges. Feature extraction is used to extract intensity, color, and texture with high accuracy, and then mean shift mode seeking segmentation is used to segment the image based on features. The Bregman divergence function classifies images on the basis of segmented region similarity. DNP-MDMSMSIC achieved a 8% higher accuracy and a 20% faster detection than previous methods in the ACRIMA database.
Somasundaram Devaraj and Senthil Kumar Arunachalam [38] proposed Max Pool Convolution Neural Kuan Filtered Tobit Regressive Segmentation based Radial Basis Image Classifier (MPCNKFTRS-RBIC) for the detection of early glaucoma and Stargardt disease with high accuracy and low processing time. It uses a weighted adaptive Kuan filter for preprocessing the fundus image. Feature extraction is to extract intensity, color, and texture with high accuracy then Tobit regressive segmentation to partition image based on extracted features. The radial basis function classifier was used to analyze images for classification. MPCNKFTRS-RBIC achieved good performance on various metrics in different image sizes and databases.
Alifia Revan Prananda et al., [39] suggested analyzing damage to the retinal nerve fiber layer for glaucoma detection. This proposed method had two steps: the preprocessing and the classification process. In the first step, unnecessary parts, such as the OD and blood vessels, are removed because they could hinder the analysis. For classification, nine DL architectures were used. The proposed method achieved the highest accuracy of 92.88% with an AUROC of 0.8934 when evaluated on the ORIGA database.
Abdelali Elmoufidi et al., [40] suggested automating the diagnosis of glaucoma using fundus images. Their framework operates as follows: ROIs are decomposed into components using the Bi-dimensional Empirical Mode Decomposition (BEMD) algorithm. The DL features are extracted from these decomposed components using the VGG19 CNN architecture. These features are then aggregated for each ROI using a bag-of-features approach. Due to their high dimensionality, the features are subsequently reduced using PCA. The resulting bags of features serve as input to an SVM classifier for the final diagnosis. The public databases ACRIMA and REFUGE were used for model training, while the testing involved a combination of the ACRIMA, REFUGE, ORIGA-light, RIM-ONE sjchoi86-HRF, and Drishti-GS-GS1. The REFUGE-trained model achieved overall accuracies of 98.31%, 98.61%, 96.43%, 96.67%, 95.24%, and 98.60% on ACRIMA, REFUGE, RIM-ONE, ORIGA-light, Drishti-GS-GS1, and sjchoi86-HRF, respectively. Similarly, the model trained on ACRIMA achieved accuracies of 98.92%, 99.06%, 98.27%, 97.10%, 96.97%, and 96.36% on the same databases, respectively. The above-reviewed articles have been summarized in Table (1).
Table (1) A Summary of reviewed papers.
Reference number | Database | Pre-processing | Feature Extraction | Classification and detection | Results (%) |
[3] | RIM-ONE, DRISHTI-GS, and 300 fundus images from a hospital | Blood vessels removal, OD and OC segmentation | Texture directionality feature extracted from N + 1 directional difference of Gaussian, Gabor, Hu-invariant moments, and color features, along with gray-level co-occurrence matrix based features | Using dynamic selection techniques, two types of ensemble of classifiers were used: 1- The homogeneous ensemble of RF classifiers 2- The heterogeneous ensemble of classifiers | The most promising results came from an ensemble of RF: For the hospital database: accuracy: 100, specificity: 100, and sensitivity: 100 RIM-ONE: accuracy: 97.86, specificity: 100, and sensitivity: 93.85 DRISHTI-GS: accuracy: 97, specificity: 90, and sensitivity:100 |
[21] | BMES, GHS, AIROGS, ORIGA, LAG, ODIR, REFUGE1, REFUGE2, RIM-ONEr3, RIM-ONE DL, GAMMA, ACRIMA, and PAPILA | Different steps with different databases | CNN | BMES and GHS: AUROC: reached 0.976 and 0.984 respectively. For the remaining databases, the AUROC ranged from 0.854 to 0.988. |
[22] | REFUGE | - | The ernsemble method consisted of ResNet-50, VGG-16, and MobileNet | AUROC, precision, recall, true positives, true negatives, false positives, and false negatives. The best accuracy for the ensemble methods using the average voting method was 98 |
[23] | Rotterdam EyePACS AIROGS | Cropping and resizing images to 224 × 224 | AlterNet-K | Accuracy: 91.6, Recall: 90.7, AUROC: 0.968, F score: 91.5 |
[24] | LAG, sjchoi86 HRF, ACRIMA, DRISHTI-GS, HRF, DRIONS-DB, and RIM-ONE | YOLO CNN | MobileNet | Accuracy: 97.4, F score: 97.4 Sensitivity: 97.5 Specificity: 97.2 AUROC: 99.3 |
[25] | ORIGA | AMF to eliminate image noise | SURF, HOG, and global features | EGWO with SVM | Accuracy: 94 Specificity: 92 Sensitivity: 92 |
[26] | DR and GL were collected from Kaggle and MESSIDOR | - | DL (light-weighted CNN) | For distinguishing between non-healthy and healthy images, the accuracy was: 99.67 on DR database and 96.5 on the GL database. Accuracy: 99.03 when distinguishing between cases of GL and DR. |
[27] | DRISHTI-GS and Kaggle | Binary masking technique to determine ROI, converted images into grayscale, and resizing them to 224 × 224 | DL (DenseNet121, VGG16, and ProspectNet) | ProspectNet has AUROC: 0.991, specificity, and precision: 98 |
[28] | DB1, DB2, and DB3 obtained from RIM_ONE | DB1 and DB2 images merged into one folder. Sequential file naming generates class names for an arff file with filenames and class values. Images were transformed to numeric data using filters | EH, FCTH, and PHOG filters | RF | Accuracy: 80.43 |
[29] | Kaggle | Images resized to 448×672 pixels, and data biases were reduced | DL (VGG16, VGG19, and ResNet50) | Highest accuracy is VGG19: 97.56 |
[30] | EHR from over 650 hospitals and clinics throughout the US | Filtering, transformation, binarization, and joining techniques | systemic diseases, medication, and demographic information | LR, XGBoost, MLP, and RF | XGBoost, MLP, and RF achieving an AUROC score: 0.81 |
[31] | LAG, RIM-ONE, and DRISHTI-GS | Resize images, and segmentation | Various statistical features and CDR are used for training NN | AlexNet | Accuracy: 98.42 Specificity: 97.5 Sensitivity: 97.6 |
[32] | RIM-ONE | cropping, channel separation, data enhancement, and segmentation of images using U-network | RBM with SSA for selecting optimal hyperparameters and decreasing the weight of the RBM | Accuracy: 99.2 |
[33] | Clinical data of glaucoma patients and normal people | Noise removal, subtraction, and normalization | PCA-LDA dimensionality reduction | SVM | Accuracy: 93.2 |
[34] | DRISHTI-GS, real database | CLAHE filter is applied to increase the contrast adaptive to the image and reduce noise | Extraction OD and OC | SVM, K-mean clustering, and Convex hull | The best accuracy for SVM: 85.39 |
[35] | Mendeley data repository | A modified U-Net CNN Separates specular reflections from corrupted fundus images, and segmentation | CDR, Retinal nerve fiber layer, Neuro-Retinal Rim, INST, and Statistical features | SVM with different kernels | Accuracy: 91.83 Sensitivity: 96.39 Specificity: 95.37 AUROC: 0.971 |
[36] | Single database from HRF, Kaggle, ORIGA-light, and DR HAGIS | Resize, extract green channel, noise removal, contrast enhancement, and alleviating non-uniform illumination | RF classifier and a pool of non-linear features including HOS, fractal, and entropy features. Using the sum rule for decision fusion, an ensemble of RF model and CNN was further constructed | Accuracy: 98.08 Sensitivity: 98.37 Specificity: 99.07 |
[37] | ACRIMA, retina image bank database and DIARETDB0 | remove the noise | deep neural network | Mean shift mode seeking segmentation algorithm and Bregman Divergence function in output layer for classification | Enhancement in different metrics |
[38] | ACRIMA, and retina image bank database | Resizing images, noise removal, and enhancing the quality | MPCNKFTRS-RBIC | Enhancement in different metrics |
[39] | ORIGA`-light | Unnecessary parts like blood vessels and OD were removed | AlexNet, GoogleNet, XceptionNet, ResNet-50, Inception V3, InceptionResNet, NasNet, MobileNet, and DenseNet | The highest Accuracy: 92.88 with an AUROC: 89.34 using DenseNet |
[40] | ACRIMA, REFUGE, RIM-ONE, DRISHTI-GS, ORIGA-light, and sjchoi86-HRF | decompose the ROI | CNN architecture VGG19 | SVM | Accuracy: 98.31, 98.61, 96.43, 96.67, 95.24, and 98.60 are obtained on ACRIMA, REFUGE, RIM-ONE, ORIGA-light, DRISHTI-GS, and sjchoi86-HRF databases, respectively, by using the model trained on REFUGE. Against an accuracy: 98.92, 99.06, 98.27, 97.10, 96.97, and 96.36 are obtained on the same databases as above using model training on ACRIMA |
For the selected reviewed papers, researchers were used in their works ML only, DL only, or ensemble learning that uses different ML and/or DL. The ratio of using ML, DL, and ensemble learning that uses different ML and/or DL methods appears in Figure (3).
3.1. Databases
From Table 1, we can see that the most used databases are shown in Figure (4). All databases used for the detection of glaucoma used in the selected papers are listed in Table 2.
Table (2) The databases used in the reviewed papers.
Reference number | Database | Availability | Normal images | Glaucomatous or (suspect) images | Total images |
[41] | RIM-ONE | Public | 118 | 51 | 169 |
[42] | RIM-ONEr3 | Public | 85 | 74 | 159 |
[43] | Drishti-GS | Public | 70 | 31 | 101 |
[44] | RIM-ONE DL | Public | 313 | 172 | 485 |
[24] | Sjchoi86 HRF | Public | 300 | 101 | 401 |
[45] | Rotterdam EyePACS AIROGS | Public | - | - | 112732 |
[21] | ODIR | Public | - | - | 10000 |
[46] | GAMMA | Public | 150 | 150 | 300 |
[47] | PAPILA | Public | 333 | 155 | 488 |
[48] | LAG | Public | 3432 | 2392 | 5824 |
[49] | DIARETDB0 | Public | 20 | 110 | 130 |
[50] | Kaggle | Public | - | - | 1000 |
[51] | DR HAGIS | Public | 0 | 10 and the remaining images for hypertension, diabetic retinopathy, and age-related macular degeneration | 39 |
[52] | Mendeley data repository | Public | 1060 | 1146 with artifacts | 2206 |
[53] | ORIGA-light | Public | 482 | 168 | 650 |
[54] | ACRIMA | Public | 309 | 396 | 705 |
[55] | DRIONS-DB | Public | 55 | 55 | 110 |
[56] | REFUGE | Public | 1080 | 120 | 1200 |
[57] | MESSIDOR | Public | - | - | 1200 |
[58] | HRF | Public | 15 | 15 + 15 images for DR | 45 |
3.2. Evaluation Criteria
Different evaluation criteria have been used to analyze the effectiveness of the proposed models. The description of them as follows [59, 60]:
Sensitivity (recall): represents the percentage of True Positives (TP) that the model correctly identifies. TPs are instances in which the model correctly predicts the presence of the target class. The equation of it appears in Eq. (1):
Sensitivity = Recall = TP/TP + FN (1)
Here, FN is for the False Negative. The higher the sensitivity, the better the performance of the proposed model will be.
Specificity: represents the percentage of True Negatives (TN) that the model correctly identifies. TNs are instances where the model correctly predicts the absence of the target class. It can be calculated from Eq. (2) as follows:
Specificity = TN/TN + FP (2)
Here, FP is for the False Positive. If there is more value of specificity, the model shows better performance.
Accuracy: shows information on how accurately the ground truths match the segmented result. The improved accuracy shows a better outcome of the proposed model. It can be computed as shown in Eq. (3):
Accuracy = Sensitivity + Specificity/2 (3)
Precision: measures the proportion of positive predictions that are actually correct.
F1 score: it is an accuracy degree. A higher F1 score means better performance of the proposed model, which refers to the model's ability to correctly identify TP and avoid FP. It can be computed from Eq. (4) as follows:
F1 score = 2*precision*recall / precision + recall (4)
AUROC: shows the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR) at different threshold settings. An AUROC with a higher value indicates a better overall discrimination between positive and negative cases, regardless of specific threshold choices. Table (3) provides a clear and concise overview of these evaluation criteria:
Table (3) Summary of the evaluation criteria used in the reviewed papers.
Metric | Definition | Interpretation | Limitations |
Accuracy | % of correct predictions | Overall performance | Misleading in imbalanced databases |
Specificity | % true negatives correctly identified | Effectiveness in avoiding false positives | Not relevant if TN are unimportant |
Sensitivity (recall) | % true positives correctly identified | Ability to find all relevant cases | Not relevant if FN are unimportant |
Precision | % of positive predictions that are actually correct | Proportion of positives that are TP | Not relevant if FP are unimportant |
F1 score | Harmonic mean of precision and recall | Balanced view of precision and recall | Requires equal importance of FP and FN |
AUROC | Area under ROC curve | Performance across different classification thresholds | Complex interpretation, not directly indicating class probabilities |