In this study, we aimed to develop and validate a generalized glaucoma screening model. The best-performing model achieved promising accuracy, highlighting the potential of our approach to revolutionize the landscape of glaucoma screening. To the best of our understanding, no prior study had attempted to use extensive glaucomatous fundus data from diverse demographics and ethnicities, with images from various fundus cameras with different resolutions. Most deep-learning-based models were trained on a limited dataset for classifying healthy and glaucomatous fundus images from a single institution, which made the models non-generalizable for different populations and settings.9 Our training dataset included 7,498 glaucoma cases and 10,869 healthy cases, gathered from 19 different datasets. This dataset—one of the most extensive clusters of fundus images ever used to develop a generalized glaucoma screening model—represents a wide range of ethnic groups and fundus cameras used, which could improve our model's performance and make it more applicable globally. The best-performing model was selected in this study out of 20 pre-trained models (eFigure 2 in the Supplementary)—choosing the right deep-learning architecture for a specific task is extremely important.57 This extensive and diverse dataset using the potential deep-learning architecture can enhance the model’s generalizability, making it a versatile and practical tool for glaucoma screening in diverse populations.
Our best-performing model exhibited exceptional discriminative ability between glaucomatous and healthy discs; the model learned glaucomatous features from heterogeneous data. The vgg19_bn attained a high degree of AUROC of 99.2%, which was exceeded by ophthalmologists (82.0) and deep-learning systems (97.0)58, demonstrating its potential for practical use in glaucoma screening. Li et al. trained and validated their model on 31,745 and 8,000 fundus images, respectively.59 The model performed exceptionally well, achieving an AUC of 0.986 with a sensitivity of 95.6%, a specificity of 92.0%, and an accuracy of 92.9% for identifying referable glaucomatous optic neuropathy. Our model, maintained balance across all performance metrics, as revealed in Table 2, for both glaucoma and healthy cases. The model was unbiased towards any particular class, making it a reliable tool for glaucoma screening for wider populations of glaucoma.
Furthermore, we implemented the DenseNet201, ResNet101, and DenseNet161 architectures (eTable 2 in the Supplementary). The DenseNet201 demonstrated a classification accuracy of 96%, with an AUROC of 99%. Steen et al. employed the same DenseNet201 architecture, but their model achieved an accuracy of 87.60%, precision of 87.55%, recall of 87.60%, and an F1 score of 87.57% on publicly available datasets containing 7,299 non-glaucoma and 4,617 glaucoma images.60 Our study has a unique strength in balancing sensitivity and specificity, evident from our model's high AUROC values—a significant advantage in real-world clinical settings. Many previous models had difficulty maintaining this balance, resulting in high false positive or false negative rates.15,16,61
Liu et al. trained a CNN algorithm for automatically detecting glaucomatous optic neuropathy using a massive dataset of 241,032 images from 68,013 patients, and the model’s performance was impressive.62 However, the model struggled with multiethnic data (7877) and images of varying quality (884), revealing a drop in AUC with 7.3% and 17.3%, respectively. In contrast, our model demonstrated a modest decline in accuracy, approximately 9.6%, when tested on the DRISHTI-GS dataset. We suspect that part of this performance shift might be due to inconsistencies and the lack of a clearly defined protocol for glaucoma classification across the publicly available dataset. We discovered specific variances in the classification criteria for glaucoma within the dataset (Fig. 2), which may have contributed to the slight drop in accuracy. Despite this, the model's accuracy remained potent, indicating a strong generalization capability. Nevertheless, it would be useful to evaluate the model’s performance across different datasets to confirm its reliability and generality further.
Investigating our model's top losses, we ascertained two significant insights. First, the model did not perform well in classifying borderline cases, suggesting a need for advanced training techniques to handle such intricacies. Second, we identified potential mislabeling of fundus images within our dataset. This mislabeling could introduce confusion during the model's learning phase, thereby decreasing performance. Both findings highlight the need for robust data quality checks and expert verification during dataset preparation. To improve the generalizability of the CNN model for glaucoma screening, we should consider accurate data labeled for training a model by glaucoma experts based on clinical tests rather than expanding the fundus data from multiethnic populations.
We explored the decision-making process of our deep-learning model employing Grad-CAM to create heatmaps for the input fundus images. Heatmaps generated using Grad-CAM highlighted the regions of the fundus images that the model analyzed when determining the presence or absence of glaucoma. Interestingly, the model's emphasis areas align well with those that ophthalmologists would typically examine, such as the optic disc and cup, strengthening the clinical relevance of our model. These visual insights add a layer of transparency to our deep-learning model and provide a key link between automated classification and clinical understanding. These insights from the Grad-CAM heatmaps will be invaluable in ensuring the model's decision-making process correlates with the clinical indicators of glaucoma. This can build clinicians’ trust in these algorithms, allowing for wide adoption in clinical practices.
Although our study demonstrates promising results, there are several limitations. First, we observed that our dataset had mislabeled fundus images, which could impact our model's learning process and accuracy. We employed a data-cleaning procedure to address these challenges, removing 1306 images using ImageClassifierCleaner.29 This process led to a cleaner and more reliable dataset, upon which we re-trained our model. This refinement considerably enhanced the model's robustness and improved its ability to generalize to unseen data. Second, we observed that class imbalance could reduce the model's effectiveness; however, we utilized class weight balance techniques to address this. Furthermore, a data augmentation technique was used in the training phase that could be different from the actual clinical images. Next, our Grad-CAM heatmaps indicated that the model occasionally focused on non-relevant regions for classification decisions, implying that the model might be learning from noise or artifacts within the images. Despite this limitation, the heatmaps confirmed that the model-based its predictions on clinically interpretable features. Finally, our model's external validation was conducted solely on the DRISHTI-GS dataset. Our study also acknowledges the publicly unavailable glaucomatous fundus data from the African continent (eFigure 1 in the Supplementary) for our model’s training and validation. Incorporating glaucoma datasets from African countries could be highly beneficial to further enhance our model's generalizability, especially in under-resourced areas. Future studies should aim to validate the model across multiple datasets, diverse populations, and varied imaging devices to ensure broader applicability. Additionally, our model did not integrate clinical data, such as patients’ glaucoma history or IOP measurements, and visual field data, which could further enhance its predictive capabilities. Despite these limitations, the potential of our refined model for automated glaucoma screening remains significant and provides exciting prospects for future enhancements.
Our study used fundus images to establish a robust computer vision model for glaucoma screening. The best-performing model ascertained high values across multiple evaluation metrics for glaucoma and healthy cohorts, demonstrating its robustness. Our approach promises a fast, cost-effective, and highly accurate tool that can assist ophthalmologists and optometrists in the decision-making process, ultimately improving patient outcomes and reducing the socioeconomic burden of glaucoma. However, the model's accuracy considerably dropped when evaluated on unseen data, indicating potential inconsistencies among the datasets—the model needs to be refined and validated on larger, more diverse datasets to ensure reliability and generalizability. Prospective work will involve validating the model across different datasets, integrating clinical data, and refining the model’s architectures.