DUNEScan: A Web Application for Uncertainty Estimation in Skin Cancer Detection with Deep Neural Networks

doi:10.21203/rs.3.rs-712718/v1

Download PDF

Research Article

DUNEScan: A Web Application for Uncertainty Estimation in Skin Cancer Detection with Deep Neural Networks

https://doi.org/10.21203/rs.3.rs-712718/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Recent years have seen a steep rise in the number of skin cancer detection applications. While modern advances in deep learning made possible reaching new heights in terms of classification accuracy, no publicly available skin cancer detection software provide confidence estimates for these predictions. We present DUNEScan (Deep Uncertainty Estimation for Skin Cancer), a web application that performs an intuitive in-depth analysis of uncertainty in commonly used skin cancer classification models based on convolutional neural networks (CNNs). DUNEScan allows users to upload a skin lesion image, and quickly compares the mean and the variance estimates provided by a number of new and traditional CNN models. Moreover, our web application uses the Grad-CAM and UMAP algorithms to visualize the classification manifold for the user’s input, hence providing crucial information about its closeness to skin lesion images taken from the popular ISIC database. DUNEScan is freely available at: https://www.dunescan.org.

Scientific Communication

Software Engineering

DUNEScan

Web Application

Uncertainty Estimation

Skin Cancer Detection

Deep Neural Networks

Skin cancer is among the most dangerous and frequent diseases around the world. For example, in the United States alone, up to 9,500 people are being diagnosed with it daily (Siegel et al., 2019). Naturally, the demand for an accurate diagnosis of skin cancer has risen through the past years, and dermatologists are facing increasingly high number of diagnostic challenges. As a result, an important number of skin cancer detection applications have been developed over the past few years (Wise, 2018; Sandler et al., 2018; Hogarty et al., 2020). Many of them leverage recent breakthroughs in deep learning architectures to achieve cutting-edge performance, often surpassing dermatologist-level diagnosis accuracy (Abbott & Smith, 2018; Brinkler & Hekler, 2019; Hekler et al., 2020, Abdar et al., 2021). However, an increasing number of studies suggest that many popular skin cancer detection applications feature proprietary models, making it hard to assess their true performance on external datasets. Moreover, to the best of our knowledge, no publicly available skin cancer detection applications provide confidence estimates for predicted outcomes. Here, we present a novel web application, called DUNEScan (Deep Uncertainty Estimation for Skin Cancer), which addresses the aforementioned lack of uncertainty estimates in skin cancer classification models. DUNEScan can be used by the domain experts (i.e. dermatologists and health practitioners), who can combine the confidence estimates of the classifier with their own observations to provide a more grounded diagnosis. Our novel application can be used to assess the confidence level of state-of-the-art skin cancer detection models and to visualize the related results.

Deep learning-based computer vision has recently experienced immense breakthroughs. This has had a great impact on all related application domains including medical imaging. Keeping up with the latest state-of-the-art algorithms can often be challenging and time-consuming, which is why DUNEScan includes the most recent and best performing supervised and self-supervised methods. Moreover, since the user’s privacy and data security are especially important in digital healthcare, all web connections are performed over secured protocols.

Available deep learning models

Our web application features six efficient CNN models, including the winners of the dermatological Kaggle competition based on EfficientNet (2019-2020). They are as follows: Inceptionv3 (Szegedy et al., 2016), ResNet50 (He et al., 2016), MobileNetv2 (Sandler et al., 2018), EfficientNet (Tan & Le, 2019), BYOL (Grill et al., 2020) and SwAV (Caron et al., 2020). The model repository features both supervised and self-supervised models. Although recent self-supervised learning models can match the performance of supervised learning models, no skin cancer detection applications have integrated self-supervised models in their pipelines so far. The major advantage of self-supervised methods is the ability to leverage large amounts of unlabeled data to pretrain the latent representation, which can then be used to train a simple classifier, matching the accuracy of fully supervised methods (Grill et al., 2020).

Uncertainty estimation

In risk-sensitive fields such as medical imaging, where a false negative prediction can make a difference between life and death, it is crucial to quantify the confidence level of a given model. DUNEScan uses the technique proposed by Gal & Ghahramani (2016), randomly disabling parameters of the classifier in an independent set of replicates, and thus achieving an approximate Bayesian posterior over the possible estimates of the model for a given skin lesion image.

The DUNEScan user can select the number of random replicates to be used for a given model. DUNEScan provides uncertainty estimates for each classifier through a boxplot (see Fig. 1b). If the prediction probabilities with the replicates are tightly concentrated around the mean, this implies that the classifier is confident in its class prediction for the input image and the prediction is trustworthy. In contrast, if the prediction probabilities for the benign and malignant image classes are dispersed and their confidence intervals overlap, this implies that the classifier is not confident and hence, the prediction is not trustworthy.

In addition to the boxplots described above, a classification manifold is also produced with the trained MobileNetv2 model, the fastest of the six available models (see Fig. 1d). This plot provides an alternative illustration of the confidence of the MobileNetv2 classifier obtained for the input image class prediction.

In the classification manifold graph, each green dot represents a benign skin lesion image used for training, and each red dot represents a malignant one (see Fig. 1d). If the input image, represented by a blue dot, is located close to the middle of the benign (green) cluster - then the MobileNetv2 model is confident that the lesion is benign, but if it is located close to the middle of the malignant (red) cluster - then the MobileNetv2 model is confident that the lesion is malignant. However, if the blue dot is located close to the boundary of the green and red clusters, then the model exhibits uncertainty in the prediction.

Description of DUNEScan’s output

DUNEScan first produces and presents the output plot of Grad-CAM (Selvaraju et al., 2017) that highlights the regions of high importance on the input image detected by the MobileNetv2 model (see Fig. 1c). The above described MobileNetv2 classification manifold is then presented, followed by the uncertainty estimate boxplot for each model selected to analyze the input image (see Fig. 1b).

Moreover, the output contains a bar-graph showing the average prediction probabilities of both classes obtained with each model used (see Fig. 1a). By providing the classification probabilities together with means to assess the confidence of these predictions, the DUNEScan server allows practitioners to quickly evaluate the probability that a given skin lesion is benign or malignant.

Testing the application

The application was tested by using images from the HAM10000 dataset (Taschandl et al., 2018). This dataset was used as source data for the International Skin Imaging Collaboration (ISIC) 2018 challenge (Codella et al., 2019) and includes images of skin lesions corresponding to seven different classes: actinic keratosis (akiec), basal cell carcinoma (bcc), benign keratosis (bkl), dermatofibroma (df), melanocytic nevi (nv), melanoma (mel) and vascular lesions (vasc).

Amongst these, melanoma and basal cell carcinoma are considered to be malignant skin diseases, whereas the other lesion types are considered as benign. The class labels assigned for more than 50% of the images were confirmed by histopathology, while for the others the labels were derived from expert consensus or confirmed by in-vivo confocal microscopy. Selected images were analyzed using 50 replicates with all six CNN models available in DUNEScan to give an overall classification prediction.

Melanoma and melanocytic nevi images, the most common malignant and benign classes of lesions in the dataset, representing ~11% and ~67% of the dataset, respectively, were used to assess the performance of the application. In general, the prediction average and the confidence in the prediction vary between the different algorithms. However, in most cases they broadly tend to agree on the prediction with some exceptions.

For example, for the melanoma image Mel1 (ISIC_24482) presented in Figure 2a, all the algorithms, except BYOL, give a malignant prediction with a probability greater than 0.80 (Table 1; for improved readability, it is expressed in percentages in Figs. 1-4). As also illustrated in Figures 2-4, half of the algorithms (ResNet50, EfficientNet and SwAV) are highly confident in their predictions as they all output low-variance probability distributions. The MobileNetv2 and InceptionV3 models also yield reliable predictions, but the spread of their approximate posterior distribution is noticeably larger. However, the BYOL model generally provided low-confidence predictions for these images, and thus should be used with caution.

In the case of the melanoma image Mel2 (ISIC_24751), most algorithms yield a high probability of malignancy (above 0.90) with the exception of InceptionV3 and BYOL, which suggest that the lesion is benign with a probability of 0.74 and 0.93, respectively (see Table 1 and Fig. 2b). Although the confidence intervals produced by InceptionV3 do not overlap, they are considerably larger than those produced by the other models. Therefore, the results produced by InceptionV3 and BYOL are less reliable than the consensus prediction obtained with the rest of the models for the Mel2 image.

Interestingly, the InceptionV3 model again produces an outlier result with the melanocytic nevus image Nv2 (ISIC_24334, see Fig. 3b). In this case, all other algorithms predict that the lesion is likely benign (all producing a probability of malignancy below 0.36), whereas InceptionV3 predicts that the lesion is malignant with a probability of 0.66 (see Table 1). In this case, the two models predicting the lesion to be benign with the highest probabilities, ResNet50 (0.98) and SwAV (0.97), have the tightest prediction distribution, whereas those of both InceptionV3 and MobileNetv2 are broad and overlapping (see Fig. 3b). The distributions of the prediction probabilities obtained with EfficientNet are intermediate in size, but do not overlap. Based on these results, by relying on the models producing predictions with higher confidence (ResNet50, SwAV and EfficientNet), one could conclude that the image is indeed benign.

From the sample of melanocytic nevi images tested, it seems that the models have a difficulty producing a consensus benign prediction with high probability and confidence. Nevertheless, for most nevi images, such as Nv1 (ISIC_24320), an overall convincing set of benign prediction probabilities (all 0.52 or greater) are obtained from all models (see Table 1). The EfficientNet which produces the 0.52 probability is clearly unable to assign the lesion image to one class over the other. This is clearly illustrated by the fact that all replicate prediction probabilities for both benign and malignant classes overlap with a mean near 0.50 (see Fig. 3a). All other models, which give higher benign prediction probabilities have varying levels of confidence based on the corresponding boxplots (see Fig. 3a).

Since melanoma and nevi lesions often appear to be visually similar, this may explain why in some cases most of the models have a difficulty in favoring one class over the other. For example, with the Nv3 image (ISIC_24307, Fig. 4a) most models output predictions close to 0.50 for both classes (see Table 1). Interestingly, with this image, only SwAV classifies the lesion as benign with a high probability (0.97) and confidence (see Table 1 and Fig. 4a).

Finally, we present the results obtained with a benign keratosis (BKL) image Bkl1 (ISIC_24337, Fig. 4b), which has a clearly different appearance to those of nevi and melanoma lesions. In this case, all models except BYOL predict the lesion to be benign with a probability of 0.70 or greater (Table 1). We obtain dispersed (but non-overlapping) replicate prediction probability distributions with the InceptionV3 and MobileNetv2 models (see Fig. 4b), suggesting that the overall predictions that the lesion is benign (0.84 and 0.88, respectively, see Table 1), may not be highly accurate. However, based on Figure 4b, the predictions from all other models, except BYOL, appear to be trustworthy.

Table 1

Class prediction probabilities obtained for various images by the different CNN models available in DUNEScan. The predictions are represented as probability of malignancy, p(malignancy). The probability of benignancy can be obtained by 1-p(malignancy).

Image Name¹	Image Identifier²	ResNet50	EfficientNet	InceptionV3	MobileNetv2	SwAV	BYOL
Mel1	ISIC_0024482	0.95	0.81	0.81	0.81	0.96	0.01
Mel2	ISIC_0024751	1.00	0.91	0.26	0.95	0.96	0.07
Nv1	ISIC_0024320	0.00	0.48	0.27	0.12	0.36	0.03
Nv2	ISIC_0024334	0.02	0.27	0.66	0.36	0.03	0.06
Nv3	ISIC_0024307	0.45	0.43	0.53	0.58	0.03	0.1
Bkl1	ISIC_0024337	0.30	0.09	0.16	0.12	0.05	0.47
1: Arbitrary name used as a reference in this publication. 2: ISIC image identifier.

We have developed DUNEScan - a novel web application for assessing uncertainty of deep learning models in skin cancer detection. The application includes six popular and well-performing CNN classification models: Inceptionv3 (Szegedy et al., 2016), ResNet50 (He et al., 2016), MobileNetv2 (Sandler et al., 2018), EfficientNet (Tan & Le, 2019), BYOL (Grill et al., 2020) and SwAV (Caron et al., 2020). The main feature of DUNEScan is an intuitive estimation and visualization of uncertainty for the selected state-of-the-art skin cancer classifier. Uncertainty estimates are reported via boxplots of dropout replicates, Grad-CAM highlighting of “regions of interest” on the input image, as well as the projection of the input image onto the MobileNetv2 classification manifold. Thus, DUNEScan provides crucial information for bioinformaticians, dermatologists and health practitioners, looking for an accurate skin cancer diagnosis.

Train-test split procedure

We created our training dataset using publicly available data from the International Skin Imaging Collaboration (ISIC) archive. The archive contains 23,900 skin lesion images. Among them, 2,287 correspond to malignant lesions and 21,613 correspond to benign lesions. To mitigate drastic class imbalance, we combined 10,000 randomly sampled benign lesion images with all available malignant lesion images, to form a meta dataset. This meta dataset was then randomly split using a standard 80/20 train-test split from the sklearn Python package, under the condition that the test set had a 50-50 balance between benign and malignant cases. The validation procedure was carried out using a 5-fold split of the training set. Thus, our train-test split satisfies two requirements to ensure that first, the training and test sets are independent and do not have identical images, and second, that the similarity between the training and test sets is high enough so that similar patterns are included in both sets. All confusion matrices and performance metrics reported on the DUNEScan website were computed over this independent test set.

Data pre-processing procedure

For all supervised models (Inceptionv3, ResNet50, MobileNetv2 and EfficientNet), we first normalized the input image data using the traditional Z-score method with the mean value vector = [0.485, 0.456, 0.406] and the standard deviation vector = [0.229, 0.224, 0.225] taken from the ImageNet database (http://www.image-net.org/). We then applied the following data augmentation techniques from the Kornia library (https://github.com/kornia/kornia) : random flip, crops, Gaussian blurs, color jitter and rescales for all input images in order to obtain a more balanced training set. The self-supervised models (BYOL and SwAV) were trained using the exact pre-processing described in the respective papers (see Grill et al., 2020 and Caron et al., 2020).

Classification manifold

The contour plot (see Fig. 1d) is obtained by extracting features of 2,000 random malignant and benign skin lesions from the ISIC databank using the MobileNetv2 network. MobileNetv2 is by far the smallest model (least memory required) among those available in DUNEScan, and hence its forward pass is much faster. Its output features are subsequently reduced to a 2-dimensional manifold using UMAP (McInnes et al. 2018). Then, the image submitted for analysis undergoes the same normalization and feature extraction process as during training, and is finally projected on the plot as a blue dot.

Acknowledgements

The authors thank Compute Canada and the Quebec AI Institute (MILA) for providing GPU infrastructure for training large CNN models. This work was supported by the FRQNT grant no. 271273.

Author contributions

B.M., J.B. and V.M. wrote and prepared the main manuscript text and figures. B.M. and A.M. programmed the application. All authors reviewed the manuscript.

Competing interests

The authors declare no competing interests.

Correspondence and requests for materials should be addressed to B.M.

Abbott, L., & Smith, S. (2018). Smartphone apps for skin cancer diagnosis: Implications for patients and practitioners. Australasian Journal of Dermatology, 59, 168-170.
Abdar, M. et al. (2021). Uncertainty quantification in skin cancer classification using three-way decision-based Bayesian deep learning. Computers in biology and medicine.
Bianco, S., Cadene, R., Celona, L., & Napoletano, P. (2018). Benchmark analysis of representative deep neural network architectures. IEEE Access, 6, 64270-64277.
Brinker, A. J., & Hekler, A. (2019). Deep learning outperformed 136 of 157 dermatologists ina head-to-head dermoscopic melanoma image classification task. Eur J Cancer, 113, 47-54.
Caron, M. et al. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 34.
Codella, N. et al. Skin Lesion Analysis Toward Melanoma Detection: A Challenge at the 2017 International Symposium on Biomedical Imaging (ISBI). Hosted by the International Skin Imaging Collaboration (ISIC). arXiv: 1710.05006.
Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. International Conference on Machine Learning, 33, 1050-1059.
Grill, J.-B. et al. (2020). Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems, 34.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778.
Hekler, A. et al. (2020). Effects of Label Noise on Deep Learning-Based Skin Cancer Classification. Frontiers in Medicine, 7, 177.
Hogarty, D. et al. (2020). Artificial Intelligence in Dermatology - Where We Are and the Way to the Future: A Review. American Journal of Clinical Dermatology, 21, 41-47.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. Proceedings of the IEEE conference on computer vision and pattern recognition, 4510-4520.
Selvaraju, R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision, 618--626.
Siegel, R. L., Miller, K. D., & Jemal, A. (2019). Cancer statistics. CA: a cancer journalfor clinicians, 69, 7-34.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition, 2818-2826.
Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning.
Tschandl, P., Rosendahl, C. & Kittler, H. (2018). The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5, 180161.
Wise, J. (2018). Skin cancer: smartphone diagnostic appsmay offer false reassurance, warn dermatologists. BMJ, 362, k2999.
Wu, H., & Yin, H. (2020). A deep learning,image based approach for automated diagnosis for inflammatory skin diseases. Ann Transl Med, 8, 581.

No competing interests reported.

Download PDF

Editorial decision: Major revision
10 Aug, 2021
Reviews received at journal
02 Aug, 2021
Reviews received at journal
25 Jul, 2021
Reviewers agreed at journal
21 Jul, 2021
Reviewers agreed at journal
21 Jul, 2021
Reviewers invited by journal
21 Jul, 2021
Editor assigned by journal
21 Jul, 2021
Editor invited by journal
15 Jul, 2021
Submission checks completed at journal
15 Jul, 2021
First submitted to journal
13 Jul, 2021

You are reading this latest preprint version

DUNEScan: A Web Application for Uncertainty Estimation in Skin Cancer Detection with Deep Neural Networks

Status:

Version 1

Abstract

Figures

Background

Results

Available deep learning models

Uncertainty estimation

Description of DUNEScan’s output

Testing the application

Conclusion

Methods

Train-test split procedure

Data pre-processing procedure

Classification manifold

Declarations

Acknowledgements

Author contributions

Competing interests

References

Additional Declarations

Status:

Version 1