Using a Deep Learning Model to Address Interobserver Variability in the Evaluation of Ulcerative Colitis (UC) Severity

doi:10.21203/rs.3.rs-3226758/v1

Download PDF

Article

Using a Deep Learning Model to Address Interobserver Variability in the Evaluation of Ulcerative Colitis (UC) Severity

https://doi.org/10.21203/rs.3.rs-3226758/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The use of endoscopic images for the accurate assessment of ulcerative colitis (UC) severity is crucial to determining appropriate treatment. However, experts may interpret these images differently, leading to inconsistent diagnoses. This study aims to address the issue by introducing a standardization method based on deep learning. We collected 254 rectal endoscopic images from 115 patients with UC, and five experts in endoscopic image interpretation assigned classification labels based on the Ulcerative Colitis Endoscopic Index of Severity (UCEIS) scoring system. Interobserver variance analysis of the five experts yielded an intraclass correlation coefficient of 0.8431 for UCEIS scores and a kappa coefficient of 0.4916 when the UCEIS scores were transformed into UC severity measures. To establish a consensus, we created a model that considered only the images and labels on which more than half of the experts agreed. This consensus model achieved an accuracy of 0.94 when tested with 50 images. Compared with models trained from individual expert labels, the consensus model demonstrated the most reliable prediction results.

Health sciences/Medical research

Physical sciences/Engineering

endoscopy

ulcerative colitis

deep learning

inter-observer variation

severity

Ulcerative colitis (UC) is an idiopathic, chronic inflammatory disease of the colon mucosa, usually beginning in the rectum and extending proximally through all or part of the colon ¹. With alternating periods of exacerbation and remission, the clinical course is unpredictable. Diagnosis of UC is based on clinical symptoms and confirmed by objective findings from endoscopic and histological examinations. Endoscopic evidence of UC includes persistent colonic inflammation, with confirmatory biopsy specimens indicating chronic colitis ². The endoscopy plays an important role in managing patients with UC, allowing us to visualize and assess disease severity. Consequently, the objective assessment provided by endoscopy is important to the optimal management of patients with UC.

The most widely used scoring systems for assessing endoscopic disease activity in UC are the Mayo endoscopic subscore and the Ulcerative Colitis Endoscopic Index of Severity (UCEIS) ³. The UCEIS, which was developed using a linear mixed regression model, assesses the extent of endoscopic severity based on three variables: vascular pattern (normal, 1; patchy obliteration, 2; or obliterated, 3), bleeding (none, 1; mucosal, 2; luminal mild, 3; or luminal moderate or severe, 4) and erosions and ulcers (none, 1; erosions, 2; superficial ulcer, 3; or deep ulcer, 4) ⁴. It is challenging to correctly grade colonoscopies using the UCEIS, with even experienced and sufficiently trained experts showing interobserver variability. As a result, researchers have been working to develop a deep learning system for the consistent and objective analysis of UC endoscopic images based on artificial intelligence (AI) ^5–10.

Our study used AI for the endoscopic evaluation and diagnosis of patients with UC. AI was first used in 2003 to assess endoscopic severity in UC patients. Sasaki et al. defined the Matts score for grading endoscopic severity using pictorial parameters of mucosal redness from 133 digital colonoscopy fixed images of 55 patients with UC. The degree of mucosal redness was measured as a hemoglobin index through a Bayesian-driven computer-aided detection algorithm. This algorithm could differentiate the Matts grades based on the kurtosis of hemoglobin index with high sensitivity and specificity ⁵. More recently, Ozawa et al. attempted to detect mucosal remission or activity in UC patients using AI and a computer-aided detection system based on convolutional neural networks (CNN) and trained on large datasets of endoscopic still images. The system showed a high level of performance, with areas under the receiver operating characteristic curve of 0.86 and 0.98 to identify Mayo 0 and 0–1, respectively ⁶.

In addition, many other authors have conducted UC-related research using AI. Sutton et al. utilized AI to differentiate UC from other intestinal diseases and to assess the severity of UC endoscopic ulcers, achieving an accuracy of 87.50% and an area under the curve of 0.90 with 851 images from UC patients ⁷. Takenaka et al. created a deep neural network system that analyzed endoscopic images from 2012 UC patients (totaling 40,758 images) and 6,885 biopsy outcomes, achieving 90.1% accuracy in identifying remission in endoscopy ⁸. Najarian et al. trialed a fully automated video system for analyzing and grading endoscopic disease in UC. The system, working with videos of the clinical trial set comprising 51 high resolution and 264 tests, correctly differentiated between remission and active disease in 83.7% of cases ⁹. Gottlieb et al. verified a deep learning algorithm's capability to predict levels of UC severity from full-length endoscopy videos from 249 patients, displaying an area under the curve ranging from 0.787 to 0.901 for the endoscopic Mayo Score and 0.855 for the UCEIS ¹⁰. Finally, Bossuyt et al. developed an operator-independent computer-based tool to determine UC activity based on endoscopic images from 29 consecutive UC patients and 6 healthy controls. This tool's readings correlated significantly with the Robarts histological index, Mayo Endoscopic Score, and UCEIS. These studies collectively indicate the growing potential of AI in improving the diagnosis, assessment, and management of UC ¹¹.

Therefore, AI implementation in UC is promising for improving the assessment of disease activity and reducing interobserver variability in grading such activity. In most studies, the primary focus has been on the binary classification of Ulcerative Colitis (UC) states, differentiating between the inactive and active phases of the disease.

In this study, we developed a model that predicts three stages of UC severity in the diagnosis of endoscopic images from patients with UC. Furthermore, to enhance the objectivity and precision of UC diagnosis, we constructed a robust deep learning model that effectively reduces discrepancies between different expert evaluations.

• UCEIS score estimation

The UCEIS scores and severities evaluated by five experts and the consensus data are shown in Fig. 1 (a) and (b). When we examined the UCEIS scores given by each expert, we observed a maximum difference of 63 images. Similarly, when we considered the severity of the condition, we found a maximum difference of 79 images. Of the total 254 images, we constructed a consensus data set using 220 images that were scored identically by more than half of the experts.

• Statistical analysis of interobserver variance

The interobserver variance among five experts was assessed based on UCEIS scores. The ICC, a metric of interobserver consistency, was 0.8431, indicating good agreement among the observers’ assessments ¹².

Subsequently, UCEIS scores were transformed into measures of UC severity, and interobserver variance was recalculated among the same set of experts. In this context, the calculated kappa coefficient was 0.4916. While this value does not imply perfect agreement among the observers, it does denote moderate agreement, demonstrating a certain degree of uniformity in the classification of UC severity ¹².

• Outcome of the deep learning network model

We evaluated the performance of 13 deep learning network models of consensus data, which is crucial for ensuring reliability in endoscopic image classification. Table 1 presents the results, including accuracy, F1 score, recall, and precision. The models are listed in descending order based on accuracy. the VGG19 exhibited the highest overall performance. With an impressive accuracy of 94.00%, coupled with a balanced F1 score of 94.16%, this model demonstrated robust capabilities in the classification tasks. The confusion matrix for the VGG-19 model shows that it correctly identified remission and mild 41 times out of 42, and moderate 7 times out of 8. However, in distinguishing between moderate and severe, the model incorrectly classified 1 moderate case as severe and 1 severe case as moderate (Fig. 2). The individual deep learning results for each of the five experts are presented in Supplementary Tables 1 to 5.

Table 1

Performance of deep learning networks on consensus data
Model	Accuracy	F1 score	Recall	Precision
VGG19	0.9400	0.9416	0.9400	0.9444
MobileNetV3Large	0.8846	0.8912	0.8846	0.9152
EfficientNetB0	0.8654	0.8679	0.8654	0.8727
VGG16	0.8077	0.8173	0.8077	0.9145
DenseNet121	0.7692	0.7023	0.7692	0.6462
EfficientNetB7	0.7115	0.7451	0.7115	0.831
ResNet50	0.6731	0.7215	0.6731	0.7955
ResNet152V2	0.6538	0.6821	0.6538	0.7263
MobileNetV2	0.6346	0.6702	0.6346	0.7806
ResNet50V2	0.6154	0.6558	0.6154	0.8901
InceptionV3	0.5686	0.5118	0.5686	0.4652
DenseNet201	0.5686	0.5686	0.5686	0.5686
Xception	0.3725	0.2022	0.3725	0.7662

• Differences in accuracy for each model

We created six deep learning network models trained with labels from five experts and consensus data. Table 2 shows the number of pass/fail images when labels were predicted for 50 test images. Expert A and the consensus networks predicted the most accurately.

Table 2

The number of pass/fail images among 50 test images with six network models
	Consensus model	Expert models
	Consensus model	A	B	C	D	E
Pass	46	40	39	44	28	32
Fail	4	10	11	16	22	18

This study aimed to measure and quantify the differences in interpretation among experts analyzing endoscopic images of patients with UC. A deep learning model using images and labels that were agreed upon by more than half of the participating experts was developed. The proposed consensus approach can effectively reduce variations in expert opinions while preserving the diagnostic patterns specific to each institution.

The implementation of this model provides a standardized diagnostic guide, both for those who are new to clinical practice and those who handle numerous cases. This guide will ensure the consistent interpretation of cases over time while accounting for individual reading tendencies that may change with experience. Ultimately, this tool will assist in delivering accurate and reliable diagnoses for patients with UC.

The decision to combine the remission and mild categories in the dataset used to train the deep learning model is grounded in clinical rationale and informed by treatment objectives. From a treatment perspective, patients in remission or with mild symptoms are often not the primary targets for aggressive intervention. The combination of these categories into a single group reflects a clinically meaningful distinction in the condition’s management. This approach ensures that the model’s predictions align more closely with the clinical considerations that guide treatment decisions, potentially improving the model’s utility in the real-world clinical setting.

In our study, we used a total of 254 endoscopic images, which may seem relatively small compared with other deep learning studies involving endoscopic images. However, we intentionally focused on collecting images from the same specific area to enhance the learning performance of the deep learning model. Given that UC symptoms typically begin in the rectum and gradually spread to other areas, we selected images from the rectum region and developed our learning model based on this dataset. While the number of images used may appear small, we believe that this focused approach allowed us to achieve highly accurate results ^13,14.

However, the limited number of severe images reduced the overall predictive accuracy of the model primarily because there were considerably fewer severe images compared with those depicting remission/mild or moderate severity. We recognize the need to gather a larger amount of image data to address this issue in the future. To achieve this, we intend to conduct a multicenter study, which will allow us to collect a more diverse and comprehensive dataset and ultimately improve the model’s performance.

We also collected pathologic readings of endoscopic images for our study and sought the input of two pathologists to ensure the reliability. In cases where their opinions diverged, the pathologists engaged in discussions to reach a consensus on the pathological interpretation ^15,16. Figure 3 illustrates the distribution of severity based on both pathology and clinical findings. The differences between the pathological and clinical findings are thought to arise because the pathological findings are evaluated only in the biopsy tissue, while the clinical findings are evaluated in the entire endoscopic image. In clinical practice, biopsy results are important, but decisions are also influenced by the size and appearance of lesions visible on endoscopic images¹⁵. As a result, we chose not to utilize the pathology findings for data labeling in this study, focusing instead on other significant factors.

In this study, we developed a consensus model for reliably interpreting endoscopic images. To do so, we gathered label data created from the collective opinions of multiple experts and then evaluated the accuracy of the model. The results of our study are significant as they propose a method to reduce differences and variations that individual experts may introduce. By adopting a consensus approach, we can improve the consistency and reliability of interpreting endoscopic images.

In this study, we aimed to analyze statistically experts’ scoring differences for patients with UC and assess quantitatively the impact of this interobserver variability on diagnostic outcomes ^17–19. To achieve this, we trained a deep learning network called consensus data using only images for which expert scoring was consistent. We compared the performance of the deep learning model, which was trained using each expert’s scoring, with the performance of different models trained to diagnose test images. Figure 4 shows the flowchart of this study.

• Patients and images

A total of 254 rectal endoscopic images obtained from 115 patients with ulcerative colitis who underwent endoscopy at Ewha Womans University Seoul Hospital between June 10, 2019, and February 29, 2021, were targeted (Table 3). Patients with Crohn’s disease, resection before the colonoscopy date, or other bowel resection were excluded from the study. The study protocol was approved by the Ethics Committee of Ewha Womans University Seoul Hospital (IRB no. 2023-03-028), and all methods were carried out in accordance with relevant guidelines and regulations. Informed consent was obtained from all subjects and/or their legal guardian(s) as required, ensuring full compliance with ethical standards.

Table 3

Demographics and images
Index		Data
Sex, n	Male	57
Sex, n	Female	58
Age, years	Mean (range)	46 (19–78)
Age, years	Median	44
Images	Sampling date	06/2019–02/2021
Images	Number of images (remission/mild, moderate, severe)	254 (204, 42, 8)

The severity of ulcerative colitis was classified into three categories: remission/mild, moderate, and severe. These labels were used as input data for the deep learning network models. Figure 5 shows the collected endoscopic images and labels in this study.

The endoscopic images were captured using a CV-290 (Olympus, Tokyo, Japan). They are RGB images with resolutions of 543 ⅹ 475 and 1242 ⅹ 1079, and a 10-bit color depth.

• Scoring system

We introduced the consensus approach in this study. Under this approach, five experts independently assign scores, but the final score is only adopted if at least three experts assign the same score, termed “consensus data.” This method minimizes bias in evaluations from individual experts’ judgments, contributing to increasing reliability.

• Deep learning network

For image classification, we selected 13 deep learning network models based on CNN (Fig. 6), which are known for their exceptional performance in this field: DenseNet-121, MobileNetV2, DenseNet201, InceptionV3, EfficientNetB0, EfficientNetB7, MobileNetV3Large, ResNet152V2, ResNet50, ResNet50V2, VGG19, VGG16, and Xception.

These models were chosen based on their exceptional performance in this field and their extensive application in research. Despite their diverse architectures, all models utilize convolutional and pooling layers for feature extraction and dimensionality reduction, which makes them highly efficient for high-level image classification tasks, aligning perfectly with the requirements of our study.

All of these models are TensorFlow implementations initialized using weights from the ImageNet dataset ²⁰. This method, known as transfer learning, is common practice in many fields of medical imaging and has proven to be exceptionally successful ²¹. The principal advantage of this approach is that the pre-learned weights from the lower layers of these models. These lower layers often detect more generalized features, such as edges or textures, which are universal to many image classification problems. Our study employed end-to-end training, achieving satisfying results. However, for cases where this approach isn't as effective, 'fine-tuning'—freezing the lower layers—could be a viable alternative.

We divided the collected endoscopic images into training and test sets at a ratio of about 8:2 (Table 4).

Table 4

Number of images used for training and test set
	Severity
	Remission/Mild	Moderate	Severe
Training set	164	34	6
Testing set	40	8	2
Total	204	42	8

The model training was conducted for 30 epochs, with a batch size of 30. The categorical cross-entropy loss function, commonly used for multiclass classification, was selected, and the Adam optimization algorithm was used. The learning rate was set to 1e-4, and all images were downscaled to a size of 543 ⅹ 475.

Accuracy, recall, precision, and F1 score were utilized to assess the performance of our deep learning model comprehensively. These metrics are used to evaluate the model’s classification performance quantitatively, also reflecting the approach’s sensitivity and the harmonic mean of the metrics.

• Data preprocessing

Endoscopic images inherently possess characteristics such as reflections caused by the light source and dark regions where the light does not reach. These phenomena exist as artifacts in deep learning training, and it is crucial to control them to ensure the model’s accuracy and reliability [2, 3]. In order to eliminate such artifacts, we attempted to remove the areas corresponding to light reflection in the RGB channels. However, we faced challenges in accurately isolating only the areas of light reflection, as parts of ulcer or erosion regions were also eliminated alongside the reflective regions in the RGB channels. This complexity posed a difficulty in precisely detecting the areas of light reflection (Fig. 7).

In the data preprocessing stage, the color space of the images was converted from RGB to HSV to eliminate reflections and dark areas (Fig. 8) ^22,23. The ranges to detect reflections and dark areas were (0, 360) for H, (90, 255) for S, and (65, 236) for V.

With this method, the identified regions were converted into binary mask images, which were then multiplied with the original RGB endoscopic images. Subsequently, the empty spaces were filled using an inpainting technique ²⁴.

To enhance the generalization performance of the model through data augmentation, these techniques were applied: rotation range (360 degrees), zoom range (15%), width shift range (20%), height shift range (20%), shear range (15%), horizontal flipping, and filling mode (“reflect”).

• Interobserver variation

We employed the intraclass correlation coefficient (ICC) to assess the agreement among UCEIS scores (ranging from 0 to 8) assigned by the experts (Table 5). The ICC is a statistical methodology that measures the level of agreement among observations by calculating the ratio of between-observer variance to the total variance among observations. Through this approach, we evaluated the consistency of scores assigned by multiple experts to the same images.

Table 5

Interpretation of intraclass correlation coefficients (ICC)
ICC	Level of Agreement
0.9–1.0	Excellent
0.75–0.9	Good
0.5–0.75	Moderate
< 0.5	Poor

We used Fleiss’ kappa index to evaluate the agreement in severity classification based on the UCEIS assigned by the experts (Table 6). In this context, the labels were classified into four categories: remission, mild, moderate, or severe. Fleiss’ kappa is a suitable statistical method for measuring agreement among evaluators assessing categorical data.

Table 6

Interpretation of Kappa index
Kappa	Level of Agreement
1.00	Perfect
0.81–0.99	Near perfect
0.61–0.80	Substantial
0.41–0.60	Moderate
0.21–0.40	Fair
0.10–0.20	Slight
0	Equivalent to chance

Acknowledgements

This research supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2022R1H1A2092091) and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00240003).

Ethical approval

This study does not involve any experiment with human participants or animals.

Data availability

The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.

Magro, F. et al. Third European Evidence-based Consensus on Diagnosis and Management of Ulcerative Colitis. Part 1: Definitions, Diagnosis, Extra-intestinal Manifestations, Pregnancy, Cancer Surveillance, Surgery, and Ileo-anal Pouch Disorders. J Crohns Colitis 11, 649–670 (2017).
Nakase, H. et al. Evidence-based clinical practice guidelines for inflammatory bowel disease 2020. J Gastroenterol 56, 489–526 (2021).
Ikeya, K. et al. The Ulcerative Colitis Endoscopic Index of Severity More Accurately Reflects Clinical Outcomes and Long-term Prognosis than the Mayo Endoscopic Score. J Crohns Colitis 10, 286–295 (2016).
Travis, S. P. L. et al. Developing an instrument to assess the endoscopic severity of ulcerative colitis: the Ulcerative Colitis Endoscopic Index of Severity (UCEIS). Gut 61, 535–542 (2012).
Sasaki, Y., Hada, R. & Munakata, A. Computer-aided grading system for endoscopic severity in patients with ulcerative colitis. Digestive Endoscopy 15, 206–209 (2003).
Ozawa, T. et al. Novel computer-assisted diagnosis system for endoscopic disease activity in patients with ulcerative colitis. Gastrointest Endosc 89, 416-421.e1 (2019).
Sutton, R. T., Zai͏̈ane, O. R., Goebel, R. & Baumgart, D. C. Artificial intelligence enabled automated diagnosis and grading of ulcerative colitis endoscopy images. Sci Rep 12, 2748 (2022).
Takenaka, K. et al. Development and Validation of a Deep Neural Network for Accurate Evaluation of Endoscopic Images From Patients With Ulcerative Colitis. Gastroenterology 158, 2150–2157 (2020).
Yao, H. et al. Fully automated endoscopic disease activity assessment in ulcerative colitis. Gastrointestinal Endoscopy 93, 728-736.e1 (2021).
Gottlieb, K. et al. Central Reading of Ulcerative Colitis Clinical Trial Videos Using Neural Networks. Gastroenterology 160, 710-719.e2 (2021).
Bossuyt, P. et al. Automatic, computer-aided determination of endoscopic and histological inflammation in patients with mild to moderate ulcerative colitis based on red density. Gut 69, 1778–1786 (2020).
Han, X. On Statistical Measures for Data Quality Evaluation. Journal of Geographic Information System 12, 178–187 (2020).
Halevy, A., Norvig, P. & Pereira, F. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 24, 8–12 (2009).
Hestness, J. et al. Deep Learning Scaling is Predictable, Empirically. Preprint at https://doi.org/10.48550/arXiv.1712.00409 (2017).
DeRoche, T. C., Xiao, S.-Y. & Liu, X. Histological evaluation in ulcerative colitis. Gastroenterol Rep (Oxf) 2, 178–192 (2014).
Geboes, K. A reproducible grading scale for histological assessment of inflammation in ulcerative colitis. Gut 47, 404–409 (2000).
Odze, R. D. et al. Interobserver Variability in the Diagnosis of Ulcerative Colitis-Associated Dysplasia by Telepathology. Mod Pathol 15, 379–386 (2002).
de Lange, T., Larsen, S. & Aabakken, L. Inter-observer agreement in the assessment of endoscopic findings in ulcerative colitis. BMC Gastroenterol 4, 9 (2004).
Thia, K. T. et al. Measurement of disease activity in ulcerative colitis:Interobserver agreement and predictors of severity. Inflammatory Bowel Diseases 17, 1257–1264 (2011).
Abadi, M. et al. {TensorFlow}: A System for {Large-Scale} Machine Learning. in 265–283 (2016).
Shin, H.-C. et al. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Transactions on Medical Imaging 35, 1285–1298 (2016).
Zhu, X., Xu, X. & Mu, N. Saliency Detection Based on the Combination of High-Level Knowledge and Low-Level Cues in Foggy Images. Entropy (Basel) 21, 374 (2019).
Lee, K. & Jeong, J. Multi-Color Space Network for Salient Object Detection. Sensors (Basel) 22, 3588 (2022).
Telea, A. An Image Inpainting Technique Based on the Fast Marching Method. Journal of Graphics Tools 9, 23–34 (2004).

No competing interests reported.

Supplementary.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Using a Deep Learning Model to Address Interobserver Variability in the Evaluation of Ulcerative Colitis (UC) Severity

Status:

Version 1

Abstract

Figures

Introduction

Results

• UCEIS score estimation

• Statistical analysis of interobserver variance

• Outcome of the deep learning network model

• Differences in accuracy for each model

Discussion

Materials and Methods

• Patients and images

• Scoring system

• Deep learning network

• Data preprocessing

• Interobserver variation

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1