In this study, we aimed to analyze statistically experts’ scoring differences for patients with UC and assess quantitatively the impact of this interobserver variability on diagnostic outcomes 17–19. To achieve this, we trained a deep learning network called consensus data using only images for which expert scoring was consistent. We compared the performance of the deep learning model, which was trained using each expert’s scoring, with the performance of different models trained to diagnose test images. Figure 4 shows the flowchart of this study.
• Patients and images
A total of 254 rectal endoscopic images obtained from 115 patients with ulcerative colitis who underwent endoscopy at Ewha Womans University Seoul Hospital between June 10, 2019, and February 29, 2021, were targeted (Table 3). Patients with Crohn’s disease, resection before the colonoscopy date, or other bowel resection were excluded from the study. The study protocol was approved by the Ethics Committee of Ewha Womans University Seoul Hospital (IRB no. 2023-03-028), and all methods were carried out in accordance with relevant guidelines and regulations. Informed consent was obtained from all subjects and/or their legal guardian(s) as required, ensuring full compliance with ethical standards.
Table 3
Index
|
Data
|
Sex, n
|
Male
|
57
|
Female
|
58
|
Age, years
|
Mean (range)
|
46 (19–78)
|
Median
|
44
|
Images
|
Sampling date
|
06/2019–02/2021
|
Number of images
(remission/mild, moderate, severe)
|
254
(204, 42, 8)
|
The severity of ulcerative colitis was classified into three categories: remission/mild, moderate, and severe. These labels were used as input data for the deep learning network models. Figure 5 shows the collected endoscopic images and labels in this study.
The endoscopic images were captured using a CV-290 (Olympus, Tokyo, Japan). They are RGB images with resolutions of 543 ⅹ 475 and 1242 ⅹ 1079, and a 10-bit color depth.
• Scoring system
We introduced the consensus approach in this study. Under this approach, five experts independently assign scores, but the final score is only adopted if at least three experts assign the same score, termed “consensus data.” This method minimizes bias in evaluations from individual experts’ judgments, contributing to increasing reliability.
• Deep learning network
For image classification, we selected 13 deep learning network models based on CNN (Fig. 6), which are known for their exceptional performance in this field: DenseNet-121, MobileNetV2, DenseNet201, InceptionV3, EfficientNetB0, EfficientNetB7, MobileNetV3Large, ResNet152V2, ResNet50, ResNet50V2, VGG19, VGG16, and Xception.
These models were chosen based on their exceptional performance in this field and their extensive application in research. Despite their diverse architectures, all models utilize convolutional and pooling layers for feature extraction and dimensionality reduction, which makes them highly efficient for high-level image classification tasks, aligning perfectly with the requirements of our study.
All of these models are TensorFlow implementations initialized using weights from the ImageNet dataset 20. This method, known as transfer learning, is common practice in many fields of medical imaging and has proven to be exceptionally successful 21. The principal advantage of this approach is that the pre-learned weights from the lower layers of these models. These lower layers often detect more generalized features, such as edges or textures, which are universal to many image classification problems. Our study employed end-to-end training, achieving satisfying results. However, for cases where this approach isn't as effective, 'fine-tuning'—freezing the lower layers—could be a viable alternative.
We divided the collected endoscopic images into training and test sets at a ratio of about 8:2 (Table 4).
Table 4
Number of images used for training and test set
|
Severity
|
|
Remission/Mild
|
Moderate
|
Severe
|
Training set
|
164
|
34
|
6
|
Testing set
|
40
|
8
|
2
|
Total
|
204
|
42
|
8
|
The model training was conducted for 30 epochs, with a batch size of 30. The categorical cross-entropy loss function, commonly used for multiclass classification, was selected, and the Adam optimization algorithm was used. The learning rate was set to 1e-4, and all images were downscaled to a size of 543 ⅹ 475.
Accuracy, recall, precision, and F1 score were utilized to assess the performance of our deep learning model comprehensively. These metrics are used to evaluate the model’s classification performance quantitatively, also reflecting the approach’s sensitivity and the harmonic mean of the metrics.
• Data preprocessing
Endoscopic images inherently possess characteristics such as reflections caused by the light source and dark regions where the light does not reach. These phenomena exist as artifacts in deep learning training, and it is crucial to control them to ensure the model’s accuracy and reliability [2, 3]. In order to eliminate such artifacts, we attempted to remove the areas corresponding to light reflection in the RGB channels. However, we faced challenges in accurately isolating only the areas of light reflection, as parts of ulcer or erosion regions were also eliminated alongside the reflective regions in the RGB channels. This complexity posed a difficulty in precisely detecting the areas of light reflection (Fig. 7).
In the data preprocessing stage, the color space of the images was converted from RGB to HSV to eliminate reflections and dark areas (Fig. 8) 22,23. The ranges to detect reflections and dark areas were (0, 360) for H, (90, 255) for S, and (65, 236) for V.
With this method, the identified regions were converted into binary mask images, which were then multiplied with the original RGB endoscopic images. Subsequently, the empty spaces were filled using an inpainting technique 24.
To enhance the generalization performance of the model through data augmentation, these techniques were applied: rotation range (360 degrees), zoom range (15%), width shift range (20%), height shift range (20%), shear range (15%), horizontal flipping, and filling mode (“reflect”).
• Interobserver variation
We employed the intraclass correlation coefficient (ICC) to assess the agreement among UCEIS scores (ranging from 0 to 8) assigned by the experts (Table 5). The ICC is a statistical methodology that measures the level of agreement among observations by calculating the ratio of between-observer variance to the total variance among observations. Through this approach, we evaluated the consistency of scores assigned by multiple experts to the same images.
Table 5
Interpretation of intraclass correlation coefficients (ICC)
ICC
|
Level of Agreement
|
0.9–1.0
|
Excellent
|
0.75–0.9
|
Good
|
0.5–0.75
|
Moderate
|
< 0.5
|
Poor
|
We used Fleiss’ kappa index to evaluate the agreement in severity classification based on the UCEIS assigned by the experts (Table 6). In this context, the labels were classified into four categories: remission, mild, moderate, or severe. Fleiss’ kappa is a suitable statistical method for measuring agreement among evaluators assessing categorical data.
Table 6
Interpretation of Kappa index
Kappa
|
Level of Agreement
|
1.00
|
Perfect
|
0.81–0.99
|
Near perfect
|
0.61–0.80
|
Substantial
|
0.41–0.60
|
Moderate
|
0.21–0.40
|
Fair
|
0.10–0.20
|
Slight
|
0
|
Equivalent to chance
|