Overview of the Study Protocol
The Study Protocol Description, as illustrated in Fig. 1, is dedicated to the integration of AI object-detection techniques within the context of pathological workflows. We employed a smartphone-based system to capture images of endometrial cells and subsequently trained the YOLOv5x model. In cases of cancer, we annotated clusters of abnormal cells, whereas benign cases provided essential background data for our model training.
Throughout the evaluation phase, our AI models not only processed validation and test datasets comprising digital static images but also engaged in real-time analysis of microscopy images captured via a CCD camera, with the goal of achieving real-time detection of abnormal cell clusters. When the AI model identified abnormalities exceeding a predefined confidence score threshold, the results were visualized by displaying red bounding boxes around the abnormal cell clusters on a monitor. The total count of abnormal cell clusters was computed at the completion of AI analysis for each slide. If the count surpassed the predefined detection threshold, the AI made a positive judgment, indicating the necessity for further biopsy examinations. In contrast, if the count fell below the threshold, the AI judgment was considered negative, signifying a normal or benign case. To assess the accuracy of the AI-assisted diagnosis, the AI's final judgment was juxtaposed with the cytodiagnosis conducted by three pathologists and four medical students in 20 new cases. Furthermore, we investigated whether the accuracy and speed of human diagnosis can be improved with and without AI assistance.
Case Selection and Data Preparation
Ethical approval for this study was granted by the Institutional Review Board at the Nippon Medical School (approval number: 23K08900).
From April 2017 to March 2023, at Nippon Medical School Hospital, we selected endometrial cytological slides from 146 cases, including 72 cases labeled as 'malignant’, consisting of endometrial cancer, and 74 cases labeled as 'benign’, consisting of nonmalignant endometrial lesions such as leiomyoma. All cases were pathologically confirmed using hysterectomy specimens. All endometrial cytology specimens were prepared using the smear technique and stained with Papanicolaou stain. For the purposes of our study, case selection and data preparation were conducted as follows: 96 cases (comprising 49 benign and 47 malignant cases) were used to develop YOLOv5x. Additionally, 30 cases (15 benign and 15 malignant) were used to assess the accuracy of real-time detection of abnormal cell clusters and to quantify these clusters at the slide level. The remaining 20 cases (10 benign and 10 malignant) were reserved for evaluating the performance metrics of the AI in real-time diagnostic scenarios and for comparative diagnostic evaluations performed by pathologists and medical students. The purpose of this evaluation was to assess diagnoses made by AI alone, humans alone, and AI-assisted humans. focusing on the accuracy of the slide-level analysis for both positive and negative cases. The distribution of these patients is schematically represented in Fig. 2A. Table 1 shows the distribution of patients according to histological category and median age.
Table 1. Distribution and Median Age of Malignant and Benign Patients
|
|
AI model training
|
Real-time object detection under microscope
|
|
|
Training, validation and test cases
(n=96)
|
Cases for cell-cluster and slide-level assessment
(n=30)
|
Cases for diagnostic concordance with/without AI
(n=20)
|
|
|
Malignant
|
Median age (range)
|
57(31-82)
|
58(38-83)
|
54(28-77)
|
Number of cases
|
47
|
10
|
20
|
Endometrioid carcinoma
|
Grade 1
|
24
|
10
|
5
|
Grade 2
|
17
|
3
|
4
|
Grade 3
|
5
|
0
|
0
|
Serous carcinoma
|
1
|
2
|
1
|
Benign
|
Median age (range)
|
47(37-73)
|
46(30-57)
|
45.5(38-51)
|
Number of cases
|
49
|
10
|
20
|
Leiomyoma
|
49
|
10
|
20
|
Acquisition of Digital Images Using a Smartphone-Based Diagnostic Imaging Device
Digital images were acquired by using a smartphone-based imaging system. Specifically, an iPhone SE (Apple Inc., Cupertino, CA, USA) was mounted on an Olympus BX53 microscope (EVIDENT/Olympus, Tokyo, Japan) using a specialized adapter (i-NTER LENS; MICRONET Co., Kawaguchi, Saitama, Japan), as shown in Fig. 2B. The captured images had a resolution of 4,032 × 3,024 pixels. While examining the cytological slides, the focus was manually adjusted, and the images were taken at 20x objective lens magnification. For malignant cases, images were taken such that abnormal cell clusters identified by a gynecologic pathologist were positioned in the center of the image. For benign cases, images of all visible cells were captured, which were randomly selected and considered normal.
Dataset and Annotation
In our study, the dataset comprised 3,151 endometrial cytology images, including both benign and malignant images (Fig. 2A). Malignant images were annotated for abnormal cell clusters using LabelImg (version 1.8.6), a Python-based graphical annotation tool, while other material was treated as "background" without annotation. Abnormal cell clusters were labeled "malignant" regardless of their type. Substances appearing in benign images were also treated as "background" without annotation. In addition to atypical cell clusters, various materials, such as benign cell clusters, mucus, and inflammatory cells, appear in endoscopic cytology specimens. Labeling each individual is an extremely complex task. Because the purpose of this model is to detect malignant cell clusters, it is not always necessary to detect other substances. Therefore, we prioritized the advantage of simplicity in annotation and labeled only the atypical cell clusters. The dataset was refined to 1,579 malignant and 1,572 benign images, with the final training, validation, and testing split set at an 8:1:1 ratio (Fig. 2A).
Architecture of the Object Detection Model
Our study employed YOLOv5x14, an object detection model chosen for its proven image recognition performance and capability for high-speed analysis, which is critical for the real-time processing demands of endometrial cytology image analysis. YOLOv5x operates as a one-stage detector, optimizing the efficiency of the model for detecting cellular anomalies with high precision 15. The general architecture of YOLOv5 is illustrated in Fig. 3. The YOLOv5x architecture is structured around three main components: the backbone, neck, and head. The backbone incorporates a cross-stage partial network with DarkNet53 to enhance the feature map processing efficiency by splitting the input into two paths, conserving computational resources while maintaining information diversity. Neck employs a path aggregation network for improved feature propagation and spatial pyramid pooling fusion to handle inputs of varying sizes, thereby facilitating robust detection across different object scales and contexts. Adopting the YOLOv3 architecture finalizes the detection process 15.
Pretrained on the extensive Microsoft Common Objects in Context (COCO) dataset 2017, YOLOv5x leveraged a wide-ranging preexisting knowledge base, allowing it to adapt to the nuanced challenges of cytological imagery. The primary hyperparameters were carefully selected based on extensive preliminary experiments aimed at maximizing training efficiency and model performance. The hyperparameters, including the number of epochs, batch size, and learning rate, used are listed in Table 2. Other parameters were kept at their default values, as provided in the YOLOv5x GitHub repository14, such as with no frozen layers, to fully exploit the learning capacity of the model in recognizing the specific features of endometrial cytology images.
Table 2
Training conditions for model optimization
Settings | batch size | epoch | optimizer | learning rate | weight decay |
YOLOv5x | 4 | 200 | SGD | 0.01 | 0.0005 |
Image Preprocessing
Following the acquisition and annotation of the dataset, we proceeded with an image preprocessing phase. The original high-resolution images, measuring 4,032 × 3,024 pixels, were resized to 640 × 640 pixels to conform to the YOLOv5x input requirements. The training process used YOLOv5x's built-in data augmentation features, including mosaic, rotation, flipping, and color adjustments, to enhance the robustness of the model against variations.
Computational Setup and Software Environment
Our computational setup included an Iiyama Sense 15F161 laptop PC powered by an Intel Core i7 CPU and an NVIDIA GeForce RTX3060 GPU. The software environment was established using Anaconda Distribution (version 2022.10), with Python 3.10.9 as the programming language and PyTorch 1.13.1 as the deep learning framework.
Model Performance Evaluation Using Static Images for Object Detection
Following the previously described model training and optimization process, we evaluated the performance of each model using static images from the validation and test datasets. For this evaluation, we employed several metric standards in object detection tasks, including precision, recall, F1 score, and mean average precision (mAP).
Precision quantifies the fraction of accurate positive identifications relative to the total number of positive identifications made by the model and is defined as
$$Presicion = \frac{TP}{TP+FP}$$
Here, TP and FP denote the true positives and false positives, respectively. Recall, or sensitivity, measures the fraction of accurate positive identifications relative to the total number of actual positive instances and is defined as
$$Recall = \frac{TP}{TP+FN}$$
Here, FN denotes false negatives. The F1 score is the harmonic mean of precision and recall and is defined as
$$F1 Score = \frac{2 \times Precision\times Recall}{Precision+Recall}$$
With the precision on the vertical axis and the recall on the horizontal axis, a precision-recall curve (PR curve) can be drawn. The area ratio occupied by the PR curve is the average precision (AP) value of the PR curve. AP is a standard in object detection evaluation and offers a comprehensive metric by averaging the maximum precision values across varying recall thresholds. This is defined as follows:
$$AP={\sum }_{k=1}^{m}p\left(k\right)△r\left(k\right)$$
where k is the number of abnormal cell clusters that have been detected, p (k) is the precision at the cutoff k in the list, and Δr (k) is the change in recall from items k − 1 to k.
where mAP is the average AP value. The mAP is defined as follows:
$$mAP=\frac{1}{n}{\sum }_{i=1}^{n}AP\left(i\right)$$
where n is the number of classes and AP(i) is the AP value for a given class. In this study, because only one class of objects was trained, the AP and mAP values matched.
During the model training phase, performance was assessed using the validation set, with the testing set applied posttraining. The model was evaluated using an intersection over union (IoU) threshold of 0.5, which measures the congruence between the ground-truth bounding box and the model's predicted bounding box, providing insights into the precision of the model regarding object location and size. Owing to the training approach where benign cases were used as the background, precluding the generation of a receiver operating characteristic (ROC) curve (as true negatives could not be measured), we employed the precision‒recall (PR) curves of YOLOv5x. The performance of the model in a static image test set was evaluated. Based on these results, PR curves were plotted using Matplotlib.
Microscope Setup for Real-time Object Detection
For real-time detection, we connected a microscope (ECLIPSE Ci, Nikon Co., Tokyo, Japan) with a C-mount to a charge-coupled device (CCD) camera (JCS-HR5U, CANON Inc., Tokyo, Japan). The trained model was set to inference mode in an integrated development environment (Visual Studio Code, Microsoft Co., WA, USA), with the input image set to the video from the CCD camera to instantly display the bounding box and confidence score for the detection of abnormal cell clusters (Fig. 4). The confidence score threshold for displaying bounding boxes was set to 0.01 for YOLOv5x. We evaluated the speed of model detection using frames per second (FPS), which refers to the number of images that can be processed per second. The higher this value is, the smoother and more natural the image on the monitor. Generally, an FPS greater than 30 is considered to provide real-time detection.
Evaluation of Abnormal Cell Cluster Detection Under a Microscope by a Real-Time Detection Method
In line with the case selection detailed earlier, where 20 cases (10 benign and 10 malignant) were earmarked for real-time detection accuracy assessment, a total of 100 points were marked (50 for malignant cases and 50 for benign cases), with each slide having five points near the abnormal or randomly selected benign cell clusters for comprehensive evaluation by the trained YOLOv5x. Following detailed case selection, where a total of 100 points were marked across 20 cases (50 points for malignant cases and 50 points for benign cases), the trained YOLOv5x model was employed. Bounding boxes with confidence scores for the detected cell clusters were recorded in real time under a microscope. To establish the optimal confidence score thresholds, ROC curves were generated from the recorded confidence scores, and the area under the curve (AUC) was calculated. This allowed the computation of performance metrics in detecting abnormal cell clusters above the confidence score threshold, providing a measure of the model's detection accuracy.
Evaluation of Slide-Level Real-Time Diagnostic Performance Using YOLOv5x
At our institution, endometrial cytopathology is classified according to the categorization of cells into five classes: class I for normal cells, class II for benign changes, class III for indeterminate or atypical cells that require further assessment, class IV for cells suspicious of malignancy, and class V for cells that are unequivocally malignant. In this study, we translated this system into a binary schema for streamline training and evaluation of a deep learning model. Classes I and II were labeled 'no abnormality', while classes III to V were grouped as 'requires further examination,' establishing a binary gold standard (GS) for comparative analysis. To evaluate the diagnostic accuracy of the trained YOLOv5x model at the slide level using real-time detection, each slide was moved at a consistent speed from one end to the other over a duration of 4 min, and the number of detected abnormal cell clusters was counted. A new set of 10 slides (five benign and five malignant) was analyzed for this purpose. ROC curves were plotted based on the number of clusters detected by YOLOv5x and GS, and the AUC was calculated to determine the optimal threshold for slide-level detection. Slides with bounding box counts exceeding this threshold were designated as 'requires further examination,’ while those below were set to be considered 'no abnormality.’
Diagnostic Performance Evaluation with and without the AI Assist
Human Evaluators
The human evaluators in this study comprised three pathologists with different specialties and four medical students at various stages of their education. Pathol-1 is a pathologist who specializes in gynecology, whereas Pathol-2 and Pathol-3 have expertise in nongynecological areas. Medical students were divided based on their exposure to the AI model and their academic year: Stud-1 (4th year) and Stud-2 (5th year) were directly involved in the AI model's annotation and training process. Conversely, Stud-3 (3rd year) and Stud-4 (1st year) were provided with a 20-minute lecture on endometrial cytology to familiarize them with the subject before their participation in the study. This setup aimed to assess diagnostic performance across a spectrum of experience levels and determine the impact of AI assistance on diagnostic accuracy and speed.
Cohen's kappa coefficient analysis
To establish a comprehensive understanding of the diagnostic process, the baseline performance of the classifiers was first assessed using Cohen's kappa coefficient. This statistical measure accounts for chance agreement and provides a robust method for gauging concordance in binary-classification tasks. By comparing the predictions of the AI model and each human evaluator with the gold standard, we obtained a baseline diagnostic accuracy that reflected the true performance without the influence of AI. The kappa scores were calculated using Scikit-learn, and heatmaps for visualization of agreement levels were generated with Matplotlib, illustrating the alignment between the gold standard and the evaluators' diagnoses.
AI-assisted Implementation
Following the baseline assessment, an AI assist was used to evaluate its impact on diagnostic performance. The YOLOv5x model was integrated into the diagnostic workflow, providing real-time assistance by presenting predictions and bounding boxes for abnormal cell clusters along with cytological slides. This enabled a direct comparison of the evaluators' performance metrics, with and without the aid of AI. The accuracy is defined as the ratio of the number of correctly predicted observations to the total number of observations and is defined as
$$Accuracy = \frac{TP +TN}{TP + TN + FP + FN}$$
Precision, recall, and F1 score were defined in the 'Model Performance Evaluation using Static Images in Object Detection' section. A paired t test using SciPy was conducted to statistically compare these metrics, offering a clear visualization of the performance differences in a boxplot format using Matplotlib.
Time Measurement for Diagnosis
The efficiency of the diagnostic process, with and without an AI assistant, was quantified by measuring the time taken to diagnose a set of patients. This study focused on the practical application of AI in a clinical setting, aiming to identify any significant time savings offered by the integration of AI technology. The total diagnostic time for all patients was recorded in a controlled environment, and a paired t test was applied to determine the significance of the time differences, thereby highlighting the potential for AI to streamline the diagnostic workflow.