Study design
This research involved a retrospective experimental observational study using a multi-institutional intraoperative video dataset. A total of 5238 images, which were randomly extracted from 128 intraoperative videos, were utilized. The image selection criteria were that the target surgical instrument must be clearly visible and out-of-focus images and/or images with mist were excluded. The video dataset contained 112 laparoscopic colorectal resection (LCRR), 5 laparoscopic distal gastrectomy (LDG), 5 laparoscopic cholecystectomy (LC), and 6 laparoscopic partial hepatectomy (LPH) cases.
This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guidelines [20]. The protocol for this study was reviewed and approved by the Ethics Committee of National Cancer Center Hospital East, Chiba, Japan (Registration No.: 2020–315). Informed consent was obtained in the form of an opt-out on the study website, and data from those who rejected participation were excluded. The study conformed to the provisions of the Declaration of Helsinki established in 1964 (and revised in Brazil in 2013).
Training and test sets
The training set contained 4074 images, which were randomly extracted from 85 intraoperative videos of LCRR, and each image was captured using one of three types of surgical instruments: T1) Harmonic Shears (Ethicon Inc., Somerville, NJ, USA), T2) endoscopic surgical electrocautery (Olympus Co., Ltd., Tokyo, Japan), and T3) Aesculap AdTec atraumatic universal forceps (B Braun AG, Melsungen, Germany). Representative images of T1–3 are shown in Fig. 1A. Every intraoperative video was recorded using an Endoeye laparoscope (Olympus Co., Ltd., Tokyo, Japan) and Visera Elite II system (Olympus Co., Ltd, Tokyo, Japan).
The validation set contained 345 images from nine intraoperative videos, and the conditions, which included the type of laparoscopic recording system, recognition target surgical instrument, and surgery, were the same as those for the training set.
Test set 1 contained 369 images from 10 intraoperative videos, and the conditions were the same as those of the training set.
Test set 2 contained 103 images, including surgical instruments extracted from five intraoperative videos. Although the recognition target surgical instrument and surgery types were the same as those in the training set, the videos were recorded by different types of laparoscopic systems, including a 1488 HD 3-Chip camera system (Stryker Corp., Kalamazoo, MI, USA) and Image 1 S camera system (Karl Storz SE & Co., KG, Tuttlingen, Germany).
Test set 3 contained 124 images that captured surgical instruments extracted from three intraoperative videos. Although the laparoscopic recording system and surgery types were the same as those of the training set, the types of recognition target laparoscopic surgical forceps, including T4) Maryland (Olympus Co., Ltd., Tokyo, Japan), T5) Croce-olmi (Karl Storz SE & Co., KG, Tuttlingen, Germany), and T6) needle holder (Karl Storz SE & Co., KG, Tuttlingen, Germany), were not included in the training set. Representative images of T4–6 are shown in Fig. 1B.
Test set 4 contained 223 images that captured surgical instruments extracted from 16 intraoperative videos of different types of surgery, including LDG, LC, and LPH. The other conditions, including the types of laparoscopic recording system and recognition target surgical instrument, were the same as those for the training set.
The characteristics of the training set, validation set, and each test set are summarized in Table 1.
Table 1
| Number of videos | Number of annotated images | Laparoscopic recording system | Recognition target surgical instruments | Type of surgery |
Training set | 85 | 4,788 | Olympus | T1-3 | LCRR |
Validation set | 9 | 345 |
Test set 1 | 10 | 369 |
Test set 2 Test set 2.1 Test set 2.2 | 5 2 3 | 103 40 63 | Stryker Karl Storz | T1-3 | LCRR |
Test set 3 Test set 3.1 Test set 3.2 Test set 3.3 | 3 1 1 1 | 124 31 74 19 | Olympus | T4 T5 T6 | LCRR |
Test set 4 Test set 4.1 Test set 4.2 Test set 4.3 | 16 5 5 6 | 223 65 81 77 | Olympus | T1-3 | LDG LC LPH |
T1: harmonic shears; T2: endoscopic surgical electrocautery; T3: Aesculap AdTec atraumatic universal forceps; T4: Maryland; T5: Croce-olmi; T6: needle holder; LCRR: laparoscopic colorectal resection; LDG: laparoscopic distal gastrectomy; LC: laparoscopic cholecystectomy; LPH: laparoscopic partial hepatectomy |
Data and model optimization
Every intraoperative video was converted into MP4 video format with a display resolution of 1280 × 720 pixels and a frame rate of 30 frames per second (fps), and neither upsampling nor downsampling was performed. The data split was performed on the per-case level, instead of the per-frame level; thus, no image extracted from an intraoperative video in the training set appeared in the test sets.
Annotation was performed by 14 non-medical doctors under the supervision of surgeons, and the annotation labels were manually assigned pixel by pixel by drawing directly on the area of each surgical instrument in the images using a Wacom Cintiq Pro (Wacom Co., Ltd., Saitama, Japan) and Wacom Pro Pen 2 (Wacom Co., Ltd., Saitama, Japan). The representative annotated images are shown in Supplementary Fig. 1.
Mask R-CNN with a deformable convolution [14, 21] and ResNet50 [22] were utilized as the instance segmentation model and backbone network, respectively, and every annotated image in the training set was input into the model. The network weight was initialized to a pre-trained one on the ImageNet [23] and COCO [24] datasets, and fine-tuning was then performed for the training set.
ImageNet is a large visual database designed for use in visual object recognition tasks. It contains more than 14 million images with labels of more than 20,000 typical categories, such as “balloon” and “strawberry.” COCO is a large-scale dataset for object detection, segmentation, and captioning. It contains more than 120,000 images with more than 880,000 labelled instances for 80 object types. The best epoch model based on the model performance on the validation set was selected. Horizontal and vertical flips were used for data augmentation. The hyperparameters used for the model training are listed in Supplementary Table 1.
Code and computer specification
The code was written using Python 3.6 (Python Software Foundation, Wilmington, DE, USA), and the model was implemented based on MMDetection [25], which is an open-source Python library for object detection and instance segmentation.
A computer equipped with an NVIDIA Quadro GP100 GPU with 16 GB of VRAM (NVIDIA, Santa Clara, CA, USA) and an Intel® Xeon® CPU E5-1620 v4 @ 3.50 GHz with 32 GB of RAM was utilized for network training.
Model performance
The intersection over union (IoU) and average precision (AP) were utilized as metrics to assess the model performance for the surgical instrument segmentation task.
The IoU was calculated for each pair of X (the area annotated as the ground truth) and Y (predicted area output by the model), which simply measures the overlap of the two areas divided by their union:
IoU = | X ⋂ Y | / | X ⋃ Y |.
The mean AP (mAP) is a metric that is widely used for object detection and instance segmentation tasks [23, 24, 26]. It is calculated from the area under the precision–recall curve that is described based on the number of true positives (TP), false negatives (FN), and false positives (FP). Assigned pairs of X and Y were defined as TP and FN when their IoU was more and less than 0.75, respectively, and they were defined as FP when no pairs could be assigned.
To confirm the reproducibility of the results, we trained five models for each test set with different random seeds and reported the metrics averaged over the five models as the mean (± standard deviation).