Patient characteristics
In our retrospective study, a total of 187 patients examined using endoscopy between January 2015 and June 2019 were included in the Hospital of Chengdu Office of People’s Government of Tibetan Autonomous Region. The BE conditions were confirmed by pathological examinations. All data were anonymized, and an Ethics Approval was granted by the Ethics Committee of Hospital of Chengdu Office of People’s Government of Tibetan Autonomous Region (No. 201920).
Image acquisition
We obtained 443 endoscopic images from a total of 187 clinical cases, and the instruments used in the examinations were Olympus GIF-HQ290, GIF-Q260 gastroscope (Olympus Company, Japan). The esophagus was cleaned and examined with white light, narrow band imaging, and staining endoscopy. The BE scope was recorded according to the Prague classification system. The endoscope was positioned proximally to the GEJ, and the endoscopic image was taken. Meanwhile, the biopsy samples were obtained using biopsy forceps, and the final diagnoses were proved by pathologists.
Image annotation
To obtain the ground truth of the BE scopes in images, we invited two senior endoscopists with over 15 years’ experience to manually draw the outlines of the scopes using one in-house developed software. More specifically, the rims of the GEJ and SCJ were delineated to define the BE scopes. The experts were trained to follow the same quality standard before conducting the tasks. The first expert annotated all images, and the results were confirmed by the second expert. For any disagreement, the two experts discussed and made necessary new annotations until consensus was reached. The annotation information was later extracted to generate segmentations as ground truths for later DL algorithm training and evaluation.
Deep learning algorithm
In this study, we developed a DL algorithm in a neural network structure of fully convolutional networks (FCN) [27]. As shown in Figure 2, the neural network adopted several layers of fully convolutional neural network layers to extract abstract feature maps of an input image. After the downsampling, the deconvolutional neural network layers were appended to conduct the upsampling to generate the output image in the same size as the input image. Skip architectures were fused to both deep and shallow layers to achieve semantic segmentation at the pixel level. Furthermore, FCN is capable of processing images of any size, which allows FCN to be more suitable for medical images of various sizes. In the training stage, each image was input into the FCN, and a corresponding mask was generated to indicate the segmentation. The segmentation was compared against the ground truth obtained by experts. The loss formation was used to train the FCN. When all images in the training set were used to update the network, a trained FCN was obtained and passively used to generate segmentations for any inputs.
Since the BE scopes usually have two rims, namely GEJ and SCJ. We considered two approaches. Firstly, we trained two FCN networks independently to achieve the segmentation for GEJ and SCJ. In other words, two independent FCN networks were trained and evaluated using the annotations to obtain the rims of GEJ and SCJ. Secondly, we segmented the GEJ and SCJ using one single trained network. We reported and compared the performance of the two approaches. The obtained segmentations were further visualized for examinations.
To train and test the developed DL algorithm, we randomly divided the collected 443 images from 187 patients into two independent subsets according to patients. This approach ensured no images from a given individual patient appeared in both training and testing sets. In result, we obtained two subsets, namely a training set (n= 150, 354 images, 80%) and a test set (n = 37, 89 images, 20%). The DL algorithm was first trained using the annotated images in the training set. Afterward, the trained DL algorithm was evaluated in the test set.
The FCN neural network was implemented in the programming language of Python (3.7.3) using publicly available libraries of PyTorch (1.1.0), CUDA (10.1), and NumPy (1.16.2). The algorithm was trained and evaluated in a DL server equipped with a Tesla P40 graphic processing unit (GPU) running the operating system CentOS Linux (7.6.1810). Though a DL server was utilized in this study, it’s believed that a conventional workstation nowadays could be used to deploy the trained DL algorithm and generate segmentations within an acceptable time.
Statistical analysis
In line with previous studies of image segmentation, the metric of IOU was used to measure the performance of the DL algorithms. Intuitively, IOU indicated how well the predicted segmentation overlapped with the ground truth. A larger value of IOU closes to one indicates a favorable segmentation performance for a given algorithm. We also reported the Dice similarity coefficient (DSC), which is similar to IOU and widely appears in literature [28]. However, we used IOU as the major measurement for its simplicity and wide acceptance in literature. Therefore, the overall performance was reported as the averaged IOU and DSC in the test set.