A total of 3932 oral photos were captured from 625 patients admitted at Department of Periodontics, orthodontics and endodontics, Nanjing Stomatological Hospital, Nanjing University, between January 2018 and December 2019. All the photos were captured by postgraduate dental students and dentists, and the patient’s ages cover a range from 14 to 60. The project was approved by the Ethical Review Board at Nanjing University (approval 2019NL-065(KS)). The methods were conducted in accordance with the approved guidelines, written informed consent was obtained from each participant. For children under the age of 18, the written informed consent was obtained from their parents/guardians. To approximate the image quality in the practical scenario, the photos were collected with various equipments which include iPhone 8, iPhone 7, Samsung Galaxy s8, and Canon 6D. No specific in- or exclusion criteria about images, e.g. lighting and resolution, was applied. Three dental diseases were considered in this study: gingivitis, dental calculus, and soft deposit. Among the dataset, 3,175 photos show gingivitis, 921 show dental calculus, and 746 images show soft deposits. Note that each photo can show none, one or more types of conditions. All photos were pseudonymized and no other image processing steps are performed.
We split the data into training, validation and testing subsets by randomly splitting the patients into three independent groups. All photos of each patient only existed in one of the three subsets. Table 1 shows the patient and photo split. Table 2 shows the distribution of photos with positive findings among the dataset and different subsets.
Table 1
Numbers of Patients and Images Assigned to Training, Validation, and Testing Subsets.
|
Training
|
Validation
|
Testing
|
Total
|
Patients
|
344 (55.04%)
|
94 (15.04%)
|
187 (29.92)
|
625 (100%)
|
Images
|
2138 (54.37%)
|
608 (15.46%)
|
1186 (30.16%)
|
3932 (100%)
|
Table 2
Distribution of images with positive findings among training, validation, and testing subsets for each type of diagnosis. The data shows the numbers of images with positive findings, as well as their proportions within all the images of a category.
|
Gingivitis
|
Dental Calculus
|
Soft Deposit
|
Training
|
1726 (80.73%)
|
514 (24.04%)
|
424 (19.83)
|
Validation
|
469 (77.14%)
|
175 (28.78%)
|
116 (19.08%)
|
Testing
|
980 (82.63%)
|
232 (19.56%)
|
206 (17.37%)
|
Total
|
3175 (80.75%)
|
921 (23.42%)
|
746 (18.97%)
|
Ground Truth Annotations.
We collected the reference annotations of the three dental conditions for all the photos from three board-certified dentists. In specific, the dataset was evenly split and assigned to the three dentists. Each image was independently labeled by one of the dentists with the given clinical report using the labeling software LabelBox (Labelbox, Inc, CA). For gingivitis and dental calculus, we collected the annotations of bounding boxes for indicating the localizations of diseases. Note that since there can be no well-defined boundaries for diseases in some cases, we followed a common approach16 by instructing the dentists to focus on the correctness of box centers. Meanwhile, for soft deposit, we only collected image-level classification labels, since such condition mostly appears all over cavities and its localizations are labor-costly to label.
Problem Formulation and Model Architecture.
We formulated the problem as a mixture of object localization and image classification. In specific, we developed a CNN with Multi-task Learning (MTL)17,18 for solving both tasks in order to increase the generalization19 and compactness of the model. Figure 1 shows the overall architecture of our MTL model, which takes oral images (Figure 1(a)) as input, and outputs both diagnosis and locations of the detected conditions (Figure 1(e)). The model is consisted of three subnets: (i) FNet (feature extraction subnet), (ii) LNet (localization subnet) and (iii) CNet (classification subnet). FNet (Figure 1(b)) extracts deep features of the input image through a stack of convolutional layers, and was trained to be discriminative for both localization and classification tasks. Meanwhile, LNet (Figure 1(c)) regresses over the feature maps derived from FNet for a set of location vectors, where each vector y encodes one bounding box by its coordinates, height, width, and probability for gingivitis and dental calculus. Similar with20, the proposed bounding boxes are aligned to nearest ground-truth boxes during training to approximate localizations, and are filtered with Non-maximum Suppression (NMS)21 during inference to reduce overlapped findings. CNet (Figure 1(d)) performs fully connected operations over the extracted feature maps for a length 1 vector as outputs, whose value represented the probability for the existence of soft deposits.
To help users comprehensively understand the diagnosis results, we aim to highlight the spatial locations of the detected dental conditions. For gingivitis and dental calculus, the bounding boxes from the model can already localize the ROIs. However, for soft deposit, the model only produces classification results since their ground-truth location maps are labor costly to annotate. Thus, gradient-based class activation maps22 was used to reason the areas of the images that are most indicative to the classification.
Figure 1(e) shows an example result from our system.The detected gingivitis and dental calculus are pinpointed with boxes, and soft deposits are hinted with a heat-map, where a higher temperature indicates the stronger relevance of a region.
Implementation and Training Strategy.
To train the model, we defined the loss as an equally weighted sum of smooth L1 loss for bounding box regression20, and cross entropy loss for classification23. We employed intensive augmentations to input images 24, which includes random shifts, crops, rotations, scaling, and color channel shifts (random changes of hue, saturation, and exposure). Such augmentations is targeted to increase the robustness of the model for in-the-wild application. Moreover, we employed transfer learning by initializing our FeatNet from VGG-1625 that pre-trained on large-scale image recognition tasks for speeding up the training process26.
The CNN model was developed using the PyTorch framework. The model was trained using a mini batch size of 16 per GPU on three Nvidia 1080 Ti GPUs. Validation set was used to determine the early stopping of the training process. Parameter updates were calculated using the Adam algorithm, with the learning rate set to 1e-4 and decay rate set to 5e-4.
Evaluation Metrics and Statistical Analysis.
We evaluated the model from two aspects: (i) classification performance for telling the existence of a condition, and (ii) localization performance for indicating regions on images that related to a diagnosis.
In terms of the classification performance, we utilized the Receiver Operating Characteristic (ROC) curve, which shows the true-positive rate (TPR), or sensitivity, against its false-positive rate (FPR), or 1 - specificity, as a function of varying discrimination thresholds. For gingivitis and dental calculus, we took the highest probability of the detected bounding boxes as the predicted probability of classification. AUC was used to compare between models.
In terms of localization performance, we utilized the Free-Response ROC (FROC) curve, which shows the bounding-box-wise TPR, or sensitivity, against the average number of false-positive (FP) boxes per image, as a function of varying thresholds for box probabilities. By following the practice of van Ginneken, a predicted box was taken as a hit if its center falls into the range of a ground-truth box. Similarly, the Area Under FROC (FAUC) was used as the index of performance, which was defined as the average sensitivity at false positive rates of 1/2, 1, 2, and 3. This metric is used as an index to compare different models, where a larger value indicates the better performance. To measure the quality of soft deposit predictions, we followed Selvaraju 22 by collecting agreement ratings from three board-certified dentists for each localization heat-map of testing images. Specifically, we show dentists images that were detected with soft deposits together with the localization heat-maps that visualized as in Figure 1(e). Then a rating is given on a scale from 1 (strongly disagree) to 5 (strongly agree) by evaluating if a heat-map demonstrates the regions of the condition according to dentists’ opinions.