Ethical approval
The project was approved by the Ethical Review Board at local University (approval number NJSH-2022NL-069 ).
Subjects and dataset
We built an in-house dataset of oral photos collected in the Department of Periodontics, Orthodontics and Endodontics, Stomatological Hospital, from January 2020 to December 2022. The dataset contained 683 images captured by postgraduate dentists from 134 gingivitis patients and the healthy population. The images cover a wide age range from 14 to 60 years old. Images of teeth with severe cervical caries and periodontitis with severe gingival recession were excluded. The methods were conducted in accordance with the relevant guidelines and regulations, written informed consent was obtained from each participant. The diagnosis of chronic gingivitis requires that two criteria be met: (1) clinical symptoms including bleeding with tooth brushing, blood in the saliva, and gingival swelling and redness (Figure 1); and (2) no attachment loss in periodontal probe examination and no loss of supporting structures in radiographic analysis.
Implicit sorting relationships may exist in the initially gathered data, which may negatively affect the DCNN model’s accuracy. Therefore, during training, we initially use a shuffling procedure to disrupt the order of the data.
We divided the dataset into training, validation, and testing subsets by randomly splitting the photos into three groups. We used the training set to update the model. We changed the hyperparameters in the validation set, and the test set was used to evaluate the model’s performance. To increase the use of data and train the model with a large number of parameters, we improved the efficiency of data utilization through cross validation. Table 1 provides details about the distribution of the dataset.
Table 1 Numbers of images with positive and negative findings assigned to the training, validation, and testing subsets
|
Training
|
Validation
|
Test
|
Total
|
|
N
|
P (%)
|
N
|
P (%)
|
N
|
P (%)
|
N
|
P (%)
|
Positive
|
346
|
66.8
|
86
|
16.6
|
86
|
16.6
|
518
|
100
|
Negative
|
48
|
66.8
|
12
|
16.6
|
12
|
16.6
|
72
|
100
|
Total
|
394
|
66.8
|
98
|
16.6
|
98
|
16.6
|
590
|
100
|
Note: Numbers of images and their percentages within the whole dataset are represented as N and P, respectively.
1) Training dataset
The identification of gingivitis is a complex process that involves considering the color of the gums, the level of swelling, and the bleeding condition. Accordingly, we selected models with large numbers of structural layers to extract the complex features of gingivitis. To update the model’s parameters, the model was trained many times with the training data.
2) Validation dataset
The model’s variables, including the number of layers and neurons, can affect the ultimate recognition accuracy. In this study, the model that performed best in the validation set was chosen as the final version after we tested numerous iterations of each model. As the number of training sessions increases, the model accumulates considerable useless knowledge, including the brightness of the image, the placement of the teeth, the size of the teeth, and other features of oral diseases such as black stains and dental calculus. Additionally, we used the validation set to halt model updates early.
3) Test dataset
We used the test set to test the final performance of the model based on accuracy and analysis of the receiver operating characteristic (ROC) curves.
4) Cross-validation method
Due to the complexity of the task and the small amount of data, we needed to increase the use of the collected data. We used a cross-validation method to train/test multiple groups of models with different training/test sets, as shown in Figure 2.
We randomly and evenly divide the data into six parts.After dividing the data by the average of six parts, we selected four parts as the training dataset. After making five separate selections, we chose the best outcome as the final result.
ConvNet models
We selected several models and compared their performances by training to identify the model with the best performance. Finally, we outperformed a single model by using ensemble learning. Figure 3 shows the overall framework of the model.
1) AlexNet
AlexNet uses ReLU as the activation function. Only the most obvious elements of a region are kept. As an illustration, consider the color of the gingiva. Only the components whose color depth exceed the threshold are kept, and the rest of the features are eliminated (Table 2).
Table 2 AlexNet uses the ReLU activation function to make the update process more stable.
Layer
|
Kernel
|
Dimension
|
Activation
|
|
Conv
|
11
|
48
|
ReLU
|
MaxPool
|
Conv
|
5
|
128
|
ReLU
|
MaxPool
|
Conv
|
3
|
192
|
ReLU
|
|
Conv
|
3
|
192
|
ReLU
|
|
Conv
|
3
|
128
|
ReLU
|
MaxPool
|
FC
|
|
2048
|
ReLU
|
Dropout
|
FC
|
|
2048
|
ReLU
|
Dropout
|
FC
|
|
2
|
|
|
Note: “Conv” is short for convolutional, and “FC” stands for the fully connected layer.
To increase the diversity of the data, we randomly eliminated some of the intermediate results using the dropout function.
2) VGG
For the model to quickly collect the general information of the image, a lot of distant information needs to be swiftly aggregated. We used the 3×3 convolution kernel to quickly gather tooth and gingival information over large distances. Therefore, we obtained a general assessment without a precise division of tooth position (Table 3).
Table 3 The VGG model can quickly aggregate information from long distances in images
Layer
|
Kernel
|
Dimension
|
Activation
|
|
Conv
|
3
|
64
|
ReLU
|
Block1
|
Conv
|
3
|
64
|
ReLU
|
MaxPool
|
Conv
|
3
|
128
|
ReLU
|
Block2
|
Conv
|
3
|
128
|
ReLU
|
MaxPool
|
Conv
|
3
|
256
|
ReLU
|
Block3
|
Conv
|
3
|
256
|
ReLU
|
|
Conv
|
3
|
256
|
ReLU
|
MaxPool
|
Conv
|
3
|
512
|
ReLU
|
Block4
|
Conv
|
3
|
512
|
ReLU
|
|
Conv
|
3
|
512
|
ReLU
|
MaxPool
|
Conv
|
3
|
512
|
ReLU
|
Block5
|
Conv
|
3
|
512
|
ReLU
|
|
Conv
|
3
|
512
|
ReLU
|
MaxPool
|
FC
|
|
4096
|
ReLU
|
Dropout
|
FC
|
|
4096
|
ReLU
|
Dropout
|
FC
|
|
2
|
|
|
Note: “Conv” is short for convolutional, and “FC” stands for the fully connected layer.
3) GoogLeNet
Each model layer’s dimension may correlate to the tooth’s higher-dimensional attributes. To minimize the feature’s dimension, we used a 1×1 convolution kernel (Table 4).
Table 4 The GoogLeNet model uses a 1×1 convolution kernel to reduce the dimensions
Layer
|
Kernel
|
Dimension
|
Activation
|
|
Conv
|
7
|
64
|
ReLU
|
MaxPool
|
Conv
|
1
|
64
|
ReLU
|
|
Conv
|
3
|
192
|
ReLU
|
MaxPool
|
|
Conv1
|
Conv3 +
Conv1
|
Conv5 +
Conv1
|
Conv1 + MaxPool
|
Inception
|
192
|
64
|
128
|
32
|
|
|
96
|
16
|
32
|
Inception
|
256
|
128
|
192
|
96
|
|
|
128
|
32
|
64
|
Inception
|
480
|
192
|
208
|
48
|
|
|
96
|
16
|
64
|
Inception
|
512
|
160
|
224
|
64
|
|
|
112
|
24
|
64
|
Inception
|
512
|
128
|
256
|
64
|
|
|
128
|
24
|
64
|
Inception
|
512
|
112
|
288
|
64
|
|
|
144
|
32
|
64
|
Inception
|
528
|
256
|
320
|
128
|
|
|
160
|
32
|
128
|
Inception
|
832
|
256
|
320
|
128
|
|
|
160
|
32
|
128
|
Inception
|
832
|
384
|
384
|
128
|
|
|
192
|
48
|
128
|
Inception
|
832
|
384
|
384
|
128
|
|
|
192
|
48
|
128
|
InceptionAux
|
512
|
2
|
|
|
InceptionAux
|
512
|
2
|
|
|
FC
|
1024
|
2
|
AvgPool
|
Dropout
|
Note: “Conv” is short for convolutional, and “FC” stands for the fully connected layer.
4) ResNet
We used a 34-layer version of the ResNet model (Table 5). We discovered that the starting information and position information in the image were helpful in the recognition process. Basic knowledge of the structure and color can be learned in the first few layers, and this knowledge remains helpful in the subsequent layers. ResNet adds the results of previous layers to subsequent layers by using residual structures and skipping intermediate steps. In this way, the knowledge learned in the previous layer can be directly transferred to the next layer.
Table 5 The ResNet model adds the results of previous layers to subsequent layers by using residual structures.
Layer
|
Kernel
|
Dimension X
|
Dimension Y
|
Residual Network
|
Conv
|
7
|
112
|
112
|
|
Conv
|
3
|
56
|
56
|
|
Conv
|
3
|
56
|
56
|
3
|
Conv
|
3
|
|
|
|
Conv
|
3
|
28
|
28
|
1
|
Conv
|
3
|
|
|
|
Conv
|
3
|
28
|
28
|
3
|
Conv
|
3
|
|
|
|
Conv
|
3
|
14
|
14
|
1
|
Conv
|
3
|
|
|
|
Conv
|
3
|
14
|
14
|
5
|
Conv
|
3
|
|
|
|
Conv
|
3
|
7
|
7
|
1
|
Conv
|
3
|
7
|
7
|
2
|
Conv
|
3
|
|
|
|
FC
|
1000
|
2
|
|
AvgPool
|
Note: “Conv” is short for convolutional, and “FC” stands for the fully connected layer.
5) Ensemble Learning
The AlexNet, GoogLeNet, and ResNet models had good performance based on their accuracy and area under the ROC curve (AUC). In contrast, the VGGNet model performed poorly. We know that for locating particular data, some models perform poorly, while others show good performance. Consequently, we combined the findings from the three models with good performance using ensemble learning. Three models—Alexnet, Googlenet, and Resnet—were used to make predictions for one image. The output of ensemble learning was then determined by the outcomes of the three models with the highest number. The ensemble learning process is shown in Figure 3.
Training strategies
To train the model quickly, we employed transfer learning hyperparameters. The optimization method included a few hyperparameters. Transfer learning[15] was first utilized for initialization. The model parameters developed using the open dataset were utilized by all four models. Compared with the loss value of the model, which will decrease later, the loss value of the model in the early stage of training is larger.The early loss value of training was greater because the hyperparameters used in transfer learning were different from those employed in our jobs. The GoogLeNet and ResNet models both showed substantial loss values in the first few epochs, whereas the VGG and AlexNet models displayed smaller loss values. The loss values of the four prediction algorithms converged to lower values as the number of epochs increased[16].