3.1 Statistical description of baseline data
Subjects with many missing values and obvious data errors were deleted, and the final sample size was 4106, including 149 patients with type 2 diabetes and 395 people in the normal population. The difference in sample size between the two groups was huge, so SMOTE method was adopted to process the unbalanced data. The parameters were perc.over=2600 and perc.under=103 (sampling ratio was 2600% and 103%, respectively). After treatment, 4023 patients with type 2 diabetes were enrolled, and 3990 were in the control group. 70% data were randomly selected as the model training set, and the remaining 30% data were used as the test set. The specific results are shown in Table 1.
3.2 Screening results of variables in univariate analysis
Univariate analysis was performed on the balanced samples, α=0.1. The normal test found that the distribution of each characteristic attribute in the two groups of samples was mostly skewed distribution, so the Mann-Whitney U rank sum test and chi-square test in SPSS were used to analyze the quantitative and qualitative data respectively. The results of univariate analysis were shown in Table 2. The results showed that except the educational level, the distribution of the other 23 characteristic variables between the case group and the control group was statistically different.
3.3 Parameter tuning results
In this study, the 10-fold cross validation method was applied to 70% of the training sets, and the included characteristic variables were statistically significant variables in univariate analysis. The corresponding 10-fold cross validation results were compared by continuously adjusting the model parameters. For SVM, linear kernel function, radial basis function, polynomial kernel function and Sigmoid kernel function are used for 10-fold cross validation. The results show that the linear kernel function is the best predictor. For BP neural network, the maximum number of iterations is set to 3000, and the number of hidden layer neurons within the range of 5-20 is respectively cross-verified by ten times. The results show that when the number of hidden layer neurons is 2, the prediction effect is the best. For deep neural network(DNN), the range of hidden layers was 8-12, and the number of neurons in hidden layers was 25-35. The number of neurons in each hidden layer was set to be equal in this study. The results showed that the prediction effect was best when the number of hidden layers was 9 and the number of neurons in each hidden layer was 33.
3.4 Model construction results
3.4.1 Logistics regression model
Fit all data except education level, build logistics regression model, and then use stepwise regression method to screen variables based on AIC information criterion. A total of 16 variables were finally screened, as shown in Table 3, which were age, alcohol consumption, consumption frequency of cereals, potatoes, beans, fruits, eggs, dairy, poultry and fish, DBP, FPG, TC, TG, HDL-C, and LDL-C. Variables screened by stepwise regression were applied to the training set to build a Logistic model, as shown in Table 4. In this model, the factors that had a greater influence on T2DM included potato consumption frequency, fish consumption frequency, TC, FPG, HDL-C. In addition, the frequency of cereal consumption and TC were negatively correlated with the incidence of T2DM, while the other variables were positively correlated with the incidence of type 2 diabetes.
The logistics model equation is: logit(𝑃)=−17.486+0.027Age+0.173Drinking −0.236Cereals+0.442Potatoes+0.176Beans+0.199Fruits+0.294Eggs+0.154Milk+0.373Poultry+0.491Fish+0.026DBP+2.112FPG−0.724TC+0.249TG+0.573HDL-C+0.303LDL-C.
The Logistic model confusion matrix table and ROC curve were obtained by applying the model to test set for verification. As shown in Table 5, Table 6 and Figure 3(A), it can be concluded that the accuracy rate of this model is 89.4%, the recall rate is 86.0%, the accuracy rate is 93.0%, and the area under the ROC curve AUC is 0.962.
3.4.2 Support Vector Machine Model
By using the linear kernel function, the 23 characteristic variables that are significant in the single factor analysis in the training set were substituted into the SVM model, the constructed SVM model was applied to the test set for verification, and the confusion matrix table and ROC curve of the SVM model were obtained. As shown in Table 5, Table 6 and Figure 3(B), it can be concluded that the accuracy rate of this model is 91.2%, the recall rate is 89.0%, the accuracy rate is 93.3%, and the AUC of the area under the ROC curve is 0.911.
3.4.3 BP neural network
The three-layer neural network structure is adopted. The hidden layer has 20 neurons and the maximum number of iterations is 3000. Twenty-three significant characteristic variables from univariate analysis in the training set were substituted into the BP neural network model. The final model constructed was applied to the test set for verification, and the correlation confusion matrix table and ROC curve were obtained. As shown in Table 5, Table 6 and Figure 3 (C), it can be concluded that the accuracy rate of this model is 93.7%, the recall rate is 92.8%, the accuracy rate is 94.6%, and the area under the ROC curve AUC is 0.977.
3.4.4 Decision tree model
(1) CART decision tree
The 23 characteristic variables that were significant in the single factor analysis in the training set were substituting into the CART decision tree model, and the output model of the CART decision tree was shown in Figure 1. When FPG≥5.6mmol/ L, type 2 diabetes was diagnosed directly; When FPG<5.6mmol/ L, Potatoes= 0,1, the patient was diagnosed as non-type 2 diabetes mellitus; When FPG<5.6mmol/ L, Potatoes≠ 0,1 and AGE <34, the patients were diagnosed as non-type 2 diabetes mellitus; When FPG<5.6mmol/ L, Potatoes≠ 0,1 and AGE ≥34, the patient was diagnosed as having type 2 diabetes. The CART decision tree model constructed was applied to test set for verification, and the correlation confusion matrix table and ROC curve were obtained. As shown in Table 5, Table 6 and Figure 3 (D), we can conclude that the accuracy rate of this model is 88.7%, the recall rate is 84.8%, the accuracy rate is 93.3%, and the area under the ROC curve AUC is 0.906.
(2) C4.5 decision tree
Twenty-three significant characteristic variables from univariate analysis in the training set were put into the C4.5 decision tree model. As shown in Figure 2, the decision tree model output by C4.5 algorithm includes 6 root nodes and 9 leaf nodes. According to the model, type 2 diabetes was diagnosed when FPG> was 5.61mmol/ L; When FPG≤5.61mmol/ L: ①Potatoes=0 was diagnosed as non-type 2 diabetes; ②Potatoes=1, Age≤54 was diagnosed as non-type 2 diabetes; ③Potatoes=1, Age>54, TC≤5.11mmol/ L was diagnosed as type 2 diabetes. ④Potatoes=1, Age>54, TC BBB>11 mmol/ L was diagnosed as non-type 2 diabetes mellitus. ⑤Potatoes=2, DBP≤81mmHg was diagnosed as non-type 2 diabetes; ⑥Potatoes=2, DBP>81mmHg was diagnosed as type 2 diabetes; ⑦Potatoes=3, Age≤34 was diagnosed as non-type 2 diabetes; ⑧Potatoes=3, Age>34 was diagnosed as type 2 diabetes. The C4.5 model was applied to the test set for verification, and the correlation confusion matrix table and ROC curve were obtained. As shown in Table 5, Table 6 and Figure 3 (E), it can be concluded that the accuracy rate of this model is 88.6%, the recall rate is 84.9%, the accuracy rate is 92.7%, and the area under the ROC curve AUC is 0.888.
3.4.5 Deep neural network model construction
The 23 characteristic variables that were significant for univariate analysis in the training set were substituted into the DNN model. The number of hidden layers was 9, with 33 neurons in each layer, and the correlation confusion matrix table and ROC curve were obtained. As shown in Table 5, Table 6 and Figure 3(F), the accuracy rate of this model was 84.5%, the recall rate was 82.9%, the accuracy rate was 86.1%, and the AUC of the area under the ROC curve was 0.845.
3.5 Comparison of model performance
DeLong test in R Studio was used to compare the AUC values of each model, as shown in Table 7 and Figure 4. Based on the data set and incorporating the robustness of the model and the prediction effect of type 2 diabetes, BP neural network model is the best, the accuracy is as high as 93.7%, the recall rate is 92.8%, accurate rate was 94.6%, the AUC value is 0.977, followed by logistic regression model, the SVM model, CART decision tree model, C4.5 decision tree model, depth of neural network model. The prediction effect of SVM and CART was similar, and the difference was not statistically significant.