In recent years, the advancement of computer technology has led to the flourishing of machine learning. As a result, an increasing number of scholars are applying machine learning techniques to improve the diagnosis and treatment of diabetes. Saxena et al. [6] employed feature selection algorithm, rejection of outliers, and missing value padding to preprocess data, followed by adjusting the parameters of K-nearest neighbor. Random forest achieved the highest accuracy of 79.8%. Similarly, Krishnamoorthi et al. [7] utilized missing value processing, outlier removal, and normalization to classify the processed data using proposed Logistic Regression.
Butt et al. [8] deployed a range of classifiers and models to investigate diabetes classification and prediction. Specifically, they employed three classifiers, namely random forest, multilayer perceptron, and logistic regression, in conjunction with three models, LSTM, MA, and LR, to conduct their analysis. Ultimately, their findings revealed that the multilayer perceptron classifier yielded the most accurate classification results, achieving an accuracy rate of 86.06%. Regarding prediction accuracy, the LSTM model emerged as the most effective, with an accuracy rate of 87.26%.
Garcia-Ordas et al. [9] addressed the data imbalance by using a variational self-encoder for data augmentation, followed by a sparse self-encoder for feature augmentation. The PIMA dataset was transformed from the original 8 features to 400 features, and a convolutional neural network and a sparse self-encoder were used for joint training. The combined training of convolutional neural network and sparse self-encoder achieved 92.31% accuracy, better than the traditional model.
Hasan et al. [10] used various data pre-processing techniques to improve the data quality, followed by ensemble classifiers such as AdaBoost and Gradient Boost. Bukhari et al. [11] proposed an improved ANN model using artificial back propagation proportional conjugate gradient neural network (ABP-SCGNN) algorithm, which achieved a high accuracy of 93%, without any data preprocessing.
In [12], the authors tackled the issue of missing data by filling in the mean of each column and then trained six different models, namely Naive Bayes (NB), Linear Regression (LR), Random Forest (RF), AdaBoost (AB), Gradient Boosting Machine (GBM), and Extreme Gradient Boosting (XGBoost). Their findings revealed that the XGBoost model achieved the highest accuracy rate of 77.54%.In another study by Maniruzzaman et al. [13], missing data and outliers were addressed through group median and median interpolation techniques. They further conducted feature extraction and optimization using six distinct feature selection techniques, including random forest, logistic regression, mutual information, principal component analysis, analysis of variance, and Fisher discriminant ratio. The researchers then combined ten different classifiers, including linear discriminant analysis, quadratic discriminant analysis, plain Bayesian, Gaussian process classification, support vector machine, artificial neural network, AdaBoost, logistic regression, decision tree, and random forest, to conduct experiments on the PIMA dataset. Their findings revealed a remarkably high accuracy rate of 92.26%.
Quan Zou et al. [14] used a series of machine learning algorithms such as decision trees, random forests, and neural networks to predict diabetes. The model was evaluated using five-fold cross-validation and independent testing experiments with PCA and mRMR for dimensionality reduction. Random forest achieved the highest accuracy of 80.84% when all features were used.
Hayashi and Yukita [15] proposed to use a rule extraction algorithm Re-RX with J 48 graft, combined with a sampling selection technique in the Pima Indian Diabetes (PID) dataset to achieve highly accurate, concise, and interpretable classification rules, and the average accuracy obtained using this algorithm was 83.83% after running 10 times 10-fold cross-validation.
Alneamy et al. [16] proposed an algorithm based on The Teaching Learning-Based Optimization (TLBO) algorithm and a new classification technique combining Fuzzy Wavelet Neural Network (FWNN) and Functional Link Neural Network (FLNN). The TLBO algorithm was used to train a new hybrid Functional Fuzzy Wavelet Neural Network (FFWNN) and optimize the learning parameters, and the results showed that the PIDD dataset achieved an accuracy of 88.67%.
Chang et al. [17] employed three interpretable supervised machine learning models, including Naive Bayes classifier, Random Forest classifier, and J48 decision tree model, and trained and tested them using the Pima Indians diabetes dataset. By analyzing the performance and decision-making processes of each algorithm, they concluded that the Naive Bayes model is suitable for more refined binary feature selection, while the Random Forest is suitable for conclusions with more features.
Maniruzzaman et al. [18] utilized Gaussian Process-based classification technology, using three kernel functions: linear, polynomial, and radial basis kernel. They compared the performance of GP classification technology with existing techniques such as LDA, QDA, and NB. They evaluated the performance using accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), and receiver operating characteristic (ROC) curve. They conducted experiments on the PIMA dataset and the results showed that the accuracy of the GP model was 81.97%. R. D. Joshi and C. K. Dhakal. [19] focus on predicting type 2 diabetes in Pima Indian women using a logistic regression model and decision tree algorithm. The analysis identifies glucose, pregnancy, body mass index (BMI), diabetes pedigree function, and age as the main predictors of type 2 diabetes. The classification tree confirms the importance of glucose, BMI, and age, while also highlighting the significance of pregnancy and diabetes pedigree function. The model achieves a prediction accuracy of 78.26% and a cross-validation error rate of 21.74%.
Ejiyi et al. [20] propose robust frameworks for predictive diabetes diagnosis using limited medical data for women aged 21 to 81. The proposed frameworks include data augmentation, attribute analysis, and missing data imputations as preliminary steps. Glucose, age, and BMI are identified as the most important features for prediction using SHAP. XGBoost and Adaboost performed best among the ML algorithms tested, with an accuracy of 94.67% and F1 scores of 95.27 and 95.95, respectively.
Although the aforementioned scholars used a series of data processing methods to process the data, some of the methods, despite their complexity, did not yield desirable results. Additionally, some scholars used machine learning models that were too simplistic, resulting in suboptimal model accuracy. Therefore, in this article, we analyze and improve upon the methods of these scholars and establish a high-performance diabetes intelligent diagnosis framework