Optimizing Diabetes Classification with a Machine Learning-Based Framework

doi:10.21203/rs.3.rs-2866487/v1

Download PDF

Research Article

Optimizing Diabetes Classification with a Machine Learning-Based Framework

https://doi.org/10.21203/rs.3.rs-2866487/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 13 Nov, 2023

Read the published version in BMC Bioinformatics →

You are reading this latest preprint version

Background

Diabetes is a metabolic disorder usually caused by insufficient secretion of insulin from the pancreas or insensitivity of cells to insulin, resulting in long-term elevated blood sugar levels in patients. Patients usually present with frequent urination, thirst, and hunger. If left untreated, it can lead to various complications that can affect essential organs and even endanger life. Therefore, developing an intelligent diagnosis framework for diabetes is necessary.

Result

This paper proposes a machine learning-based diabetes classification framework MOG. The framework includes using the mean, median joint filling method to handle missing values, using the cap method for outlier processing, and then proposing a diabetes classification model based on the Generative Adversarial Network for Diabetes Classification (DCSGAN), and finally using logistic regression to analyze the features in detail. The model was tested using the PIMA dataset and the diabetes dataset in the GEO database, achieving an accuracy rate of 98.37% for binary classification and 96.75% for ternary classification in the PIMA dataset, and better performance than traditional models in the data from the GEO database.

Conclusion

The experimental results show that the framework proposed in this paper can accurately classify diabetes and provide new ideas for intelligent diagnosis of diabetes.

Diabetes diagnoses

Machine learning

GAN

Diabetes is a chronic disease resulting from insufficient insulin production by the pancreas or ineffective insulin use by the body[1]. Without enough insulin, glucose absorption is hindered, resulting in increased blood glucose levels that can damage various organs over time. While diabetes cannot be cured, it can be managed through careful diet, physical activity, medication, and regular screening for complications. Failure to treat diabetes can result in severe complications such as cardiovascular disease, diabetic ketoacidosis, chronic kidney disease, and foot ulcers, among others [2]. Shockingly, the number of people with diabetes has risen from 108 million in 1980 to 422 million in 2014, with an estimated 700 million people projected to have diabetes by 2045[3]. Thus, developing an intelligent diagnostic framework for diabetes is crucial, given the disease's significant impact.

The disease has three types: type 1 diabetes, type 2 diabetes, and gestational diabetes [4]. Type 2 diabetes, affecting over 95% of diabetic patients, results from the body's inability to use insulin efficiently, typically caused by being overweight and lacking physical activity. In contrast, type 1 diabetes results from insufficient insulin secretion, requiring insulin injections, and its cause remains unknown. Polyuria, thirst, hunger, weight loss, and vision loss are specific symptoms of type 1 diabetes. Finally, gestational diabetes, a hyperglycemic condition, occurs during pregnancy when blood glucose levels are higher than usual average but not high enough for a diabetes diagnosis.

In recent years, machine learning has received increasing attention in medicine, particularly for intelligent disease diagnosis. Consequently, machine learning techniques have been widely applied to the intelligent diagnosis of diabetes [5]. By analyzing and mining data from diabetic patients, machine learning models can help with early diagnosis, classification, prediction, and treatment planning. With the promise of improving diabetes management and treatment, researchers are exploring the application of machine learning technology in diabetes diagnosis.

However, despite recent advances, several challenges remain. Data acquisition and processing challenges plague many disease diagnosis areas, including small, unbalanced, or low-quality data, which can impact algorithm performance. To address these challenges, this paper proposes a machine learning-based framework for diabetes classification. The framework includes missing value padding, outlier processing, a proposed diabetes classification model called DCSGAN, and correlation analysis of features using logistic regression to mine significant biomarkers of diabetes. The paper demonstrates the validity and accuracy of the framework by conducting experiments on the PIMA dataset and datasets from the GEO database.

The paper is organized as follows: The Related Work section discusses current research in diabetes classification and the challenges that scholars face. The Materials and Methods section describes the dataset, data preprocessing techniques, and the diabetes classification model (DCSGAN). The Results section presents the framework's results, including comparisons to other classifiers, different classification tasks, and results on different datasets. Figure 1 shows the flow chart of our MOG framework.

In recent years, the advancement of computer technology has led to the flourishing of machine learning. As a result, an increasing number of scholars are applying machine learning techniques to improve the diagnosis and treatment of diabetes. Saxena et al. [6] employed feature selection algorithm, rejection of outliers, and missing value padding to preprocess data, followed by adjusting the parameters of K-nearest neighbor. Random forest achieved the highest accuracy of 79.8%. Similarly, Krishnamoorthi et al. [7] utilized missing value processing, outlier removal, and normalization to classify the processed data using proposed Logistic Regression.

Butt et al. [8] deployed a range of classifiers and models to investigate diabetes classification and prediction. Specifically, they employed three classifiers, namely random forest, multilayer perceptron, and logistic regression, in conjunction with three models, LSTM, MA, and LR, to conduct their analysis. Ultimately, their findings revealed that the multilayer perceptron classifier yielded the most accurate classification results, achieving an accuracy rate of 86.06%. Regarding prediction accuracy, the LSTM model emerged as the most effective, with an accuracy rate of 87.26%.

Garcia-Ordas et al. [9] addressed the data imbalance by using a variational self-encoder for data augmentation, followed by a sparse self-encoder for feature augmentation. The PIMA dataset was transformed from the original 8 features to 400 features, and a convolutional neural network and a sparse self-encoder were used for joint training. The combined training of convolutional neural network and sparse self-encoder achieved 92.31% accuracy, better than the traditional model.

Hasan et al. [10] used various data pre-processing techniques to improve the data quality, followed by ensemble classifiers such as AdaBoost and Gradient Boost. Bukhari et al. [11] proposed an improved ANN model using artificial back propagation proportional conjugate gradient neural network (ABP-SCGNN) algorithm, which achieved a high accuracy of 93%, without any data preprocessing.

In [12], the authors tackled the issue of missing data by filling in the mean of each column and then trained six different models, namely Naive Bayes (NB), Linear Regression (LR), Random Forest (RF), AdaBoost (AB), Gradient Boosting Machine (GBM), and Extreme Gradient Boosting (XGBoost). Their findings revealed that the XGBoost model achieved the highest accuracy rate of 77.54%.In another study by Maniruzzaman et al. [13], missing data and outliers were addressed through group median and median interpolation techniques. They further conducted feature extraction and optimization using six distinct feature selection techniques, including random forest, logistic regression, mutual information, principal component analysis, analysis of variance, and Fisher discriminant ratio. The researchers then combined ten different classifiers, including linear discriminant analysis, quadratic discriminant analysis, plain Bayesian, Gaussian process classification, support vector machine, artificial neural network, AdaBoost, logistic regression, decision tree, and random forest, to conduct experiments on the PIMA dataset. Their findings revealed a remarkably high accuracy rate of 92.26%.

Quan Zou et al. [14] used a series of machine learning algorithms such as decision trees, random forests, and neural networks to predict diabetes. The model was evaluated using five-fold cross-validation and independent testing experiments with PCA and mRMR for dimensionality reduction. Random forest achieved the highest accuracy of 80.84% when all features were used.

Hayashi and Yukita [15] proposed to use a rule extraction algorithm Re-RX with J 48 graft, combined with a sampling selection technique in the Pima Indian Diabetes (PID) dataset to achieve highly accurate, concise, and interpretable classification rules, and the average accuracy obtained using this algorithm was 83.83% after running 10 times 10-fold cross-validation.

Alneamy et al. [16] proposed an algorithm based on The Teaching Learning-Based Optimization (TLBO) algorithm and a new classification technique combining Fuzzy Wavelet Neural Network (FWNN) and Functional Link Neural Network (FLNN). The TLBO algorithm was used to train a new hybrid Functional Fuzzy Wavelet Neural Network (FFWNN) and optimize the learning parameters, and the results showed that the PIDD dataset achieved an accuracy of 88.67%.

Chang et al. [17] employed three interpretable supervised machine learning models, including Naive Bayes classifier, Random Forest classifier, and J48 decision tree model, and trained and tested them using the Pima Indians diabetes dataset. By analyzing the performance and decision-making processes of each algorithm, they concluded that the Naive Bayes model is suitable for more refined binary feature selection, while the Random Forest is suitable for conclusions with more features.

Maniruzzaman et al. [18] utilized Gaussian Process-based classification technology, using three kernel functions: linear, polynomial, and radial basis kernel. They compared the performance of GP classification technology with existing techniques such as LDA, QDA, and NB. They evaluated the performance using accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), and receiver operating characteristic (ROC) curve. They conducted experiments on the PIMA dataset and the results showed that the accuracy of the GP model was 81.97%. R. D. Joshi and C. K. Dhakal. [19] focus on predicting type 2 diabetes in Pima Indian women using a logistic regression model and decision tree algorithm. The analysis identifies glucose, pregnancy, body mass index (BMI), diabetes pedigree function, and age as the main predictors of type 2 diabetes. The classification tree confirms the importance of glucose, BMI, and age, while also highlighting the significance of pregnancy and diabetes pedigree function. The model achieves a prediction accuracy of 78.26% and a cross-validation error rate of 21.74%.

Ejiyi et al. [20] propose robust frameworks for predictive diabetes diagnosis using limited medical data for women aged 21 to 81. The proposed frameworks include data augmentation, attribute analysis, and missing data imputations as preliminary steps. Glucose, age, and BMI are identified as the most important features for prediction using SHAP. XGBoost and Adaboost performed best among the ML algorithms tested, with an accuracy of 94.67% and F1 scores of 95.27 and 95.95, respectively.

Although the aforementioned scholars used a series of data processing methods to process the data, some of the methods, despite their complexity, did not yield desirable results. Additionally, some scholars used machine learning models that were too simplistic, resulting in suboptimal model accuracy. Therefore, in this article, we analyze and improve upon the methods of these scholars and establish a high-performance diabetes intelligent diagnosis framework

3.1 Dataset

This study utilized the PIMA Indian Diabetes Dataset, a publicly accessible dataset collected and compiled by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). The database comprises data from 768 patients, including 268 individuals with diabetes and 500 individuals without. For each patient, eight physiological indicators were recorded, namely Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, and Age. These parameters were utilized to predict the presence of diabetes in each individual. Table 1 provides a detailed description of each characteristic.

Table 1

Description of PIMA Dataset
S/N	Features	Description
1	Pregnancies	Number of times pregnant
2	Glucose	Plasma glucose concentration 2 hours in an oral glucose tolerance test
3	Blood Pressure	Diastolic blood pressure (mm Hg)
4	Skin Thickness	Triceps skin fold thickness (mm)
5	Insulin	2-Hour serum insulin (mu U/ml)
6	BMI	Body mass index (weight in kg/(height in m)^2)
7	Diabetes Pedigree Function	Diabetes pedigree function
8	Age	Age (years)

The Gene Expression Omnibus (GEO) is an open-access repository designed for the preservation and dissemination of gene expression data. Maintained by the National Center for Biotechnology Information (NCBI), this vast database houses diverse genomics datasets, encompassing gene microarray data, RNA-Seq data, and miRNA data, among others. To evaluate the generalizability of our model, we selected 13 diabetes-related datasets from the GPL570 platform of GEO. A comprehensive overview of the chosen datasets is provided in Table 2.

Table 2

Description of GEO Dataset
ID	Dataset	Samples	Features
1	GSE76894	103	29530
2	GSE76895	103	29612
3	GSE23343	17	54613
4	GSE161355	33	54675
5	GSE71416	20	54675
6	GSE55650	23	54613
7	GSE55100	44	54675
8	GSE55098	22	54675
9	GSE55099	22	847
10	GSE15932	32	54675
11	GSE19420	42	54675
12	GSE66738	14	45101
13	GSE25462	50	54675

3.2 Data preprocessing

3.2.1 Missing Value Imputation

Missing values are a common challenge in data analysis and machine learning, which occur when certain variables or attributes lack values during data collection or processing [21]. Missing values can cause problems such as reduced sample size, information loss, and biased analysis results, potentially compromising the accuracy and reliability of data analysis and models. Therefore, this paper addresses this issue by performing missing value processing on the data. Specifically, we identified 0 values in Glucose, Blood Pressure, Skin Thickness, Insulin, and BMI features in the PIMA dataset, which do not align with typical human indices, and treated them as missing values. To fill in the missing values, we utilized a combined mean and median filling approach to ensure the data distribution remains consistent with the normal distribution while minimizing any potential data bias.

3.2.2 Outlier detection

Outliers, which are values in a dataset that are significantly different from other data values and can adversely affect the distribution, relationships, and statistical analysis of the data [22], must be identified and processed to obtain accurate data analysis results. As shown in Fig. 2, the box plot of the data indicates that the Insulin feature in the original data contains a large number of outliers that persist even after filling in missing values. As such, outlier processing is required for this feature to improve the quality of the analysis.

The direct removal of outliers is a frequently employed technique to handle these values. The basic principle involves the elimination of the outlier data points from the dataset. This method is straightforward and practical, particularly in cases where the dataset is large and the number of outliers is minimal. However, it is not without its drawbacks. Firstly, if the size of the sample after outlier removal is too small, the analysis results may be unreliable. Secondly, the deletion of outliers may result in a loss of valuable information, which can compromise the thoroughness and precision of the data analysis. Finally, as outliers in the data are often a mix of real occurrences and noise, the removal of outliers may incorrectly assess genuine data, consequently impacting the data analysis outcomes.

The capping method is a data preprocessing technique that mitigates the effect of outliers by transforming extreme values into more reasonable ones. This is achieved by computing the quartiles $Q1$, $Q2$, and $Q3$ of the data, where $Q1$ denotes the value below which 25% of the data lies and $Q3$ denotes the value above which 75% of the data lies. The upper and lower are computed as follows:

$$\begin{array}{c}IQR=Q3-Q1\#\left(1\right)\end{array}$$

$$\begin{array}{c}upper=Q3+1.5 \times IQR\#\left(2\right)\end{array}$$

$$\begin{array}{c}lower=Q1-1.5 \times IQR\#\left(3\right)\end{array}$$

The capping method offers several advantages for dealing with outliers. Firstly, it is straightforward and does not require any assumptions about the data distribution. Secondly, it can prevent outliers from exerting a significant impact on data conclusions. Additionally, compared to directly removing outliers, the capping method can prevent excessive reduction in the sample size, thereby preserving the integrity of the data for subsequent analytical processing.

3.2.3 Relabeling based on Glucose

In machine learning, relabeling involves updating the labeling or classification of samples in a dataset to improve model performance and accuracy by correcting mislabeled or inaccurate labels. Table 3 demonstrates the diagnostic criteria for diabetes. Olisah et al. [23] relabeled the PIMA dataset by labeling samples with Glucose greater than 125 as diabetes, those with Glucose greater than 99 and less than or equal to 125 as prediabetes, and the remaining samples as normal based on Fasting Plasma Glucose show in Table 3. This transformed the PIMA dataset from a dichotomous to a trichotomous task. In this paper, to explore the model's generalization, the PIMA dataset is also transformed into a triple classification task based on Glucose.

Table 3

Criteria of diagnosing diabetes
Diagnosis	A1C	Fasting Plasma Glucose	Oral Glucose Tolerance Test	Random Plasma Glucose Test
Normal	below 5.7%	99 mg/dL or below	139 mg/dL or below	N/A
Prediabetes	5.7–6.4%	100 to 125 mg/dL	140 to 199 mg/dL	N/A
Diabetes	6.5% or above	126 mg/dL or above	200 mg/dL or above	200 mg/dL or above

3.3 Correlation Analysis

3.3.1 Pearson correlation coefficient

Pearson's correlation coefficient is a valuable tool for assessing the strength of the linear relationship between two variables. This statistic ranges from − 1 to 1, with 0 indicating no correlation, 1 indicating a perfectly positive correlation, and − 1 indicating a perfectly negative correlation. The calculation formula for Pearson's correlation coefficient involves dividing the covariance of the two variables by the product of their standard deviations and is expressed as follows:

$$\begin{array}{c}{\rho }_{X,Y}=\frac{\text{cov}\left(X,Y\right)}{{\sigma }_{X}{\sigma }_{Y}}=\frac{E\left[\left(X-{\mu }_{X}\right)\left(Y-{\mu }_{Y}\right)\right]}{{\sigma }_{X}{\sigma }_{Y}}\#\left(4\right)\end{array}$$

The variables ${\sigma }_{X},{\sigma }_{Y}$ represent the sample standard deviation, while ${\mu }_{X},{\mu }_{Y}$ represent the sample mean in the calculation formula.

3.3.2 Logistic Regression

Logistic regression is a popular machine learning algorithm for binary classification, which is often used to analyze the impact of one or more independent variables on the dependent variable. The formula for logistic regression is as follows:

$$\begin{array}{c}f\left(x\right)=\frac{L}{1+{e}^{-k\left(x-{x}_{0}\right)}}\#\left(5\right)\end{array}$$

The logistic regression model employs a formula that includes an upper exact bound $L$ and a logistic growth rate $k$. Although it is commonly used as a classification algorithm, it is also valuable for correlation analysis, enabling the determination of whether two variables are correlated. One key advantage of using logistic regression for correlation analysis is its ability to accurately quantify the correlation between two variables, as well as analyze the correlation between multiple independent variables.

3.4 Other Machine Learning Algorithms:

3.4.1 Convolutional neural networks

The Convolutional Neural Network (CNN) is a prevalent deep learning model utilized for processing various data types, particularly images and speech signals [24]. Comprising convolutional layers, pooling layers, and fully connected layers, CNN learns weight and offset parameters to extract features and classify input data. The convolutional layer applies convolutional kernels to images, enabling the capture of spatial features. On the other hand, the pooling layer reduces image size using downsampling techniques to reduce complexity and improve feature invariance. Finally, the fully connected layer classifies image features via matrix operations.

3.4.2 Deep neural networks

A deep neural network is a powerful multilayer feedforward neural network that excels in learning and extracting abstract features from input data [25]. DNNs are capable of capturing intricate nonlinear relationships and extracting abstract representations of features by constructing several hidden layers. Each layer in a DNN is composed of multiple neurons and is fully connected, allowing for nonlinear feature extraction through the learning of weights and offsets.

3.4.3 Support Vector Machine

A support vector machine (SVM) is a binary classification model frequently used to address classification and regression problems [26]. The key concept behind SVM is to discover an optimal decision boundary that maximizes the distance between two classes of samples, resulting in more accurate classification. The decision boundary in SVM is defined by the support vectors, i.e., the samples closest to the decision boundary. The decision function of SVM is:

$$\begin{array}{c}f\left(x\right)={w}^{T}x+b=0\#\left(6\right)\end{array}$$

Where $w$ is the coefficient and $b$ is the intercept. SVM also supports nonlinear decision boundaries and can use kernel functions to map samples so that the original nonlinearly divisible samples become linearly divisible samples, thus achieving nonlinear classification. SVM is widely used in many fields, such as text classification, handwriting recognition, speech recognition, etc.

3.4.4 Naive Bayes

Naive Bayes [27] is a probabilistic classification algorithm that leverages Bayes' theorem and the assumption of conditional independence of features to predict the category of a new sample based on its features. Bayes' theorem involves calculating the conditional probability $P\left(B\right|A)$, where the posterior probability is computed when the prior probability $P\left(A\right)$ is known, as follows:

$$\begin{array}{c}P\left(A|B\right)= \frac{P\left(B|A\right)P\left(A\right)}{P\left(B\right)}\#\left(7\right)\end{array}$$

where $P\left(B\right)$ is the marginal probability of $B$, calculated as follows:

$$\begin{array}{c}P\left(B\right)= \sum _{i}P\left(B|{A}_{i}\right)P\left({A}_{i}\right)\#\left(8\right)\end{array}$$

Naive Bayes algorithm is a powerful tool for solving classification problems in large datasets and is widely employed in text classification and spam filtering due to its ease of use and rapid processing capabilities. There exist various types of plain Bayesian algorithms, including Bernoulli Bayes, polynomial Bayes, Gaussian Bayes, among others, that are tailored to different data distributions. However, a major limitation of the plain Bayesian algorithm is its dependence on the assumption of conditional independence of features, and this assumption may negatively impact the classification results when there is a correlation among pairs of features.

3.4.5 K-nearest neighbor

The K-nearest neighbor (KNN) algorithm [28] is an instance-based learning method that relies on the proximity of data points in the feature space to make predictions. Specifically, KNN classifies a new data point based on the category of the k most similar data points in the training set. Despite its simplicity and interpretability, KNN suffers from high computational and space complexity. To use KNN, one first trains the model on a set of labeled data points. Then, when a new unlabeled data point is given, KNN computes its distance to each labeled data point and selects the k closest neighbors. The predicted classification of the new data point is then determined by the category that appears most frequently among the k neighbors. Taking the Euclidean distance as an example, the KNN is calculated as follows:

$$\begin{array}{c}y=f\left(x\right)=majority\left({y}_{i}|{x}_{i} \in {N}_{k\left(x\right)}\right)\#\left(8\right)\end{array}$$

where $y$ is the category of test sample $x$, $f\left(x\right)$is the classification function, $majority\left({y}_{i}\right)$ is the category with the most occurrences among $k$ neighbors, and ${N}_{k\left(x\right)}$ represents the set of $k$ nearest training samples to $x$.

3.4.6 Decision Tree

The decision tree [29] is a tree-structured method for classification and regression that predicts the target variable values by recursively dividing the feature space. It is an intuitive and easy-to-implement approach that can be used for both classification and regression problems. The decision tree algorithm constructs the tree by selecting important features from the training data and splitting the data into smaller segments based on those features. At each non-leaf node of the tree, a feature is selected, and its children represent different values of that feature. The leaves of the tree correspond to the classification result. By traversing the tree and classifying the samples based on the features, the unknown samples are mapped to the corresponding leaf nodes, resulting in the predicted classification. The decision tree has excellent interpretability but may suffer from overfitting. Additionally, decision trees may not perform well on high-dimensional and linearly inseparable data, and they may also be sensitive to outliers.

4.1 Result of data preprocessing

Following the mean median joint filling of missing values, the distribution of the PIMA dataset is illustrated in Fig. 4. The visual reveals that the attributes Glucose, BloodPressure, SkinThickness, Insulin, and BMI conform more closely to the normal distribution.

To compare the efficacy of two outlier processing methods, the present study examines the results of utilizing the two methods with four machine learning models SVM, NB, KNN, and DT. Figure 5 illustrates the findings, which demonstrate that the capping method outperforms direct outlier rejection in all four models. Based on these results, the present paper employs the capping method for outlier processing.

4.2 Result of Correlation Analysis

Upon exploring the correlation of features in the PIMA dataset, we generated a correlation coefficient heat map as illustrated in Fig. 6. The results indicated that Glucose exhibited a stronger correlation with the outcome compared to other features. To delve deeper into the impact of the features on the outcome, we further utilized logistic regression for conducting a correlation analysis.

Table 4

Result of Logistic Regression
	Regression coefficients	Standard Error	Wald	P	OR	OR 95% confidence interval
	Regression coefficients	Standard Error	Wald	P	OR	Upper limit	Lower limit
Pregnancies	-0.125	0.032	14.953	0.000***	0.882	0.828	0.94
Glucose	-0.037	0.004	90.025	0.000***	0.964	0.956	0.971
BloodPressure	0.009	0.009	1.021	0.312	1.009	0.992	1.026
SkinThickness	-0.003	0.013	0.062	0.803	0.997	0.971	1.023
Insulin	0.001	0.002	0.164	0.686	1.001	0.997	1.004
BMI	-0.093	0.018	27.028	0.000***	0.911	0.88	0.944
DiabetesPedigreeFunction	-0.87	0.297	8.56	0.003***	0.419	0.234	0.75
Age	-0.013	0.01	1.838	0.175	0.987	0.969	1.006
Dependent variable: Outcome
Note: *, , * represent 1%, 5%, 10% significance levels, respectively

Table 4 presents the results of the logistic regression, revealing insightful findings on the relationship between the features and Outcome in the PIMA dataset. The results indicate that Pregnancies and Glucose have a significant effect on Outcome, while BloodPressure, SkinThickness, and Insulin do not. Specifically, for each unit increase in Pregnancies, the probability of Outcome being 0 decreases by 11.767%, and for each unit increase in Glucose, the probability of Outcome being 0 decreases by 3.633%. Similarly, BMI and DiabetesPedigreeFunction also have a significant effect on Outcome, with each unit increase in BMI leading to an 8.867% decrease in the probability of Outcome being 0, and each unit increase in DiabetesPedigreeFunction resulting in a 58.112% decrease in the probability of Outcome being 0. On the other hand, Age does not have a significant effect on Outcome as the significance p-value is 0.175, indicating that the original hypothesis cannot be rejected.

4.3 DCSGAN: optimized for Diabetes classification:

Generative Adversarial Networks [30] is a machine learning technique consisting of two neural networks: a Generator and a Discriminator. The Generator generates a fake sample by learning from the sample data and aims to deceive the Discriminator, while the Discriminator learns to identify whether the sample was generated by the Generator or not. The training process involves the Generator maximizing the probability of the Discriminator making a mistake while the Discriminator minimizes the probability of making a mistake. The fundamental idea behind GAN is to make the generated data similar to the distribution of real data. In the traditional GAN, the Discriminator outputs two categories, true and false, through the softmax output layer, indicating whether the data comes from the probability distribution of the Generator. In a modified version of GAN called SGAN [31], the Discriminator has N + 1 output units, including false labels that can be used for classification tasks. In this paper, we propose a diabetes classification model (DCSGAN) based on adversarial neural networks.

Figure 3 depicts the architecture of the model employed in this study. Initially, the Generator learns from the actual data and generates counterfeit data to deceive the Discriminator, which endeavors to distinguish between authentic and fabricated data through ongoing training and eventually produces data labels. As the Generator's performance reaches a certain threshold, the counterfeit data generated becomes indistinguishable from the authentic data, allowing the Discriminator to learn more valuable insights about the data, thereby enhancing its classification efficacy.

4.4 Comparison with other models

Convolutional neural networks, deep neural networks, support vector machines, plain Bayesian, K-nearest neighbor algorithm, and decision trees were compared with our proposed DCSGAN using 10-fold cross-validation, a commonly used method for machine learning model evaluation that assesses the generalization ability of the model. The original dataset was divided into 10 disjoint subsets, with one used as the validation dataset and the remaining nine used for training. The model was trained on the nine training datasets and evaluated on the validation dataset, and this process was repeated 10 times using different validation datasets. The final evaluation results were obtained by averaging the 10 evaluations, thus avoiding evaluation errors caused by inappropriate data partitioning.

DCSGAN achieved 98.37% accuracy in binary classification and 96.75% accuracy in triple classification, outperforming other models in both tasks as shown in Fig. 7. Table 5 compares the results of DCSGAN with techniques proposed by other scholars.

Table 5

Comparison with state-of-the-art methods
Author	Methods	Classification accuracy
Krishnamoorthi et al.[7]	LR, KNN, SVM, RF	83%
Saxena et al.[6]	KNN, RF, DT, MLP	79%
Garcia-Ordas et al.[9]	VAE, SAE, CNN	92.31%
Bukhari et al.[11]	ABP-SCGNN	93%
I. Gnanadass.[12]	NB, LR, RF, AB, GBM, XGB	77.54%
Maniruzzaman et al.[13]	LDA, QDA, NB, GPC, SVM, ANN, AB, LR, DT, RF	92.26%
Hayashi and Yukita[15]	Re-RX with J 48 graft	83.83%
Alneamy et al.[16]	TLBO, FWNN, FLNN, FFWNN	88.67%
Chang, V et al.[17]	NB, RF, J48	79.57%
Ours	DCSGAN	98.37%

4.5 Result on other data set

To assess the generalizability of our model, we validated it on 13 datasets obtained from the GEO database. As shown in Table 6, DCSGAN was found to be inferior only to the convolutional neural network and plain Bayes in the GSE15932 dataset while outperforming other models in terms of accuracy across the remaining 12 datasets.

Table 6

Result of GEO dataset
	SVM	NB	DT	NN	CNN	DNN	DCSGAN
GSE76894	0.8157	0.7667	0.7952	0.8524	0.8352	0.8145	0.9079
GSE76895	0.6895	0.7290	0.6600	0.7200	0.8239	0.6881	0.8436
GSE23343	0.5333	0.6500	0.4500	0.4667	0.6424	0.5870	0.9992
GSE161355	0.5429	0.5667	0.7286	0.4524	0.7921	0.4849	0.8844
GSE71416	0.7000	0.7000	0.9000	0.7000	0.9549	0.7000	0.9179
GSE55650	0.6400	0.7900	0.6400	0.7400	0.8705	0.4783	0.9491
GSE55100	0.5400	0.8600	0.7300	0.8200	0.8289	0.5454	0.9799
GSE55098	0.5400	0.8600	0.5400	0.8200	0.8289	0.5454	0.9102
GSE55099	0.5400	0.7700	0.5999	0.7800	0.8583	0.5460	0.8657
GSE15932	0.6714	0.7524	0.6238	0.5429	0.8451	0.5313	0.7500
GSE19420	0.7167	0.7167	0.7389	0.5667	0.8451	0.7143	0.8570
GSE66738	0.5000	0.5999	0.5071	0.5036	0.6239	0.5263	0.9548
GSE25462	0.8000	0.8000	0.8400	0.8400	0.8767	0.6800	0.8999

Currently, there is no effective treatment for diabetes, and to prevent its further progression, we propose a deep learning-based framework for diabetes detection. Firstly, we perform data preprocessing on the dataset, which includes filling missing values and using the capping method to handle outliers. We then relabel the PIMA dataset, where samples with Glucose greater than 125 are labeled as diabetes, samples with Glucose greater than 99 and less than or equal to 125 are labeled as prediabetes, and the remaining samples are labeled as non-diabetes. Subsequently, we propose a classification model based on the adversarial neural network called DCSGAN, which achieves impressive results of 98.37% accuracy on the dichotomous task and 96.75% accuracy on the trichotomous task. Additionally, our model outperforms other models on all 12 datasets in the GEO dataset, demonstrating its strong generalization ability. Furthermore, logistic regression-based correlation analysis reveals that Pregnancies, Glucose, BMI, and Diabetes Pedigree Function are significant biomarkers of diabetes, which have a substantial impact on the results. Therefore, our proposed framework not only performs well in the PIMA dataset but also exhibits excellent performance in other datasets, highlighting its potential in the early detection and prevention of diabetes.

SVM

Support Vector Machines

Naive Bayes

Decision Tree

KNN

K Nearest Neighbors

GAN

Generative Adversarial Networks

Ethics approval and consent to participate

Not applicable.

Consent to publication

All authors give consent to publish.

Availability of Data and Material

The data that support the findings of this study are available available on UCI Repository. You can download at Pima Indians Diabetes Database | Kaggle and the datasets shown in Table 2 can download in GEO database by search the appropriate number

Competing interests

The authors declare that we have no known competing financial interests or personal relationships that could influence the work reported in this article.

Acknowledgments

We thank all individuals who participated in this study.

Funding

This work is supported by the Natural Science Foundation of Jilin Province (YDZJ202301ZYTS401,YDZJ202301ZYTS288),the Science and Technology Project of the Education Department of Jilin Province (JJKH20220245KJ, JJKH20220226SK), the National Natural Science Foundation of China Joint Fund Project (U19A200496).

Authors’ contributions

Xin Feng: Conceived the study, Designed and performed the experiment, Writing the original draft. Yihuai Cai: Provided inputs on method design, Evaluation, Case studies and Manuscript. Ruihao Xin: Conceived the study, Edited the manuscript.

Khan RMM, Chua ZJY, Tan JC, Yang Y, Liao Z, Zhao Y: From Pre-Diabetes to Diabetes: Diagnosis, Treatments and Translational Research. Medicina (Kaunas) 2019, 55(9).
Blake R, Trounce IA: Mitochondrial dysfunction and complications associated with diabetes. Biochim Biophys Acta 2014, 1840(4):1404-1412.
Marateb HR, Mansourian M, Faghihimani E, Amini M, Farina D: A hybrid intelligent system for diagnosing microalbuminuria in type 2 diabetes patients without having to measure urinary albumin. Comput Biol Med 2014, 45:34-42.
Roden M: Diabetes mellitus: definition, classification and diagnosis. Wien Klin Wochenschr 2016, 128 Suppl 2:S37-40.
Richens JG, Lee CM, Johri S: Improving the accuracy of medical diagnosis with causal machine learning. Nat Commun 2020, 11(1):3923.
Saxena R, Sharma SK, Gupta M, Sampada GC: A Novel Approach for Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods. Comput Intell Neurosci 2022, 2022:3820360.
Krishnamoorthi R, Joshi S, Almarzouki HZ, Shukla PK, Rizwan A, Kalpana C, Tiwari B: A Novel Diabetes Healthcare Disease Prediction Framework Using Machine Learning Techniques. J Healthc Eng 2022, 2022:1684017.
Butt UM, Letchmunan S, Ali M, Hassan FH, Baqir A, Sherazi HHR: Machine Learning Based Diabetes Classification and Prediction for Healthcare Applications. J Healthc Eng 2021, 2021.
Garcia-Ordas MT, Benavides C, Benitez-Andrades JA, Alaiz-Moreton H, Garcia-Rodriguez I: Diabetes detection using deep learning techniques with oversampling and feature augmentation. Comput Methods Programs Biomed 2021, 202:105968.
Hasan MK, Alam MA, Das D, Hossain E, Hasan M: Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers. IEEE Access 2020, 8:76516-76531.
Bukhari MM, Alkhamees BF, Hussain S, Gumaei A, Assiri A, Ullah SS, Gelfusa M: An Improved Artificial Neural Network Model for Effective Diabetes Prediction. Complexity 2021, 2021:1-10.
Gnanadass I: Prediction of Gestational Diabetes by Machine Learning Algorithms. IEEE Potentials 2020, 39(6):32-37.
Maniruzzaman M, Rahman MJ, Al-MehediHasan M, Suri HS, Abedin MM, El-Baz A, Suri JS: Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers. J Med Syst 2018, 42(5):92.
Zou Q, Qu K, Luo Y, Yin D, Ju Y, Tang H: Predicting Diabetes Mellitus With Machine Learning Techniques. Front Genet 2018, 9:515.
Hayashi Y, Yukita S: Rule extraction using Recursive-Rule extraction algorithm with J48graft combined with sampling selection techniques for the diagnosis of type 2 diabetes mellitus in the Pima Indian dataset. Informatics in Medicine Unlocked 2016, 2:92-104.
Majeed Alneamy JS, Z AHA, Mohd Hashim SZ, Hamed Alnaish RA: Utilizing hybrid functional fuzzy wavelet neural networks with a teaching learning-based optimization algorithm for medical disease diagnosis. Comput Biol Med 2019, 112:103348.
Chang V, Bailey J, Xu QA, Sun Z: Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural Comput Appl 2022:1-17.
Maniruzzaman M, Kumar N, Menhazul Abedin M, Shaykhul Islam M, Suri HS, El-Baz AS, Suri JS: Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm. Comput Methods Programs Biomed 2017, 152:23-34.
Joshi RD, Dhakal CK: Predicting Type 2 Diabetes Using Logistic Regression and Machine Learning Approaches. Int J Environ Res Public Health 2021, 18(14).
Ejiyi CJ, Qin Z, Amos J, Ejiyi MB, Nnani A, Ejiyi TU, Agbesi VK, Diokpo C, Okpara C: A robust predictive diagnosis model for diabetes mellitus using Shapley-incorporated machine learning algorithms. Healthcare Analytics 2023, 3.
Zhang Y, Thorburn PJ: Handling missing data in near real-time environmental monitoring: A system and a review of selected methods. Future Generation Computer Systems 2022, 128:63-72.
Aguinis H, Gottfredson RK, Joo H: Best-Practice Recommendations for Defining, Identifying, and Handling Outliers. Organizational Research Methods 2013, 16(2):270-301.
Olisah CC, Smith L, Smith M: Diabetes mellitus prediction and diagnosis from a data preprocessing and machine learning perspective. Comput Methods Programs Biomed 2022, 220:106773.
Du C, Liu PX, Zheng M: Classification of Imbalanced Electrocardiosignal Data using Convolutional Neural Network. Comput Methods Programs Biomed 2022, 214:106483.
Mittal S: A survey on modeling and improving reliability of DNN algorithms and accelerators. Journal of Systems Architecture 2020, 104.
Khan A, Khan A, Khan MM, Farid K, Alam MM, Su'ud MBM: Cardiovascular and Diabetes Diseases Classification Using Ensemble Stacking Classifiers with SVM as a Meta Classifier. Diagnostics (Basel) 2022, 12(11).
Alwateer M, Almars AM, Areed KN, Elhosseini MA, Haikal AY, Badawy M: Ambient Healthcare Approach with Hybrid Whale Optimization Algorithm and Naive Bayes Classifier. Sensors (Basel) 2021, 21(13).
Suyanto S, Meliana S, Wahyuningrum T, Khomsah S: A new nearest neighbor-based framework for diabetes detection. Expert Systems with Applications 2022, 199.
Flayer CH, Perner C, Sokol CL: A decision tree model for neuroimmune guidance of allergic immunity. Immunol Cell Biol 2021, 99(9):936-948.
Wang K, Gou C, Duan Y, Lin Y, Zheng X, Wang F-Y: Generative adversarial networks: introduction and outlook. IEEE/CAA Journal of Automatica Sinica 2017, 4(4):588-598.
Zheng C, Koh V, Bian F, Li L, Xie X, Wang Z, Yang J, Chew PTK, Zhang M: Semi-supervised generative adversarial networks for closed-angle detection on anterior segment optical coherence tomography images: an empirical study with a small training dataset. Ann Transl Med 2021, 9(13):1073.

No competing interests reported.

Download PDF

Journal Publication

published 13 Nov, 2023

Read the published version in BMC Bioinformatics →

Editorial decision: Major revision
10 Jun, 2023
Reviews received at journal
26 May, 2023
Reviewers agreed at journal
26 May, 2023
Reviewers invited by journal
09 May, 2023
Editor assigned by journal
30 Apr, 2023
Editor invited by journal
29 Apr, 2023
Submission checks completed at journal
29 Apr, 2023
First submitted to journal
26 Apr, 2023

You are reading this latest preprint version

Optimizing Diabetes Classification with a Machine Learning-Based Framework

Status:

Journal Publication

Version 1

Abstract

Figures

1 Background

2 Related Work

3 Materials and Methods

3.1 Dataset

3.2 Data preprocessing

3.2.1 Missing Value Imputation

3.2.2 Outlier detection

3.2.3 Relabeling based on Glucose

3.3 Correlation Analysis

3.3.1 Pearson correlation coefficient

3.3.2 Logistic Regression

3.4 Other Machine Learning Algorithms:

3.4.1 Convolutional neural networks

3.4.2 Deep neural networks

3.4.3 Support Vector Machine

3.4.4 Naive Bayes

3.4.5 K-nearest neighbor

3.4.6 Decision Tree

4 Result and discussion

4.1 Result of data preprocessing

4.2 Result of Correlation Analysis

4.3 DCSGAN: optimized for Diabetes classification:

4.4 Comparison with other models

4.5 Result on other data set

5 Conclusion

Abbreviations

Declarations

References

Additional Declarations

Status:

Journal Publication

Version 1