In order to find out whether or not a patient has diabetes, we've used five different predictive models: Logistic Regression [27] K Nearest Neighbors [28] Classification Tree [29], Random Forest [30], AdaBoost [31] Classifier and ANN [32].
4.1. Logistic Regression
A complete model was constructed, with Outcome as the response variable and the remaining eight variables as predictor variables. Identifying the most important variables was accomplished through the use of the stepwise variable selection method. The final model, which was selected using the AIC as the selection criterion, produced a logistic regression model with the lowest AIC value of 593.85, as shown in the table below.
4.2. K-Nearest Neighbour (k-NN)
The K-Nearest Neighbor (k-NN) algorithm is simple, but it provides excellent results. The algorithm is lazy, nonparametric, and instance-based. This algorithm is equally applicable to classification and regression problems. K-Nearest Neighbors is applied in classification to discover the class to which a new unclassified object belongs. The constraint was that the object had to have a 'k' number of neighbors to be considered (where 'k' is the number of neighbors, of which there could be an odd number). Next, the distance between the nearest data points of the objects to the objects themselves was calculated, for example, by the Euclidean, Hamming, Manhattan, or Minkowski distance. The votes of the closest 'k' neighbors are used to compute the new class of the new object. The k-Nearest Neighbor predicts the outcome with an extremely high-level of accuracy.
4.3. Decision Tree
A decision tree is a prediction model that is commonly used in operations research, specifically in decision analysis, and is classified as such in machine learning. When the features and values for different variables are correlated, it represents the correlation between the features and values. It is possible to display different conditions and their possible consequences by using a tree-like model that contains conditional control statements, which is called a decision tree.
An example of a decision tree model is shown below. Each node represents a single object, and each branch represents the various possible values for this object. A decision tree will have three types of nodes: decision nodes, chance nodes, and end nodes. Decision nodes are the nodes that make decisions. One possibility exists along the path from one decision node to one end node, and each variable has its own value along the path between the two nodes. Going through each node simulates performing a "test" on an attribute, and each branch represents the result of the test results.
4.4. Random forests
In the field of machine learning, the decision tree is a well-known technique. However, because of the decision tree's ability to expand indefinitely, it can have a low bias but a very high variance, resulting in overfitting of datasets. Random forests were then developed as a type of classification method that combines multiple decision trees in order to correct for the tendency of decision trees to overfit to their training data during the training process.
Random forests are constructed by randomly sampling a subset of all candidate variables, resulting in a large number of decision trees with relatively uncorrelated models being constructed. If this is the case, the group of decision trees will outperform any of the individual constituent models as a result of the low correlation between them. As a result, the combination of uncorrelated models can produce more accurate predictions than any single prediction because they can protect each other from the errors introduced by their individual components. Even if some trees are incorrect, a large number of other trees will be correct, ensuring that the entire group moves in the same direction as a whole. Because of the uniqueness of the random forests method, it can produce extremely high-dimensional (many features) data even when no dimensionality reduction or feature selection is performed, and it can do so at a relatively fast rate. Furthermore, it can distinguish between the relative importance of different characteristics and the mutual influence between different characteristics. Creating a parallel method with decision trees within a forest is straightforward. More importantly, even if a significant portion of the features are not present, the accuracy can still be maintained in most cases.
4.5. Adaptive Boosting classifier
The AdaBoost (Adaptive Boosting) classifier is one of the most straightforward boosting algorithms available. Initially, AdaBoost assigns equal weights to each observation of training in order to minimize bias. This makes use of several weak models and gives higher weights to certain results that have been observed to be misclassified in previous studies. Because it makes use of multiple weak models, it is possible to combine the effects of the decision boundaries reached through multiple iterations. The accuracy of the misclassified observations is improved, resulting in an improvement in the accuracy of the overall iterations as a result of this improvement. Diabetes is a disease that develops as a result of a sustained high level of sugar obsession in the blood. In this paper, various classifiers are discussed, and a decision support system is proposed that uses the AdaBoost algorithm with decision Stump as a base classifier. It is worth noting that the accuracy obtained for AdaBoost calculation with choices stump as a base classifier.
4.6. Artificial neural network (ANN)
The present study employed a qualitative approach that involved artificial neural network (ANN) much like the human brain, mimics certain functions. In layman's terms, it can be understood as a collection of nodes known as artificial neurons. This network is capable of sending information to every single node on it. Neurons can be visualized as having a value of 0 or 1, and every node has an associated weight that represents the relative strength or importance of that particular node in the overall system. The structure of ANN is organized into multiple layers, starting with the input layer and continuing through the hidden layers to the output layer, where each layer processes the data and generates a useful output.
4.7. Stepwise proposed method
Step 1: Datasets/Inputs: First and foremost, the dataset was checked for any missing or null values, which served as a starting point for data cleaning. There were no null or empty values in the dataset, which was a good thing. The next step was to determine whether or not there were any outliers in the data. A join grid plot was created for each feature in the dataset in order to achieve this goal. It was discovered from the grid plot that the features Blood-Pressure, BMI, Glucose, and Skin-Thickness all had values of zero, which was found to be incorrect. The removal of outliers from the dataset was accomplished by directly eliminating the outlier values from the dataset; as a result, a modified dataset was provided as input, which was then divided into training and testing datasets.
Step 2: Tuning the hyper-parameters: Tuning the hyper-parameters is critical in the creation of the model because it has the potential to make or break the model [33]. For the purpose of obtaining the optimal set of values for the parameters, GridSearchCV has been employed. Grid search will train a model with all of the possible combinations of hyper-parameters and will extract the most useful information from the model [34]. Table 1 lists the parameters that were tuned as a result of the grid search and the values that were tuned as a result of the grid search.
Step 3: Fitting the classifiers: Based on the classifiers AdaBoost, that have been used with optimal values calculated through the method of tuning hyper-parameters, a diabetes prediction analysis has been conducted for the first time.
Step 4: AdaBoost: The accuracy of the misclassified observations is improved, resulting in an improvement in the accuracy of the overall iterations as a result of this improvement [35]. When using the AdaBoost classifier, a series of models with increased sample weights is trained, and on the basis of error, individual classification models are given Alpha confidence coefficients. Low errors result in a large Alpha, which indicates that the voting process is of greater importance.
Step 5: Results prediction: The test outcomes are estimated as a number between 0 and 1, with 0 denoting non-diabetic and 1 denoting diabetic respectively.
Step 6: Evaluation of prediction: Performance metrics such as accuracy, confusion matrix, and classification report are used to assess the model's overall performance. The Confusion Matrix is used to represent the values of true-positive, true-negative, false-positive, and false-negative in a true-positive or true-negative situation [30]. The accuracy of a situation is the percentage of circumstances that have the correct description when compared to the total number of circumstances. The accuracy, recall, F1, and support scores of the model are displayed in the classification report's visualization, which is also available online.
Step 7: Apply K-Fold Validation: Basically, this works by dividing the dataset into k-parts and referring to each break in the data as a "fold." k-1 folds must be used to train the classifier, and each retained fold must be checked against the classifier's output. Therefore, each fold of data must be replicated at least once so that it has an equal chance of being carried back to the testing set. The value of K used in this case is 5. The result is a more accurate estimate of the performance of the new data algorithm, which is based on the results of the testing. This is more dependable because the algorithm has been trained and tested on a variety of data several times before being used.