The present study relied on the data retrieved from a prospective cohort study aiming to explore the most important predictors of CKD patients using ML models. In this study, the following models were analyzed: LR, GLM, DL, DT, RF, ANN, NB, GBT, and SVM. In recent years, various studies have used different ML models to predict the risk of CKD 22,23. Yadav et al. (2021) explored a dataset containing 26 CKD-related parameters and combined the ANN classifier with four feature-based algorithms (Extra Tree, Pearson correlation, Lasso model, and chi-square) to identify CKD predictors24 . Emon et al. (2021) also used 8 machine learning classifiers employing Weka software to analyze their performance in predicting CKD Logistic Regression (LG), Naive Bayes (NB), Multilayer Perceptron (MLP), Stochastic Gradient Descent (SGD), Adaptive Boosting (Adaboost), Bagging, Decision Tree (DT), Random Forest (RF) classifier 25.
In this study, the Ensemble BB algorithm was used to enhance the performance of the proposed models. The comparison of the results of the primary and BB-enhanced models showed that BB enhancement increased the performance of the models in terms of sensitivity, specificity, accuracy, and error rate. Srivastava et al. (2022) proposed an algorithm to predict CKD using diagnostic medical data available in the UCI repository combining an array of physiological parameters and ML techniques. In the recent study, the researchers employed the Ranking Weighted Ensemble algorithm to boost the performance of the proposed models and reported that this algorithm could be used to develop an electronic diagnostic system for determining the severity of CKD with the accuracy, sensitivity, specificity, and F1 Score of 98.75%, 100%, 96.55%, and 99.03%, respectively26 . Moreover, Wang et al. (2020) initially tried to estimate serum creatinine levels using a regression model with eight predictors. They then combined the predicted creatinine level with 23 main characteristics in order to predict the risk of CKD in patients. They further boosted their findings using an ensemble technique, including three models (RF, XGBoost (a boosting tree), and ResNet (a neural network-based model)), among which the XGBoost model offered a better performance compared to other models, with an AUC value of only 0.76 27.
According to our findings, the final model developed in this study could reliably discern CKD patients from healthy individuals. This model was able to correctly discriminated true patients with healthy individuals with 100% sensitivity and 96.6% specificity, respectively. In another study, Qin et al. (2019) employed an ML approach to diagnosing CKD and after removing missing data, used six ML algorithms (logistic regression, random forest, support vector machine, k-nearest neighbor, naive Bayes classifier, and feed-forward neural network). In the recent study, the random forest model obtained the best performance with a diagnostic accuracy of 99.75% 28. Dritsas et al. (2022) utilized the SVM, LR, SGD, ANN, and k-NN models to predict the risk of CKD, among which the Rotation Forest (RotF) model with an AUC of 100%, as well as accuracy and F measure of 99.2%, was designated as the best model 22. Priyanka et al. (2019) also used the Naïve Bayes, KNN, SVM, Decision tree, and ANN algorithms to predict the risk of CKD, among which the best performance belonged to the Naïve Bayes model with an accuracy of 94.6% 29.
In the present study, the final model with the best performance was GLM, according to which serum creatinine level, place of residence, waist circumference, and age attained the greatest weight in the diagnosis of CKD. Among these variables, the greatest weight was related to serum creatinine level. Likewise, Chiu et al. (2021) identified BUN and UA as the first and second most important predictors in the risk stratification of CKD 30. Shih et al. (2020) observed that the C4.5 model performed better than other models in predicting CKD, suggesting the creatinine ratio (UPCR), proteinuria, age, RBC, GLU, triglyceride level, total cholesterol, and gender as the most important predictors of CKD, while variables such as HDL, LDL, and ALB seemed to be less important according to this model 31.
The results of the model developed in the present study suggested that SC, AIP, and gender were the strongest predictors of CKD in our participants. In another study by Chiu et al. (2021), SBP, SGPT, SGOT, and LDL-C were identified as the most important risk factors associated with the incidence of CKD (29). Also, Jarad et al. 32 declared that reduced albumin levels were strongly associated with impaired renal function, which was in line with the report of Lang et al., noting that urinary levels of albumin and creatinine were strongly associated with impaired renal function33 .
According to the final model proposed in the present study, the most important predictors that contradicted with None-CKD patients were serum sodium level, SGOT, and DBP. In their study, Samsuria et al. (2019) investigated the relationship between renal dysfunction and the serum levels of sodium and potassium, leading to the observation of a significant relationship between potassium and urea levels34 .
Some noteworthy strengths of the present study include: there were a little amount of missing data (0.8%) in some attributes, having no outliers in the data (indicating the high quality of the dataset), and the use of the state-of-the-art Bayesian Boosting technique to improve the performance of learning algorithms. The limitation of this study was the lack of information on some other variables, such as urine specific gravity, albumin, bacteria, urine protein, and lower extremity edema, in the cohort data.
Strengths and limitations of study
The results of the big data study are from PERSIAN cohort study in which there is a minimum amount of sensoring and high accuracy in recording the variables. From the total cohort, 10 065 eligible remaining 9984 (98.92%). Procedures for data access, information on collaborations, publications and other details can be found at http://persiancohort.com. Similar to all cohort studies, this study is limited because of selection bias. Individuals who are willing to participate in long-term research may be more concerned about their health than others and may adopt lifestyles that they believe address these concerns.