Prediction of Heart Failure in Type 2 Diabetes Mellitus Subjects Using Machine Learning: A Cross-Sectional Study

doi:10.21203/rs.3.rs-3971385/v1

Download PDF

Research Article

Prediction of Heart Failure in Type 2 Diabetes Mellitus Subjects Using Machine Learning: A Cross-Sectional Study

https://doi.org/10.21203/rs.3.rs-3971385/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The concomitance of Type 2 Diabetes Mellitus (T2DM) and heart failure has made scientists investigate ways the onset of heart failure in T2DM can be predicted. Machine learning techniques have been shown to help with the prediction of heart disease and several model algorithms have been affirmed as good. This study aimed at predicting heart failure in T2DM subjects using machine learning techniques. A total of 123 blood samples from 59 healthy subjects without T2DM (controls) and 63 T2DM subjects (tests) were analyzed for biochemical parameters [troponin (TnI), electrolytes, Lactate dehydrogenase (LDH), Aspartate aminotransferase (AST), Alanine transaminase (ALT), AST/ALT ratio, Creatinine phosphokinase (CK-MB), Fasting Blood Sugar (FBS), Cholesterol, Triglyceride, B-Natriuretic peptide (BNP)] using standard procedures. Demographic data and biochemical results were all subjected to machine learning algorithms. The results of ML showed that the Random Forest algorithm is the best model for heart failure prediction with 87% accuracy. SHAP value (impact on model output) among all possible combinations identified glucose (FBG), BNP, Systolic and diastolic blood pressure, and waist circumference as important features in the prediction of heart failure in T2DM. The permutation importance score of the features studied showed systolic BP, BNP, MUAC and troponin I in this order to have the highest positive importance to the prediction of heart failure in T2DM. Height, weight, and waist circumference have small negative importance values meaning they slightly decrease model performance. The study concluded that CK-MB, BNP, and troponin I alone may not be early indicators of heart failure in T2DM subjects. However, subjecting them to ML and combining them with the key features identified would make prediction better.

Machine learning

Heart Failure

Type 2 Diabetes Mellitus

Prediction

BNP

Random Forest

It is no longer news that there is a surge in the prevalence of diabetes around the world. Over the years, it has also been observed that the most popular type of diabetes among older adults is type 2 diabetes and it is now a menace to global health in this 21st century coupled with its increased popularity among the younger population (WHO, 2021). The global prevalence of T2DM among adults aged 20 to 79 is 537 million (10.5%), and by 2045, it is expected to rise to 783 million (12%), according to the 2021 International Diabetes Federation (IDF) study (Sun et al., 2022). In Africa, 24 million individuals live with T2DM in 2021 (Adamu et al., 2023). According to projections, this figure will rise by 129% to 55 million by 2045 (Sun et al., 2022). The prevalence of Type 2 DM in Nigeria has been reported as 3.9% that is, 3.9 million adults, aged 20–79 years have type 2 diabetes (IDF, 2019).

Congestive Heart Failure (CHF), as defined by the American College of Cardiology (ACC) and the American Heart Association (AHA), is "a complex clinical syndrome that results from any structural or functional impairment of ventricular filling or ejection of blood.” CHF is a common disorder worldwide with a high morbidity and mortality rate. With an estimated prevalence of 26 million people worldwide, CHF contributes to increased healthcare costs, reduces functional capacity, and significantly affects quality of life. Diagnosing and effectively treating the disease is imperative to prevent recurrent hospitalizations, decrease morbidity and mortality, and enhance patient outcomes (Heidenreich et al., 2022). Diabetes has been identified as an important risk factor for Heart Failure (HF), others being age and obesity (Groenewegen et al., 2020). HF often manifests as the first cardiovascular (CV) event in people with type 2 diabetes (T2D) (Birkeland et al., 2020). Even individuals with pre-diabetes, as defined by the criteria of the World Health Organization (WHO) and the American Diabetes Association (ADA), are at a 9–58% greater risk of developing HF (Cai et al., 2021).

Artificial intelligence (AI), the imitation of human cognition by technology, can guide clinical care and decision-making without human involvement in the process. One sub-field of AI is ML, which allows computers to evaluate data beyond programmatic algorithms, identify patterns within data, map learned patterns to unseen data, and improve the performance of computational tasks beyond human capabilities (Rajkomar et al., 2019). In HF, ML has demonstrated its superiority in mortality prediction compared to the benchmark models (Shin et al., 2021; König et al., 2021; Negassa et al., 2021; Jing et al., 2020). Preventive CV medicine increasingly implements modeling techniques to estimate the individual's absolute risk of a CV event. Modern ML techniques extract useful patterns from large datasets to answer clinical questions and have demonstrated significant promise for risk stratification across various populations (Ross et al., 2021; Cho et al., 2021). ML can help in identifying predictors and relationships between them that may not be identified by traditional models (Ward et al., 2020), thus new risk factors may emerge (Nabrdalik et al., 2023).

Recently, several studies have been undertaken on the subject of disease prediction, and some physicians are now using machine-learning models to forecast various diseases (Modern, 2019; Kaur & Kumari, 2020; Pradhan et al., 2020). So, it is essential to create a diabetes classifier that is practical, reliable, and economical. Most importantly, one that can predict the development of heart failure in diabetic patients. This research will contribute immensely to the body of knowledge specifically the Nigerian health sector and the African health sector at large by predicting and reducing the comorbidity of heart failure in diabetic patients, mortality, and rate of complications of diabetes. Data generated from this study will also be highly instrumental for further studies on predictive model building.

Study Area

This study was carried out in Oba Adejuyigbe General Hospital and Primary Health Care, Okeyinmi, Ado-Ekiti

Study Design

The study employed a case-control research design.

Ethical Approval

Ethical approval was collected from the Ethics Committee, College of Medicine and Health Sciences, Afe Bablola University, Ado Ekiti, Nigeria. Informed consent was obtained from all participants before the commencement of the study

Sample Size Determination

The sample size (N) was calculated using Fisher’s formula (Safranek, 2018)

n=z2p(1−p) / d²

Where: N = the desired sample size (when the population is greater than 10,000)

Z = is a constant given as 1.96 (or more samples at 2.0) which corresponds to the 95% confidence level.

p = prevalence (3.9%) (IDF, 2021)

q = 1.0 – p

d = acceptable error (5%)

However, 123 participants were recruited for this study; comprising of 59 apparently healthy subjects without T2DM (controls) and 63 T2DM subjects (tests).

Inclusion and Exclusion Criteria

The inclusion criteria for this study are non-pregnant and non-lactating women, those who gave their consent, and those within the age range of 20-95 years. The exclusion criteria for this study are pregnant women, nursing mothers, and, those who did not give their consent.

Sample Collection and Statistical Analysis

5ml of blood was collected from all subjects into a lithium heparin tube and fluoride oxalate tube via venipuncture. Fasting blood glucose was analyzed immediately. Electrolytes were estimated using an ion-selective electrode (ISE). BNP and Troponin I were determined using Enzyme-Linked Immunosorbent Assay (ELK Biotechnology) according to the manufacturer’s instructions. Lactate dehydrogenase (LDH), Aspartate aminotransferase (AST), Alanine transaminase (ALT), Creatinine phosphokinase (CPK), Fasting blood sugar (FBS), Total Cholesterol, and triglyceride were quantified using the enzymatic rate method following the manufacturer’s direction.

Machine Learning Analysis

Dataset related to previously published datasets where reported conventional risk factors like age, glycemic parameters, lipid profile, and demographic data were built. However, additional biochemical parameters were added (biomarkers of heart failure inclusive). In this present study, the dataset consists of 123 samples with 34 attributes. Data preprocessing was done by Label encoding. Lemmatization was used to transform the data into a categorical variable and grouping categories was created using factorize to group the Dataset into T2DM patients and non-T2DM patients. The T2DM subject group was Labeled 1 while non-T2DM subjects were labeled 0. Following this, the data were thoroughly checked for missing values as well as incorrect values which impact the quality of the model. To reduce the influence of missing values as well as incorrect values on the model performance the means from the data was applied.

Following data preprocessing, various machine learning classification models such as K-nearest neighbors (K-NN), Logistic Regression (LR), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), Classification and Regression Trees (CART), Naive Bayes (NB), and Random Forest were applied to the dataset and implemented in Python language. To carry out the ML analysis the data was split into training (70%), validation (20%) and test sets (30%). SHAP framework (shap 0.37.0 version) for interpreting additive feature importance in an ensemble Random Forest model. SHAP is a cutting-edge method that explains predictions made by complex ML and DL models (Christoph, 2019). The SHAP values explain the relative contribution of each feature in ML model prediction (Jangili et al., 2023). To assess the predictive power of these models we have considered the confusion matrix which provides Precision, Accuracy, F1 score and Recall. The “confusion_matrix” function from the Python library “sklearn. metrics”, is used for model evaluation (Pedregosa et al., 2011). This Python function takes actual and predicted values as inputs to obtain a confusion matrix.

A total of 123 subjects were recruited in this study: 59 diabetes subjects and 63 non-diabetes subjects. Table 4.1 shows the demographic characteristics of the subjects studied. Table 4.2 shows the accuracy of classification models. Table 4.3 shows the 5-fold cross-validation scores for different machine learning models for predicting heart failure in diabetes patients. Table 4.4 shows the predictive performance of the Random Forest model. Table 4.5 shows the permutation importance scores for features from a Random Forest classifier model predicting heart failure in type 2 diabetes patients.

Figure 4.1 shows ROC (Receiver Operating Characteristic) curves for multiple machine learning algorithms. Figure 4.2 shows a confusion matrix showing the performance of a Random Forest classification model for the prediction of heart failure in type 2 diabetes patients. Figure 4.3 shows the ROC and AUC of Random Forest. Figure 4.4 shows the SHAP plot bar () of features that increase and decrease model output. Figure 4.5 shows the bee swarm plot of each feature to visualize the SHAP values. Figure 4.6: Shap plot bar () showing the overall impact of features on model output

Table 4.1: Demographic Characteristics of the Subjects

Variable

Frequency

Percent (%)

Gender

Male

Female

22.00%

78.00%

Age (years)

<50

51 – 60

61 & above

20.00%

38.00%

42.00%

Marital status

Married

Widowed

14.00%

86.00%

Educational status

Tertiary

Secondary

No formal

70.00%

24.00%

6.00%

Occupation

Civil servants

Entrepreneurs

Retired

62.00%

28.00%

10.00%

Religion

Christian

Islam

86.00%

14.00%

Table 4.2 Accuracy of classification models

Table 4.3: 5-fold Cross-validation scores for models

MODELS	MEAN CROSS-VALIDATION SCORE	STANDARD DEVIATION
Linear Regression	0.825000	0.144482
Linear Discriminant Analysis	0.777500	0.242629
k-nearest neighbor	0.675000	0.256174
Decision Tree	0.732500	0.218103
Naive Bayes	0.842500	0.164526
Support Vector Machine	0.502500	0.222191
Random Forest Classifier	0.837500	0.190312

Table 4.4: Predictive performance of the Random Forest model

	PRECISION	RECALL	F1-SCORE	SUPPORT
CLASS 0	0.83	0.95	0.88	20
CLASS 1	0.93	0.78	0.85	18
OVERALL ACCURACY			0.87	38
MACRO AVG	0.88	0.86	0.87	38
WEIGHTED AVG	0.88	0.87	0.87	38

TABLE 4.5: The permutation importance scores for features from a Random Forest classifier model predicting heart failure in type 2 diabetes patients.

This present study used a multi-marker approach to predict heart failure in T2DM subjects via machine learning techniques. The anthropometric data (age, systolic blood pressure, diastolic blood pressure, weight, height, MUAC, waist circumference) and biochemical data namely fasting blood glucose, CK-MB, Troponin, and BNP were subjected to 18 model algorithms and the random forest classifier was identified as the best model to predict heart failure accurately (Accuracy = 87%). Several studies have also reported that a random forest is a useful tool in the identification of important predictors from a mixture of variables,

Yuan et al (2019) reported the random forest as the best algorithm that can predict heart failure using four cardiac biomarkers, namely, BNP, sST2, Gal-3, and CK-MB (Yuan et al., 2019). Random forest was also used to rank the importance of the biomarkers used. Kavitha et al (2021) in their study discovered that random forest (Accuracy = 81%) and hybrid model (random forest and decision tree) are good algorithms for the prediction of heart disease. Jangili et al., (2023), reported in their study that random forest (Accuracy = 76%) is a good predictor of T2DM, coronary artery diseases, and risk factors for the occurrence of each condition. It's worthy of note that Jangili et al (2023) used age, glycemic parameters, lipid profile, 27 traditional cytokines and 10 metabolic hormones, 3 adipokines, and 6 apo-lipoproteins. Ansari et al (2023), also found random forests as a good model for the prediction of heart disease (Accuracy = 99%) (Ansari et al., 2023). All these studies and more have identified the random forest classifier model as a good algorithm for predicting heart failure (disease) and this corroborates the result of this study.

To classify patients and identify predictors, random forests were employed, which is a novel method that develops numerous decision trees, by which the accuracy of the predictors was tested. The random forest classifier operates by building a set of decision trees. At each node in the trees, a random subset of the predictor variables is randomly selected and considered as split candidates (Yuan et al., 2019). The dataset was repeatedly divided into subtrees, assessing predictor variables by importance based on the change in the classification error affected by its presence or absence in the subset. Furthermore, the random forests also combine the predictions of multiple decision trees, improving the power of the algorithm (Sylvester et al., 2017).

SHAP value (impact on model output) among all possible combinations identified glucose (FBG), BNP, Systolic and diastolic blood pressure, and waist circumference as important features in the prediction of heart failure in T2DM. The permutation importance score of the 12 features studied shows the effect that each feature would have on the performance of the model when reshuffled. Systolic BP, BNP, MUAC, and troponin I in this order have the highest positive importance to the prediction of heart failure in T2DM. Height, weight, and waist circumference have small negative importance values meaning they slightly decrease model performance. This means that the most useful predictor of heart failure based on the RF model in this dataset is systolic blood pressure.

There are a few limitations in this study. First, this is a cross-sectional study with a relatively small size that was performed in multi-centers and the generalizability of these results (in another site) is yet to be carried out. Second, this study focused on three cardiac markers (CK-MB, BNP, Troponin I) among the several other cardiac markers (Gal-3, sST2, etc.) and other biomarkers involved in the pathophysiology of HF such as oxidative stress markers, renal salt, and water retention.

The present study is the first to provide a framework for the exploration of the random forest algorithm in the prediction of heart failure. This study provided supporting evidence that with Systolic BP, BNP, MUAC and troponin I, an accurate predictive model for the biochemical prediction of heart failure in T2DM subjects can be built. The random forest algorithm was once again proven to have the ability to provide relevant information about the importance of different biomarkers and anthropometric variables. However, this needs further investigation.

FBS – Fasting blood sugar

CK-MB – Creatinine Phosphokinase

BNP – B natriuretic Peptide

sSt2 - soluble suppression of tumorigenesis-2

Gal-3 – Galectin 3

HF – Heart Failure

RF – Random Forest

MUAC – Mid Upper arm Circumference

T2DM – Type 2 Diabetes Mellitus

ROC – Receiver Operating Characteristic

AUC – Area Under Curve

SHAP -

BMI – Body Mass Index

ML – Machine learning

KNN - K-nearest neighbors

LR - Logistic Regression

SVM - Support Vector Machine

LDA - Linear Discriminant Analysis

CART - Classification and Regression Trees

NB - Naive Bayes

Ethics approval and consent to participate

Ethical approval was collected from the Ethics Committee, College of Medicine and Health Sciences, Afe Babalola University, Ado Ekiti, Nigeria. Informed consent was obtained from all participants before the commencement of the study.

Consent for publication

Not applicable.

Availability of data and materials

Data available from the corresponding author on reasonable request

Competing interests

None declared

Funding

None

Author’s contributions

All auth

Acknowledgments

Adamu, U. G., Mpanya, D., Patel, A., & Tsabedze, N. (2023). Beyond HbA1c cardiovascular protection in type 2 diabetes mellitus. Journal of Endocrinology, Metabolism, and Diabetes of South Africa, 28(1), 7-13.
Ansari, G. A., Bhat, S. S., Ansari, M. D., Ahmad, S., Nazeer, J., & Eljialy, A. E. M. (2023). Performance Evaluation of Machine Learning Techniques (MLT) for Heart Disease Prediction. Computational and Mathematical Methods in Medicine, 2023.
Birkeland, K. I., Bodegard, J., Eriksson, J. W., Norhammar, A., Haller, H., Linssen, G. C., ... & Kadowaki, T. (2020). Heart failure and chronic kidney disease manifestation and mortality risk associations in type 2 diabetes: a large multinational cohort study. Diabetes, obesity and metabolism, 22(9), 1607-1618.
Cai, X., Liu, X., Sun, L., He, Y., Zheng, S., Zhang, Y., & Huang, Y. (2021). Prediabetes and the risk of heart failure: a meta‐analysis. Diabetes, Obesity and Metabolism, 23(8), 1746-1753.
Cho, S. Y., Kim, S. H., Kang, S. H., Lee, K. J., Choi, D., Kang, S., ... & Chae, I. H. (2021). Pre-existing and machine learning-based models for cardiovascular risk prediction. Scientific reports, 11(1), 8886.
Christoph, M. (2019). Interpretable machine learning: A guide for making black box models explainable. Lulu. com.
International Diabetes Federation. The IDF Diabetes Atlas.9th ed. Brussels: Belgium, International Diabetes Federation, 2019.
Jangili, S., Vavilala, H., Boddeda, G. S. B., Upadhyayula, S. M., Adela, R., & Mutheneni, S. R. (2023). Machine learning-driven early biomarker prediction for type 2 diabetes mellitus associated coronary artery diseases. Clinical Epidemiology and Global Health, 24, 101433.
Jing, L., Ulloa Cerna, A. E., Good, C. W., Sauers, N. M., Schneider, G., Hartzel, D. N., ... & Fornwalt, B. K. (2020). A machine learning approach to management of heart failure populations. Heart Failure, 8(7), 578-587.
Kaur, H., & Kumari, V. (2022). Predictive modeling and analytics for diabetes using a machine learning approach. Applied computing and informatics, 18(1/2), 90-100.
Kavitha, M., Gnaneswar, G., Dinesh, R., Sai, Y. R., & Suraj, R. S. (2021, January). Heart disease prediction using hybrid machine learning model. In 2021 6th international conference on inventive computation technologies (ICICT) (pp. 1329-1333). IEEE.
Modern, S. (2019). A critical review of machine learning algorithms and their applications in pure sciences. Research Journal of Recent Sciences, 8(1), 14-29.
Negassa, A., Ahmed, S., Zolty, R., & Patel, S. R. (2021). Prediction model using machine learning for mortality in patients with heart failure. The American journal of cardiology, 153, 86-93
Pedregosa, F. (2011). Scikit‐learn: Machine learning in python Fabian. Journal of machine learning research, 12, 2825.
Pradhan, N., Rani, G., Dhaka, V. S., & Poonia, R. C. (2020). Diabetes prediction using artificial neural network. In Deep Learning Techniques for Biomedical and Health Informatics, 327-339. Academic Press.
Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. New England Journal of Medicine, 380(14), 1347-1358.
Ross, E. G., Jung, K., Dudley, J. T., Li, L., Leeper, N. J., & Shah, N. H. (2019). Predicting future cardiovascular events in patients with peripheral artery disease using electronic health record data. Circulation: Cardiovascular Quality and Outcomes, 12(3), e004741.
Shin, S., Austin, P. C., Ross, H. J., Abdel‐Qadir, H., Freitas, C., Tomlinson, G., ... & Lee, D. S. (2021). Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality. ESC heart failure, 8(1), 106-115.
Sun, H., Saeedi, P., Karuranga, S., Pinkepank, M., Ogurtsova, K., Duncan, B. B., Stein, C., Basit, A., Chan, J. C. N., Mbanya, J. C., Pavkov, M. E., Ramachandaran, A., Wild, S. H., James, S., Herman, W. H., Zhang, P., Bommer, C., Kuo, S., Boyko, E. J., & Magliano, D. J. (2022). IDF Diabetes Atlas: Global, regional, and country-level diabetes prevalence estimates for 2021 and projections for 2045. Diabetes research and clinical practice, 183, 109119. https://doi.org/10.1016/j.diabres.2021.109119
Sylvester, E. V., Bentzen, P., Bradbury, I. R., Clément, M., Pearce, J., Horne, J., & Beiko, R. G. (2018). Applications of random forest feature selection for fine‐scale genetic population assignment. Evolutionary applications, 11(2), 153-165.
Ward, A., Sarraju, A., Chung, S., Li, J., Harrington, R., Heidenreich, P., ... & Rodriguez, F. (2020). Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population. NPJ digital medicine, 3(1), 125.
Yancy, C. W., Jessup, M., Bozkurt, B., Butler, J., Casey, D. E., Drazner, M. H., ... & Wilkoff, B. L. (2013). 2013 ACCF/AHA guideline for the management of heart failure: a report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice Guidelines. Journal of the American college of cardiology, 62(16), e147-e239.
Yuan, H., Fan, X. S., Jin, Y., He, J. X., Gui, Y., Song, L. Y., ... & Chen, W. (2019). Development of heart failure risk prediction models based on a multi-marker approach using random forest algorithms. Chinese medical journal, 132(07), 819-826.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Prediction of Heart Failure in Type 2 Diabetes Mellitus Subjects Using Machine Learning: A Cross-Sectional Study

Status:

Version 1

Abstract

Figures

INTRODUCTION

METHODOLOGY

RESULTS

DISCUSSION

CONCLUSION

Abbreviations

Declarations

References

Additional Declarations

Status:

Version 1