Predicting Hypertension in Rangpur Region: A Machine Learning Approach

doi:10.21203/rs.3.rs-4676295/v1

Download PDF

Research Article

Predicting Hypertension in Rangpur Region: A Machine Learning Approach

https://doi.org/10.21203/rs.3.rs-4676295/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

This study presents a machine learning approach to forecast hypertension within urban inhabitants, focusing on the Rangpur district, Bangladesh for data col- lection and model training. Ten machine learning algorithms, such as Logistic Regression, Gaussian Naive Bayes (GNB), Support Vector Machine (SVM), K- Nearest Neighbor (K-NN), Decision Tree (DT), Random Forest (RF), Bagging, AdaBoost, Gradient Boosting (GB), and Extra Tree (ET) are utilized to enhance the accuracy of predicting hypertension risk in this specific region. Data gath- ered from 611 patients across different healthcare facilities, containing details like blood pressure measurements, medical records, and hypertension diagno- sis, form the dataset for analysis. The aim of this research is to enhance early detection techniques and customize public health interventions in Rangpur City. Examination of the primary data establishes a substantial association between hypertension and blood pressure parameters (0.79 for Sys BP, 0.78 for Dia BP) in comparison to other variables. Evaluation of model performance is based on metrics like accuracy, precision, recall, and F1-score. Findings demonstrate that the AdaBoost model exhibits superior performance indicators, achieving 98.37% accuracy, 100% precision, 96.87% recall, and an F1-score of 98.39% when trained on 70% of the dataset and evaluated on 30%, with a focus on blood pressure. Even excluding this attribute, the AdaBoost model surpasses others with 78.86% accu- racy, 77.14% precision, 84.38% recall, and an F1-score of 78.79% when trained on 80% of the dataset and tested on 20%. By prioritizing early detection and pre- ventive healthcare, Bangladesh’s healthcare system can diminish expenses linked to costly therapies and hospital stays.

Hypertension Prediction

Machine Learning

Rangpur

Logistic Regression

Bangladesh Healthcare

The importance of maintaining a healthy lifestyle is of utmost significance, considering its vital role in enhancing overall health and preventing chronic diseases. However, the adoption and continuation of a healthy lifestyle have increasingly presented challenges in today’s society. Rapid urbanization, advancements in technology, and the demands of modern workplaces have led to sedentary habits, poor dietary choices, and elevated stress levels. Many individuals face difficulties incorporating physical activity into their daily routines or finding time to prepare nutritious meals. Additionally, the easy availability of processed foods high in salt, sugar, and unhealthy fats encourages unfavorable eating habits. These lifestyle changes have contributed to a global increase in obesity, diabetes, and hypertension[1][2], underscoring the crucial need for effective preventive and management strategies.

Hypertension serves as a primary risk factor for cardiovascular disease (CVD) and can lead to severe complications in various organs such as the kidneys, brain, and heart, encompassing conditions like heart failure, stroke, and chronic kidney disease

[3] [4]. The underdiagnosis and inadequate management of hypertension, particularly in resource-limited regions like Rangpur district, Bangladesh, where a surge in cases is observed, pose significant challenges. Timely identification plays a critical role in mitigating adverse effects like fatigue, confusion, and discomfort [5]. The current diag- nostic and predictive methodologies for hypertension in Rangpur district heavily rely on intermittent clinical assessments and patient-reported signs, which lack consistency and are insufficient for early detection. These conventional strategies overlook the vast health data accessible from contemporary healthcare systems, which could potentially provide more accurate and prompt insights [6] [7] [8]. Elshaw et al. [9] Applied two local interpretability strategies and five global interpretability techniques for the pre- diction of hypertension in a study including 23,095 individuals. Amaratunga et al.[10] employed machine learning (ML) algorithms like Artificial Neural Networks (ANN), Support Vector Machine (SVM), and Classification Tree (CT) to forecast hyperten- sion among a total of 1,504,437 patients. The outcomes indicated that CT exhibited the highest performance metrics with a sensitivity of 51.0%, specificity of 99.0%, and an Area under the ROC Curve (AUC) of 0.870 compared to ANN and SVM. Sakr1 et al. [11] utilized six ML algorithms including Random Tree Forest (RTF), Locally Weighted Naive Bayes (NB), ANN, SVM, Logit Boost (LB), and Bayesian Network (BN). The study revealed that the RTF model outperformed the others with an AUC of 0.930. Golino et al. [6] considered variables such as Age, Body Mass Index (BMI), Waist Circumference (WC), Hip Circumference (HC), and Waist-Hip Ratio (WHR),

demonstrating that the training data achieved a recall and specificity of 80.86% and 81.22%, respectively, while the test data exhibited a recall and specificity of 45.65% and 65.15%, respectively. Islam et al.[12] included attributes like Age, sex, BMI, wealth index, working status, marital status, and education level, along with factors like phys- ical activity, region, residence, wealth index, family size, smoking status, and diabetes. The findings highlighted that the combination of SVMRFE-GB yielded the highest accuracy of 66.98%.

The problem, therefore, pertains to the insufficiency of current methodologies in effectively forecasting hypertension, resulting in delayed identification and subpar handling of the ailment. This insufficiency highlights the necessity for a sophisti- cated, data-centric strategy to enhance the accuracy and efficiency of hypertension risk prediction. Despite progress made in hypertension research, deficiencies persist in the utilization of machine learning (ML) for hypertension prediction, particularly in resource-limited environments such as Rangpur, Bangladesh. To date, no research has employed ML techniques on Rangpur’s data to anticipate hypertension.

This research aims to develop and compare ML models to identify the most effective one for predicting hypertension in Rangpur district special in Rangpur city , offering insights for early intervention and management. In this regard,the objectives of this study was four-fold: i)collect primary data from medical clinics and diagnostic centers in Rangpur city (Section 3.1) ii)identify the effect of factors on hypertension (Section 3.2) iii) develop, train and compare various machine learning models to assess them in order to predict hypertension risk (Section 3.3) and finally iv) evaluate the prediction of the selected model (Section 3.4).Here, it should be noted that the considerations and the results of this paper are only for comparison which do not indicate the optimum conditions for any part of the whole study.

2.1 Area of the study

Figure 1 shows the study area, the Rangpur district in Bangladesh, selected for exam- ining machine learning-based hypertension prediction among its residents. Located in the northern part of Bangladesh, Rangpur district is one of the country’s eight admin- istrative regions, with a population of approximately 3 million people, mostly living in rural areas.

The escalation in the prevalence of hypertension is intricately linked with the geographical and demographic features of a particular area. The tropical climate of Rangpur district, characterized by elevated temperatures and humidity levels, could potentially elevate hypertension rates through the induction of physical strain and dehydration. Despite the predominantly agricultural nature of the country’s economy, a considerable portion of the population continues to grapple with poverty as a conse- quence of persistent economic challenges. Additionally, rural regions face constraints in accessing healthcare services, which further compounds the situation. The lack of adequate health awareness and limited healthcare accessibility in these areas pose sig- nificant obstacles to the early detection and effective management of various illnesses, including hypertension.

2.2 Required tools and analysis

2.2.1 Data collection

This study was conducted based on primary data. In this regard, on field survey was used to collect relevant data from patients. Patients’ historical medical records were reviewed to extract pertinent information related to hypertension.

2.2.2 Statistical analysis

Summary statistics (mean, median, standard deviation) were computed to understand the distribution of variables. Frequency distributions were analyzed to determine the

prevalence of hypertension across different demographic groups. Python was utilized for advanced data analysis and visualization, including the creation of heatmap to explore correlations between variables. Beside this, with the help of STATA, logistic regression models were used to calculate Crude Odds Ratios (COR) and Adjusted Odds Ratios (AOR). Two logistic regression models, designated Model-I and Model- II, were examined in this study. In Model I, the explanatory variables included age, gender, glucose, systolic blood pressure (Sys BP), diastolic blood pressure (Dia BP), and body mass index (BMI), with hypertension as the outcome variable and the Crude Odds Ratio (COR) was calculated. In Model II, adjustments were made for the statistically significant variables—age, gender, diabetes, systolic blood pressure, diastolic blood pressure, and BMI in order to determine the Adjusted Odds Ratio (AOR) and the associated 95% confidence interval. Graph and bar charts were used to evaluate and compare the performance of the various AI models.

2.2.3 Data Splitting

The dataset is divided into different proportions for training and testing purposes: 80% for training 20% for testing, 70% for training 30% for testing, 60% for training, and 40% for testing. The testing set is employed to evaluate the models, while the training set is utilized for model learning

2.2.4 Model selection

Various machine learning algorithms, including Logistic Regression [13], Gaussian Naive Bayes [14], Support Vector Machine [15], K-Nearest Neighbors [16], Decision Tree [17], Random Forest [18], Bagging [19] [20], AdaBoost [20], Gradient Boosting [21], and Extra Trees [22] were evaluated in this study.

2.3 Flow chart

As shown in Fig.2, a flow chart was included to illustrate the research process, from data collection to analysis and evaluation. The primary objective of this research was the early detection and prediction of cardiovascular illness, particularly hypertension. Python programming language was required for the installation of Sklearn’s required library packages. The dataset undergoes preprocessing to remove any invalid records, followed by entity extraction to minimize input complexity. Subsequently, ten different classifiers are utilized to predict hypertension. The data is then split into training and testing subsets. Post-training and testing, the performance of ten classifiers were evaluated and compared, to identify the most effective classifier.

2.4 Data

2.4.1 Source of data

Historical medical health records were collected from various clinics and medical cen- ters in Rangpur City in order to extract relevant data on hypertension and associated variables. Additional health data, including hypertension status, were obtained from diagnostic centers.

2.4.2 Attribute information

The dataset comprised records about 611 patients, with each patient associated with a multitude of distinct medical attributes. These attributes encompass crucial health indicators such as age, gender, body mass index (BMI), glucose level, systolic blood pressure (Sys BP), and diastolic blood pressure (Dia BP). The focal point of this investigation lies in the determination of hypertension status, wherein subjects are categorized as either hypertensive or non-hypertensive.

The dataset reveals that 317 patients are identified as hypertensive, while 294 patients exhibit non-hypertensive characteristics. Furthermore, gender distribution within the dataset delineates 283 female patients and 328 male patients. The table 1provides details on the attributes that were taken into account, along with a description, measurements, and range value for each attribute.

Age (13 to 90 years), gender (coded as 1 for male and 0 for female), plasma glucose levels (1.38 to 32.86 mmol/L), systolic and diastolic blood pressure (30 to 180 mmHg and 50 to 130 mmHg, respectively), body mass index (BMI) (14.88 to 37.71 kg/m2), and an outcome attribute (coded as 1 for hypertensive and 0 for non-hypertensive) are among the variables included. The range of values assigned to each attribute makes it easy to grasp the parameters of the dataset.

Table 1 Attribute information of a dataset

Sl. No.	Attributes	Description	Measurement	Value Range
1	Age	Age of an individual	Years	13 to 90
2	Gender	Male or Female	1 = male or	0 or 1
			0 = female
3	Glucose	Plasma glucose level	mmol/L	1.38 to 32.86
4	Sys BP	Systolic Blood Pressure	mmHg	30 to 180
5	Dia BP	Diastolic Blood Pressure	mmHg	50 to 130
6	BMI	Body mass index	Kg/m²	14.88to 37.71
7	Outcome	Class attributes	1 hypertensive and	0 or 1
			0 non -hypertensive

2.4.3 Feature selection

In this study, important features (age, gender, glucose, systolic pressure, diastolic pressure, BMI) were identified for predicting hypertension. In this study, a heatmap was used to find the important features. When choosing features, a heatmap might be a helpful visualization tool.

Fig. 3 Illustrates the Canonical Correlation Analysis (CCA) of all attributes employed for disease prediction, showcasing a spectrum of connections spanning from

+1 to 1 on both the X- and Y-axes. The correlations’ strength and direction are shown by the color gradient, which goes from blue to red, where blue denotes weaker or negative correlations and red, stronger positive correlations. Systolic blood pressure (Sys BP) and diastolic blood pressure (Dia BP) show substantial positive correlations (0.75), and both have strong correlations (0.79) and (0.78) with the outcome vari- able (hypertension status). There are also moderate connections between the status of hypertension and BMI and glucose levels (0.27 and 0.37, respectively). There are lower relationships between the other factors, such as gender and age, and the result.

2.5 Performance evaluation

Performance evaluation was required in order to evaluate different machine learning techniques. The evaluation techniques were: accuracy, precision, recall and F1-score. Table.2 shows the evaluation techniques with their related equations.

Table 2 Model evaluation technique

3.1 Attributes description

A dataset comprising 611 entries in seven different attributes—Age, Gender, Glucose, Systolic Blood Pressure (Sys BP), Diastolic Blood Pressure (Dia BP), Body Mass Index (BMI), and Outcome—is summarized statistically in the table 3. The count, mean, standard deviation (Std), minimum (Min), and maximum (Max) numbers were employed to specify each property.For the age range of 13 to 90 years in this study, the standard deviation was 15.79 and the average age was 55.08 years. Gender was almost evenly distributed, with a mean of 0.54 suggesting a minor bias toward one gender. Blood glucose levels range from 1.38 to 32.86 had an average of 10.38 and a significant variance (std: 4.63). The average blood pressure was 130.39 (standard deviation: 16.64) and 82.64 (standard deviation: 12.95) mmHg. The BMI had a standard deviation of 3.61 and an average of 23.65. The outcome variable represents two alternative states (0 or 1) which was a binary indicator.

Table 3 Data set description

Attributes	Count	Mean	Std	Min	Max
Age	611	55.08	15.79	13	90
Gender	611	0.54	0.49	0	1
Glucose	611	10.38	4.63	1.38	32.86
Sys BP	611	130.39	16.64	30	180
Dia BP	611	82.64	12.95	50	130
BMI	611	23.65	3.61	14.88	37.71
Outcome	611	0.52	0.5	0	1

As shown in Fig. 4, the histogram demonstrates a normal distribution pattern for the age attribute within the designated range of values. The x-axis indicates the age values, while the y-axis represents the density of the age attribute. The analysis reveals that within the dataset, the minimum age recorded was 13 years, while the maximum age reached 90 years. Furthermore, the fitted distribution line indicates a mean value of 55.08 years, shedding light on the central tendency of the age distribution within the studied population.

3.2 Effect of factors on hypertension

The table.4 and table.5 present the results of a logistic regression analysis examin- ing the relationship between various variables and a specific outcome of interest. The variables under evaluation included age, gender (male), body mass index (BMI ≥ 25), systolic blood pressure (Sys BP ≥ 140), diastolic blood pressure (Dia BP ≥ 90), and glucose levels (diabetes ≥ 7.8mmol/L ). The adjusted odds ratios (AOR) and crude odds ratios (COR) for each component are provided, along with the matching p-values and 95% confidence intervals (CI). While the adjusted impact did not reach statistical significance (AOR 0.9680, p=0.063), it is noteworthy that age exhibited a signifi- cant COR (1.0350, p=0.000), indicating a rise in risk with each additional year. The adjusted effect (AOR 0.6197, p=0.321) did not show significance, despite males show- ing a notable COR (1.6239, p=0.003). Strong associations were observed with elevated glucose levels in both adjusted (AOR 6.6786, p=0.001) and unadjusted (COR 5.8999, p=0.000) models. A significant association between diabetes and hypertension was identified. The COR stood at 5.8999 (95% CI of COR: 4.0965, 8.4973), indicating that individuals with diabetes had a 5.8999-fold higher risk of hypertension compared to those without diabetes. Following adjustment for significant risk factors, Model II dis- played an AOR of 6.6786 (95% CI of AOR: 2.2475, 19.846; p-value: 0.001), signifying a 6.6786-fold higher risk of hypertension among individuals with diabetes compared to those without diabetes. Previous studies have also highlighted the substantial influ- ence of diabetes on hypertension prevalence[23]. A strong correlation exists between high systolic blood pressure (COR 333.19, p=0.000; AOR 255.16, p=0.000) and hyper- tension. Similarly, a strong correlation was evident with high diastolic blood pressure (AOR 1507.0, p=0.000; COR 1345.6, p=0.000). Moreover, a BMI ≥ 25 indicates a robust positive correlation in both adjusted (AOR 7.8157, p=0.000) and unadjusted (COR 4.2696, p=0.000) models. Individuals with a BMI ≥ 25 were 4.2696 times more likely to have hypertension compared to those with a normal BMI below 25, highlight- ing a significant relationship between hypertension and obesity or overweight (BMI ≥ 25). Other studies have also indicated that being overweight, defined as a Body Mass Index (BMI) exceeding 25, significantly influences hypertension prevalence[24]. These findings underscore the importance of blood pressure, BMI, and glucose levels as key predictors of the outcome, even after accounting for other variables.

Table 4 Effect of factors on hypertension (Crude OR)

	Model 1
Factors	COR	P-value	95% CI of COR
			Lower	Upper
Age	1.0350	0.000	1.0238	1.0464
Gender (Male)	1.6239	0.003	1.1793	2.2359
Glucose (Diabetic>7.8mmol/L)	5.8999	0.000	4.0965	8.4973
Sys BP>=140	333.19	0.000	130.58	850.19
Dia BP>=90	1345.6	0.000	185.03	9785.9
BMI>=25	4.2696	0.000	2.9998	6.0769

Table 5 Effect of different factors on hypertension (Adjusted OR)

	Model 2
Factors	AOR	P-value	95% CI of AOR
			Lower	Upper
Age	0.9680	0.063	0.9354	1.0017
Gender (Male)	0.6197	0.321	0.2409	1.5941
Glucose (Diabetic>7.8mmol/L)	6.6786	0.001	2.2475	19.846
Sys BP>=140	255.16	0.000	66.299	982.06
Dia BP>=90	1507.0	0.000	138.18	16435
BMI>=25	7.8157	0.000	2.8636	21.331

3.3 Training: Test ratio based comparison

3.3.1 60% Training 40% Testing Dataset

The evaluation table 6 assesses the efficacy of various classification algorithms based on metrics such as Accuracy, Precision, Recall, and F1 Score, with and without the integration of blood pressure (BP) as a predictor where train:test ratio was 60:40. The presence of BP led to a substantial improvement in the algorithms’ performance across all measured metrics. For example, the accuracy of the Logistic Regression (LR) algorithm increased from 0.6694 without BP to 0.9673 with BP. Similarly, the Precision of the Random Forest (RF) algorithm rose from 0.7164 to 0.9764 when BP was included, and the Recall of the Support Vector Machine (SVM) algorithm escalated from 0.8189 to 0.9685 with the incorporation of BP. These results demon- strate that utilizing BP data enhanced the predictive accuracy, precision, and recall of the algorithms. Specifically, when BP was taken into account, the Support Vec- tor Machine (SVM) algorithm achieved the highest precision (0.9755), while Logistic Regression (LR) and Random Forest (RF) algorithms also displayed notable perfor- mance improvements. In the absence of BP, the AdaBoost model excelled in precision (0.7429), and SVM achieved the highest recall (0.8189). These outcomes underscore the significance of including BP as a predictor to enhance the predictive capabilities of the algorithms.

Table 6 Test 40%, Train 60% (with BP and without BP)

Model Name	Accuracy	With BP Precision	Recall	F1 Score	Accuracy	Without BP Precision	Recall	F1 Score
LR	0.9673	0.9837	0.9528	0.968	0.6694	0.6742	0.7008	0.6873
NB	0.9469	0.9523	0.9449	0.9486	0.6489	0.6847	0.5984	0.6387
SVM	0.9755	0.984	0.9685	0.9762	0.702	0.6753	0.8189	0.7402
KNN	0.9592	0.9606	0.9606	0.9606	0.7143	0.705	0.7717	0.7368
DT	0.9592	0.9756	0.9449	0.96	0.6571	0.6617	0.6929	0.6769
RF	0.9755	0.9764	0.9764	0.9764	0.7184	0.7164	0.7559	0.7356
Bagging	0.9633	0.9758	0.9528	0.9641	0.6857	0.6984	0.6929	0.6957
AdaBoost	0.9641	0.9758	0.9528	0.9641	0.7429	0.75	0.7559	0.7529
GB	0.9633	0.9758	0.9528	0.9641	0.7184	0.7197	0.748	0.7336
ET	0.9306	0.9741	0.8898	0.93	0.6	0.6074	0.6457	0.6259

3.3.2 70% Training 30% Testing Dataset

Table 7 illustrates a comparative assessment of ten different machine learning algo- rithms, elucidating their performance metrics such as Accuracy, Precision, and Recall, both with and without the inclusion of the blood pressure (BP) attribute where train: test ratio was 70:30. AdaBoost showcased the highest accuracy (98.37%) and preci- sion (100%) when utilizing the BP feature, while Random Forest (RF) achieved the highest recall rate (97.89%). Conversely, Extra Trees (ET) demonstrated the lowest accuracy (89.67%) and precision (87.25%), with Decision Tree (DT) and ET sharing the minimum recall rate (93.68%). In the absence of the BP attribute, AdaBoost con- tinued to perform well with the highest accuracy (73.37%) and precision (73.96%), while Support Vector Machine (SVM) obtained the highest recall rate (77.89%). Naive Bayes (NB) recorded the lowest accuracy (63.04%) and recall rate (58.95%), whereas Logistic Regression (LR) exhibited the lowest precision (64.58%). Clearly, the incor- poration of the BP attribute significantly enhanced the model performance across all metrics, positioning AdaBoost as the leading algorithm in both scenarios, followed by Random Forest (RF) and Gradient Boosting (GB) respectively. Removing the BP attribute resulted in a noticeable decline in accuracy, precision, and recall for all models, emphasizing the critical role of BP in predictive modelling.

Table 7 Test 30%, Train 70% (with BP and without BP)

Model Name	Accuracy	With BP Precision	Recall	F1 Score	Accuracy	Without BP Precision	Recall	F1 Score
LR	0.9619	0.9783	0.9474	0.9626	0.6359	0.6458	0.6526	0.6492
NB	0.9457	0.9474	0.9474	0.9474	0.6304	0.6588	0.5895	0.6222
SVM	0.9674	0.9785	0.9579	0.9681	0.663	0.6435	0.7789	0.7048
KNN	0.9511	0.9479	0.9579	0.9529	0.7011	0.6961	0.7474	0.7208
DT	0.9565	0.978	0.9368	0.9569	0.663	0.6701	0.6842	0.6771
RF	0.9728	0.9688	0.9789	0.9738	0.7011	0.7041	0.7263	0.715
Bagging	0.9619	0.9681	0.9579	0.9629	0.6957	0.7053	0.7053	0.7053
AdaBoost	0.9837	1	0.9684	0.9839	0.7337	0.7396	0.7474	0.7435
GB	0.9674	0.9684	0.9684	0.9684	0.7283	0.7272	0.7579	0.7423
ET	0.8967	0.8725	0.9368	0.9036	0.6793	0.6875	0.6947	0.6912

3.3.3 80% Training 20% Testing Dataset

The comparison presented in Table 8 illustrates the performance contrast of various machine learning algorithms with and without BP under an 80:20 train-test ratio. Evaluation metrics such as Accuracy, Precision, and Recall were utilized in the assess- ment. When models were compared with and without BP, those incorporating BP demonstrated significantly superior performance across all metrics. For instance, the Logistic Regression (LR) model achieved an accuracy of 0.9675 with BP, whereas it only reached 0.6748 without BP. Similarly, the Support Vector Machine (SVM) achieved perfect precision of 1.000 with BP, which decreased to 0.6585 without BP. This consistent trend was observed across all models, highlighting the critical role of BP in enhancing model efficacy. Notably, in the presence of BP, the Support Vector Machine (SVM) emerged as the model with the highest Accuracy (0.9756) among all models, while the Extra Trees (ET) model exhibited the lowest Accuracy (0.9268). Conversely, in the absence of BP, the AdaBoost model recorded the highest Accuracy (0.7886), whereas the Logistic Regression (LR) model displayed the lowest Accuracy (0.6748).

3.4 F1 Score based comparison

The F1 score is widely regarded as the most effective metric for evaluating accuracy due to its incorporation of both false positives and false negatives. While accuracy may have its merits in certain scenarios, particularly when the repercussions of false positives and false negatives vary significantly, the F1 score is generally deemed more appropriate.

The graph depicted in Fig. 5 effectively showcases the models that have achieved the highest F1 scores, such as RF (0.9764) and SVM (0.9762), following an assessment

Table 8 Train 80% , test 20% (with BP and without BP)

Model Name	Accuracy	With BP Precision	Recall	F1 Score	Accuracy	Without BP Precision	Recall	F1 Score
LR	0.9675	0.9839	0.9531	0.9683	0.6748	0.6667	0.75	0.7059
NB	0.9431	0.9524	0.9375	0.9449	0.6585	0.6774	0.6563	0.6667
SVM	0.9756	1	0.9531	0.976	0.6912	0.6585	0.8438	0.7397
KNN	0.9349	0.9242	0.9531	0.9385	0.7561	0.7429	0.8125	0.7761
DT	0.9431	0.9523	0.9375	0.9449	0.6748	0.6579	0.7813	0.7143
RF	0.9675	0.9545	0.9844	0.9692	0.7317	0.7183	0.7969	0.7556
Bagging	0.9593	0.9594	0.9688	0.9612	0.7073	0.7	0.7656	0.7313
AdaBoost	0.9756	0.9841	0.9688	0.9764	0.7886	0.7714	0.8438	0.8059
GB	0.9675	0.9545	0.9844	0.9692	0.7724	0.7647	0.8125	0.7879
ET	0.9268	0.9231	0.9375	0.9302	0.6585	0.6964	0.6094	0.65

procedure that employed a partitioning technique. This method involved allocating 40% of the dataset for testing purposes, while the remaining 60% was utilized for training. It is evident that these models exhibited optimal performance when consid- ering BP attributes. Excluding blood pressure attributes results in AdaBoost (0.7529) and SVM (0.7402) displaying the weakest performance among the models. The drop in their performance was noteworthy when compared to models that include blood pressure attributes. Eliminating BP features leads to a notable decrease in the F1 score, with the most substantial decline (from 0.968 to 0.6873) observed in the Logis- tic Regression model. This observation indicates the pivotal role of BP attributes in the prediction of hypertension. The effectiveness of the hypertension prediction model was heightened when BP measurements were incorporated. The impressive F1 scores of the model ”With BP” emphasize the crucial significance of BP data in ensuring accurate predictions. Conversely, the reduced F1 scores of the model ”Without BP” underscore the limitations associated with predictions made in the absence of this essential information.

Fig. 6 illustrates the F1 score achieved by ten different machine learning mod- els in the context of predicting hypertension within the selected region where the train: test ratio was set at 70:30. Similar to Fig. 5, the scenarios encompass both the presence and absence of blood pressure (BP) attributes. AdaBoost consistently out- performed other models in both scenarios, whether BP characteristics were present or not. In cases where BP variables were included, AdaBoost attained the highest F1 score of 0.9839, showcasing its adeptness in utilizing this information for hyperten- sion prediction. Remarkably, AdaBoost maintained a respectable F1 score of 0.7435 even in the absence of BP attributes, underscoring its robustness and capacity to effectively leverage other significant features. The collective performance of all models demonstrated improvement when BP attributes were part of the analysis, emphasiz- ing the crucial role of these attributes in hypertension prediction. Notably, models like AdaBoost, Gradient Boosting (GB), and Support Vector Machine (SVM) exhibited enhanced performance when BP attributes were considered. Conversely, Extra Trees (ET) experienced a significant performance decline when BP attributes were excluded.

Fig. 7 depicts the F1 score (with and without BP) during the training of 80% and testing of 20%. Within this visual representation, the Support Vector Machine (SVM) and AdaBoost techniques illustrated the most superior F1 scores, reaching close to 97.6%. In contrast, the Extra Trees (ET) approach with BP resulted in the lowest F1 scores, approximately at 93.02%. Furthermore, in the absence of BP, the ET method displayed the minimal F1 score at around 65%, while AdaBoost once again demonstrated the highest F1 score at approximately 80.59%.

3.5 Evaluation of prediction

Table 9 Predict result

Sl. No.	Age	Gender*	Glucose	Attributes Sys BP	Dia BP	BMI	Outcome**	Accurate
1	70	0	14.75	140	90	21.86	1	Yes
2	64	1	26.7	150	100	25.7	1	Yes
3	38	0	6.88	120	70	22.73	0	Yes
4	45	1	12.5	120	80	23.8	0	Yes
5	44	1	9.9	150	100	26.65	1	Yes
6	41	1	7.2	-	-	25.23	1	Yes
7	55	0	7.9	-	-	28.43	0	No
8	73	0	5.6	-	-	19.83	0	Yes
9	66	1	9.9	-	-	27	1	Yes
10	60	1	11	-	-	26.65	1	Yes

Gender*: 0 = female,1 = male. Outcome**: 0 =non-hypertensive,1 = hypertensive

Upon evaluation of all ten models using various train-test ratios and taking into account the presence or absence of Blood Pressure (BP), the AdaBoost model emerged as the most superior predictor in the current investigation. Subsequent to the identi- fication of the optimal model, a set of ten manually inputted attributes was utilized for the purpose of predicting hypertension. The predictive performance of Adaboost is detailed in Table 9.

When predicting without the inclusion of BP feature, 80% of the dataset was assigned to the training subset, while the remaining 20% was designated for testing. Conversely, in cases where BP was factored into the prediction, a train-test ratio of 70:30 was implemented. Within this phase of the research, variables such as age, gender, glucose levels, systolic blood pressure (Sys BP), diastolic blood pressure (Dia BP), and body mass index (BMI) were taken into consideration for each individual. The initial five parameters were included in predictions involving BP, whereas the latter five parameters excluded BP from the analysis. Notably, all entries comprising BP data demonstrated accurate prognostications. In contrast, the five entries that did not incorporate BP yielded an accuracy rate of 80%.

The initial record in the table pertained to a 70-year-old male individual (Gender = 0), where parameters such as a glucose concentration of 14.75 mmol/L, systolic BP of 140, diastolic BP of 90, and a BMI of 21.86 were taken into account. The corresponding result was denoted by 1, signifying the prediction of hypertension, which was indeed precise (”Yes”). In contrast, the seventh record in this tabular presentation involved a 55-year-old male individual (Gender = 0), with a glucose concentration of 7.9 mmol/L, absence of BP data, and a BMI value of 28.43. In this instance, the outcome indicated 0, suggesting the absence of hypertension. Nevertheless, this particular prognosis proved to be inaccurate (”No”). This investigation highlights that forecasts related to hypertension were more prone to inaccuracies in scenarios where blood pressure readings were omitted, as opposed to cases where such data was integrated into the predictive models.

This research illustrates the capacity of machine learning algorithms in forecasting hypertension by utilizing diverse health parameters of the residents in Rangpur dis- trict. The examination unveiled numerous significant discoveries that enhance our comprehension of risk elements for hypertension and the efficiency of diverse predictive algorithms. The results of this investigation can be succinctly outlined as:

1. Glucose level, blood pressure, and Body Mass Index (BMI) demonstrated the strongest positive associations with hypertension. These results are consistent with established medical literature, underscoring the importance of these factors in the diagnosis of hypertension.

2. BMI emerged as a critical factor, with individuals having a BMI above 25 being more prone to hypertension. This emphasizes the need for targeted interventions aimed at weight management to mitigate hypertension risk.

3. The research further emphasizes the crucial significance of blood pressure assess- ments in anticipating hypertension. Excluding blood pressure information in models led to a notable decrease in the accuracy of predictions, highlighting the essential nature of blood pressure as a necessary factor in screening for hypertension.

4. In terms of model effectiveness, the AdaBoost technique exhibited greater predic- tive abilities in contrast to alternative models. More precisely, when incorporating blood pressure and employing an 80:20 train-test ratio, and excluding blood pres- sure with a 70:30 train-test ratio, AdaBoost surpassed nine alternative models. This highlights the robustness and reliability of AdaBoost in handling imbalanced datasets.

Supplementaryinformation.Not applicable

Acknowledgements.Not applicable

Ethics approval and consent to participate

The patients consented to participate in this research. The patients' privacy is guaranteed with their anonymity preserved. They participated without being personally identified. Beside this, participants’ personal information is not disclosed or used in this study. The Department of Electrical and Electronic Engineering of Begum Rokeya University, Rangpur approved the research.

Funding

Not applicable

Author Contribution

Dina Islam: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data Collection, Data Curation, Writing - Original Draft, Visualization.Taiaba Akter: Data Collection, Writing - Review & Editing, VisualizationMst.Nazifa Tasnim: Data Collection, Writing - Review & Editing.Most.Sadia Islam Ria: Data Collection, Formal analysis, Software.Iffat Ara Badhan: Formal analysis, Writing - Review & Editing.A K M Mahmudul Haque (Supervisor): Supervision, Conceptualization, Project Administration, Writing - Review & Editing.Each author has approved the final version of the manuscript and agrees to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Kearney, P.M., Whelton, M., Reynolds, K., Muntner, P., Whelton, P.K., He, J.: Global burden of hypertension: analysis of worldwide data. The Lancet 365(9455), 217–223 (2005) https://doi.org/10.1016/S0140-6736(05)17741-1
Golino, H.F., Amaral, L.S.d.B., Duarte, S.F.P., Gomes, C.M.A., Soares, T.d.J., Reis, L.A.d., Santos, J., et al.: Predicting increased blood pressure using machine learning. Journal of obesity 2014 (2014)
Connelly, P.J., Currie, G., Delles, C.: Sex differences in the prevalence, outcomes and management of hypertension. Current Hypertension Reports 24(6), 185–192 (2022)
Ammann, E.M., O’Brien, E.S., Milentijevic, D., Kharat, A.A., Talbot, D.A., Canovatchel, W., Haskell, L., Andrawis, N.S.: Characteristics, management, and blood pressure control in patients with apparent resistant hypertension in the us. Heliyon 9(2) (2023)
Costa, F.F.: Big data in biomedicine. Drug discovery today 19(4), 433–440 (2014)
Golino, H.F., Amaral, L.S.d.B., Duarte, S.F.P., Gomes, C.M.A., Soares, T.d.J., Reis, L.A.d., Santos, J.: Predicting Increased Blood Pressure Using Machine Learning. Journal of Obesity 2014, 637635 (2014) https://doi.org/10.1155/2014/637635 . Accessed 2024-05-20
Held, E., Cape, J., Tintle, N.: Comparing machine learning and logistic regression methods for predicting hypertension using a combination of gene expression and next-generation sequencing data. BMC Proceedings 10(7), 34 (2016) https://doi. org/10.1186/s12919-016-0020-2 . Accessed 2024-05-20
Zhao, H., Zhang, X., Xu, Y., Gao, L., Ma, Z., Sun, Y., Wang, W.: Predicting the Risk of Hypertension Based on Several Easy-to-Collect Risk Factors: A Machine Learning Method. Frontiers in Public Health 9 (2021) https://doi.org/10.3389/ fpubh.2021.619429 . Accessed 2024-05-20
Koshimizu, H., Kojima, R., Kario, K., Okuno, Y.: Prediction of blood pressure variability using deep neural networks. International Journal of Medical Informat- ics 136, 104067 (2020) https://doi.org/10.1016/j.ijmedinf.2019.104067 . Accessed 05-20
Amaratunga, D., Cabrera, J., Sargsyan, D., Kostis, J.B., Zinonos, S., Kostis, W.J.: Uses and opportunities for machine learning in hypertension research. Interna- tional Journal of Cardiology Hypertension 5, 100027 (2020) https://doi.org/10. 1016/j.ijchy.2020.100027 . Accessed 2024-05-20
Sakr, S., Elshawi, R., Ahmed, A., Qureshi, W.T., Brawner, C., Keteyian, S., Blaha, M.J., Al-Mallah, M.H.: Using machine learning on cardiorespiratory fitness data for predicting hypertension: The henry ford exercise testing (fit) project. PLoS One 13(4), 0195344 (2018)
Islam, M.M., Rahman, M.J., Chandra Roy, D., Tawabunnahar, M., Jahan, R., Ahmed, N.A.M.F., Maniruzzaman, M.: Machine learning algorithm for char- acterizing risks of hypertension, at an early stage in Bangladesh. Diabetes & Metabolic Syndrome: Clinical Research & Reviews 15(3), 877–884 (2021) https: //doi.org/10.1016/j.dsx.2021.03.035 . Accessed 2024-05-21
Cucchiara, A.: Applied Logistic Regression. Technometrics 34, 358–359 (2012) https://doi.org/10.1080/00401706.1992.10485291
Ontivero-Ortega, M., Lage-Castellanos, A., Valente, G., Goebel, R., Valdes-Sosa, M.: Fast Gaussian Na¨ıve Bayes for searchlight classification analysis. Neu- roImage 163, 471–479 (2017) https://doi.org/10.1016/j.neuroimage.2017.09.001 Accessed 2024-05-02
Lau, K.W., Wu, Q.H.: Online training of support vector classifier. Pattern Recog- nition 36(8), 1913–1920 (2003) https://doi.org/10.1016/S0031-3203(03)00038-4 . Accessed 2024-05-07
Wu, Y., Ianakiev, K., Govindaraju, V.: Improved k-nearest neighbor classiy¨cation. Pattern Recognition (2002)
Chiu, P.K.-F., Shen, X., Wang, G., Ho, C.-L., Leung, C.-H., Ng, C.-F., Choi, K.-S., Teoh, J.Y.-C.: Enhancement of prostate cancer diagnosis by machine learning techniques: an algorithm development and validation study. Prostate Cancer and Prostatic Diseases 25(4), 672–676 (2022) https://doi.org/10.1038/ s41391-021-00429-x . Accessed 2024-05-07
Biau, G., Scornet, E.: A random forest guided tour. TEST 25(2), 197–227 (2016) https://doi.org/10.1007/s11749-016-0481-7 . Accessed 2024-05-02
Zareapoor, M., Shamsolmoali, P.: Application of Credit Card Fraud Detection: Based on Bagging Ensemble Classifier. Procedia Computer Science 48, 679–685 (2015) https://doi.org/10.1016/j.procs.2015.04.201 . Accessed 2024-05-02
Schapire, R.E.: Explaining AdaBoost. In: Sch¨olkopf, B., Luo, Z., Vovk, V. (eds.) Empirical Inference, pp. 37–52. Springer, Berlin, Heidelberg (2013). https://doi. org/10.1007/978-3-642-41136-6 5 . https://link.springer.com/10.1007/978-3-642- 41136-6₅Accessed2024 − 05 − 02
Bent´ejac, C., Cs¨org˝o, A., Mart´ınez-Mun˜oz, G.: A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review 54(3), 1937–1967 (2021) https: //doi.org/10.1007/s10462-020-09896-5 . Accessed 2024-05-02
Nematollahi, M.A., Jahangiri, S., Asadollahi, A., Salimi, M., Dehghan, A., Mashayekh, M., Roshanzamir, M., Gholamabbas, G., Alizadehsani, R., Bazraf- shan, M., Bazrafshan, H., Bazrafshan Drissi, H., Shariful Islam, S.M.: Body composition predicts hypertension using machine learning methods: a cohort study. Scientific Reports 13(1), 6885 (2023) https://doi.org/10.1038/ s41598-023-34127-6 . Accessed 2024-05-07
Islam, M.M., Rahman, M.J., Tawabunnahar, M., Abedin, M.M., Maniruzzaman, M.: Investigate the Effect of Diabetes on Hypertension based on Bangladesh Demography and Health Survey, 2017-18 (2021). https://doi.org/10.21203/rs.3. rs-140346/v1 . https://www.researchsquare.com/article/rs-140346/v1 Accessed 2024-05-07
Das, S.: Association of hypertension with overweight and obesity among adults in Rangpur region of Bangladesh: A cross-sectional study. Human Nutrition & Metabolism 37, 200273 (2024) https://doi.org/10.1016/j.hnm.2024.200273 . Accessed 2024-06-24

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Predicting Hypertension in Rangpur Region: A Machine Learning Approach

Status:

Version 1

Abstract

Figures

1 Introduction

2 Research methodology

2.1 Area of the study

2.2 Required tools and analysis

2.2.1 Data collection

2.2.2 Statistical analysis

2.2.3 Data Splitting

2.2.4 Model selection

2.3 Flow chart

2.4 Data

2.4.1 Source of data

2.4.2 Attribute information

2.4.3 Feature selection

2.5 Performance evaluation

3 Results and Discussion

3.1 Attributes description

3.2 Effect of factors on hypertension

3.3 Training: Test ratio based comparison

3.3.1 60% Training 40% Testing Dataset

3.4 F1 Score based comparison

3.5 Evaluation of prediction

4 Conclusion

Declarations

References

Additional Declarations

Status:

Version 1