Facilitating heart disease prediction using deep learning models founded on routinely accessible health data

doi:10.21203/rs.3.rs-4823408/v1

Download PDF

Article

Facilitating heart disease prediction using deep learning models founded on routinely accessible health data

https://doi.org/10.21203/rs.3.rs-4823408/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Cardiovascular diseases are the primary cause of non-natural deaths globally, accounting for over 18 million fatalities annually. Therefore, expandable and low-cost cardiac risk prediction systems are crucial for mitigating the impact of heart diseases on human health. In this work, we deploy a Heart Disease Risk Prediction System (HDRPS) ,a deep learning-based cardiac risk prediction system that utilizes affordable health data and electrocardiogram (ECG) images for cardiac risk assessment. In the data prediction segment, a Deep Neural Network Classification Model (DNNCM) was initially developed based on the original 13-feature UCI dataset, achieving a binary classification accuracy of 0.9655. After removing five hard-to-obtain features from the 13-feature UCI dataset, the α part of Heart Disease Risk Prediction Model (HDRPMα), a deep neural network model was developed based on the 8-feature UCI dataset. This model, utilizing eight easily accessible health data points, reached a binary classification accuracy of 0.917. In the image prediction segment, we use a database established from ECG images easily exported from smart wearable devices. The HDRPMβ convolutional neural network model developed for this database achieved an accuracy of 0.95. In the field of AI-driven cardiac disease prediction, HDRPS has significantly improved upon the practical limitations of previous research models, making substantial advances in usability. HDRPS could potentially be employed for national-level large-scale cardiac risk screenings and personal cardiac health monitoring, contributing to humanity's fight against heart disease.

Biological sciences/Computational biology and bioinformatics/Machine learning

Physical sciences/Mathematics and computing/Computer science

Physical sciences/Mathematics and computing/Information technology

Cardiovascular disease, as a major global health challenge, not only poses a threat to individual life safety but also exerts profound impacts on national public health systems and economic burdens.

In the early stages, heart disease often presents with nearly no symptoms. Subtle changes in physiological indicators may signal the presence of heart disease, but these minor changes are easily overlooked, making early detection and prevention challenging. Moreover, regular systematic cardiac examinations (such as echocardiography and coronary angiography) are prohibitively expensive for most people. Traditional heart disease risk assessments typically rely on physicians' experience and standardized assessment tools, which may involve subjectivity and inherent limitations. The advent of deep learning technology offers a novel approach to addressing this issue.

Deep learning excels at uncovering hidden relationships within data that are difficult for humans to detect. Through artificial intelligence and deep learning, vast amounts of medical data can be used to train models, extracting potential patterns and rules from the enormous data sets, thereby enhancing the accuracy and efficiency of heart disease risk assessments.

Additionally, the gradual proliferation of home glucose monitors and smart wearable devices allows people to easily access a variety of health data that previously required hospital visits, such as fasting blood glucose, cholesterol levels, and heart rate. More advanced smart wearable devices, such as the Apple Watch, can conveniently perform electrocardiograms and export the images as PDF.

The use of artificial intelligence (AI) and deep learning to predict heart disease holds significant clinical and scientific value. In the current push for personal health management and preventive medicine, the Heart Disease Risk Prediction System (HDRPS) not only reduces the cost of screening and early warning for heart disease but also improves the efficiency of heart disease prediction and prevention, demonstrating the enormous potential of AI applications in the healthcare sector. By promoting the adoption of this system, we aim to enhance public awareness of heart disease risks, foster the development of healthy lifestyles, and ultimately reduce the incidence and mortality rates of heart disease. This endeavor contributes to improving the quality of life for heart disease patients, reducing healthcare costs, and advancing medical science.

The research on predicting heart disease using artificial intelligence and machine learning technologies has become one of the major focal points in the fields of medicine and computer science globally.

Atta Ur Rahman et al. deployed a novel self-attention-based transformer model that combines self-attention mechanisms and transformer networks^[1] to predict heart disease risk. This model achieved an accuracy of 0.965 on the aforementioned UCI dataset, surpassing other state-of-the-art methods previously used for heart disease prediction. Jintai Chen et al. proposed a deep learning approach for CHD (congenital heart disease) detection^[2] that integrates human concepts. This approach is designed to detect CHD in children using low-cost and non-invasive electrocardiograms (ECGs) for early diagnosis and treatment. It extracts features from ECGs and combines them with human concepts, achieving a specificity of 0.88 on a real test set covering 12,000 cases, outperforming cardiologists. Sidra Abbas et al. developed an artificial intelligence framework that classifies heart disease based on audio signals^[3], using the sounds produced by the heart to identify and diagnose cardiovascular diseases. Utilizing two datasets with real heart sounds, the framework achieved an accuracy of 0.96 in heart disease detection. Caroline A. Taksoee-Vester et al. developed a deep learning model to assess the quality of fetal echocardiograms^[4] and perform early clinical validation of heart disease. The model achieved an accuracy of 0.91 across the entire test set, with the study emphasizing the importance of developing and evaluating AI models using "noisy" real-life data rather than pursuing the highest possible accuracy on academic datasets. Robert J. H. Miller et al. addressed the time-consuming nature of manually annotating epicardial adipose tissue volume^[5] and attenuation by modifying existing deep learning models to improve myocardial perfusion imaging. Their modifications showed good correlation with expert annotations while significantly reducing the required time. William DeGroat et al. proposed a new method^[6] combining traditional statistics and advanced artificial intelligence to identify key biological markers predicting heart disease by analyzing the complete transcriptome of heart disease patients. They then used four machine learning classifiers (Random Forest, Support Vector Machine, Extreme Gradient Boosting Decision Tree, and K-Nearest Neighbors) to evaluate these markers, achieving a maximum accuracy of 0.96. Chin Lin et al. utilized innovative AI algorithms for ECG detection to effectively identify both overt and subclinical hyperthyroidism^[7], aiding in cardiovascular risk assessment. Eunjung Lee et al. also developed an AI-supported ECG model to identify diastolic dysfunction^[8] and elevated filling pressure as determined by echocardiography, achieving an AUC of up to 0.94, enhancing the detection of diseases associated with diastolic dysfunction and elevated filling pressure. Seungmin Lee et al. proposed an approach integrating time-series deep learning architectures^[9] and AI-based validation to enhance the result analysis of lateral flow assays, assisting in the rapid diagnosis of point-of-care testing.

Limitation and motivation

Previous works limitations

The previous research can be primarily divided into two directions.

On one hand, researchers have attempted to use various algorithms based on the UCI dataset and optimize these algorithms to achieve higher prediction accuracy. The commonly used UCI dataset contains 13 feature fields. However, some of these features are difficult to obtain and require patients to undergo certain medical examinations and evaluations by professional doctors. For example, the results for features such as restecg (resting electrocardiogram results), oldpeak (ST depression), and slope (ST slope) require dynamic electrocardiogram examinations; ca (number of major vessels colored by fluoroscopy) requires coronary angiography; and thal (thalassemia) requires a nuclear stress test. These examinations are expensive, cumbersome, and can only be performed in specialized hospitals.

On the other hand, researchers have sought other data sources for heart disease prediction. Some studies have overly focused on achieving high experimental accuracy at the expense of usability and data accessibility. Some research teams even use whole-genome sequencing results to predict heart disease. Researchers using electrocardiograms (ECGs) for heart disease prediction also rely on specialized medical data that is inaccessible and difficult to understand for the general public, obtainable only through hospital visits.

For patients to use these researchers' models, they must obtain this data, which comes at a high cost. Since they have already spent significant time and money at the hospital, where they can directly communicate with professional doctors who provide medical plans and take full responsibility for diagnosis and treatment, there is little need to use an AI prediction model that cannot be held accountable medically, legally, or ethically for a vague prediction.

The ultimate goal of scientific and technological development is to bring convenience to humanity and improve people's quality of life. Current research is overly focused on academic depth, often neglecting the practicality of models and detaching from real-world applications. In reality, AI cannot widely replace doctors in the foreseeable future, and people are unlikely to entrust their health to cold, unfeeling machines and algorithms. Therefore, the primary focus in the field of machine learning for heart disease prediction should be on the usability and convenience of the models, rather than achieving higher accuracy using data that is difficult to obtain in daily life. This approach will enable the practical application of research findings in social production and everyday life.

Motivation

With changes in people's lifestyles, the incidence of heart disease is continuously rising, making early prevention and diagnosis particularly important. However, formal medical examinations are prohibitively expensive and difficult for many to access, and the existing models require this data for evaluation and prediction, thus lacking practicality. The purpose of developing HDRPS is to address the issue of insufficient practicality in current models.

Key contributions and novelty

This research has developed a convenient and practical heart disease risk prediction system based on deep learning, named HDRPS. The system comprises a deep neural network module HDRPMα for data prediction and a convolutional neural network module HDRPMβ for ECG image prediction.

The system was created using deep learning techniques, with a focus on the practicality of the neural network model. It aims to allow ordinary people without professional medical knowledge or computer literacy to easily use this system for heart disease risk prediction while maintaining a reasonable level of model prediction accuracy.

For the data prediction part of heart disease risk, the commonly used UCI dataset has 13 feature fields, five of which are difficult to obtain, as discussed earlier. HDRPMα uses only the remaining eight features, which are relatively easy for the general population to acquire, to predict heart disease risk. After multiple model optimizations, HDRPMα achieved an accuracy level surpassing most domestic models based on the 13-feature dataset.

For the non-data prediction part, research typically uses professional ECG signal data or medical feature data analyzed by doctors, which are inaccessible to ordinary people. Current smart wearable devices only support exporting ECG images as PDFs. HDRPMβ requires only image files obtained by taking screenshots of these exported PDFs to predict heart disease risk, and the model has achieved a commendable accuracy level.

HDRPS is an easy-to-use and relatively accurate heart disease risk prediction tool with good scalability. It can help the general public better understand their heart health, take appropriate preventive measures, and improve their quality of life. HDRPS is poised to become a valuable tool in the healthcare field, enhancing the accuracy and universality of heart disease risk assessment and ultimately reducing the impact of heart disease on individuals and society.

There has been extensive research on the classification of the UCI dataset both domestically and internationally. In this development, we first focused on the original 13-feature UCI dataset and constructed a Deep Neural Network Classification Model (DNNCM) using deep learning algorithms. This model utilizes a deep neural network framework, employing the 13-feature UCI dataset for training and incorporating regularization techniques to prevent overfitting. After a total of 20 million epochs of training, the model achieved a peak accuracy of 0.9655. DNNCM surpasses other leading global classification models in the research domain of the original UCI dataset, as shown in Fig. 1.

Building on DNNCM, we initiated the development of the numerical prediction component based on a deep neural network, referred to as HDRPMα. To address the issue of some data in the UCI dataset being difficult to obtain, five data features were removed, and HDRPMα was developed, trained, and optimized using a dataset retaining only eight features. The model was evaluated on a test set, with the best-performing model parameters being tracked and saved. After a total of 130 million epochs of training, the model achieved an accuracy of 0.917, reaching high classification accuracy using only the 8-feature UCI dataset.

In response to the challenges in current non-data prediction research, specifically the difficulty in obtaining non-image data and diagnostic conclusions, we developed the ECG image prediction module HDRPMβ using a convolutional neural network. Due to the lack of readily available ECG image datasets, we manually extracted ECG images from the MIT-BIH Arrhythmia Dataset and the European ST-T Database of healthy individuals. After preprocessing, these images were organized into an ECG image dataset. HDRPMβ was developed and evaluated based on this image dataset, with approximately 5 million epochs of training conducted throughout the development process, achieving a maximum accuracy of 0.95.

Heart disease is a leading cause of unnatural death among humans, making early detection and warning crucial. Early stages of heart disease are often difficult to identify, and the cost of various medical tests can be prohibitively high. Artificial intelligence (AI) has the potential to uncover hidden risks in health data and provide warnings, significantly reducing the cost of risk prediction. The use of AI for heart disease classification is a major focus of current research. However, existing studies often prioritize experimental accuracy using hard-to-obtain feature data, overlooking the practical value of the models. This research addresses these limitations.

Initially, this study developed a deep neural network model, DNNCM, based on the widely studied 13-feature UCI dataset, surpassing all models within the scope of research to achieve a leading global accuracy level. However, the goal of developing HDRPS was not merely this achievement. In developing HDRPMα, we removed five hard-to-obtain features from the UCI dataset and achieved remarkable accuracy using only eight features. Simultaneously, we developed HDRPMβ based on ECG image data, which also achieved high accuracy. These two neural network models were ultimately integrated into the HDRPS system for user application.

Building on this foundation, further exploration was conducted. This included manually labeling heart disease data from the UCI dataset into three severity levels based on medical expert recommendations and exploring a four-classification deep neural network. Although the final accuracy was not satisfactory, this exploration provided direction for further research.

Efforts were also made to enhance the convenience of HDRPS, with attempts to use fully convolutional networks (FCNs) for image segmentation, eliminating the need for manual screenshots by users. Although these attempts were not successful, the approach can be integrated into the system as image segmentation technology continues to advance.

In summary, HDRPS exhibits both high technical proficiency and practicality, utilizing easily accessible health data and ECG images from everyday life. HDRPS can be used for large-scale, low-cost heart health screenings by government entities or for individual users to monitor their heart health, ultimately contributing to public health and improving living standards.

HDRPS marks a significant step forward in the practical application of AI models for heart disease prediction. However, it is not without limitations. The accuracy of HDRPMα, based on the 8-feature UCI dataset, stands at 0.917, which is only moderately satisfactory. As mentioned earlier, international models for the 13-feature UCI dataset have achieved accuracy as high as 0.965, which HDRPMα has yet to match. Even achieving a 0.965 accuracy with the 8-feature dataset would not suffice in my view.

In fact, the development of HDRPMβ was intended to compensate for the less-than-optimal accuracy of HDRPMα. However, obtaining medical data from hospitals is challenging, and due to dataset independence, the two models can only be developed and operated separately. In its current state, HDRPS consists of two independent components, HDRPMα and HDRPMβ, providing users with dual references.

The ultimate form of HDRPS should be a comprehensive multimodal complex neural network capable of accepting health data, ECG images, and even user heart and breath sounds, integrating these inputs for final decision-making. Moreover, HDRPS should deliver four-tier risk-level classifications with accuracy reaching 0.98 or higher, establishing it as a robust and user-friendly heart disease risk prediction system.

I hope this paper serves as a catalyst for further improvement and refinement of this research. I would be grateful if interested scholars could enhance this work further.

Experimental Environment

All experiments in this study were conducted on a computer equipped with an Intel Core i7-14700K processor, 64GB of RAM, and an NVIDIA GeForce RTX 4090 GPU. The CUDA version used was 12.6, the PyTorch version was 2.1.2, and the Python environment was set up using Visual Studio Code from Anaconda Navigator.

DNNCM

Data sources

The Deep Neural Network Classification Model (DNNCM) was trained and evaluated using the UCI dataset. This dataset is a compilation of four sub-datasets: "Cleveland," "Hungary," "Switzerland," and "Long Beach V." Each sub-dataset originates from a different medical research institution: the Cleveland Clinic, the Hungarian Institute of Cardiology, a medical research facility in Switzerland, and the University of California, Long Beach. The dataset consists of a total of 1,025 samples, 13 feature fields, and 1 target field.

Model Architecture

As shown in Fig. 2, The first fully connected layer of the model transforms the input features into 64 hidden nodes. This is followed by a ReLU activation function, which introduces non-linearity to the model, allowing it to learn complex feature representations. The ReLU activation function has advantages such as simple computation and a constant gradient in the positive interval, which helps mitigate the vanishing gradient problem. Next is a Dropout layer, which randomly drops nodes with a probability of 0.3. Following this, another fully connected layer reduces the 64 nodes to a single output node, used to generate the final prediction. Finally, a Sigmoid activation function is employed to map the output layer's values to the (0, 1) range, making it suitable for probability predictions in binary classification tasks.

Training

After conducting small-scale exploratory training, this model can consistently achieve an accuracy of over 0.9. Based on these results, the model undergoes training for 100,000 iterations, with 200 epochs per iteration, to find the optimal model parameters. During the training process, the accuracy is tracked after each epoch. Whenever a new accuracy high is achieved, the accuracy is recorded, and the model parameters are saved. After a total of 20 million epochs of training, the model achieved a peak accuracy of 0.9655.

HDRPMα

Data sources

The development of HDRPMα is primarily based on a modified version of the UCI dataset.

As mentioned earlier, the main objective of HDRPS is to reduce the cost of screening and warning for heart disease, and to improve the efficiency of heart disease prediction and prevention. Therefore, the data accepted by HDRPMα must be low-cost and easy-to-obtain health data to align with the initial design goals of the system. The dataset is preprocessed according to this standard.

Initially, the dataset contains 13 feature fields, some of which require hospital visits and even professional medical diagnosis to obtain. After careful examination, we identified five features that are relatively difficult to obtain: restecg, oldpeak, slope, ca, and thal. By removing these five features, the remaining eight features are easily accessible, thus facilitating users' everyday use of HDRPS. The rest of the dataset remains unchanged, and this modified 8-feature UCI dataset will be used for the development of HDRPMα.

Model Architecture

The first fully connected layer accepts 8 input features and outputs 16 nodes. This is followed by a ReLU activation function, which introduces non-linearity and helps capture complex relationships within the input data.

The first Dropout layer randomly drops 30% of the features to reduce overfitting. The second fully connected layer further maps the output of the first layer to 32 nodes. The ReLU activation function and Dropout layer in the second layer serve the same purpose as before. The third fully connected layer expands the features to 64 nodes, followed by a ReLU activation function.

The fourth fully connected layer maps the 64-node features to 2 output nodes, which is suitable for binary classification problems.

The Sigmoid activation function compresses the output of the fourth fully connected layer to the (0, 1) range, which is typically interpreted as a probability, making it appropriate for outputs in binary classification tasks.

Fig. 3 shows the architecture of HDRPMα.

Training

After finalizing the model structure, extensive training was conducted. HDRPMα consists of 2,994 learnable parameters. Throughout the development process, the model was trained for approximately 130 million epochs, achieving a maximum accuracy of 0.917. Reaching this level of accuracy with only the 8-feature UCI dataset is a highly commendable result.

HDRPMβ

Data sources

As previously mentioned, research on heart disease prediction using electrocardiograms (ECGs) often relies on highly specialized ECG signal data that is difficult for the average person to access and understand. Fortunately, current smart wearable devices can perform mobile ECG monitoring and export the data as PDFs, allowing people to obtain ECG images conveniently. Therefore, the development and training of HDRPMβ must utilize an ECG image dataset. We selected ECG images extracted from the MITBIH Arrhythmia Dataset to represent the diseased portion of the training data, and 50 ECG images from the European ST-T Database were chosen to represent the healthy portion of the training data.

Model Architecture

HDRPMβ is a convolutional neural network model designed for heart disease prediction, consisting of multiple convolutional layers, pooling layers, and fully connected layers , shown in Fig. 4, with the inclusion of activation functions and Dropout regularization to enhance model performance and robustness. The model input is a single-channel (grayscale) ECG image. Initially, the input passes through a convolutional layer that includes 16 filters of size 3×3, with a stride of 1 and padding of 1, to extract preliminary features. This is immediately followed by a ReLU activation function to introduce non-linearity, and a 2×2 max pooling layer is applied for down-sampling, followed by a 30% Dropout to reduce overfitting. Subsequently, the second convolutional layer increases the feature map to 32 channels, using the same 3×3 filter size, stride of 1, and padding of 1, followed by ReLU activation, max pooling, and another 30% Dropout. The third convolutional layer further extracts features, outputting 64 channels, with the same filter and pooling configuration. After completing the convolution and pooling process, the three-dimensional feature maps are flattened into a one-dimensional vector, passing through a fully connected layer with 128 nodes, utilizing a ReLU activation function and a 50% Dropout. In the output layer, a fully connected layer maps the 128 nodes to 2 output nodes, employing a Sigmoid activation function to compress the outputs into the range of 0 to 1, suitable for probability predictions in binary classification tasks.

Training

HDRPMβ consists of approximately 1.45 million learnable parameters, and the development process involved training for a total of about 5 million epochs. The model achieved a maximum accuracy of 0.95.

Competing interests

The authors declare no competing interests.

Author Contribution

R.Z. collected and analyzed the data. R.Z. and A.L. conceived and conducted the experiments. D.C. and M.L. analyzed the results. All authors reviewed the manuscript.

Data Availability

The relevant datasets and code are available upon request from the corresponding author.

Rahman, A.U., Alsenani, Y., Zafar, A. et al. Enhancing heart disease prediction using a self-attention-based transformer model. Sci Rep 14, 514 (2024).
Chen, J., Huang, S., Zhang, Y. et al. Congenital heart disease detection by pediatric electrocardiogram based deep learning integrated with human concepts. Nat Commun 15, 976 (2024).
Abbas, S., Ojo, S., Al Hejaili, A. et al. Artificial intelligence framework for heart disease classification from audio signals. Sci Rep 14, 3123 (2024).
Taksoee-Vester, C.A., Mikolaj, K., Bashir, Z. et al. AI supported fetal echocardiography with quality assessment. Sci Rep 14, 5809 (2024).
Miller, R.J.H., Shanbhag, A., Killekar, A. et al. AI-derived epicardial fat measurements improve cardiovascular risk prediction from myocardial perfusion imaging. npj Digit. Med. 7, 24 (2024).
DeGroat, W., Abdelhalim, H., Patel, K. et al. Discovering biomarkers associated and predicting cardiovascular disease with high accuracy using a novel nexus of machine learning techniques for precision medicine. Sci Rep 14, 1 (2024).
Lin, C., Kuo, FC., Chau, T. et al. Artificial intelligence-enabled electrocardiography contributes to hyperthyroidism detection and outcome prediction. Commun Med 4, 42 (2024).
Lee, E., Ito, S., Miranda, W.R. et al. Artificial intelligence-enabled ECG for left ventricular diastolic function and filling pressure. npj Digit. Med. 7, 4 (2024).
Lee , S., Park, J.S., Woo, H. et al. Rapid deep learning-assisted predictive diagnostics for point-of-care testing. Nat Commun 15, 1695 (2024).

No competing interests reported.

Download PDF

Reviewers agreed at journal
28 Oct, 2024
Reviews received at journal
28 Oct, 2024
Reviewers agreed at journal
28 Oct, 2024
Reviewers agreed at journal
11 Sep, 2024
Reviewers invited by journal
11 Sep, 2024
Editor assigned by journal
11 Sep, 2024
Editor invited by journal
08 Aug, 2024
Submission checks completed at journal
06 Aug, 2024
First submitted to journal
29 Jul, 2024

You are reading this latest preprint version

Facilitating heart disease prediction using deep learning models founded on routinely accessible health data

Status:

Version 1

Abstract

Figures

Introduction

Related work

Limitation and motivation

Previous works limitations

Motivation

Key contributions and novelty

Results

Discussion

Methods

Declarations

Competing interests

Author Contribution

Data Availability

References

Additional Declarations

Status:

Version 1