Electrocardiogram Feature Based Heart Arrhythmia Detection Using Machine Learning and Apache Spark

doi:10.21203/rs.3.rs-2819902/v1

Download PDF

Research Article

Electrocardiogram Feature Based Heart Arrhythmia Detection Using Machine Learning and Apache Spark

https://doi.org/10.21203/rs.3.rs-2819902/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Heart arrhythmias are the main cause of death worldwide. Electrocardiogram (ECG) results can be used to identify arrhythmias, or irregularities in the heart's rhythm. Because symptoms are not always present, the diagnosis is often off. To prevent a potentially catastrophic situation, patients using real-time ECG monitoring must identify arrhythmias early on. In this work, Structured Streaming, an open-source Apache Spark technology, was used. Finding a method to apply machine learning to detect cardiac arrhythmias in real-time is the goal of the project. Investigating how structured streaming affects metrics for content classification and how long it takes to find episodes was another goal. At MIT and BIH, we have been gathering ECG information. With this information, arrhythmias like RBBB and atrial fibrillation might be recognised and categorised. There are many methods for separating these erratic rhythms from one another. We used a multiclass classifier based on logistic regression, a random forest, and three different decision trees to categorise the data. The random forest classifier wins out when the three classification methods are compared. In comparison to other studies, this study demonstrated improved classification model performance metrics and decreased pipeline runtime.

Arrhythmia Detection

Spark Structured Programming

Electrocardiogram

Segmentation

Feature Extraction

The rise of new types of data may be at least somewhat to blame for the current healthcare system's improved analytical skills. Patients who require at-home or remote monitoring can find wearable and wireless remote sensors extremely beneficial. These sensors can be quite valuable to individuals in need. MIT-BIH Arrhythmia data might be communicated via a centralized system, which should not be ruled out. Massive data analysis has been proposed as a method for healthcare practitioners to improve the quality of their job and outcomes for their patients [1–2]. Healthcare data management needs to get better at coming up with new ways to deal with the growing amount of data. The analysis of this data [3–4] can result in new medical knowledge that can improve services, lower costs for patients and hospitals, and save lives. Big data analysis is being applied in cardiology. This is different from previously. Sensors and smartphones are being developed to collect physiological data and ECG signals from patients to detect potentially life-threatening cardiac arrhythmias [5]. ECGs are crucial medical data that may be generated in a timely and sequential manner. In real-time, a big data platform can analyse and identify cardiac arrhythmias [6]. An electrocardiogram can be used by a medical specialist to identify a variety of irregular cardiac rhythms. Atrial fibrillation is a common kind of abnormal heart rhythm. One of the irregular heart rhythms is the right bundle branch block. It is conceivable to come across the abbreviation "AF" for atrial fibrillation [7] When illnesses such as right bundle branch block and atrial fibrillation affect the heart, the electrical system of the heart may be altered. The acronym AF is commonly used while discussing atrial fibrillation. Atrial fibrillation, sometimes known as "AF," affects a large proportion of the population. Individuals with heart failure are more likely to have strokes, which unfortunately often result in death. When the right bundle branch, the heart's electrical transmission channel, is clogged for any reason, the right ventricle is overstimulated [8–9]. An abnormal heart rhythm called RBBB may occur if the right bundle branch does not get regular impulses from the ventricles. The abbreviation RBBB refers to this medical condition. A variety of methods [10, 11] have been developed to identify AF in a static batch mode. Real-time streams can be processed in a variety of ways. Other technologies, such as databases and mining, are employed. Approaches based on databases include sampling, sketching, and approximation. Stream mining employs a variety of machine learning techniques [12–13]. The data stream may be handled using tool-based methodologies. MOA and Rapid Miner are two popular online analytic tools, however, there is a range of additional ways and resources accessible. Computational engines must be able to handle data streams to perform calculations in parallel and across several nodes.

Right bundle branch block and atrial fibrillation are two arrhythmias that are usually detected and treated swiftly. The Pan-Tampkins approach, which employs data mining, can readily detect AF and RBBB beats. It has been demonstrated that the employment of tool-based methodologies in conjunction with a scalable data platform may be utilised to identify atrial fibrillation and ventricular arrhythmias in real-time. When it comes to parallel and distributed computing, Spark Structured Streaming is still in its early stages. Our research will immediately lead to the detection of a greater number of cardiac arrhythmias in less time and with less complicated equipment. Real-time data processing is possible using Apache Structured Streaming, which was created on top of Apache Spark. Spark's SQL Library's Structured Streaming Processing Engine serves as the foundation for the Data Frame and Dataset Application Programming Interface. This application programming interface simplifies data retrieval and modification. When working with enormous data sets, there are several dangers that may be avoided, and this engine can be useful [14–15]. SQL queries can be executed on the streaming data at that time if necessary. We are at the forefront of this new subject because our processing engine can detect cardiac arrhythmias in real-time. This puts us in a good position. When opposed to other stream processing approaches, structured streaming provides more weight to data that is broadcast continually.

Structured streaming can accept data from a number of sources and provide it, in contrast to earlier ways of processing enormous data streams, which had limitations. It was previously only possible to transmit data to a small number of receivers by using conventional techniques for handling enormous data streams. Structured streaming's adaptability directly enhances its ability to quickly evaluate enormous amounts of data. If standard approaches and sizable data sets are used, the effects will be more obvious [16–17]. Depending on the situation, this can lead to a reduction in the amount of time needed to inform clients of the outcomes. But, there is no guarantee that it will happen. Compared to alternatives, structural streaming greatly outperforms them in terms of throughput and response time. In a recent study [18], structured streaming was compared to other streaming services and studied to identify areas of similarity and difference. It is possible to boost a system's speed, throughput, and latency with structured streaming [19]. When it comes to throughput performance, Structured Streaming outperforms both Apache Flink and Kafka Stream. The programme used to compress streaming data is called Spark. This section's assertions aren't sufficiently supported by the citations provided. This section's assertions aren't sufficiently supported by the citations provided. By using SQL instructions, it is feasible to reduce part of the system's load [20].

The staggered execution of queries is made possible by these ideas. To operate to their full potential, Apache Flink, Spark Streaming, and Google Data Flow all need a user-defined physical process organisation. UDF State Complete operators [23–24] are used by Structured Streaming clients to provide stream processing logic. It is easy to create a single script that controls both interactive and non-interactive activities in a stream using the Data Frame Application Programming Interface. This possibility is now a reality thanks to the flexibility of the Data Frame application programming interface API and a method known as "coordinated streaming." To the best of our knowledge, this system is the only one with these additional qualities. While structured streaming is still not supported by Apache Flink [24], this capability is not yet available. Real-time tracking of moving data is thus made possible. The outcomes of structured streaming [24] will only be transmitted over the wire once, regardless of whether the underlying system is up or down. To prevent the results from being saved more than once, this is done. We utilise a streaming technique that cannot be altered in any way, so you can be sure that none of the information you submit will be intercepted or changed in any way. The system must be restarted right away if it is not operating properly.

Many approaches for identifying cardiac arrhythmias have been developed by the scientific community. A smartphone and an algorithm that computes the QRS complex, identify the ECG signal beats and calculates the RR interval can be used to detect AF. This technique identifies atrial fibrillation. In this study, AF detection was 97 percent accurate, 100 percent specific, and 93 percent sensitive [25] discusses the statistical methods used to detect AF beats in mobile phones. Because it combines all existing statistical methodologies in real time, the authors claim their technique is feasible and correct 99 percent of the time. Using this method, researchers computed RR intervals and differences between them, compared coefficients of deviation with their normalised values, and identified AF in real time [26]. This method calculates RR intervals. The RR interval and the Markov model may be utilized in real-time to categorise heartbeats and diagnose atrial fibrillation. Both are discussed more below. Using the MIT/BIH database, researchers discovered that the proposed approach was 92% accurate and 97% predictive. Researchers discovered this after validating the recommended technique. To determine these features in real-time, three statistical approaches have been developed: RMSSD, TPR, and Shannon Entropy. Statistical approaches include the following: They validated their findings using data from two databases, and the results showed sensitivity and specificity of 90.2 percent and 91.2 percent, respectively [27]. high specificity and sensitivity.

Heartbeat classification algorithms may be taught offline, allowing for the real-time identification of cardiac arrhythmias [28]. The researchers offer a detector that is personalized for each patient, as well as a strategy for proactively learning to recognize AF in real-time. In our present work, we employ support vector machines (SVMs) to construct feature vectors and train classifiers[29–30]. When three alternative classification algorithms were investigated, it was determined that overall accuracy, sensitivity, and specificity could all be increased by 91, 86, and 94.38 percentage points, respectively. The study's findings showed these facts. Reference [31] authors use two separate machine learning algorithms to develop a method for automatically recognizing arrhythmias such as right bundle branch block and atrial fibrillation. This method is preferred to others since the morphological components of each pulse are integrated and named in line with a rule-based classification scheme. The data from the supraventricular group stood out, with a sensitivity of 94.63% and a positive predictive value of 96.79%.In terms of accuracy, the ventricular class is 83.98 percent sensitive and 87.1 percent predictive. In both cases, these values are quite important. Demonstrate that your method is both predictive and responsive, throughout this study, two different types of class labels were used, and each one achieved the following levels of accuracy (99.5), sensitivity (98.77), and specification (99.0).

Deep learning algorithms can now detect cardiac arrhythmias, these models can compensate for typical shortcomings in signal feature extraction to some extent [32] and uses ECG signal characteristics in conjunction with a convolutional neural network to detect cardiac arrhythmias. This structure has several dimensions. There are various categories in this grading technique. It can extract more signal characteristics than previously. This feature can distinguish between different forms of arrhythmia, allowing the model to distinguish between beats more precisely. Researchers used two sets of data to evaluate the proposed model and obtained an F-score of 84.1 percent. It was more beneficial than other approaches at the time, according to its developers [33]. Members of the ensemble vote on how to categorise beats. Arrhythmia classification metrics are computed independently. In this experiment, detecting arrhythmia was 74.2 percent sensitive and 11.6 percent false-positive. Data augmentation was employed to relieve class imbalance after creating a standard procedure for ECG R-peak identification. The performance of the structural unit was enhanced by combining a convolutional neural network with a pointwise temporal convolutional structure unit. This method uses deep learning to classify arrhythmias with 99.5 percent accuracy. According to AAMI researchers [34], ectopic S-type arrhythmias can be addressed in two ways. This strategy consists of a pacemaker and catheter ablation. A strong dual-channel convolutional neural network is required to detect fusion, ventricular, and unexplained arrhythmias. A central-to-LSTM supporting model has been devised to help in the identification of S-type arrhythmias, which differ from normal heartbeats but nevertheless occur in the human body. The length of a song's rhythm is one of the numerous components that comprise the CLSM model. When applied to S-type beats, the approach gets the highest attainable accuracy score of 97.7 percent.

This research employs a pipeline for segmenting and extracting an electrocardiogram to detect atrial and Right Bundle Branch Block (RBBB) arrhythmias. The random forest machine learning approach is then used to identify these abnormal heart rhythms in real time. Segmentation and property extraction of the ECG is used to find atrial and right bundle branch block arrhythmias. Electrocardiograms are often used by doctors as a diagnostic tool to evaluate a patient's heart activity. These signals allow for continuous monitoring of cardiac electrical activity. This would give a reading based on the time that could be compared to normal values to help find problems like cardiac arrhythmias. ECG records from MIT and BIH are required for this investigation. To improve computation performance, all patient information was captured at a rate of 360 hertz. This is the most plausible idea since it is supported by a plethora of anatomical data. By utilising this data to train a machine learning model, reviewing its learned predictions, and then using the model's findings, you may be able to detect the places of R peaks inside beats. Long-term success may be dependent on the ability to detect tempo variations. This study used a labelling system established by the MIT and BIH databases to distinguish between sinus beats, atrial beats, and RBBB beats. Because they feature recognizable rhythmic patterns, these beats are simple to learn and enjoy. AAMI's five super classes have each been assigned a place in the MIT/BIH database. Although RBBB atrial beats are the most prevalent, atrial arrhythmias can occasionally be supraventricular in character. Table 1 summarises the results of a survey administered to 360 participants chosen at random from the MIT/BIH database. To help you get started with your search, we'll provide example records for each of the categories you choose. There are 242 recordings of "atrial arrhythmia," 115 recordings of "RBBB arrhythmia," 205 recordings of "normal," and 118 recordings of "other." Many marginal notes pollute the text of these volumes.

Table 1

Data Samples Are Drawn from Both The Validation And Training Sets
Dataset	Sample
Training	2222911
Test	469311

3.1 Data Pre-processing

To guarantee accuracy in arrhythmia diagnosis and analysis, data must be preprocessed. Many steps must be completed before processing and categorising the data. There are no offline preparation activities in real-time analysis; rather, the outcomes of each stage flow into the next. As part of this project's online workflow, Apache Spark was utilised to prepare ECG data. Pandas-UDF is a Spark technology that allows for the use of standard data mining techniques and instructions. Following noise reduction, R-peaks are recognised, data is segmented, and features are retrieved. Following data processing [35], created characteristics must be classified to identify arrhythmia. Figure 1 shows how to combine denoised and R-peak-identified ECG data into a single pulse. In the last stage, the significant characteristics of each segment are removed to reduce the sample size to 25. Finally, arrhythmia classifiers make use of the extracted feature.

EKG denoising can improve an EKG by reducing background noise. There are some problems with the way we collect data that need to be fixed before we can move on. If the patient's ECG signal is messed up by outside forces, the ECG test may not be as accurate as it could be. To get reliable signal data that does not distort or omit original information, we must first remove noise from the ECG signal. Only in this manner can we acquire correct information. ECG noise was reduced using band-pass filtering and restricted impulse response (FIR). FIR band-pass filters are used in digital signal processing applications. The linear phase and remarkable stability of this filter stand out. The phase of the digital signal must be linear for it to process each amplitude-frequency characteristic in real time.

3.2 R-Peak Detection

These waves are altered by every arrhythmia, recognizing and interpreting these changes aids in the detection and evaluation of arrhythmias. When the Q wave, R wave, and S wave all arrive in the heart at the same moment, the QRS complex is formed.By looking up the R points online, we were able to locate the specific pulse that created the peak and trough of the QRS complex. As a result, we were able to concentrate on more specific results. It's possible that, utilising the methods mentioned above, we'll be able to isolate individual heartbeats from a long ECG recording one day. Detecting heartbeats and diagnosing cardiac arrhythmias rely on identifying the R points on the ECG signal. Filtering decreases ECG background noise, and the method shown in Fig. 2 locates R peaks (the first and second steps of data preparation). Each heartbeat is accompanied by a wave-like electrical cardiac cycle. These waves are produced by electrical stimulation throughout each pulse. At any time, irregular heartbeats can disrupt these waves. Heart arrhythmias can be detected and identified by looking for certain patterns of movement [36]. A variety of indications can be used to detect and identify cardiac arrhythmias. When the three QRS waves are merged, the amplitude rises. The significance of each wave in maintaining a steady pulse cannot be overstated. By looking up the R points online, we were able to locate the specific pulse that created the peak and trough of the QRS complex. As a result, we were able to concentrate on more specific results. These sites will be used in the segmentation method to separate a large ECG signal file into individual beats. The R points on an ECG assist us to distinguish between beats, which influences our decisions about a patient's cardiac arrhythmia. Filtering decreases ECG background noise, and the method shown in Fig. 2 locates R peaks.

3.3 Segmentation

A large ECG record may be separated into individual heartbeats using segmentation techniques. The filtered ECG data from the previous step is used in this preprocessing phase to convert R peaks into beats. According to the study, the number of 360 Hz pulse samples ranges from 144 to 432. For each beat, 200 samples were analysed in this study. 130 samples were collected after the R-peak. Arrhythmias will be difficult to identify if critical beat information is lost owing to an improper number of samples per beat. It is critical to segment correctly. Figure 3 illustrates that the signal's important information and features were preserved in 200 samples. This was accomplished using a constant sample size.

3.4 Feature Extraction

The T, QRS, and P waves are sent out by the heart in a pattern known as a cardiac signal cycle. The lengths and amplitudes at which clinical information can be obtained in an ECG are determined by the signal's waves [37]. This information covers morphological and temporal characteristics. QRS complex length, PR wave distance, and T segment are all factors that influence heart shape. Temporal features are statistical vectors. To find the issue, the feature extraction technique detects as few ECG signal features as feasible. It is critical to choose a technique that collects ECG signal characteristics rapidly and reliably. To diagnose cardiac arrhythmias, the ECG feature must be correctly identified. To diagnose an arrhythmia using an ECG, extract as few parameters as possible.

The discrete wavelet transform approach [38] used in this work involves four steps of decomposition. Despite using fewer original pulse samples, no meaningful information is lost. This will expedite the subsequent stage of categorisation. Despite its low sampling rate, this filter performs a decent job of minimising the amount of sound that travels through it. According to Nyquist's theorem, to reach the required frequency in the filtered signal, half of the samples produced from a signal must be discarded. If the person is acting in this manner, it is obvious that their mental health is in shambles. Subband encoding refers to the practise of using this approach to breakdown increasingly complex signals. Each granularity level increases the statistical significance and frequency range by half [39]. The output of a low-pass filter is utilised as an input when transitioning between stages. Using a four-level decomposition strategy, the number of primary samples gathered during each 5-second epoch was reduced from 200 to 25. We saved a lot of time because of this. The epochs are used for training a classification model [40].

Figure 4 depicts feature extraction. This method preserves heartbeat characteristics by using 25 samples. There are four levels of breakdown, with fewer signal samples at each level. Only 25 heartbeat samples remain after the fourth round of deconstruction, but the core beat information remains.

3.5 Classification Conducted Via the Internet Using Apache Spark

To preprocess and categorise ECG streaming data, this pipeline employs Apache Spark components.The Structured Streaming Application Programming Interface (API) allows the construction of data frames, which can be used to store data in transit. Structured Streaming produces the same results as a batch operation on all incoming streaming data because it reads sequential input as an indefinite table. Spark SQL can handle relational data, making this possible. If data mining is considered as an endless table, it may be used to stream data. Figure 5 depicts the structured streaming data stream structure of Apache Spark.

3.6 File Source

Structured Streaming accepts data from Kafka and files, among other sources. ECG results were saved in files. When an ECG record is obtained, the file source data is read every 5 seconds and the received beats are computed. The internet is always evolving. Traditional computer systems and batch processing are sluggish in responding to changes. We can respond quickly to data changes because streaming systems allow us to query the system continually. This live investigation follows the previous one. The data interval has an impact on query execution speed and is often modified. Pipeline phases are used to describe the preprocessing and classification of our continuous queries. This study evaluates query execution time over a five-second period. This indicates that the query is conducted on streaming data every five seconds.

The three steps of this project's real-time data processing workflow are reading data from a file source, judging how well the pipeline works, and presenting the results. Structured streaming offers real-time computing by repeatedly processing requests at predefined intervals (often every five seconds). The random forest model is loaded before any computations are performed to allow for final classification. Following the expiration of the timer, five seconds of test data will be constantly downloaded from the file source and recorded in memory. Something will be done the instant the timer chimes. Unless the timer is reset, this operation will continue indefinitely. The following phase in the procedure is the execution of the query's instructions. The data is subsequently sorted and organised according to predetermined protocols. Following the completion of the categorising procedure, the database will search for arrhythmia class classifications based on the study's findings. These category names will appear in the search results. When a new file is read in, the process restarts and continues until there are no more files to read. Figure 6 demonstrates the online technique for detecting cardiac arrhythmias, which may be accessed by clicking here. This pipeline uses a Spark model to offer structured streaming machine learning.

Hardware constraints prevented the complete execution of this study's phases. Our development tools ran on a Google virtual machine with a 2.3 GHz CPU, 12.6 GB of RAM, and 100 GB of disc space. This allowed us to get around previous constraints. Our solution is built using Python 3.6 and Apache Spark 2.4.3.

We used multiclass categorization assessment measures since there are more than two class labels. The test streaming data-based approach was evaluated using sensitivity and specificity performance indicators [22], and the approach was clinically sound. Table 2 The display indicates the number of 360 Hz test samples obtained, along with the matching MIT/B entry, which displays the number of test samples obtained at 360 Hz, as well as the record from the MIT/BIH database that corresponds to that number.

Table 2

shows all the information in the MIT/BIH database about record labels, as well as the total number of beats based on a frequency of 360 hertz

Class	Number of Entries	Sample Count
A	226	215511
B	235	230718
C	313	235315

Using 5-second streaming data packets, we tested our web-based arrhythmia detection approach. On the test dataset, the 10-tree ensemble random forest classifier fared the best. The multiclass classification metrics for this approach are shown below.

Table 3 provides a prospective coding system and compares it to a variety of well-known strategies for organizing content on the internet

Table 3

Demonstrates that Proposed Approach Surpasses All Others in the Study
Method	Accuracy (%)	Sensitivity (%)	Specificity (%)
Atrial fibrillation detection	98	—	—
Automated real-time ECG monitoring	97.8	98.6	90.1
Machine learning pipeline for real-time analysis	93.2	99.8	84.5
Atrial fibrillation detection via accelerometer	98	94	99.2
Automatic detection of atrial	—	95.5	98.3
Automatic real time detection of atrial fibrillation	—	91.3	92.3
Scalable customization of atrial fibrillation detection	92.0	91.7	95.4
Arrhythmia detection using Android-based device	—	90.6	91.7
ResNet Deep Learning	92.7	93.7	94.8
Proposed method	88.7	83.8	97.5

shorthand for the more complete term "area under the curve." ROC curves may be used to demonstrate the performance of a classification model across a variety of cutoff settings receiver operating characteristic curve [45]. In the section that follows, you will see graphs that indicate both the frequency of apparent positive findings and the true positive outcomes of the test. As the AUC increases, so does the accuracy of the data categorization. This measure shows how well the model can tell the difference between the three types of arrhythmia that were looked at.

Arrhythmias other than AF, RBB, or both are disregarded in several of Table 4's processes. Other techniques make a diagnosis solely based on arrhythmia and nonarrhythmia. AAMI statements have been employed in online strategies [27–29] to reduce comparisons. This study is in no way comparable, which focused on offline cardiac arrhythmia detection using the Spark platform. Deep artificial neural networks have aided in the diagnosis of arrhythmias. Table 4 compares the performance characteristics of our suggested technique to those of competing strategies.

Table 4

The efficacy of the proposed categorization strategy, as well as a comparison to some other, more recent classification strategies
Approach	Method	Acc	Se/Rec	Pre	F1 score	Sp
Offline	Deep multi-scale	—	93.3	94.9	93.9	—
	multirate cosine filter bank and deep neural network,”	98.7	99.6	—	—	99.8
	deep cnn	98.4	98.8	98.7	98.8	—
	two-step DNN-based	96.2	98.1	93.5	95.0	—
Online	nonlinear morphological features and voting method	—	85.3	—	—	92.3
Online	Proposed method	98.8	98.7	99.1	94.2	98.9

4.1 Execution Time

The query can be modified by the user using the Spark structured streaming progress reporter, which generates a report that includes all the parameters. This accounts for the whole duration of continuous inquiries; from the moment they are declared to the time they are terminated. This time must be left aside for each query cycle to obtain the most current batch offset as well as the batch from the input source [44]. At this point, there is no better moment than the present. Because of all the effort that goes into setting up, categorising, and performing a query, its execution may take a very long time. Figure 7 depicts the time required to complete the query on 200 ECG packets. Table 5 compares the pipeline's data consumption time to the most successful real-time arrhythmia detection techniques currently available. We used one-minute epochs as our measuring unit to calculate how long it took to finish each step. Even though we sampled at 360 Hz, we only examined the data from 17 distinct time stamps.

Table 5

Compares the pipeline's data consumption time
Methods	Language	Classes	consumption time	Sample	Epochs time(s)
Online Feature Extraction Method	Apache Spark Streaming	4	± 5	595	75
Offline Feature Extraction using Machine Learning Method	MatLab	4	> 5	595	75
Proposed method	Apache Structured Streaming	5	± 2	1920	12

Table 5. How long it will take to implement the proposed approach, as well as how quickly and successfully it compares to other published methods, remains unclear

The longer feature extraction technique used in the pre-processing stage accounts for the higher time taken by this experiment.

Real-time data analysis using algorithms and tools is prevalent. Most previous research in this sector, employed algorithm-based methodologies for real-time peak recognition and data classification. Strategy Decision trees, SVMs, KVNs, and neural networks are frequently used in these projects. A large data infrastructure and conventional data classification methods were used to do real-time analysis of streaming ECG data. Apache Spark's Structured Streaming feature speeds up processing by letting SQL queries be run on real data streams. Spark Structured Streaming was made so that Apache Spark could process cardiac data in real time. Because previous studies employed serial processing to speed up diagnostics, this module is novel. Several studies have suggested that online ECG analysis and arrhythmia detection be used. Techniques based on algorithms and tools are widely used. Algorithm-based solution strategies are used in the real-time analysis of streaming data. Method-based techniques analyse data streams in real-time [41]. Previously, computational approaches were largely used in online ECG signal processing. Of course, there are downsides to using these strategies (Fig. 8). An expert is necessary to interpret the pulse rate and character. At this stage, an expert may make an inaccurate diagnosis of the heartbeat. Older systems used ways to recover ECG signal properties, but they couldn't recover all of them. This diminishes crucial characteristics. All these issues affect the accuracy of categorization model. Deep learning techniques are being utilised to detect cardiac arrhythmia as machine learning becomes more popular. Deep learning approaches are more capable of mastering crucial components than algorithm-based learning procedures. There is a scarcity of hardware for training models, which are computationally intensive. As a result, batch processing and offline deep learning are preferred methods for detecting cardiac arrhythmias.

We applied a big data technique to get answers more quickly. In a massive data environment, user-defined functions allow us to perform routine operations that don't require a lot of processing. With this feature, deep learning's computational problems might be avoided and classification performance measures for older models could be preserved as processing rates increased. So, we contrasted conventional and modern algorithms. accelerating finding while preserving classification standards. With a consumption time of roughly one second for each 5-second ECG epoch, the suggested method can identify three cardiac arrhythmias with an accuracy of 88 percent, sensitivity of 83.8 percent, F1 of 86.1 percent, precision of 92.5 percent, specification of 97.5 percent, and overall, ROC score classification accuracy of 86.23 percent.

Our technique reduces detection latency while maintaining accuracy, F1-score, specification, and AUC score. It is scalable, pipeline-portable, and compatible with a wide range of biological signals compared to conventional techniques. In terms of classification performance and time needed, our findings beat those of tool-based approaches. The high-level API of Structured Streaming enables rapid SQL on streaming data. The output choices and sinks provided by Structured Streaming help to reduce I/O latency. The models used in this work can combine diverse data sets into coherent wholes, which may have further benefits. Random forest models win this round because they outperform models created by all other strategies. This is because the random forest model's basis contains decision trees by default. Random forest models scale effectively to big datasets with many independent variables [42]. Therefore, each sample had an average of 25 independent variables. The classification results are saved for future use when using a random forest classifier. Given the quality of the ECG data and the classification setup, a random forest model may be the most advantageous in fulfilling the study's objectives. The number of deep artificial neural network applications using vast amounts of data has increased in recent years. As a result, it is now used in sophisticated data analysis. Complex methodologies make real-time data stream processing at high output rates impossible. An incremental artificial neural network might be used to adapt to fresh input data. When a new sample and information are provided, a new neuron is added to the network. Inconsistent data delivery and excessive network configuration are also issues.

By examining and retaining the ECG data, a deep learning system may one day be able to aid online systems in recognising cardiac arrhythmias. Education is the key to achieving this goal. Several studies have demonstrated that deep convolutional neural networks may assist in improving human categorization skills. The use of big data platforms and parallel computing helps simplify deep learning computations [43]. Because of deep artificial neural network techniques and the features of large datasets, real-time analysis of ECG signals is now possible volume, diversity, and velocity. However, after the early symptoms of heart disease were identified, none of these initiatives specifically addressed the population's risk of getting the ailment. Because some of these delay calculation methodologies do not employ the same equipment as used in this study, data on the time required to recognise cardiac disease cannot be compared. In the patient, an arrhythmia known as atrial fibrillation was discovered. We finished the research in roughly the same amount of time that it takes to find cardiac arrhythmias [46]. The algorithms were completed in less than two seconds using a large spark cluster calculating PSPR features, descriptive features, and class prediction. PSPR, Descriptive, and Qualitative Analysis and Calculations.

This research aims to identify cardiac arrhythmias as soon as they occur. With the help of the stream processing engine from Structured Streaming, we conducted research using simple methodologies. Processes were speed up and latency was reduced by using a big data platform. Unlike to earlier methods, our system maintains classification metrics as processing performance rises. Methods based on algorithms are less efficient than approaches based on tools. One of the benefits is: This idea shortens the time it takes to diagnose cardiac arrhythmias in real-time as compared to algorithm-based methods. The two approaches are compared. Our online pipeline enables real-time processing of EEG signals. Due to its inherent user-defined features, our method can be adapted to and integrated with a variety of signal processing strategies. We have a pipeline for mobile internet. Our online method enhances the accuracy of our conclusions, such as patient clinical symptoms, when supplemented with static data. The main conclusions of the study Past studies have demonstrated that methodological procedures based on tools perform better than those based on algorithms. These are a few examples of the many advantages: This method of real-time cardiac arrhythmia detection is superior to slower alternatives based on algorithms or instruments. There is no maximum number of employees that our streaming system can handle. The capacity has no maximum value. In this case, a brief restart is performed so that the most recent results are displayed. Our approach includes a number of drawbacks, some of which may turn out to be advantageous in the future.

(1) One's level of confidence in the patients' outcomes increases as the data becomes more varied. This study aimed to reduce the time needed for real-time cardiac arrhythmia diagnosis. Deep neural networks and other modern technologies are not utilised in this study.

(2) Despite our best efforts, our ideas were out of date with the times. A data shift known as "drift" is brought on by external factors like noise. This could make certain recent developments in online education less accurate. Concept drift must be addressed in real life.

Ethics approval and consent to participate

Not Applicable

Consent for publication

Not Applicable

Competing interests

The authors declare that they have no competing interests

Acknowledgements

Not Applicable

Author’s Contributions

Prateek Singhal conducted the research, analyzed the data, proposed the methodology, and wrote the initial draft; Prateek Singhal modified the initial draft; Rakesh Kumar Yadav supervised the research and Prateek Singhal wrote the final version of the manuscript. The author’s had approved the final version.

Availability of data and materials

The MIT-BIH Arrhythmia Database contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the BIH Arrhythmia Laboratory between 1975 and 1979. Twenty-three recordings were chosen at random from a set of 4000 24-hour ambulatory ECG recordings collected from a mixed population of inpatients (about 60%) and outpatients (about 40%) at Boston's Beth Israel Hospital; the remaining 25 recordings were selected from the same set to include less common but clinically significant arrhythmias that would not be well-represented in a small random sample. The recordings were digitized at 360 samples per second per channel with 11-bit resolution over a 10 mV range. Two or more cardiologists independently annotated each record; disagreements were resolved to obtain the computer-readable reference annotations for each beat (approximately 110,000 annotations in all) included with the database.

This directory contains the entire MIT-BIH Arrhythmia Database. About half (25 of 48 complete records, and reference annotation files for all 48 records) of this database has been freely available here since PhysioNet's inception in September 1999. The 23 remaining signal files, which had been available only on the MIT-BIH Arrhythmia Database CD-ROM, were posted here in February 2005.

Data Availability from the physionet.org.

https://archive.physionet.org/physiobank/database/html/mitdbdir/mitdbdir.htm
https://physionet.org/content/mitdb/1.0.0/

Circulation Electronic Pages; http://circ.ahajournals.org/content/101/23/e215.full

Moody GB, Mark RG. The impact of the MIT-BIH Arrhythmia Database. IEEE Eng in Med and Biol 20(3):45-50 (May-June 2001). (PMID: 11446209)

Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220 2000 (June 13).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.”

C. A. Alexander and L. Wang, “Big data analytics in heart attack prediction,” Journal of Nursing and Care, vol. 6, no. 393, pp. 2167–1168, 2017.
E. Nazari, M. H. Shahriari, and H. Tabesh, “BigData analysis in healthcare: apache hadoop, Apache spark and Apache flink,” Frontiers in Health Informatics, vol. 8, no. 1, p. 14, 2019.
M. Chen, S. Mao, and Y. Liu, “Big data: a survey,” Mobile Networks and Applications, vol. 19, no. 2, pp. 171–209, 2014.
B. Ristevski and M. Chen, “Big data analytics in medicine and healthcare,” Journal of Integrative Bioinformatics, vol. 15, no. 3, 2018.
S. S. Sandha, M. Kachuee, and S. Darabi, “Complex event processing of health data in real-time to predict heart failure risk and stress,” 2017, https://arxiv.org/abs/1707.04364.
J. Lee, B. A. Reyes, D. D. McManus, O. Maitas, and K. H. Chon, “Atrial fibrillation detection using an iPhone 4S,” IEEE Transactions on Biomedical Engineering, vol. 60, no. 1, pp. 203–206, 2012.
S. Bhattacharyya and U. Snekhalatha, “Classification of right bundle branch block and left bundle branch block cardiac arrhythmias based on ecg analysis,” Advances in Intelligent Systems and Computing, vol. 316, pp. 331–341, 2015.
I. Hajjar and T. A. Kotchen, “Trends in prevalence, awareness, treatment, and control of hypertension in the United States, 1988–2000,” The Journal of the American Medical Association, vol. 290, no. 2, pp. 199–206, 2003.
T. S. M. Tsang, G. W. Petty, M. E. Barnes et al., “The prevalence of atrial fibrillation in incident stroke cases and matched population controls in Rochester, Minnesota,” Journal of the American College of Cardiology, vol. 42, no. 1, pp. 93–100, 2003.
R. Alcaraz and J. J. Rieta, “A review on sample entropy applications for the non-invasive analysis of atrial fibrillation electrocardiograms,” Biomedical Signal Processing and Control, vol. 5, no. 1, pp. 1–14, 2010.
S. Poli, V. Barbaro, P. Bartolini, G. Calcagnini, and F. Censi, “Prediction of atrial fibrillation from surface ECG: review of methods and algorithms,” Annali dell’Istituto superiore di sanita, vol. 39, no. 2, pp. 195–203, 2003.
Sakalle, A., Tomar, P., Bhardwaj, H., & Alim, M. (2022). A Modified LSTM Framework for Analyzing COVID-19 Effect on Emotion and Mental Health during Pandemic Using the EEG Signals. Journal of Healthcare Engineering, 2022.
Sakalle, A., Tomar, P., Bhardwaj, H., Iqbal, A., Sakalle, M., Bhardwaj, A., & Ibrahim, W. (2022). Genetic Programming-Based Feature Selection for Emotion Classification Using EEG Signal. Journal of Healthcare Engineering, 2022.
L. Rutkowski, M. Jaworski, L. Pietruczuk, and P. Duda, “Decision trees for mining data streams based on the Gaussian approximation,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 108–119, 2013.
H. Abdulsalam, D. B. Skillicorn, and P. Martin, “Classification using streaming random forests,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 1, pp. 22–36, 2010.
A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “Moa: massive online analysis,” Journal of Machine Learning Research, vol. 11, pp. 1601–1604, 2010.
A. Kumar and A. Singh, “Stream mining a review: tool and techniques,” in Proceedings of the 2017 International conference of Electronics, Communication and Aerospace Technology (ICECA), vol. 2, pp. 27–32, IEEE, Coimbatore, India, April 2017.
S. Chintapalli, “Benchmarking streaming computation engines: storm, flink and spark streaming,” in Proceedings of the 2016 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pp. 1789–1792, IEEE, Chicago, IL, USA, May 2016.
J. Pan and W. J. Tompkins, “A real-time QRS detection algorithm,” IEEE Transactions on Biomedical Engineering, vol. 32, no. 3, pp. 230–236, 1985.
S. Ghiasi, M. Abdollahpur, N. Madani, K. Kiani, and A. Ghaffari, “Atrial fibrillation detection using feature based algorithm and deep convolutional neural network,” Computing in Cardiology, vol. 1, pp. 1–4, 2017.
J. Park and K. Kang, “Intelligent classification of heartbeats for automated real-time ECG monitoring,” Telemedicine and e-Health, vol. 20, no. 12, pp. 1069–1077, 2014.
J. R. Sutton, R. Mahajan, O. Akbilgic, and R. Kamaleswaran, “PhysOnline: an open source machine learning pipeline for real-time analysis of streaming physiological waveform,” IEEE Journal of Biomedical and Health Informatics, vol. 23, no. 1, pp. 59–65, 2018.
E. Ventocilla, “Big data programming with Apache spark,” Studies in Big Data, in Data Science in Practice, vol. 10, pp. 171–194, 2019.
M. Armbrust, “Structured streaming: a declarative API for real-time applications in apache spark,” in Proceedings of the 2018 International Conference on Management of Data, pp. 601–613, Houston, TX, USA, June 2018.
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, “Discretized streams: fault-tolerant streaming computation at scale,” in Proceedings of the twenty-fourth ACM symposium on operating systems principles, pp. 423–438, Farminton, PA, USA, November 2013.
T. Akidau, “The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing,” Proceedings of the VLDB Endowment, vol. 8, no. 12, 2015.
R. Kashyap, "Big Data Analytics Challenges and Solutions", Big Data Analytics for Intelligent Healthcare Management, pp. 19–41, 2019. Available: 10.1016/b978-0-12-818146-1.00002–7 [Accessed 30 August 2022].
S. Tiwari, R. Gupta and R. Kashyap, "To Enhance Web Response Time Using Agglomerative Clustering Technique for Web Navigation Recommendation", Advances in Intelligent Systems and Computing, pp. 659–672, 2018. Available: 10.1007/978-981-10-8055-5_59 [Accessed 30 August 2022].
R. Kashyap, "Machine Learning for Internet of Things", Advances in Wireless Technologies and Telecommunication, pp. 57–83, 2019. Available: 10.4018/978-1-5225-7458-3.ch003 [Accessed 30 August 2022].
R. Kashyap, "Object boundary detection through robust active contour based method with global information", International Journal of Image Mining, vol. 3, no. 1, p. 22, 2018. Available: 10.1504/ijim.2018.10014063 [Accessed 30 August 2022].
S. Dash, K. H. Chon, S. Lu, and E. A. Raeder, “Automatic real time detection of atrial fibrillation,” Annals of Biomedical Engineering, vol. 37, no. 9, pp. 1701–1709, 2009.
S. Kiranyaz, T. Ince, and M. Gabbouj, “Real-time patient-specific ECG classification by 1-D convolutional neural networks,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 3, pp. 664–675, 2015.
R. Nair and A. Bhagat, "An Introduction to Clustering Algorithms in Big Data", Encyclopedia of Information Science and Technology, Fifth Edition, pp. 559–576, 2021. Available: 10.4018/978-1-7998-3479-3.ch040 [Accessed 14 June 2022].
R. Nair, P. Sharma and T. Sharma, "Optimizing the Performance of IoT Using FPGA as Compared to GPU", International Journal of Grid and High Performance Computing, vol. 14, no. 1, pp. 1–15, 2022. Available: 10.4018/ijghpc.301580 [Accessed 8 August 2022].
R. Nair, M. Soni, B. Bajpai, G. Dhiman and K. Sagayam, "Predicting the Death Rate Around the World Due to COVID-19 Using Regression Analysis", International Journal of Swarm Intelligence Research, vol. 13, no. 2, pp. 1–13, 2022. Available: 10.4018/ijsir.287545.
M. Agrawal, P. Kumar Shukla, R. Nair, A. Nayyar and M. Masud, "Stock Prediction Based on Technical Indicators Using Deep Learning Model", Computers, Materials & Continua, vol. 70, no. 1, pp. 287–304, 2022. Available: 10.32604/cmc.2022.014637.
R. N. V. P. S. Kandala, R. Dhuli, P. Pławiak et al., “Towards real-time heartbeat classification: evaluation of nonlinear morphological features and voting method,” Sensors, vol. 19, no. 23, p. 5079, 2019.View at: Publisher Site | Google Scholar
Moody GB, Mark RG. The impact of the MIT-BIH Arrhythmia Database. IEEE Eng in Med and Biol 20(3):45–50 (May-June 2001). (PMID: 11446209)
P. Singhal, P. Singh and A. Vidyarthi (2020) Interpretation and localization of Thorax diseases using DCNN in Chest X-Ray. Journal of Informatics Electrical and Elecrtonics Engineering,1(1), 1, 1–7
Singhal, P., Sharma, P., & Hazela, B. (2019). End-to-end message authentication using CoAP over IoT. In International Conference on Innovative Computing and Communications (pp. 279–288). Springer, Singapore.
Singhal, P., Sharma, P., & Rizvi, S. (2019). Thwarting Sybil Attack by CAM Method in WSN using Cooja Simulator Framework. International Journal of Engineering & Technology, 8(1.5), 116–125.
Singhal, P., Sharma, P., & Arora, D. (2018). An approach towards preventing iot based sybil attack based on contiki framework through cooja simulator. International Journal of Engineering & Technology, 7(2.8), 261–267.
Kumar, S., Wajeed, M. A., Kunabeva, R., Dwivedi, N., Singhal, P., Jamal, S. S., & Akwafo, R. (2022). Novel Method for Safeguarding Personal Health Record in Cloud Connection Using Deep Learning Models. Computational Intelligence and Neuroscience, 2022.
Singhal, P., Srivastava, P. K., Tiwari, A. K., & Shukla, R. K. (2022). A Survey: Approaches to Facial Detection and Recognition with Machine Learning Techniques. In Proceedings of Second Doctoral Symposium on Computational Intelligence (pp. 103–125). Springer, Singapore.
T. Mahmud, S. A. Fattah, and M. Saquib, “Deeparrnet: an efficient deep cnn architecture for automatic arrhythmia detection and classification from denoised ecg beats,” IEEE Access, vol. 8, pp. 104788–104800, 2020.View at: Publisher Site | Google Scholar
J. He, J. Rong, L. Sun, H. Wang, and Y. Zhang, “An advanced two-step DNN-based framework for arrhythmia detection,” Advances in Knowledge Discovery and Data Mining, in Pacific-asia Conference on Knowledge Discovery and Data Mining, vol. 240, pp. 422–434, 2020.View at: Publisher Site | Google Scholar

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Electrocardiogram Feature Based Heart Arrhythmia Detection Using Machine Learning and Apache Spark

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Works

3. Proposed Method

3.1 Data Pre-processing

3.2 R-Peak Detection

3.3 Segmentation

3.4 Feature Extraction

3.5 Classification Conducted Via the Internet Using Apache Spark

3.6 File Source

4. Results

4.1 Execution Time

5. Discussion

6. Conclusion

Declarations

References

Additional Declarations

Status:

Version 1