Study Population
In a prospective study, we enrolled consecutive participants with no indications CVD. The subjects were enrolled from November 2019 to November 2022 in two hospitals, a tertiary University hospital and a community General Hospital. Hypertensive patients were recruited from the outpatient clinics of the respective centers. Normotensive healthy individuals were referred either for the investigation of atypical chest pain or for the modification of risk factors for cardiovascular disease such as hyperlipidemia. The diagnosis of hypertension was based on office BP > 140/90 mmHg, measured in three consecutive visits, or in one visit when the diagnosis was confirmed by out-of-office measurements [24]. In addition, out-of-office measurements were performed to exclude masked or white coat hypertension [24, 25].
A physical examination and routine laboratory tests were performed for all subjects before inclusion, all of whom underwent a routine echocardiography examination. Height and weight were measured during the same visit as the ECG acquisition, and the individuals were classified using the World Health Organization (WHO) classification of body mass index (BMI).
Emphasizing the importance of data quality over quantity, we had every ECG carefully reviewed by cardiologists to exclude subjects with certain conditions that could confuse the ML model. Subjects with any of the following characteristics were excluded: tachy- or bradyarrhythmia; permanent atrial fibrillation, RBBB, LBBB or other conduction abnormalities on ECG, coronary artery disease; moderate or severe valvular heart disease, cardiomyopathy, cerebrovascular, liver or renal disease; history of acute coronary syndrome or myocarditis; ejection fraction < 55%; history of drug or alcohol abuse; any chronic
inflammatory or other infectious disease during the last 6 months; thyroid gland disease; pregnant or lactating women. Vascular or neoplastic conditions were also ruled out by a careful examination of the history and routine laboratory tests. Functional tests for myocardial ischemia, coronary computed tomography angiography or invasive coronary angiography were performed according to physician’s judgement, in order to exclude coronary artery disease.
The study was conducted in accordance with the Declaration of Helsinki, the protocol was approved by the Hospital Ethics Committee, and patients gave written informed consent to their participation in the study.
Electrocardiography
A 12-lead ECG of 10 seconds duration in resting position was performed on each subject using a digital 6-Channel machine (Biocare iE 6, Shenzhen, P.R. China) and was stored in a digital file using eXtensive Markup Language format (XML). The sampling rate was 1000 Hz. In order to simulate devices that record only a single lead, we kept the tracings of only lead I in our digital files for further processing. Automated measurements of wave/complex duration and wave amplitude, calculated by Biocare’s software, were then extracted from the digital files. These measurements were based on representative complexes (corresponding to individual heart beats) of 1-second duration, which, according to the manufacturer, were calculated by breaking the 10 seconds signal into ten 1-second signals and averaging those into one. The final beat signals were verified, and adjusted where needed.
Classification with Random Forests
Classification, a type of ML supervised learning, assigns a subject into two or more categories based on features used as training input to the model. Our models were trained to discriminate whether a person is hypertensive or not, based on a number ECG-derived and anthropometric features.
A Random Forest (RF) classifier is an ML model, an ensemble of decision trees [26]. Each decision tree performs a series of binary decisions (splits) by selecting a subgroup of the input features (such as age, gender, BMI), effectively trying out different feature order and feature combinations. A RF builds a large collection of de-correlated trees, and then averages their votes for the predicted class [27]. RFs are good predictors even with smaller datasets due to the above technique, known as bootstrap aggregating (bagging). Bagging trains multiple trees on overlapping, randomly selected subset of the data, and makes the final decision based on the votes of the different trees. For implementing the RF in this work, we used RandomForest classifier from the scikit-learn library [28]. We optimized the model hyperparameters by minimizing the RF’s built-in out-of-bag error estimate which is almost identical to that obtained by N-fold cross-validation [27]. This technique enables RFs to be trained and cross-validated in one pass. Additionally, a RF is capable of handling non-linear interactions as well as identifying correlations among features.
Feature engineering
We calculated additional ECG waveform measurements with custom Python [29] code on the 1-second representative beat produced by the electrocardiograph. Starting from the automated measurements provided by the machine, we calculated areas under curves, slopes, and heights of waveforms. Electrocardiographic terms are consistent with AMA Manual of Style (2019, 11th edition). We chose to include ECG measurements adjusted for BMI based on studies showing that larger body mass decreases the amplitude of the R- and S-waves in specific leads due to the electrical currents traveling longer distances [18].
Feature selection
We removed features that exhibited high correlation (> 85%) with one of the chosen features, as calculated by Spearman’s rank correlation test; highly correlated features contribute the same amount of information and including both of them in a RF model might not affect performance, but it will divide, thus lessen, each feature’s predictive significance. We choose the Spearman test because of the possibility of non-linear relationships among the data and then calculated the Rank-Sum statistic and ranked the features according to their p-value. This feature selection is part of pre-processing (before model training). We used the spearmanr function from the scipy.stats library for measuring correlation. All numerical plots in this paper were created directly from the data, using Python’s plotting library matplotlib.
Feature Importance
Explaining predictions from tree models is always desired and is particularly important in medical applications, where the patterns uncovered by a model are often more important than the model’s prediction performance [30]. We explored feature importance in our RF model. The library scikit-learn provides a tree ensemble implementation which allows for the computing of measures of feature importance. These measures aspire to provide insight into which features drive the model’s prediction. Mean Decrease in Impurity (MDI), an approach popular among medical researchers, calculates each feature importance as the sum over the number of splits (across all trees). It was shown that the impurity-based feature importance can inflate the role of numerical features and bias the contribution of categorical, low cardinality ones [31]. Furthermore, these significances are computed on training set statistics and therefore do not reflect the usefulness of the feature in predictions that generalize to the test set.
We go one step further and calculate a recent feature importance metric called Shapley Additive explanations (SHAP), a game theoretic approach to explain the output of any machine learning model. SHAP connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions [30, 32]. Visualizing feature importance using SHAP values is thought to be more accurate for global and local feature importance (importance calculated on each feature instead of all of them). SHAP values have already been used in medical papers [33].
Datasets
The dataset was randomly partitioned into a train set of 877 (70%) data points, used directly to learn the parameters of the model, and a test set of 377 (30%), consisting of a part of the data the model had not seen before and was used exclusively for final performance evaluation of the models. Stratification for sex and history of hypertension, during the partition, ensured the two sets contained the same proportions of these two features. The ratio of hypertensive to normotensive patients in the total dataset was about 2, which did not, in our view, necessitate the use of techniques for imbalanced datasets, other than adjusting the decision threshold. For validation while training the RF we used the model’s internal out-of-bag set. All reported performance results are on the hold-out test set. Feature importance graphs are also on the test set, as, using the train set inflates the importance of some features which might not be as important in predicting the outcome. We also made sure that data from the same patient was not included in both the train and test set. The area under the receiver operating characteristic curve (AUC/ROC) was measured using the trapezoidal rule.