Baseline characteristics
According to the inclusion and exclusion criteria, among the 11,631 patients diagnosed with H-type hypertension at Beijing Anzhen Hospital from January 2022 and December 2023, 4,632 suffered an ischemic stroke. A total of 3,305 had medical records in same hospital between January 2018 and December 2021, and 2,340 were assigned to the training set and 965 to the testing set (Supplementary Fig. 1). Another 103 H-type hypertension patients, including 61 patients without ischemic stroke and 42 patients with ischemic stroke, were enrolled as an external validation cohort from the China-Japan Friendship Hospital (Supplementary Fig. 2). Detailed information about the characteristics of patients in the total cohort, training, and internal validation sets are shown in Table 1 and Supplementary Table 1, respectively. As shown in Table 1, patients with ischemic stroke were older with higher SBP and had a higher proportion of smokers and a history of cardiovascular disease (all P < 0.05) as compared to non-stroke patients.
Table 1
Baseline clinical and biochemical characteristics of all patients
Variable | Total (n = 3408) | non-Stroke (n = 1951) | Ischemic stroke (n = 1457) | P value |
Age, years, median (IQR) | 56 (42–66) | 46 (37–58) | 65 (57–74) | < 0.001 |
Male, % | 2435 (71.4%) | 1382 (70.8%) | 1053 (72.3%) | 0.358 |
BMI, median (IQR) | 26.26 (24.00-28.98) | 26.84 (24.38–29.59) | 25.61 (23.44–28.04) | < 0.001 |
SBP, mmHg, median (IQR) | 142 (130–155) | 140 (130–152) | 145 (132–159) | < 0.001 |
DBP, mmHg, median (IQR) | 86 (78–97) | 90 (80–100) | 82 (74–92) | < 0.001 |
Smoke, % | 1605 (47.1%) | 882 (45.2%) | 723 (49.6%) | 0.011 |
Drink, % | 1482 (43.5%) | 873 (44.7%) | 609 (41.8%) | 0.086 |
Diabetes mellitus, % | 1043 (30.6%) | 405 (20.8%) | 638 (43.8%) | < 0.001 |
Hyperlipidemia, % | 2796 (82.0%) | 1413 (72.4%) | 1383 (94.9%) | < 0.001 |
Coronary heart disease, % | 729 (21.4%) | 315 (16.1%) | 414 (28.4%) | < 0.001 |
Atrial fibrillation, % | 178 (5.2%) | 16 (0.8%) | 162 (11.1%) | < 0.001 |
Antihypertensive drugs, % | 2287 (67.1%) | 1366 (70.0%) | 921 (63.2%) | < 0.001 |
Antiplatelet drugs, % | 218 (6.4%) | 123 (6.3%) | 95 (6.5%) | 0.799 |
Family history of cerebral infarction, % | 216 (6.3%) | 118 (6.0%) | 98 (6.7%) | 0.422 |
hs-CRP, mg/L, median (IQR) | 1.34 (0.65–3.18) | 1.17 (0.60–2.65) | 1.72 (0.78–4.83) | < 0.001 |
Na, mmol/L, median (IQR) | 140.6 (139.0-142.0) | 140.4 (138.9-141.7) | 140.9 (139.2-142.4) | < 0.001 |
K, mmol/L, median (IQR) | 4.05 (3.80–4.27) | 4.13 (3.92–4.33) | 3.90 (3.68–4.15) | < 0.001 |
Mg, mmol/L, median (IQR) | 0.91 (0.86–0.95) | 0.92 (0.87–0.96) | 0.89 (0.84–0.94) | < 0.001 |
TBil, µmol/L, median (IQR) | 13.00 (10.05–16.82) | 13.20 (10.52-17.00) | 12.51 (9.40–16.60) | < 0.001 |
DBil, µmol/L, median (IQR) | 4.20 (3.05–5.57) | 4.21 (3.06–5.47) | 4.20 (3.03–5.71) | 0.202 |
Urinary protein, % | | | | < 0.001 |
- | 2831 (83.1%) | 1804 (92.5%) | 1027 (70.5%) | |
+ | 425 (12.5%) | 95 (4.9%) | 330 (22.6%) | |
++ | 108 (3.3%) | 38 (1.9%) | 70 (5.0%) | |
+++ | 40 (1.2%) | 13 (0.7%) | 27 (1.9%) | |
Hcy, µmol/L, % | | | | < 0.001 |
< 15 | 2092 (61.4%) | 1260 (64.6%) | 832 (57.1%) | |
15–30 | 1051 (30.8%) | 541 (27.7%) | 510 (35.0%) | |
> 30 | 265 (7.8%) | 150 (7.7%) | 115 (7.9%) | |
Carotid artery stenosis, % | 282 (8.3%) | 144 (7.4%) | 138 (9.5%) | 0.028 |
IQR, interquartile range; BMI, body mass index; SBP, systolic blood pressure; DBP, diastolic blood pressure; hs-CRP, hypersensitive C-reactive protein; Na, Sodium; K, Potassium; Mg, magnesium; TBil, total bilirubin; DBil, direct bilirubin; Hcy, homocysteine. |
Predictor selections
There were 16 variables with P < 0.05 by univariate logistic regression (Supplementary Table 2). After stepwise regression, 13 variables were ultimately retained, namely, age, antihypertensive therapy, hyperlipidemia, atrial fibrillation (AF), diabetes mellitus (DM), BMI, SBP, DBP, hs-CRP, K, Mg, Hcy and proteinuria. In best subset selection regression, when the model included eight variables, the BIC of the model reached its minimum. These eight variables were age, antihypertensive therapy, hyperlipidemia, AF, hs-CRP, K, Mg, and proteinuria, respectively (Fig. 1A and B). In LASSO regression, 17 variables were selected with a lambda that is within 1 standard error (SE), namely age, gender, antihypertensive therapy, antiplatelet therapy, hyperlipidemia, AF, DM, coronary artery disease (CAD), BMI, SBP, DBP, hs-CRP, K, Mg, Hcy, proteinuria and carotid artery stenosis (Fig. 1C and D). Eventually, eight variables were included to develop models: age, antihypertensive therapy, hyperlipidemia, AF, hs-CRP, K, Mg, and proteinuria (Fig. 2).
Model development and validation
Eight variables were entered into a multivariable logistic regression model, linear kernel SVM model, random forest model, and XGBoost model, respectively. Four models yielded the AUC of 0.905 (95% CI: 0.887–0.924), 0.896 (95% CI: 0.876–0.915), 0.893 (95% CI: 0.872–0.914), 0.909 (95% CI: 0.890–0.927) for the risk of ischemic stroke (Fig. 3 and Table 2). The difference of AUC between logistic regression model and XGBoost model was not significant (DeLong test, P = 0.406). Based on the maximal Youden’s index, the threshold of four models were 55%, 46%, 37%, and 43% in order. The XGBoost model had the highest sensitivity, 0.825, with a specificity of 0.860.
Table 2
Predict performances of four models on the testing set
| AUC (95%CI) | Sensitivity | Specificity | Accuracy | PPV | NPV |
Logistic model | 0.905 (0.887–0.924)# | 0.745 | 0.905 | 0.833 | 0.860 | 0.816 |
SVM model | 0.896 (0.876–0.915)* | 0.778 | 0.851 | 0.819 | 0.806 | 0.828 |
Random forest model | 0.893 (0.872–0.914) | 0.820 | 0.840 | 0.831 | 0.803 | 0.854 |
XGBoost model | 0.909 (0.890–0.927) | 0.825 | 0.860 | 0.845 | 0.825 | 0.860 |
#: There was no significant difference in AUC between the logistic model and the XGBoost model by Delong test; *: There was no significant difference in AUC between the SVM model and the random forest model by Delong test; AUC: area under curve; CI: confidence interval; PPV: positive predictive value; NPV: negative predictive value; SVM: support vector machine. |
Calibration plots were used to assess the calibration of models. As shown in Fig. 4, four models had a good calibration. Among them, the predicted odds of the outcome of the logistic regression model and XGBoost model were close to the actual probability (Fig. 4A and D). Four models resulted in a high net benefit, especially the logistic regression model and XGBoost model (Fig. 5). Conclusively, the logistic regression model and XGBoost model exhibited excellent discrimination and calibration performance. Considering the visualization and scalability of the prediction model, we ultimately chose the logical regression model as the optimal model. The weight coefficients of eight variables in the logistic regression model was shown in Supplementary Fig. 3. Serum magnesium, serum potassium, AF, and hyperlipidemia have a higher weight in the optimal model.
In the external cohort, the logistic regression model achieved an AUC of 0.872 (95% CI: 0.805–0.939) showing good discrimination capacity (Supplementary Fig. 4 and Supplementary Table 3). The logistic regression model also was well-calibrated and had a high net benefit in the external cohort (Supplementary Fig. 5 and Supplementary Fig. 6).
Model Visualization
The eight variables: age (A), antihypertensive therapy (A), biomarkers (B) (serum magnesium, serum potassium, proteinuria, and hypersensitive C-reactive protein), comorbidities (C) (atrial fibrillation and hyperlipidemia) were fitted a logistic regression model to predict the risk of ischemic stroke in H-type hypertension patients was termed the A2BC ischemic stroke model and presented as a nomogram (Fig. 6). The variables were listed separately, and the cumulative score is matched to a risk score.