Participant characteristics and baseline demographics
We identified 177,009 patients with diabetes diagnosed for the first time from 2008 to 2020 in the TMUCRD dataset. We excluded 43,097 patients with type 1 diabetes, 31,731 patients with type 2 diabetes without complications, and 46,166 individuals who had not received antidiabetic medications. An additional 2,390 patients younger than 40 years were excluded from the study. Another 10,004 patients with no histories or follow-up treatments recorded at hospitals were excluded (shown in Fig. 1). Finally, we selected a cohort of 39,646 patients (equivalent to 1,818,138 visits) with type 2 diabetes to develop and validate the predicted models. A total of 1,216,987 visits were used for the models’ development and validation, whereas 601,151 visits were included in the external testing.
Our study included a cohort of type 2 diabetes patients. The training cohort included 14,294 patients, and the testing cohort included 25,399 patients. The mean patient age in the training cohort was 62.5 (±13.9) years, and that in the testing cohort was 61.7 (±13.6) years.
Regarding antidiabetic drugs, a combination of oral blood glucose lowering drugs was most commonly used in the training (n=792,726, 65.1%) and testing (n=383,010, 63.7%) cohorts. This was followed by Metformin (n=199,515, 16.4% and n=90,541, 15.1% in the training and testing cohorts, respectively) and insulin (including analogs) (n=111,610, 9.2% and n=68,111, 11.3%, respectively). The other drugs were dipeptidyl peptidase 4 (DPP-4) inhibitors and glucagon-like peptide-1 (GLP-1 RA) analogs.
The most prevalent comorbidity was hypertension (n=7,286, 28.8% and n=1,439, 10.1% in the training and testing cohorts, respectively), followed by hyperlipidemia (n=7,764, 30.6% and n=2,439, 17.0%, respectively), cardiovascular diseases (n=2,098, 8.3% and n=626, 4.4%, respectively), and prior stroke (n=760, 3.0% and n=536, 3.7%, respectively) (shown in Table 1).
The associations between predictors and ischemic stroke
Table 2 shows the associations between clinical features and 1-year and 3-year ischemic stroke incidence in the univariate logistic regression analysis. The OR was employed to explore the risk factors. For 1-year incidence, the significant features associated with the outcome were as follows: age (OR 1.04 [95% CI: 1.03, 1.05]), Metformin (OR 0.18 [95% CI: 0.12, 0.29]), sulfonylureas (OR 0.3, 95% CI: 0.14, 0.62]), combination of oral blood glucose lowering drugs (OR 0.25 [95% CI: 0.17, 0.36]), hyperlipidemia (OR 0.51 [95% CI: 0.35, 0.76]), prior stroke (OR 5.91 [95% CI: 3.87, 9.02]), cardiovascular disease (OR 1.87 [95% CI: 1.22, 2.87]), dementia (OR 3.78 [95% CI: 1.54, 9.31]), Parkinson’s disease (OR 5.77 [95% CI: 2.1, 15.86]), Charlson Comorbidity Index (CCI) score (OR 1.23 [95% CI: 1.13, 1.33]), beta blocking (OR 0.42 [95% CI: 0.2, 0.9]), calcium channel blockers (OR 0.23 [95% CI: 0.09, 0.63]), renin angiotensin (OR 0.26 [95% CI: 0.12, 0.54]), lipid modifying agents (OR 0.32 [95% CI: 0.16, 0.63]), and AC glucose (OR 1.0029 [95% CI: 1.0019, 1.0039]).
For the 3-year incidence, the significant features were as follows: age (OR 1.04 [95% CI 1.04, 1.05]), Metformin (OR 0.26 [95% CI 0.19, 0.35]), other blood glucose lowering drugs, excluding insulin (OR 0.26 [95% CI 0.08, 0.82]), combination of oral blood glucose lowering drugs (OR 0.42 [95% CI 0.33, 0.54]), hyperlipidemia (OR 0.72 [95% CI 0.57, 0.91]), prior stroke (OR 6.39 [95% CI 4.82, 8.46]), cardiovascular disease (OR 2.85 [95% CI 2.22, 3.65]), dementia (OR 3.35 [95% CI 1.76, 6.37]), Parkinson’s disease (OR 2.47 [95% CI 0.9, 6.73]), CCI score (OR 1.28 [95% CI 1.21, 1.35]), renin angiotensin (OR 0.7 [95% CI 0.5, 0.97]), lipid modifying agents (OR 0.7 [95% CI 0.5, 0.96]), and AC glucose (OR 1.0021 [95% CI 1.0012, 1.003]) ( shown in Table 2).
Model development and performance
Table 3a. shows our prediction models in two different modes. For 1-year ischemic stroke, the LR, LDA, LGBM, GBM, RF, xGBoost, AB, and Voting algorithms showed AUCs of 0.65, 0.754, 0.907, 0.812, 0.97, 0.895, 0.801, 0.65, and 0.95, respectively, in the training set, and 0.65, 0.77, 0.77, 0.77, 0.71, 0.78, 0.76, and 0.65, respectively, in the testing set. For 3-year ischemic stroke, the LR, LDA, LGBM, GBM, RF, xGBoost, AB, and Voting algorithms showed AUCs of 0.657, 0.74, 0.869, 0.796, 0.95, 0.874, 0.773, 0.657, respectively, in the training set, and 0.676, 0.76, 0.755, 0.744, 0.689, 0.756, 0.753, 0.676, respectively, in the testing set (shown in Table 3a. and Fig. 2).
Feature importance
Figure 3 lists out the top 20 important features that might impact the performance of the prediction models in the 1-year and 3-year follow-up periods. The top 10 features in the 1-year follow-up model were stroke history, BMI, age, HbA1c, antidiabetic drugs, creatinine, triglyceride, hypertension, lipid-modifying drug, HDL, LDL, disease duration. Meanwhile, the top 10 features in the 3-year period were prior stroke, age, diabetes duration, antithrombotic, triglyceride, diuretics, CCI score, creatinine, lipid modifying agents, and HbA1c (shown in Fig. 3)