An examination of even a few seizures in a well-established and reproducible model allows us to appreciate the variability and diversity of these events (Fig. 1a). This variability becomes even more apparent upon observation of the extracted features (See methods – feature selection) (Fig. 1b). Even though the extracted features robustly increase during seizure events, their variability during seizure events is greatly enhanced when compared to the periods before and after seizure events (Fig. 1b). Due to the inherent variability in seizure events, we trained machine learning models using these extracted features and examined their seizure detection performance.
To achieve this, we split the data into training (11 mice, 4224 hours), and testing (15 mice, 5511 hours) datasets. We then selected 5 feature-sets (Table 1) by removing redundant features and quantifying their relevance (Table 2, See methods – feature selection). After feature selection, we chose four models (Fig. 2A) for seizure detection: decision tree (DT), gaussian naïve bayes (GNB), passive aggressive classifier (PAC), and stochastic gradient descent classifier (SGD) based on model interpretability and ability to efficiently train on our dataset (See methods – model selection). Each model was tuned (Table 3, hyperparameter selection) and then trained 5 times for each feature-set (Table 2) to account for model variability and to obtain a better estimate of their performance.
We first performed a comparison of the 4 models across all feature combinations (Fig. 2A-B). We observed that the PAC model had substantially lower F1 score (Fig. 2B) – a combined measure of model precision and recall/sensitivity (See methods – model metrics) and high false detection rate compared to the other 3 models independent of the feature combination (Fig. 2C). Additionally, the PAC trained models had lower precision and specificity across all models (Fig. 2D). Therefore, the PAC model was not considered for further analysis. For each of the three remaining models, the feature combination that resulted in the highest balanced accuracy was selected for further examination. Specifically, feature combinations 4, 5, 4 were chosen for DT, GNB, and SGD, respectively (Fig. 2E).
Next, we compared the DT, GNB, and SGD models across several metrics (Fig. 3A). Overall, the DT model had the lowest number of false negatives – incorrectly classified seizure segments as non-seizures (Fig. 3B; DT = 147.87 ± 12.40 x103, GNB = 52.87 ± 0.16 x103, SGD = 65.69 ± 0.81 x103) resulting in the highest recall (Fig. 3C; DT = 0.84 ± 0.002, GNB = 0.77 ± 0.000, SGD = 0.78 ± 0.001) among the three models. However, it also had the highest number of false positives – incorrectly classified non-seizure segments as seizures (Fig. 3D; DT = 1.00 ± 0.013 x103, GNB = 1.46 ± 0.001 x103, SGD = 1.38 ± 0.003 x103) resulting in the lowest precision (Fig. 3E; DT = 0.03 ± 0.003, GNB = 0.08 ± 0.000, SGD = 0.07 ± 0.001). These results indicate that the DT model was the most sensitive and the least precise among the three models. On the other hand, the GNB model was the most precise (Fig. 3E) and had the highest F1 score (Fig. 3F; DT = 0.07 ± 0.005, GNB = 0.15 ± 0.000, SGD = 0.13 ± 0.001). The SGD model had an intermediate performance overall with a lower F1 score than the GNB model (Fig. 3F). Even though the DT model has the highest recall, it had a similar if not slightly worse performance at seizure detection (See methods – model metrics) than GNB and SGD models which detected all seizures in the test dataset (Fig. 3G; DT = 99.80 ± 0.033%, GNB = 100.00 ± 0.000%, 100.00 ± 0.000). These results demonstrate that simple and interpretable machine learning models can be very efficient for seizure detection but vary in their reliability and prediction accuracy.
Even though the DT model had significantly higher recall than the GNB model, the number of seizures detected was not superior (DT: 99.80 ± 0.03%, GNB: 100.00 ± 0.00%). Given that the recall of all three models was lower than the proportion of seizures detected, we investigated how the predicted seizure bins across time compared between the three models and ground truth data. When comparing the predicted seizure bins to ground truth data, we observed that the models detected the center of the seizure with higher accuracy than seizure boundaries (Fig. 4A-C). This is not surprising given that the features that were used to train these models do not increase as robustly at the designated seizure boundaries (Supplementary Fig. 1). This observation could explain why the proportion of segments predicted correctly as seizures is lower than the proportion of detected seizuresD. Interestingly, it seems that the increased recall of the DT model arises from high detection of seizure offset segments. However, the DT model dramatically misclassifies seizure offset (Fig. 4A, D), which likely accounts for its decreased precision. This was not specific to the selected feature-set chosen to train the DT model or the depth of the tree, as all DT models tested here overestimated seizure offset predictions (Supplementary Fig. 2).
Manual inspection of EEG datasets to create training labels is costly and laborious. To examine the dataset size required to achieve good model performance and seizure detection, we trained models on increasing size of data portions (Fig. 5). The GNB model detected all seizure events from just 1% of the training data even though its performance based on F1 score and balanced accuracy, seems to stabilize at 10% of the training data (Fig. 5A-B). The SGD model detected 99.74% of all seizures at 1% of training data and detected all seizures at 2.5% of the training data whereas its F1 score seems to have stabilized around 10% of the training data, although consistently below the GNB model (Fig. 5A-B). The DT model detected 99.93% of all seizures at 1% of training data and detected all seizures at 2.5% of the training data, but its detection dropped to 99.84% at 100% of training data size (Fig. 5A). This likely resulted from a DT model optimization to increase precision and reduce false positives (Fig. 5C-D), resulting in higher number of false negatives (Fig. 5E). In addition, the F1 score of the DT model kept improving as the training data size increased up to the full dataset. However, the F1 score, and false detection rate (Fig. 5D, F), were lower than the GNB and SGD models across training data sizes. As observed before, the DT model had a much lower number of false negatives in comparison to the SGD and GNB models even though it had a reduced seizure detection overall (Fig. 5A, E). Therefore, the GNB and SGD model perform well even with smaller training data sizes and quickly achieve a stable performance. On the other hand, the DT model requires a large amount of data to improve its precision at the cost of reduced seizure detection.
To further understand how these models classified EEG segments we extracted metrics which quantified the influence of each feature on model predictions (See methods: feature contributions; DT: feature importance, SGD: feature weight, GNB: feature separation score). This analysis revealed that in the DT model the line length of the vHPC was by far the most important feature with a value of 0.80. The second most important feature was the envelope amplitude of the vHPC with only a value of 0.14, while the two other features had negligible importance, each scoring less than 0.05 (Fig. 6A). In contrast the SGD model had a more balanced weight across features, with the line length of the vHPC also having the highest weight score of 0.40. The envelope amplitude of the vHPC was a close second, with a feature weight of 0.36, while the other two features had a combined score of 0.24 (Fig. 6B). Lastly, the GNB model does not have an in-built metric for feature importance thus we calculated a feature separation score based on the distribution of each feature from the trained GNB models (See methods – feature contributions). Our analysis indicated that most features had comparable scores, although the line length of vHPC had a marginally higher score (Fig. 6C). Thus, the GNB model appears to have a more balanced feature contribution for its predictions. Overall, this analysis reveals that the line length of vHPC feature is a key contributor to seizure detection in this dataset, whereas feature contributions varied across models. Intriguingly, the models with more balanced feature contributions also had a superior performance.
Finally, to test the validity of these models on multi-channel datasets from human EEG recordings we utilized the Children’s Boston Hospital MIT dataset (CHB-MIT) (Shoeb & Guttag, 2010) as it has been extensively used for benchmarking ML models (Siddiqui et al., 2020). This dataset consists of 24 recordings from 23 human subjects, where each recording has 23–26 channels. We selected 18 channels that were common across subjects, divided the data into 5 second bins, and extracted features (See methods: Human Data and Human Feature Selection).
We first used 8-fold cross-validation to assess the performance of models that were trained and tested on the same patient (intra-subject classification). All models achieved similar recall scores that were above 0.86 (DT: 0.86 ± 0.011, PAC: 0.88 ± 0.010, SGD: 0.87 ± 0.009) with the GNB model reaching a slightly lower score of 0.79 ± 0.014 (Fig. 7A). Interestingly, the SGD model achieved a much better precision score than all other models (SGD: 0.17 ± 0.012 vs DT: 0.09 ± 0.006, PAC: 0.08 ± 0.008, GNB: 0.09 ± 0.006) (Fig. 7B). Indeed, across scores the SGD model achieved the best performance, whereas the GNB and PAC models had the worst performance for intra-subject classification (Supplementary Fig. 3). As expected, we observed that there was some variability in the classifications scores between subjects (Fig. 7C).
To further examine how the model predictions generalize between subjects, we trained models across all subjects and excluded 1 subject for testing (leave-one-out classification). We found that model performance substantially varied across different subjects (Fig. 7E) and found that subjects fell into two groups when clustered by recall and precision (Fig. 7F). Moreover, the seizure features of low scoring subjects (cluster 2 from Fig. 7F) were similar as indicated by a PCA plot (Supplementary Fig. 4B). These findings indicate that subjects from the low-score group did not exhibit robust alterations of EEG waveforms during seizures. This can also be illustrated from example traces of seizures from subjects of the two groups (Supplementary Fig. 4C).
Finally, to test how well the models can detect whole seizure events instead of seizure segments, models were trained on 75% of the subjects and tested on 25% of the subjects (inter-subject classification; See methods: Human Model Training). Additionally, we trained models where we excluded subjects with low, medium, and high scores (average of recall and precision; N = 8 excluded subjects per group) and examined their performance. We found that exclusion of subjects with high or medium scores had little effect on seizure detection when compared to models trained on the full dataset (Fig. 7G). However, exclusion of subjects with low scores dramatically increased seizure detection to 98% across all models (DT: 98.86 ± 0.613, GNB: 98.68 ± 0.687, SGD: 98.68 ± 0.687) besides PAC that only reached a detection rate of 92.30 ± 5.076% (Fig. 7G). Indeed, the PAC model had the lowest F1 score and highest false detection rate, whereas the GNB model had the highest F1 score and lowest false detection rate, with the DT and SGD models falling somewhere in the middle (Fig. 7H-J; Supplementary Fig. 5). Overall, these data suggest that interpretable ML models can reliably detect electrographic seizures from multi-channel human EEG recordings with high sensitivity.
Here we observed that interpretable ML models with simple feature extraction were very effective at detecting seizures from a well-established model of chronic epilepsy in mice (Basu et al., 2022). To couple the high model sensitivity with enhanced accuracy we created an open-source application for semi-automated seizure detection, SeizyML, that combines model predictions with manual curation of the detected seizure events. The outline of the pipeline is illustrated in Fig. 8. Before the app can be used, the raw LFP/EEG data should be downsampled (100 Hz, 5 second windows) and must be converted from their native format (depending on recording apparatus) to HDF5. A small training dataset needs to also be prepared to train and calibrate the model. Then using the command line interface of SeizyML, the data are preprocessed, features are extracted, and model predictions are generated. Following that a simple GUI allows the user to accept or reject the detected seizures. Lastly, seizure properties can be extracted from the detected seizures using the seizyML CLI. Importantly, the app can be easily extended to use any machine learning (ML) model, channel number and features. Although, care should be taken since some ML models cannot perform well on big datasets especially on large number of features (including decision tree models used here).