Developing Probabilistic Ensemble Machine Learning Models for Home-Based Sleep Apnea Screening using Overnight SpO2 Data at Varying Data Granularity

doi:10.21203/rs.3.rs-4358408/v1

Download PDF

Research Article

Developing Probabilistic Ensemble Machine Learning Models for Home-Based Sleep Apnea Screening using Overnight SpO2 Data at Varying Data Granularity

https://doi.org/10.21203/rs.3.rs-4358408/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

Purpose

This study aims to develop sleep apnea screening models using a large clinical sleep dataset of SpO2 data, with the goal of achieving better performance and generalizability compared to existing models.

Methods

We utilized SpO2 recordings from the Sleep Heart Health Study database (N = 5667). Probabilistic ensemble machine learning was employed to predict sleep apnea status at three AHI cutoff points: ≥5, ≥ 15, and ≥ 30 events/hour. To investigate the impact of data granularity, SpO2 data were resampled to 1/30, 1/60, and 1/300 Hz. Model performance was evaluated across various decision boundaries ranging from 0.05 to 0.95.

Results

Our models demonstrated good to excellent performance, with AUC values of 0.82, 0.85, and 0.90 for cutoffs ≥ 5, ≥15, and ≥ 30, respectively. Sensitivity ranged from good to excellent (0.76, 0.84, 0.89), while specificity ranged from good to excellent (0.87, 0.86, 0.90). Positive predictive values (PPV) ranged from fair to excellent (0.97, 0.83, 0.66), and negative predictive values (NPV) ranged from low to excellent (0.43, 0.87, 0.98). Both decision boundaries and data granularity had a significant impact on model performance, with optimal decision boundaries aligning with the prevalence of positive cases in the cohort. Lower data granularity resulted in decreased model performance.

Conclusion

Our models demonstrated superior performance across all three AHI cutoff thresholds compared to existing large sleep apnea screening models, even when considering varying SpO2 data granularity. The use of probabilistic ensemble machine learning shows promises for developing generalizable sleep apnea screening models with overnight SpO2 data.

Biomedical Engineering

Sleep apnea

SpO2

oximeter

machine learning

probabilistic learning

ensemble learning

decision boundary

Sleep apnea is a prevalent sleep disorder characterized by short episodes of complete or partial blockage of the upper respiratory airway [1]. Sleep apnea is often accompanied by loud snoring, gasping during sleep, morning headaches, and excessive daytime sleepiness, although individuals with mild forms may remain asymptomatic. Scientific studies have established a correlation between sleep apnea and increased risks of cardiovascular diseases, metabolic disorders, and cognitive impairment [2]. Despite its global prevalence, many sleep apnea patients remain undiagnosed, partly due to the absence of symptoms in mild cases or the prohibitive cost of clinical diagnostic tests. The standard diagnostic method for sleep apnea involves nocturnal polysomnography (PSG) conducted in a sleep laboratory. This test monitors various physiological signals during sleep, including brain activity, breathing patterns, heart rate, chest and abdominal movements, limb movements, and blood oxygen levels. Subsequently, registered sleep technicians manually score these signals to generate a comprehensive sleep report. This report enables sleep doctors to identify any aberrations in the patient's physiological state during sleep. However, the PSG test is expensive and time-consuming, which greatly restricts its accessibility and affordability for sleep apnea diagnosis.

Current research endeavors are actively exploring alternative methods for screening sleep apnea that are more cost-effective and user-friendly. Previous studies in this domain can be broadly categorized into two groups based on whether they utilize validated psychometric questionnaires or physiological sensing techniques. The STOP-Bang questionnaire, a validated psychometric tool for obstructive sleep apnea screening, comprises eight items that assess perceivable symptoms of sleep apnea alongside demographic characteristics. Demonstrating high sensitivity and negative predictive value (NPV) across various apnea-hypopnea index (AHI) cutoffs, the STOP-Bang questionnaire offers a reliable screening approach [3]. However, STOP-Bang scores cannot be calculated for asymptomatic patients. To address this limitation, research has explored apnea screening methods grounded in physiological sensing. Numerous studies have leveraged a subset of signals derived from PSG, such as electroencephalogram (EEG), electrocardiogram (ECG), airflow, and blood oxygen saturation levels (SpO2), for automated sleep apnea screening. Performance outcomes vary based on signal modality and computational models utilized [4–13]. Yet, most sensing modalities, including EEG, ECG, and airflow, are not readily available for home use, hindering widespread adoption for at-home apnea screening. Moreover, incorporating multiple sensing modalities further exacerbates feasibility challenges for at-home screening protocols.

This study aims to develop and evaluate a computational approach for screening sleep apnea using only overnight SpO2 signals, which can be conveniently acquired with home-use sensors. The prevalence of portable and wearable oximeters has surged, particularly during the pandemic. Many consumer-grade smartwatches and activity trackers now feature built-in photoplethysmography (PPG) sensors capable of continuously measuring SpO2 levels throughout the day [14, 15]. These devices have increasingly contributed to promoting sleep health among the general population due to their enhanced accuracy and user-friendly nature [14, 16]. Such advancements present an opportunity to develop innovative sleep apnea screening methods that are accessible, cost-effective, and easy to use. While previous studies have explored sleep apnea screening models based on SpO2 signals [4, 17, 18], most were evaluated on small datasets comprising only dozens to hundreds of subjects [4]. Consequently, the reported model performance may be inflated, and the generalizability of the models is uncertain. Furthermore, the granularity of the SpO2 data can vary significantly across devices. Clinical pulse oximeters typically provide high-frequency data sampled at 1 Hz, whereas consumer wearables may offer more sparse data. For instance, Fitbit devices only allow users to retrieve the SpO2 data at 1-minute intervals. However, the impact of data granularity on the screening accuracy of previous models remains unexplored.

In this study, we employ a probabilistic ensemble machine learning approach that combines multiple based machine learning classifiers to predict the probability of sleep apnea status through majority voting. Each base classifier predicts the probability of sleep apnea, with the final prediction determined by averaging these probabilities. An individual is classified as apnea-positive if the probability exceeds a predefined decision boundary. We develop screening models for three AHI cutoff points: \(\ge\)5, \(\ge\)15, and \(\ge\)30 using one of the largest clinical sleep datasets to reduce the risk of overfitting. We evaluate and validate the model using multiple measures and with statistical rigidity. Our evaluation seeks to answer two key research questions: (1) what decision boundaries optimize the performance of the probabilistic ensemble models at each AHI cutoff? and (2) how does the granularity of SpO2 data impact model performance?

To our knowledge, this study represents the first comprehensive investigation into the influence of decision boundaries and data granularity on machine learning based sleep apnea screening. We discuss how the decision boundary elegantly incorporates the pretest sleep apnea prevalence into model tuning, thereby improving the models’ clinical relevance as well as their transferability across diverse populations. Our findings offer novel insights for developing large sleep apnea screening models with enhanced generalizability.

2.1 Database

The Sleep Heart Health Study (SHHS) database served as the primary data source for training, validation, and testing our models [19]. This dataset comprised subjects aged 40 years and above with no prior history of receiving sleep apnea treatment. SpO2 signals were collected from 5667 subjects using a Nonin XPOD 3011 with an 8000J sensor attached to a finger. Data collection took place in the subjects’ homes in an unattended manner, with a sampling rate of 1 Hz. Recordings shorter than 4 hours were excluded from the database. SpO2 data, along with demographic information and labelled ground truth, were obtained from the SHHS database following the approval of the National Sleep Research Resource repositories (NSRR) [20]. Preprocessing of the SpO2 signals involved removing zero readings and values below 50% or above 100%, as well as sudden changes exceeding 4% between consecutive readings [21]. To explore the impact of data granularity, SpO2 signals were resampled to 1/30, 1/60, and 1/300 Hz to mimic the lower granularity of data typically obtained from consumer wearable devices. Features were derived from the resampled SpO2 signals and the original 1 Hz signals. Demographic information including subjects' age, sex, and BMI were also used as features. Ground truth labels were based on the harmonized variable "nsrr_ahi_ph3r_aasm15", which adheres to the recommended annotation rules for sleep associated breath disorder events in the latest sleep scoring guideline [22]. Specifically, the AHI was calculated as the number of apnea and hypopnea events per hour of sleep, with criteria including more than 30% nasal airflow reduction and more than 3% oxygen desaturation, with or without arousal.

We applied three commonly used AHI cutoff points and developed apnea screening models corresponding to each cutoff. Table 1 presents the demographic and sleep-related characteristics of the subjects at each cutoff. Apnea prevalence is high (82.8%) at cutoff AHI\(\ge\)5, primarily due to the intentional oversampling of snorers in the SHHS database, whereas the prevalence of severe sleep apnea as defined by AHI\(\ge\)30 is low (17.4%). As depicted in Table 1, the dataset exhibits class imbalance for AHI cutoffs 5 and 30, with a higher proportion of positive samples for the former and more negative samples for the latter. Conversely, the dataset is more balanced for AHI cutoff 15.

Table 1

Demographics of Subjects Screened as Positive/Negative at Three AHI Cutoff Points.
	AHI\(\ge\)5		AHI\(\ge\)15		AHI\(\ge\)30
	Positive	Negative	Positive	Negative	Positive	Negative
No. subjects	4693	974	2525	3142	986	4681
No. males (%)	2460 (52.4%)	241 (24.7%)	1596 (63.2%	1105 (35.2%)	668 (67.7%)	2033 (42.3%)
Age	64.2 (10.9)	59.0 (11.3)	65.6 (10.6)	61.5 (11.3)	66.0 (10.4)	62.8 (11.2)
BMI (kg/m2)	28.6 (5.1)	25.9 (4.2)	29.6 (5.3)	27.0 (4.6)	30.6 (5.6)	27.7 (4.8)
BPs (mmHg)	128.2 (19.2)	123.4 (19.5)	130.0 (19.2)	125.3 (19.2)	130.6 (18.9)	126.6 (19.3)
BPd (mmHg)	73.8 (11.8)	73.0 (10.6)	74.4 (12.2)	73.1 (11.1)	75.3 (12.7)	73.3 (11.3)
AHI (events/h)	21.1 (16.0	2.9 (1.3)	31.0 (16.1)	7.6 (4.0)	46.7 (15.8)	12.1 (7.6)

2.1 Feature Engineering

A set of 67 features was constructed for model development, encompassing 63 features extracted from the cleaned SpO2 signals and 4 demographic features (age, male, female, and BMI). Notably, the variable ‘sex’ was transformed into two binary features, male and female, via one-hot encoding. Numerical features were normalized using the min-max scaling, a method known for its distribution-agnostic nature. Given the sensitivity of many machine learning algorithms to feature interdependencies, we employed independent component analysis (ICA) [23]. This technique reconfigures the original feature set into a new set, thereby minimizing statistical dependencies among the transformed features. In this study the new feature set retained the same dimensionality as the original set, as the primary objective of ICA was to mitigate inter-feature dependencies rather than dimension reduction.

2.2 Development of Probabilistic Ensemble Models for Apnea Screening

For each AHI cutoff point, we constructed a probabilistic ensemble model comprising three base classifiers: support vector machine (SVM), logistic regression (LR), and light gradient boosting machine (LGBM). These machine learning algorithms have exhibited promising performance in sleep apnea screening in previous studies [9]. SVM operates by mapping a low-dimensional input feature space to a higher-dimensional domain, where it identifies a hyperplane positioned as far as possible from the marginal samples of each class (i.e., the support vectors). LR, on the other hand, employs a logistic function to map input features to one of the dichotomous output values. LGBM, an ensemble machine learning algorithm, is based on decision trees and is renowned for its efficiency and effectiveness. This algorithm employs a sequential modelling approach, progressively amalgamating weak classifiers based on the errors from preceding iterations to yield increasingly robust classifiers.

The model development process employed a nested cross-validation approach. Eighty percent of the dataset was allocated to a training set, with the remaining 20% designated for a test set. Hyperparameter tuning for SVM and LR involved iterating through various parameters combinations via a 5-fold cross-validation grid search. The average score across five repetitions determined the performance for each specific parameter combination. For LGBM models, hyperparameter tuning proceeded incrementally. Initially, grid search with 5-fold cross-validation focused on parameters with significant impact on model performance, such as learning rate and maximum depth [24]. Subsequently, parameters including minimal child weight, subsample and colsample_bytree were fine-tuned to control overfitting. AUC served as the evaluation metric during grid search. Following hyperparameter tuning, the best combination of parameters was identified, and a model was fitted on the entire training set.

The fine-tuned base classifiers were calibrated using logistic regression in conjunction with 5-fold cross validation to obtain unbiased base classifiers. These calibrated base classifiers are subsequently ensembled via soft voting, with equal vote weight assigned to each one. The output of the ensemble model provided the probability of sleep apnea positive and negative for the corresponding AHI cutoff. A decision boundary was set to convert these probabilities into dichotomous labels.

The entire process, including data splitting, hyperparameter tuning, model calibration, and soft-voting ensemble, was repeated 50 times with different random seed values. This approach was adopted to ensure a robust evaluation of the model from a statistical perspective.

2.3 Performance Evaluation

The performance of the apnea screening models was comprehensively evaluated using multiple performance measures, including the area under the ROC curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). These diverse measures allowed for a thorough evaluation of the models from various aspects. For each combination of AHI cutoff and data resolution, we systematically examined the model performance by adjusting the decision boundary of the probabilistic ensemble models across a range from 0.05 to 0.95. This thorough analysis facilitated a detailed examination of the model behavior across various decision boundaries, providing valuable insights into their sensitivity and robustness. All data analysis was conducted using Python 3.10.5.

The AUC of the developed models is depicted in Fig. 1. A consistent trend was observed as the decision boundary increased from 0.05 to 0.95 across all data granularities. Initially, the AUC increased, reaching a peak before gradually decreasing. The turning point occurred approximately when the decision boundary ranged between 0.80–0.85 for AHI cutoff\(\ge\) 5, 0.45–0.50 for \(\ge\) 15, and 0.15–0.20 for \(\ge\) 30. Table 2 presents a summary of the optimal decision boundary, identified by the best AUC and other performance measures averaged across the 50 repetitions, for varying data granularity at each AHI cutoff point. Although the best model performance was achieved at data granularity of 1 second, the differences were not significant when compared to a data granularity of 30 seconds. However, regardless of the cutoff point, model performance deteriorated further when the data granularity dropped to 60 seconds, with the poorest performance observed at a data granularity of 300 seconds. Nevertheless, it is noteworthy that even at the lowest granularity, the worst average AUC remained above 0.70.

Meanwhile, sensitivity and NPV exhibited a monotonic decrease, while specificity and PPV demonstrated a monotonic increase, as illustrated in Fig. 2–5. Sensitivity remained relatively stable when the data granularity was 1 second. However, as the data granularity decreases to 30 and 60 seconds, the impact of decision boundary on sensitivity became more pronounced, with the most significant effect observed at 300 seconds. In contrast, specificity exhibited an opposite trend, being most affected by the decision boundary at the 1-second granularity and showing decreasing influence as the data granularity decreased.

In this study, we constructed probabilistic ensemble models for sleep apnea screening at three AHI cutoff points: 5, 15, and 30 events/h. The models solely utilize overnight SpO2 signals, making them highly compatible with consumer oximeters for home use. We examined the performance of the screening models by varying the decision boundaries between 0.05 and 0.95. The results indicated significant effect from adjustments in decision boundary and the resolution of SpO2 data. In what follows we discuss our results in relation to previous studies, highlight their clinical implications, and point directions for future studies.

4.1 Comparison to Previous Studies

Numerous systems and algorithms have been developed for home sleep apnea screening. Many prior studies used small datasets of fewer than 1000 entries [4]. To date only a few studies have utilized sufficiently large samples (N > 1000) for developing large models that ensure better generalizability. Table 3 provides a comparison between our models and existing large apnea screening models. For fairness, we used our models built with data granularity 1 Hz. A recent deep learning sleep apnea model OxiNet developed on a large dataset (N = 12,923) also relies solely on single channel overnight SpO2 [25]. However, OxiNet was designed for multiclass classification, making a direct comparison with our binary models impractical. Hence the model was not included for comparison.

Table 2

Optimal decision boundaries and corresponding model performance.
AHI cutoff point	Data granularity	Optimal Decision Boundary	AUC	Sensitivity	Specificity	PPV	NPV
>=5	1	0.85	0.82\(\pm\)0.01	0.76\(\pm\)0.01	0.87\(\pm\)0.03	0.97\(\pm\)0.01	0.43\(\pm\)0.02
	30	0.85	0.83\(\pm\)0.01	0.76\(\pm\)0.01	0.89\(\pm\)0.02	0.97\(\pm\)0.01	0.44\(\pm\)0.02
	60	0.85	0.80\(\pm\)0.01	0.75\(\pm\)0.01	0.86\(\pm\)0.02	0.96\(\pm\)0.01	0.41\(\pm\)0.02
	300	0.85	0.74\(\pm\)0.01	0.68\(\pm\)0.02	0.80\(\pm\)0.02	0.94\(\pm\)0.01	0.34\(\pm\)0.02
>=15	1	0.45	0.85\(\pm\)0.01	0.84\(\pm\)0.02	0.86\(\pm\)0.01	0.83\(\pm\)0.01	0.87\(\pm\)0.01
	30	0.45	0.84\(\pm\)0.01	0.84\(\pm\)0.02	0.85\(\pm\)0.01	0.81\(\pm\)0.02	0.87\(\pm\)0.01
	60	0.45	0.81\(\pm\)0.01	0.81\(\pm\)0.02	0.81\(\pm\)0.01	0.78\(\pm\)0.02	0.84\(\pm\)0.01
	300	0.45	0.74\(\pm\)0.01	0.74\(\pm\)0.02	0.74\(\pm\)0.02	0.70\(\pm\)0.02	0.78\(\pm\)0.02
>=30	1	0.20	0.90\(\pm\)0.01	0.89\(\pm\)0.02	0.90\(\pm\)0.01	0.66\(\pm\)0.03	0.98\(\pm\)0.01
	30	0.15	0.85\(\pm\)0.01	0.87\(\pm\)0.02	0.83\(\pm\)0.01	0.52\(\pm\)0.02	0.97\(\pm\)0.01
	60	0.15	0.83\(\pm\)0.01	0.85\(\pm\)0.03	0.81\(\pm\)0.01	0.48\(\pm\)0.02	0.96\(\pm\)0.01
	300	0.15	0.76\(\pm\)0.01	0.80\(\pm\)0.03	0.71\(\pm\)0.02	0.37\(\pm\)0.02	0.95\(\pm\)0.01

Table 3

Comparison with existing large sleep apnea screening models.
AHI cutoff	Models	N	Prevalence	AUC	Sensitivity	Specificity	PPV	NPV
\(\ge\)5	This study	5,667	82.8%	0.82	0.76	0.87	0.97	0.43
	LR [9]	5,786	82.7%	0.75	0.75	0.74	0.93	0.40
	SVM [27]	6,875	82.5%	0.82	0.74	0.75	0.93	0.38
\(\ge\)15	This study	5,667	44.6%	0.85	0.84	0.86	0.83	0.87
	SVM [27]	6,875	61.3%	0.80	0.75	0.68	0.79	0.64
	ANN [26]	17,448	53.2%	0.68	0.74	0.51	0.63	0.64
\(\ge\)30	This study	5,667	17.4%	0.90	0.89	0.90	0.66	0.98
	SVM [9]	5,786	17.0%	0.82	0.82	0.83	0.50	0.96
	LR [27]	6,875	40.6%	0.81	0.80	0.62	0.72	0.72
	GBM [28]	1,656	61.5%	0.86	0.80	0.73	/	/

As shown in Table 3, the largest study to date (N = 17,448) developed an apnea screening model exclusively at an AHI cutoff point of > = 15 using artificial neural network (ANN) [26]. This model relied only on four basic demographic features: age, sex, BMI, and race, and achieved reasonable sensitivity (0.74), PPV (0.63), NPV (0.64) but low specificity (0.51). In another study (N = 6,875) [27], three models were developed at all three AHI cutoffs, achieving reasonably high AUCs (0.80–0.82), good sensitivity (0.74–0.80), and slightly low specificity (0.62–0.75). Similarly, another study (N = 5,786) built large models at cutoffs of \(\ge\)5 and \(\ge\)30 achieved reasonably high AUC (0.75–0.82), good sensitivity (0.75–0.82), and good specificity (0.74–0.83) [9].

Our models, optimized at the best decision boundaries, significantly outperform existing models in nearly every performance measure by a substantial margin. Specially, for cutoff \(\ge\)30, our model achieved excellent AUC (0.90), sensitivity (0.89), specificity (0.90) and NPV (0.98), while PPV (0.66) was reasonably good. This underscores the advantage of utilizing overnight SpO2 signals over traditional EHR features and demonstrates the efficacy of leveraging probabilistic ensemble learning techniques for sleep apnea screening.

4.2 Impact of Decision Boundary and Data Granularity

Our study marks the first systematic exploration of the impact of decision boundary and data resolution on the performance of the probabilistic ensemble models for sleep apnea screening. We discovered a substantial impact of decision boundary on model performance, with the optimal boundary varying across the three AHI cutoff points examined. Selecting the correct decision boundary is crucial, as deviations from the optimal value (either lower or higher) result in diminished model performance. Our findings revealed that model performance reached its peak when the decision boundary was set at 0.85, 0.45, and 0.15 for AHI cutoff points \(\ge\)5, \(\ge\)15, \(\ge\)30, respectively. Interestingly, the prevalence of the positive class of the dataset at each AHI cutoff was 0.83, 0.45, and 0.17, suggesting that the optimal decision boundary aligns with the prevalence of sleep apnea within the cohort.

On the other hand, the impact of data granularity exhibits a more consistent trend. While it has been assumed that lower data granularity would lead to decreased model performance, our study is the first to demonstrate such effect in a quantitative way. Generally, higher data granularity tends to correlate with improved model performance. As data granularity decreases from 1 Hz to 1/300 Hz, we observed a decline in model performance across all three AHI cutoff points. Nevertheless, it is noteworthy that even models constructed with low-granularity data of 1/300 Hz still exhibit comparable performance to existing models, further highlighting the effectiveness of the probabilistic ensemble learning approach.

4.3 Clinical Implications

Our findings have significant clinical implications. The results demonstrate that our model performs exceptionally well for AHI cutoff \(\ge\)30, making it particularly suitable for screening severe sleep apnea. Given that patients with severe apnea often require timely treatment, our model holds great promise in ensuring that those in need receive timely screening and intervention.

An important contribution of this study is the integration of the prior pretest probability into model performance tuning. Our findings reveal that the optimal decision boundary aligns with the prevalence of positive sleep apnea cases in the cohort. According to Bayes’ theorem, the performance of a medical screening model not only depends on its own sensitivity and specificity but also on the prevalence of the disease in the population, or pretest probability. However, existing models typically neglect pretest probability and have high risk of overfitting and low generalizability. The probabilistic ensemble learning approach employed in this study elegantly incorporates pretest probability into model performance. In fact, probabilistic learning techniques have been previously utilized in the medical field to provide probabilistic predictions about the likelihood of different diseases or treatment outcomes [29]. By leveraging prior knowledge, it becomes feasible to calibrate the decision boundary of the model for use in populations with varying prevalence of sleep apnea.

Our study also underscores the significant impact of compromised model performance when constructing sleep apnea screening models using low-granularity data. Despite the high sampling rate of the PPG sensors embedded in consumer smartwatches and wristbands, the data retrieved from these devices is often aggregated over a time span, resulting in low data granularity (e.g., 1/60 Hz). This limitation in data granularity renders existing models developed with high-resolution data inapplicable and poses a significant challenge to the development of accurate and reliable sleep apnea screening models with these devices. However, our results demonstrate that our models maintain reasonable performance even when the data granularity was as low as 1/300 Hz, indicating the robustness of our approach and its great compatibility with a diverse range of data resolutions.

4.4 Study Limitations

This study has a few limitations that warrant future consideration. Firstly, the sampling criteria of the SHHS study may restrict the generalizability of the dataset itself. The cohort comprises individuals residing in the US, all aged above 40 years (median 60 years), predominantly white, and includes an over-sampled proportion of snorers. Thus, the external validity of the models developed using the SHHS dataset needs to be assessed in other populations. Secondly, the SpO2 data were collected using an oximeter placed on a finger, whereas most consumer wearable devices are worn on the wrist. Although we resampled the original 1Hz SpO2 signals to simulate the low-granularity data retrievable from consumer devices, the resulting data may not fully capture the patterns observed with consumer SpO2 data due to fundamental differences in sensor placement. Thirdly, subtypes of sleep apnea and comorbidities were not considered in this study. Future research should prioritize external validation of the models in diverse populations, utilizing SpO2 data native to consumer wearables, and explore differentiation among various subtypes of sleep apnea.

Conflict of Interest Statement

The authors certify that there is no conflict of interest involved in this manuscript and this study.

Acknowledgement

This work was supported by Japan Society for the Promotion of Science (JSPS) KAKENHI (Grant Number 21K17670). The author would like to express gratitude to the National Sleep Research Resources (NSRR) for granting access to the SHHS dataset.

Data Availability Statement

The SHHS dataset used in this study is available at https://sleepdata.org/datasets/shhs with the permission of NSRR.

Senaratna CV et al (2017) Prevalence of obstructive sleep apnea in the general population: A systematic review. Sleep Med Rev 34:70–81
Adam VB et al (2019) Estimation of the global prevalence and burden of obstructive sleep apnoea: a literature-based analysis. Lancet Respir Med 7(8):687–698
Pivetta B, Chen L, Nagappa M et al (2021) Use and performance of the STOP-Bang questionnaire for obstructive sleep apnea screening across geographic regions: a systematic review and meta-analysis. JAMA Netw Open 4(3):e211009. https://doi.org/10.1001/jamanetworkopen.2021.1009
Mendonça F, Mostafa SS, Ravelo-García AG, Morgado-Dias F, Penzel T (2018) Devices for home detection of obstructive sleep apnea: A review. Sleep Medicine Reviews 41:149–160, 2018
Rodrigues J, Pepin JL, Goeuriot L, Amer-Yahia S (2020) An extensive investigation of machine learning techniques for sleep apnea screening. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, Virtual Event Ireland, France
Wang S, Xuan W, Chen D et al (2023) Machine learning assisted wearable wireless device for sleep apnea syndrome diagnosis. Biosensors 13(4):483. https://doi.org/10.3390/bios13040483
Li Z, Li Y, Zhao G, Zhang X et al (2021) A model for obstructive sleep apnea detection using a multi-layer feed-forward neural network based on electrocardiogram, pulse oxygen saturation, and body mass index. Sleep Breath 25(4):2065–2072. https://doi.org/10.1007/s11325-021-02302-6
Wei K, Zou L, Liu G, Wang C (2023) MS-Net: Sleep apnea detection in PPG using multi-scale block and shadow module one-dimensional convolutional neural network. Comput Biol Med 155:106469. https://doi.org/10.1016/j.compbiomed.2022.106469
Liang Z (2023) Novel method combining multiscale attention entropy of overnight blood oxygen level and machine learning for easy sleep apnea screening. Digital Health 2023:9. https://doi.org/10.1177/20552076231211550
Xie B, Minn H (2012) Real-time sleep apnea detection by classifier combination. IEEE Trans Inf Technol Biomed 16(3):469–477. https://doi.org/10.1109/TITB.2012.2188299
Lin CY, Wang YW, Setiawan F et al (2021) Sleep apnea classification algorithm development using a machine-learning framework and bag-of-features derived from electrocardiogram spectrograms. J Clin Med 11(1). https://doi.org/10.3390/jcm11010192
Bhattacharjee A, Saha S, Fattah SA et al (2019) Sleep apnea detection based on Rician modeling of feature variation in multiband EEG signal. IEEE J Biomed Health Inf 23(3):1066–1074. https://doi.org/10.1109/JBHI.2018.2845303
Bahrami M, Forouzanfar M (2022) Sleep apnea detection from single-lead ECG: a comprehensive analysis of machine learning and deep learning algorithms. IEEE Trans Instrum Meas 71:1–11. https://doi.org/10.1109/TIM.2022.3151947
Liang Z, Ploderer B (2020) How does Fitbit measure brainwaves: a qualitative study into the credibility of sleep-tracking technologies. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol 4(1): 1–29. https://doi.org/10.1145/3380994
Liang Z, Chapa-Martell MA (2021) A multi-level classification approach for sleep stage prediction with processed data derived from consumer wearable activity trackers. Front Digit Health 3:665946. https://doi.org/10.3389/fdgth.2021.665946
Liang Z, Chapa-Martell MA (2019) Accuracy of Fitbit wristbands in measuring sleep stage transitions and the effect of user-specific factors. JMIR mhealth uhealth 7(6):e13384. https://doi.org/10.2196/13384
Lin HC, Su CL, Ong JH et al (2020) Pulse oximetry monitor feasible for early screening of obstructive sleep apnea (OSA). J Med Biol Eng 40:62–70. https://doi.org/10.1007/s40846-019-00479-6
Rodrigues Filho JC, Neves DD, Velasque L et al (2020) Diagnostic performance of nocturnal oximetry in the detection of obstructive sleep apnea syndrome: a Brazilian study. Sleep Breath Physiol Disorders 24:1487–1494. https://doi.org/10.1007/s11325-019-02000-4
Quan SF, Howard BV, Iber C et al (1997) The Sleep Heart Health Study: design, rationale, and methods. Sleep 20(12):1077–1085
Zhang GQ, Cui L, Mueller R et al (2018) The National Sleep Research Resource: towards a sleep data commons. J Am Med Inf Assoc 25(10):1351–1358. https://doi.org/10.1093/jamia/ocy064
Bernardini A, Brunello A, Gigli GL et al (2022) OSASUD: A dataset of stroke unit recordings for the detection of obstructive sleep apnea syndrome. Sci Data 9:177. https://doi.org/10.1038/s41597-022-01272-y
Berry R, Brooks R, Gamaldo C et al (2017) The AASM manual for the scoring of sleep and associated events: rules, terminology and technical specifications. Version 2, 4 edn. American Academy of Sleep Medicine, Darien, IL
Kwak N, Choi CH (2003) Feature extraction based on ICA for binary classification problems. IEEE Trans Knowl Data Eng 15(6):1374–1388. https://doi.org/10.1109/TKDE.2003.1245279
van Rijn J, Hutter F (2018) Hyperparameter importance across datasets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18), 2367–2376. https://doi.org/10.1145/3219819.3220058
Levy J, Alvarez D, del Campo F, Behar J (2023) Deep learning for obstructive sleep apnea diagnosis based on single channel oximetry. Nat Commun 14(1):4881. https://doi.org/https://doi.org/10.1038/s41467-023-40604-3
Holfinger SJ, Lyons MM, Keenan BT et al (2022) Diagnostic performance of machine learning-derived OSA prediction tools in large clinical and community-based samples. Chest 161(3):807–817. https://doi.org/10.1016/j.chest.2021.10.023
Huang WC, Lee PL, Liu YT, Chiang AA, Lai F (2020) Support vector machine prediction of obstructive sleep apnea in a large-scale Chinese clinical sample. Sleep 43(7):zsz295. https://doi.org/10.1093/sleep/zsz295
Shi Y, Zhang Y, Cao Z et al (2023) Application and interpretation of machine learning models in predicting the risk of severe obstructive sleep apnea in adults. BMC Med Inf Decis Mak 23:1–15. https://doi.org/10.1186/s12911-023-02331-z
Banerjee I, Gensheimer MF, Wood DJ et al (2018) Probabilistic prognostic estimates of survival in metastatic cancer patients (PPES-Met) utilizing free-text clinical narratives. Sci Rep 8(1):10037. https://doi.org/10.1038/s41598-018-27946-5

The authors declare no competing interests.

Download PDF

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

Developing Probabilistic Ensemble Machine Learning Models for Home-Based Sleep Apnea Screening using Overnight SpO2 Data at Varying Data Granularity

Status:

Version 1

Abstract

Purpose

Methods

Results

Conclusion

Figures

1 Introduction

2 Methods

2.1 Database

2.1 Feature Engineering

2.2 Development of Probabilistic Ensemble Models for Apnea Screening

2.3 Performance Evaluation

3 Results

4 Discussion

4.1 Comparison to Previous Studies

4.2 Impact of Decision Boundary and Data Granularity

4.3 Clinical Implications

4.4 Study Limitations

Declarations

Conflict of Interest Statement

Acknowledgement

Data Availability Statement

References

Additional Declarations

Status:

Version 1