Dataset
In this research, we used a publicly available dataset consisting of consumer wrist-worn wearable and medical-grade PSG measurements (18) Each subject was asked to wear an Apple Watch to capture the daily activities data for a week. This 1-week session was then followed by a one-night sleep observation in the laboratory. Wrist band data collection was also still conducted during this observation which includes acceleration and heartbeat. In total, 31 subjects were confirmed to have good quality data based on several inclusion and exclusion criteria, such as issues in data transmission, and several sleep disorders.
In this study, we used the processed features that were provided in the previous study. These features are motion count that was derived from acceleration data, HR measurement from Apple Watch, and circadian clock that was calculated from the 1-week ambulatory data. Motion count data was gathered by looking at the fluctuation in the acceleration raw data which can be interpreted as a motion. HR was processed by calculating the standard deviation from the average of each sample’s heart rate. This approach was taken to remove the individual heart rate bias because each person has a unique pattern of heart rate depending on age, gender, and other physical characteristics. All these features were aggregated to meet the sleep epoch (30 s) from the PSG data. Each sleep epoch was categorized into 5 different classes, 0 for a wake stage, 1-4 for non- Rapid Eye Movement (REM), and 5 for REM sleep. In total, 24,815 sleep epochs were included in this study.
Sleep stage classification can be considered as outlier detection, or anomaly detection, due to the imbalance data proportion, if we formulate the problem into binary classification. It means that around 90% of the sleep epochs were categorized as sleep class (nonREM and REM). This extreme discrepancy between the minority (wake) and majority (sleep) classes can be seen in Fig. 3. This Figure depicts a huge difference between the majority class and the other. Ignoring this problem may limit the model's performance.
Classification Model
We employed two different types of ML methods. The first one is a group of machine learning methods that are commonly used for tabular data, while the other group is a series of NN-based methods that offer relatively complex algorithms. The first group is also commonly called conventional machine learning. In total, five different supervised classification methods were compared with the best model from the previous study (18). This best model, MLP, is also considered as conventional machine learning, even though based on the neural network basic technique. Support Vecotr Machine (SVM) (26), Random Forest (RF) (18), and Extreme Gradient Boosting Tree (XGB) (27) were among the existing methods that were selected to be implemented in this study. These three methods offer a non-linear approach in mapping the input data to its desired output data.
On the other hand, NN-based models can be differentiated based on their hidden layer types. The first model is developed by stacking multiple DNN layers to perform a non-linear operation on the data. Although, a single neuron in each layer merely just performs a simple linear regression. Each layer was also complemented by an activation function to select which information can be passed from one neuron to another neuron. We used Rectified Linear Unit (ReLU) as the activation function in all dense layers, except the output layer (28). This last layer, which consists of 2 neurons that represent the number of classes (sleep and wake), was complemented by a Softmax function to generate a probability of a sample belonging to a certain class. While ReLU always generates a number between zero and infinite number as can be seen in Fig. 4, Softmax will only generate a number within a range of 0 and 1 using a formula in Equation 1. In addition to this DNN model, we also developed a model using LSTM layer to take into account the time series characteristic within the data.
$$\sigma {\left(\overrightarrow{\mathcal{z}}\right)}_{i}= \frac{{\mathcal{e}}^{{\mathcal{z}}_{i}}}{{\sum }_{j=1}^{K}{\mathcal{e}}^{{\mathcal{z}}_{j}}}$$
1
Where:
\(\sigma\) = softmax
\(\overrightarrow{\mathcal{z}}\) = input vector
\({\mathcal{e}}^{{\mathcal{z}}_{i}}\) = standard exponential function for input vector
K = number of classes in the multi-class classifier, 2 for binary classification
\({\mathcal{e}}^{{\mathcal{z}}_{j}}\) = standard exponential function for output vector
Prior to train the models, the entire dataset was split into training, validation, and testing subset with the proportion of 60%, 20%, and 20%, respectively. Each model was trained using the training set and validated using the validation set. Eventually, the trained model was evaluated on the test subset to measure the performance in the prediction of the new data that has not to be shown during the training. To keep the data order in each sample, the data split was done manually by splitting the entire data based on the sample id. It prevents the data from being shuffled which can consequently break the temporal information within the data.
Hyperparameters tuning was done for each model to boost its performance. This tuning was applied specifically only on the training subset. Each model has specific parameters that need to be tuned. RF and XGB have similar tunable parameters since these two methods are based on a decision tree as the main technique for ensemble learning.
Handling Imbalance Data
The main challenge in this classification task was the extreme imbalance of data between wake and sleep. The proportion between these two classes is more than 10% for the whole dataset. The summary visualization of stage proportion in each sample can be seen in Fig. 3. This data proportion is a normal condition in certain topics such as anomaly detection. The imbalance between these two groups causes the typical model to ignore the minority group and consider it as noise.
Consequently, the model accuracy shows a spectacular result, with a clear disparity between specificity and sensitivity. The specificity, in this case, is the count of the correct wake predictions, while the sensitivity is the count of the correct sleep predictions, as illustrated in Table 3. Based on this problem formulation, the main objective in this study was to increase the specificity while keeping a sensitivity score. We applied two strategies of handling this imbalance data by adding weights for each class and performing under a sampling approach to the training data.
Table 3
|
Predicted Wake
|
Predicted Sleep
|
Wake
|
True Negative (TN)
|
False Negative (FN)
|
Sleep
|
False Positive (FP)
|
True Positive (TP)
|
In the first approach, the basic intuition was to limit the loss function when calculating the error for the majority class. In contrast, it will give a booster to the minority score, so that the model will predict more on the minority group. We applied different class weights for each model as can be seen in Table 1 and Table 2. In complement to the class weight approach, we also applied a sample-based approach which aimed to balance the amount of data between two classes. To achieve this, we applied two strategies by reducing the amount of sleep class and adding synthetic wake data based on the existing data distribution. These strategies were aimed to balance the proportion of the two classes that can avoid the model only focusing on the majority class. This sample-based approach was not applied to the RNN model since it contradicts the objective of the model that emphasizes the temporal characteristics of the data. In the under-sampling approach, we reduced 50% of the majority class, while in the other strategy we added augmented data into minority class as much as 50% of the total data in the majority class using the implementation of the Synthetic Minority Over-Sampling Technique (SMOTE) method (29).
Data Evaluation
To measure the performance of each proposed model we calculate five scores, namely accuracy, specificity, sensitivity, and balanced accuracy. These scores are based on the number of correct and incorrect predictions for each class from the confusion matrix as can be seen in Table 3, where the formulas for obtaining these scores are:
In a typical binary classification with balance data, accuracy will be the main performance score to be evaluated. However, in imbalance data, this score alone cannot determine the overall performance of the model for both classes. As an illustration, using the dataset in this study, the number of data is 2,5481; 2,152; 23,329, for whole data, wake data and sleep data, respectively. If the model predicts all data as sleep class, then it achieves the accuracy of 91.55% (similar accuracy score for the best model in the previous study). On the other hand, the specificity is zero, which indicates that the model ignores the minority class. Therefore, we focused on the improvement of the specificity score compared to the previous model. At the same time, we also tried to maintain a sensitivity score at least has the same score as the previous best model. The combination of these two scores can be summarized into one score called the balanced accuracy score. This metric has been proven to be an effective scoring in evaluating the ML model for imbalanced data.(30, 31)
All the model training was done in Python using SKLearn, XGBoost, and Keras library for RF and SVM, XGB, and NN-based models, respectively. The hyperparameter tuning was helped by using the grid search function from the SKLearn library. All the plots were generated using the Matplotlib and Seaborn libraries. The computational operations were performed in a LINUX-based Personal Computer with i5-8 cores Central Processing Unit and GeForce RTX 2060 Graphical Processing Unit.