At first, for the initial evaluation of the separation of no-pain and pain classes, the data is processed in MATLAB R2016b through the steps shown in Fig. 3.
4.1 Preprocessing
EEG signals are distorted by various noises and artifacts, including line noise (50Hz), EOG, and EMG artifacts. To select the data related to EEG frequency bands, we applied a third-order bandpass Butterworth filter in the frequency range of 0.5-70Hz to the data. Using this filter, most EMG artifacts located in the frequency range of 10-2500 Hz are also eliminated [3]. A bandstop Butterworth filter (48-52Hz) is used to remove the 50 Hz power-line contamination. The data is downsampled to 120 Hz to prepare data for feature extraction using the wavelet.
4.2 Feature extraction
Many biological signals, including EEG signals, are non-stationary, and their properties change over time and frequency. Therefore, time-frequency analyses are very useful in extracting the features of these signals. Wavelet transform is one of the most practical time-frequency methods to extract features from EEG signals [20–22]. In this study, we applied Discrete Wavelet Transform (DWT) to EEG signals. [23] Determining the number of decomposition levels and selecting the mother wavelet are critical parameters of DWT. In each level of decomposition, the DWT applies a high-pass and a low-pass filter to the signal, thereby signal is decomposed into different frequency bands. The outputs of low-pass and high-pass filters provide approximate (A) and accurate (D) signal information, respectively. In this study, the EEG signals have been decomposed into five levels using the db6 mother wavelet. Since the data were downsampled to 120Hz in the preprocessing step, by decomposing the signal in five levels, coefficients of each level of decomposition are driven to the wavelet coefficients of EEG frequency bands. The classification was done by different mother wavelets, and considering the acceptable performance of most classifiers, the db6 was selected as the mother wavelet. The coefficients of decomposition levels have been considered as features of EEG signals. Figure 4 shows the decomposition levels using the DWT. The HP and LP are high-pass and low-pass filters applied to the EEG signals, respectively. The coefficients of A5, D5, D4, D3, D2, and D1represent the Delta (0.5-4Hz), Theta (4-8Hz), Alpha (8-12Hz), Beta (12-30Hz), low Gamma (30-60Hz), and high Gamma (60-120Hz) frequency bands coefficients. 293 features related to the wavelet coefficients of six frequency bands have been extracted from the data of each channel. So, 293x18 features have been extracted from the data of each subject, where 18 represents the number of EEG channels.
4.3 Feature selection
Feature selection is a process to find the optimal feature set with reduced dimensions from a feature set with large dimensions. So, the feature selection eliminates irrelevant data and maintains valuable data in classification. Selecting the feature improves the performance of the classification. [15]
In this study, we used the T-test as the first feature selection step. Among the feature selection methods, the T-test is one of the filter methods which selects features regardless of the classifier. The filter methods use global statistical information to select the features. In the T-test, the normal distribution of data samples is assumed. This method is also efficient in cases where the data distribution is quasi-normal distribution. Before using the T-test, we checked the distribution of each feature using a histogram chart. Here, the two-sample T-test is used to measure the capability of a feature in distinguishing no-pain and pain classes. Two-sample T-test or Student's T-test is a parametric test that examines the significant difference between the mean of two data sets. [24–25] The value of α in the T-test is set to 0.05 (α = 0.05). It should be noted that the features rejected by the T-test do not affect data classification, however, the features selected by the T-test are unreliable. It means that the selected features by the T-test may be useful in classification or not. Among the 18*293 features extracted in the previous step, 209 features were selected by the T-test.
4.4 Classification
Different types of classifiers have been used to classify pain-related EEG data [2, 4–6, 26–27]. In this study, we investigated the performance of the SVM, the NB, the Discriminant Analysis (DA), including Linear Discriminant Analysis (LDA) and Diagonal-QDA, the K-Nearest Neighbor (KNN), and the Decision Tree (DT) in pain EEG signal classification.
4.4.1 SVM
The SVM is a supervised learning algorithm that has been widely used to classify EEG data. An SVM searches for a hyperplane to separate the classes. The hyperplane maximizes the margin (space that does not include observations) between two classes [27–28]. Computationally, finding the best location for a hyperplane is an optimization problem that uses a kernel function to create boundaries to separate classes [28–30]. In this study, the SVM with linear kernel has been used.
4.4.2 NB
The NB classifies data based on probabilities, features independence assumption, and Bayes' theorem. In NB, the presence or absence of a feature does not influence the presence or absence of any other feature. Since NB assumes the independence of features, the addition of redundant ones that are not independent of each other affects the learning process of this classifier, negatively. In this classifier, the input data is assigned to the class that has the most similarity with the statistical distribution of that class data. [26, 31–32] Here, the normal distribution of data has been considered, and GNB is used to classify the data.
4.4.3 DA
The DA is a classifier that assumes normal distribution for the data of each class. To determine the boundary between different classes, DA determines the normal distribution parameters for each class and projects the data to a new space, in which in the new space the within-class scatter is minimized, and the scatter between classes is maximized. Here, the performance of two types of DA classifiers, including LDA and Diagonal-QDA, has been investigated in no-pain and pain EEG signal classification. LDA and QDA have linear and quadratic decision boundaries, respectively. LDA can only learn linear boundaries, whereas QDA can learn quadratic boundaries and therefore is more flexible. For LDA, the models of classes have the same covariance matrix, and only the means of classes are different. In QDA, the mean and the covariance of the data of each class are different. Diagonal-QDA is similar to QDA but estimates the diagonal covariance matrix. [33–36]
4.4.4 KNN
The KNN is a local classifier suitable for multimodal distributed data. The KNN method is a simple statistical classification method, and its main strategy is to identify the number of k nearest samples to the unknown input samples. Class determination of the input data is done according to most k samples. Here the number of neighbors is chosen to be k = 9. [5]
4.4.5 DT
The DT is a supervised non-parametric learning method for classification. This classifier does not make any prior assumptions about the data distribution. In DT, the model is created using simple decision rules inferred from features. Creating a model in a DT is based on dividing a complex problem into several simple sub-problems, and this process is repeated recursively. DT consists of a root node, decision nodes, and a set of leaf nodes. Figure 5 shows an example of a DT.
Here, Gini's Diversity Index is used to select the features of the root node and decision nodes. The Gini index is one of the criteria used to build decision trees. This index can be defined as a measure of feature purity. The Gini index is a number between 0 to 0.5. A feature with a lower Gini index is preferable to one with a higher Gini index. The Gini index of data set D is calculated from Eq. (1):
$$Gini(D)=1 - \sum\limits_{y} {{p^2}(y)}$$
1
In Eq. (1), y is the number of classes, and p(i) is the observed fraction of classes with class i that reach the node. The minimum value of the Gini index occurs when the node consists of one class only, and the maximum value occurs when all classes in the data set have equal probability. [37–39]
To reduce the risk of overfitting in the DT, a general strategy is to use pruning to remove some features of the training set or branches of the tree caused by noise. Pruning is the process of reducing a tree by converting some branch nodes into leaf nodes and removing the leaf nodes below the main branch. Once the validation set is available, the tree can be pruned according to the validation error. In this study we used post-pruning. After building the DT, if removing a branch reduces the validation error, the branch will be removed. This method is called post-pruning. [39–40]
4.5 Performance Evaluation
We calculated the parameters of classification performance, including accuracy, sensitivity, and specificity, using the confusion matrix. About 80% of the data were considered for training and the rest for testing. We used the k-fold cross-validation procedure to divide the data into training and testing sets. The number of folds is K = 6, and the training and testing procedure is repeated six times. We calculated the final performance evaluating parameters by averaging six confusion matrices from 6 repetitions. The results of no-pain and pain classification using the six classifiers are shown in Table 1.
Table 1
The results of no-pain and pain EEG signal classification using six classifiers
| Accuracy | sensitivity | specificity |
SVM | 97.22 | 100 | 94.44 |
LDA | 94.44 | 100 | 88.89 |
QDA | 97.22 | 94.44 | 100 |
NB | 97.22 | 94.44 | 100 |
DT | 83.33 | 83.33 | 83.33 |
KNN | 69.44 | 100 | 38.89 |
The results of the classification show that classifiers have acceptable performance in separating the no pain and pain EEG signals, except KNN. This means, the approach that KNN considers in classification is not suitable for separating these features, and considering the closest neighbor samples to the test data is not a suitable method for determining the class of input data. Since the three classifiers, including SVM, NB, and DT, apply entirely different approaches to data classification, we selected these three classifiers as the final classifiers in the proposed channel selection algorithm to identify the active area of the brain during pain.
4.6 Proposed Channel selection approach
The proposed method of this study to select active electrodes in the separation of no-pain and pain classes of EEG signals is voting from the output of the pseudo-SFFS algorithm applied to classifiers that have completely different methods of classification. For the data of this study, the SVM, DT, and GNB classifiers were selected. Figure 6 shows the process flow of the proposed EEG channel selection algorithm in neonates’ pain EEG data.
Classifiers use different procedures to classification. It may be possible to separate features linearly, but not with probabilities and classifiers such as NB (and vice versa). Since the goal is to select the brain's active area, it is necessary to check the separability of the features from different aspects. Therefore, classifiers that take entirely different procedures to classification are chosen for the proposed algorithm. SVM with a linear kernel and LDA search a linear separation space of the feature. Furthermore, the pseudo-SFFS feature selection algorithm shows a significant overlap between the selected features by these two classifiers. Since the SVM accuracy is higher (according to Table 1), this classifier was chosen among SVM and LDA. DT has a completely different procedure compared to the other mentioned classifiers, and at the same time, the accuracy obtained using this classifier is acceptable. So, DT is another chosen classifier. NB and QDA classifiers both resulted in similar accuracy (according to Table 1). Based on the pseudo-SFFS algorithm, all features selected by QDA have also been selected by GNB. So, GNB has been chosen as the third classifier for the proposed channel selection approach.
The reason for using the T-test before the subsequent pseudo-SFFS algorithm is to reduce the iteration of the pseudo-SFFS loop and thus reduce computational costs. Since the feature selection by pseudo-SFFS is based on classifiers, to reduce the classifier effect, the features will be considered as the final effective features that are selected by at least two classifiers.
Pseudo-SFFS algorithm
SFFS is a feature selection method to determine a set of features with lower dimensions that these features are uncorrelated and these features are most relevant to the problem. [41] The SFFS feature selection method is an iterative greedy search algorithm. The fitness function of this algorithm is determined using the classifier performance. So, this method is one of the wrapper methods. The pseudocode of this method is described below. [15, 42–43]
-
Start with an empty set of features: \({Y_0}=\{ \phi \}\)
-
Select the best feature: \({x^+}=\,\,\arg {\hbox{max} _{{x^+}}}_{{ \notin {Y_k}}}[J({Y_k}+{x^+})]\)
-
If \(J\left( {{Y_k}+{X^+}} \right)>J\left( {{Y_k}} \right):{Y_{k+1}}={Y_k}+{x^+}\) and \(k=k+1\)
-
Go to step (2)
The flowchart of the pseudo-SFFS method that we introduce in this study is shown in Fig. 7. In this method, to reduce the computational cost, the following are considered:
In SFFS, in each iteration, one feature is added to the feature set, and the accuracy of the classifier obtained by the new feature set is compared with the previous one. If the classifier's performance is improved, the feature remains in the feature set, otherwise, the feature is removed. Unlike SFFS, in the proposed pseudo-SFFS algorithm, the accuracy obtained in each iteration is compared with the constant value. It means, the value of the maximum accuracy is thresholded. The value of the thresholded accuracy (accth) for the end of the iteration has been selected according to the classifier's performance without applying the second feature selection step. According to Table 1, the threshold value for SVM and GNB is 97.22%, and for DT is 83.33%. Due to the greedy and iterative search approach, SFFS is an algorithm with a high computational cost for feature selection. By thresholding the accuracy, the number of iterations is reduced. So the computational cost is reduced.
In the feature selection process, there may exist feature sets with equal accuracy but different sensitivity and specificity. So, we applied thresholding on sensitivity and specificity, too. The threshold value for sensitivity and specificity are equal. Therefore, the feature set that has almost the same performance in detecting no-pain and pain classes, has been selected. In this way, the probability of correct detection of pain class in cases where the amount of pain perception is low increases. As shown in Fig. 7, the range of sensitivity and specificity is determined according to the value of the Rs. Rs is calculated according to the data in the Table 1 and from Eq. (2).
$$Rs=ac{c_{th}} - sen=ac{c_{th}} - spec=\frac{{\left| {sen - spec} \right|}}{2}$$
2
In Eq. (2), accth, sen, and spec are the accuracy, sensitivity, and specificity of the classifiers, respectively, whose values are taken from the Table 1. Rs is 2.78 for SVM and NB and 0 for DT.
In the process of adding a feature to the feature set, we may have more than one feature set with the highest accuracy and also the same sensitivity, and specificity. In this case, the feature set related to an added feature is selected that an added feature contains more information lonely, and the classifier's performance is higher using that feature.
By adding the second feature selection step, the classifier's performance is expected to enhance. Since the purpose is to find the active areas of the brain during pain, with the aim of reducing the computational cost, enhancing the performance of the classifiers has been neglected.
4.7 Channels evaluation
For more investigation, we have performed a channel evaluation approach similar to [2] and [16]. Each channel data has been evaluated in the procedure shown in Fig. 3, and this process was carried out for each channel, separately. Therefore, the process was repeated 18 times (for 18 channels). Channels evaluation results are shown in Table 2.
Table 2
Performance evaluation of each channel in no-pain and pain EEG signal separation. Acc, Sen, and Spec are for accuracy, sensitivity, and specificity, respectively.
classifier | SVM | NB | DT | LDA | QDA | Average accuracy obtained from five classifier |
channel | Acc | Sen | Spec | Acc | Sen | Spec | Acc | Sen | Spec | Acc | Sen | Spec | Acc | Sen | Spec |
Cz | 80.56 | 83.33 | 77.78 | 91.67 | 83.33 | 100 | 86.11 | 83.33 | 88.89 | 75 | 83.33 | 66.67 | 91.67 | 83.33 | 100 | 85.00 |
Cpz | 75 | 88.89 | 61.11 | 86.11 | 94.44 | 66.67 | 75 | 72.22 | 77.78 | 75 | 83.33 | 66.67 | 86.11 | 94.44 | 77.78 | 79.44 |
F8 | 75 | 83.33 | 66.67 | 80.56 | 83.33 | 77.78 | 58.33 | 66.67 | 50 | 66.67 | 83.33 | 50 | 80.56 | 83.33 | 77.78 | 72.22 |
T8 | 83.33 | 88.89 | 77.78 | 86.11 | 83.33 | 88.89 | 61.11 | 66.67 | 55.56 | 83.33 | 94.44 | 72.22 | 86.11 | 83.33 | 88.89 | 80.00 |
TP10 | 83.33 | 88.89 | 77.78 | 72.22 | 77.78 | 66.67 | 66.67 | 72.22 | 61.11 | 75 | 88.89 | 61.11 | 72.22 | 77.78 | 66.67 | 73.89 |
P8 | 88.89 | 94.44 | 83.33 | 80.56 | 77.78 | 83.33 | 55.56 | 61.11 | 50 | 72.22 | 88.89 | 55.56 | 80.56 | 77.78 | 83.33 | 75.56 |
O2 | 86.11 | 88.89 | 83.33 | 88.89 | 83.33 | 94.44 | 58.33 | 66.67 | 50 | 86.11 | 88.89 | 83.33 | 88.89 | 83.33 | 94.44 | 81.67 |
F4 | 83.33 | 88.89 | 77.78 | 69.44 | 83.33 | 55.56 | 66.67 | 66.67 | 66.67 | 69.44 | 77.78 | 61.11 | 69.44 | 83.33 | 55.56 | 71.66 |
C4 | 75 | 88.89 | 61.11 | 77.78 | 88.89 | 66.67 | 66.67 | 66.67 | 66.67 | 75 | 83.33 | 66.67 | 77.78 | 88.89 | 66.67 | 74.45 |
Cp4 | Removed by T-test |
F7 | 86.11 | 100 | 72.22 | 77.78 | 72.22 | 83.33 | 66.67 | 66.67 | 66.67 | 77.78 | 88.89 | 66.67 | 77.78 | 72.22 | 83.33 | 77.22 |
T7 | 83.33 | 94.44 | 72.22 | 75 | 72.22 | 77.78 | 58.33 | 55.56 | 61.11 | 69.44 | 83.33 | 55.56 | 75 | 72.22 | 77.78 | 72.22 |
TP9 | 63.89 | 83.33 | 44.44 | 69.44 | 55.56 | 83.33 | 61.11 | 94.44 | 27.78 | 75 | 77.78 | 72.22 | 69.44 | 55.56 | 83.33 | 67.78 |
P7 | 86.11 | 94.44 | 77.78 | 75 | 72.22 | 77.78 | 63.89 | 77.78 | 50 | 72.22 | 77.78 | 66.67 | 75 | 72.22 | 77.78 | 74.44 |
O1 | 83.33 | 83.33 | 83.33 | 91.67 | 94.44 | 88.89 | 66.67 | 66.67 | 66.67 | 83.33 | 77.78 | 88.89 | 91.67 | 94.44 | 88.89 | 83.33 |
F3 | 77.78 | 88.89 | 66.67 | 72.22 | 55.56 | 88.89 | 69.44 | 83.33 | 55.56 | 80.56 | 88.89 | 72.22 | 72.22 | 55.56 | 88.89 | 74.44 |
C3 | 86.11 | 100 | 72.22 | 86.11 | 88.89 | 83.33 | 66.67 | 72.22 | 61.11 | 61.11 | 61.11 | 61.11 | 86.11 | 88.89 | 83.33 | 77.22 |
Cp3 | Removed by T-test |