The proposed system takes video sequence frames as input in real time. The output would be the predicted yoga pose along with possible feedback for angle and pose correction. The system consists of three main phases: Keypoints extraction, Pose prediction and Pose correction. Keypoints extraction phase does the job of detecting and extracting location of important keypoints based on the user’s position[20]. The pose prediction phase defines the model architecture and classifies if the pose is correct or not. The final phase is pose correction where the user is further given feedback for correction of pose and is also shown the similarity percentage to the actual pose. Figure 1 shows the proposed system architecture with all the above phases.
4.2. Pose prediction
The second step is to create a deep learning model that correctly classifies any real-time video to one of the six poses given in the dataset. Here a hybrid model is used which is a mixture of CNN and LSTM. In this work, CNN is used for feature extraction [22]. CNN is a multilayered ANN (artificial neural network) that is specifically created to work on images and is used for tasks like object recognition and image classification [23]. LSTM is useful for understanding the sequence of frames occurring in a particular yoga pose. LSTM is a type of RNN that is equipped for learning and recollecting extremely long-term dependencies over long successions of input data [24]. We used a TimeDistributed layer along with CNN which is particularly useful when working with video frames or time series data [25]. Finally, a Softmax layer which uses a normalised exponentiation function. Here, it is used to find the probability of each yoga pose. The pose with the highest probability is predicted as the output.
The first layer used is a CNN layer with 16 filters and a window size of 3. The activation function used in this layer is ReLU. The ReLU activation function is a piecewise linear function that gives an output 0 when the input provided is less than 0, else it gives output as the given input [26, 27]. Eq. (1) shows the activation function of ReLU.
$$\text{R}\text{e}\text{L}\text{U}\left(\text{x}\right)=\text{m}\text{a}\text{x}(0,\text{x})$$
1
where x.∈ R.
The second layer used is a BatchNormalization layer that solves a problem of internal covariate shift and makes the flow of data through different layers easier [28]. The next layer used is a dropout layer [29], which is a regularization technique for preventing overfitting at the rate of 0.5. The output obtained from this layer is then passed to a Flatten layer that converts the data to a one-dimensional array. The next layer used is an LSTM layer. The size of the layer is 20 units with forget bias set to true in order to return the output of every node. The output of LSTM generally generated through a series of Gated operations like: Forget gate, Input gate and output gate. The mathematical equations for an LSTM are shown in the equations 2, 3, 4, 5, 6, and 7. The state information of current cell and previous cell are stored in C. The sigmoid function output in the forget gate indicates about which information to retain or forget. The output mainly depends upon the previous state output vector \({h}_{r-1}\)and current state input vector \({x}_{r}\). If there is difference then σ function does not allow the information to retain. The forget bias helps to give a shift to the output either in the direction of 1 (retain) or 0 (forget). The model outperforms with forget bias value as 0.5. In the input gate the σ function decides which value to update and tanh function adds new values for the state. In the output layer the σ function determines the value to be selected as output \({h}_{r }\) .
\({\text{l}}_{\text{r}}\) =σ (\({\text{W}}_{\text{l} }\). [\({\text{h}}_{\text{r}-1}\),\({\text{x}}_{\text{r}}\)] +\({\text{b}}_{\text{l}}\)) (2)
\({\text{m}}_{\text{r}}={\sigma }({\text{W}}_{\text{m}}.\left[{\text{h}}_{\text{r}-1}, {\text{x}}_{\text{r}}\right]+{\text{b}}_{\text{m}}\) ) (3)
$${\text{Č}}_{\text{r}}=\text{t}\text{a}\text{n}\text{h}({\text{W}}_{\text{c}}.\left[{\text{h}}_{\text{r}-1 }, {\text{x}}_{\text{r}}\right]+ {\text{b}}_{\text{c}}$$
4
$${\text{C}}_{\text{r}}= {\text{l}}_{\text{r}}\text{*} {\text{C}}_{\text{r}-1 }+ {\text{i}}_{\text{r}}\text{*} {\text{Č}}_{\text{r}}$$
5
$${\text{o}}_{\text{r}}={\sigma }({\text{W}}_{\text{o}}\left[{\text{h}}_{\text{r}-1}, {\text{x}}_{\text{r}}\right]+ {\text{b}}_{\text{o}}$$
6
$${\text{h}}_{\text{r} }= {\text{o}}_{\text{r}}\text{*}\text{tanh}\left({\text{C}}_{\text{r}}\right)$$
7
Where \({\text{l}}_{\text{r}}\), \({\text{m}}_{\text{r}}\), \({\text{o}}_{\text{r}}\), \({\text{h}}_{\text{r} }\) = output of forget gate, input gate, output gate and current state
\({\text{W}}_{\text{f} }\) , \({\text{W}}_{\text{m}}\), \({\text{W}}_{\text{o}}\)=weight of forget gate, input gate and output gate
\({\text{x}}_{\text{r}}\) = current state input
\({\text{b}}_{\text{l}}\) , \({\text{b}}_{\text{m}}\), \({\text{b}}_{\text{o}}\)= bias of forget gate, input gate and output gate
\({\text{m}}_{\text{r}}\) = output of input gate
\({\text{Č}}_{\text{r}}\) , \({\text{C}}_{\text{r}}\)= cell state of current and previous
The final layer used is a Dense layer that uses Softmax [30] as the activation function which assigns probabilities of different poses based on the current given input.
The mathematical equation of Softmax activation function is shown in Eq. (8).
$${{\sigma }\left(\underset{\_}{\text{z}}\right)}_{\text{i}}= \frac{{\text{e}}^{{\text{z}}_{\text{i}}}}{{\sum }_{\text{j}=1}^{\text{K}}{\text{e}}^{{\text{z}}_{\text{j}}}}$$
8
where, \(\sigma\)= softmax, \(\underset{\_}{z}\) = input vector, \({e}^{{z}_{i}}\)= standard exponential function for input vector, K = number of classes in the multi-class classifier, \({e}^{{z}_{j}}\)= standard exponential function for output vector.
The output obtained after this layer is then polled on 45 frames to get the final prediction. The optimizer used is an Adam optimizer [31] with a learning rate of 0.0001. The Adam optimizer helps the model to fast converge by the addition of momentum term and scaling term as shown in Eq. (9). Adam optimizer combines the idea of momentum and RMSprop optimizer and helps to avoid the exponential decay of learning rate issue.
\({{\theta }}_{\text{n}\text{e}\text{w}}\) = \({{\theta }}_{\text{o}\text{l}\text{d}}\) – η \(\widehat{\text{m}}\) ɸ \(\sqrt{\widehat{\text{s}}+\text{Є}}\)(9)
Where, \({\theta }_{new}\) ,\({\theta }_{old}\)= New and old weight value, η = learning rate, \(\widehat{m}\)= momentum term, \(\widehat{s}\)= scaling term
Є= smoothing term to avoid zero division error, ɸ= element wise division operation
The loss function used is categorical cross-entropy which is very popular for multi-class classification tasks [32]. Eq. (10) depicts the mathematical equation used in categorical cross-entropy loss function.
$${\text{E}}_{\text{C}\text{C}}= -\frac{1}{\text{N}}{\sum }_{\text{i}=1}^{\text{N}}{\sum }_{\text{c}=1}^{\text{C}}\left({\text{p}}_{\text{i}\text{c}}\text{l}\text{o}\text{g}\right({\text{y}}_{\text{i}\text{c}}))$$
10
Where, \({E}_{CC}\)= categorical cross-entropy, N = no. of pairs available in training set, C = no. of categories, \({p}_{ic}\) = a binary indicator function that detects whether the ith training pattern belongs to cth category, \({y}_{ic}\)= a predicted probability distribution for ith observation belonging to class c.
The metric used to gauge the performance of the model is Accuracy [33]. The model was trained for a total of 50 epochs. Initially, the growth was exponential, it became steady after a few epochs. After each epoch, it is checked whether the accuracy has improved and is better than the best accuracy obtained. If it is better than the best accuracy, the best accuracy is replaced by the current accuracy. All the parameters used in the model have been tuned perfectly using hyperparameter tuning in order to obtain the most optimized results [34, 35].
4.3. Pose correction
After the predicted pose is classified to be correct, with respect to the chosen pose, the user is given appropriate feedback and similarity percentage (using cosine similarity) is calculated to be shown to the user. For all the six yoga poses present in the dataset, critical angles have been identified and rules have been formulated for each pose. For each rule, a threshold is set, which is the maximum deviation allowed for the user from the standard pose. If the user exceeds this threshold value, feedback is given accordingly in the form of text and speech. The angle between two keypoints can be found out by calculating the tangent inverse of the slope with positive X-axis. Eq. (11) shows the formula for finding the angle, given the two coordinates of the keypoints.
$$\text{ϴ} = {\text{t}\text{a}\text{n}}^{-1 }\left(\frac{{\text{y}}_{2} -{\text{y}}_{1} }{{\text{x}}_{1} -{ \text{x}}_{2}}\right)$$
11
Where, \(({x}_{1},{y}_{1})\) and \(({x}_{2},{y}_{2})\)are coordinates of 2 keypoints.
Feedback is initially obtained in the form of text which is then converted into speech using Pyttsx3 library[36]. It is a text-to-speech converter and works offline as well.
Cosine similarity is also shown to the user which is a measure that compares two vectors by calculating the cosine of angles between them [37]. The mathematical formula of cosine similarity is given in Eq. (12).
$$\text{c}\text{o}\text{s}{\theta } = \frac{\text{A}\cdot \text{B}}{\left|\left|\text{A}\right|\right|\cdot \left|\left|\text{B}\right|\right|}$$
12
Where, A and B are two vectors in a multidimensional space.
In this work, cosine similarity is calculated between keypoints of the user’s pose and the standard pose. This way, it shows the degree of closeness to the actual pose. Since the distance of different users can vary based on their position from the camera, all keypoints are first normalized to bring them on a similar scale.