Many researchers have been exploring the area. Pranali Loke et al. [1] presented a tool that converts hand movements used in ISL into language. Procedures like segmentation and classification are applied to the data. The photos were taken with an Android app, and pattern recognition was done with MATLAB. Fazlur Rahman Khan et al. [2] demonstrated a working prototype that can convert 26 English alphabets from American sign language (ASL) into text. Hand motion tracking was completed utilizing a Leap Motion controller (LMC) as the interface without the use of any additional hardware. Cross Correlation, Artificial Neural Networks (ANNs), and Geometric Template Matching were used to test the effectiveness of the prototype. According to the testing findings, in contrast to different detection algorithms, Geometric Template Matching has the highest recognition accuracy. M.N.Pushpalatha et al. [3] presented a paper that translates sign language into text and voice with 92% accuracy using Feature Extractor and Posenet, facilitating communication for people with hearing and speech impairments. A webcam can take pictures. Posenet and ANN are used to group common phrases used in daily life. The webcam tracks several body components, which are then converted into audio and text to show what the user is saying in real-time. Nishi_Intwala et al. [4] presented the translation of 26 English alphabetical sign motions into their text equivalents and classification into related alphabets is the goal of an ISL translator that was developed. Preprocessing and feature selection were performed on the dataset. The photographs were then uploaded to CNN using Python software, producing real-time images with an accuracy of 87.69%. A.J. Paul et al. [5] suggested a microcontroller architecture based on the ARM Cortex-M7 for detecting alphabets in American Sign dialect. They significantly reduced accuracy loss by using interpolation as an augmentation in addition to other methods, which allowed the framework to adjust well over heretofore concealed chaotic data. The inference speed for the model is 20 frames per second, and the post-quantization size is approximately 85 KB. An appropriate smart band featuring pressure monitors composed of nanocomposite materials was proposed by R. Ramalingame et al. [6]. An improved synthesis procedure is used to prepare the sensors. Active muscular contraction and relaxation monitoring is possible with the sensor band on the arm. The inventive smart band was put to the test on 10 individuals who practiced ten times each of the numeral gestures from 0 to 9 in ASL. Each individual received an aggregate of 100 information sets, which were all captured at 100 Hz. By putting the data sets into a rigorous approach to machine learning that chooses the characteristics, biases and weights, the categorization of the motions may be done with an overall accuracy of 93%.
Shagun Gupta et al. [7] presented a technology that enables hand gestures to be used for communication. The current framework captures a patient's hand motion using a webcam and decodes the signal it transmits. The message is then delivered with audio and text and is mapped to the signal. Instead of using an alphabet, the gesture will be used to converse. The device includes an alarm that goes off in an emergency and enables the patient to choose their favorite communication language for comfort. L. Fernandes et. al. [8] demonstrated a system for translating voice and text from sign language to spoken language and the other way around. When speech is provided, it is converted into a set of motions used in ASL. The dataset is adaptable, and additional languages can be added. Text acts as a bridge between voice and gesture in this instance.
The creation of a simple, inexpensive Bangla sign language translation (BSLT) system that can convert signs into written Bangla was suggested by S.M.K. Hasan et al. [9]. They talk about how universal interpreter software (UIS) was developed and made available to users in Bangladesh and the US. For this, a successful method for feature extraction and skin detection is proposed. 11 Bengali numbers and 16 sentences can be decoded using the technology. The system produces a result of approximately 96.463% when comparing accuracy to the K-Nearest Neighbor technique. The method suggested by S Masood et al. [10] employs sign language recognition to address the communication gap. CNN was used to train the inception model for spatial features and recurrent neural networks (RNN) for time-related characteristics. Argentinian Sign Language gestures from 46 various classifications make up our dataset. Over a wide number of images, the suggested solution was able to attain a high accuracy of 95.2%. R.J. Raghavan et al. [11] demonstrated a program that turns text input into an animated gesture sequence. The linguistic synthesis system in this application transforms English text into ISL format, an interface where users can write words, and a virtual avatar that serves as an interpreter at the user interface make up its three primary components. Both manual and automatic motions, such as facial expressions and hand placement, are described using the LOTS Notation. They also included the epenthesis movement, which is the inter sign transition gesture, to reduce jitters when gesturing.
CNN has also been extensively used for converting sign images to text. A CNN was fully integrated into an HMM by O Koller et al. [12], with the CNN outputs being evaluated within a Bayesian framework. On three difficult benchmark continuous sign language recognition tasks, our embedding is able to outperform the latest developments by 15–38% comparative decrease in the word error rate and up to 20% absolute. They look at the effects of network pretraining and CNN construction. They weigh the advantages of combining models and compare hybrid modeling to a tandem strategy. A Ojha et al. [13] fingerspelling sign language converter was put into use. Using CNN, this software recognizes ASL motions and instantly transforms them to text and voice. The computer's webcam and a desktop programme track the gestures. A vision-based technique that translates sign jargon into text was created by K. Bantupalli et al. [14] to improve communication between signers and non-signers. The suggested method extracts temporal and geographical information using video sequences. RNN is used to train on temporal features after initially using a CNN to detect spatial features. The ASL Dataset was used. The LMC serves as the system's brains in the Arabic Sign Language Recognition (ArSLR) approach developed by M. Mohandes et al. [15]. In order to offer information on posture and action, this device detects and keeps an eye on the fingertips and palm. They compare the Naive Bayes classifier's capabilities to those of Multilayer Perceptron (MLP) neural networks. The proposed method yields classification accuracy of 98% for the Arabic sign alphabets with the Naive Bayes classifier and greater than 99% with the MLP.
Adversarial, Multitask, Transfer Learning was used by A. Orbay et al. [16] to find semi-supervised tokenization techniques that don't call for extra labeling. To conduct a more thorough examination, it has numerous experiments that contrast all the methods in various settings. The suggested methodology produces 36.28 ROUGE and 13.25 BLUE-4 scores using only phrases as the goal annotation, which is a 4 point improvement in BLUE-4 and a 5 point improvement in ROUGE above the state-of-the-art. A system that tries to give the mute speech was proposed by K.K. Dutta et al. [17]. With the help of MATLAB, a double-handed Indian language is collected as pictures, processed, and then transformed to the corresponding speech and text.
For statistical sign language translation and recognition, J. Forster et al. [18] introduced a large vocabulary, video-based corpus of German Sign Language, known as The RWTH-PHOENIX-Weather corpus. The collection contains weather forecasts that were manually glossed to distinguish between sign variants and acquired from German public television. The sentence and gloss levels have also been given time limits. Additionally, a cutting-edge automatic voice recognition approach has been used to semi-automatically translate the spoken German weather forecast. The glosses have been translated into spoken German a second time to account for allowable translation variability. Along with the corpus, experimental baseline results are provided for head and hand tracking, statistical translation and recognition of sign language.
Two hidden real-time methods based on the Markov model were demonstrated by T. Starner et al. [19] for comprehending Continuous ASL sentences while tracking bare hands of the user. The original device monitors the user with a camera mounted on a desk with a word accuracy of 92 percent. The second device employs a different technique and achieves an accuracy of 98 percent (with unconstrained grammar it is 97 percent). It implants the camera in the user's headgear. The lexicon used in both tests is 40 words.
For individuals with hearing impairment, R. Kaur et al. [20] created an SMS generator in ISL. An advanced visual user interface, a system for Short Message Service (SMS) into speech translation, and a sign language to English translator have all been developed for this system. The graphic interface enables the characterization of a number of signals through the usage of different services. The addition of an animation module has made it possible to play the sign motions through a GIF file.
Table 1
Citation | Year of Publication | Language used | Technique Used | Dataset | Alphabet converted | Accuracy |
[1] | 2017 | ISL | Hue Saturation Value (HSV) model | Self -Dataset | 25 alphabets | Not mentioned |
[2] | 2016 | ASL | ANN | Self -Dataset | 26 alphabets | 52.56% |
[3] | 2022 | ASL | PoseNet, ANN, MobileNet | Self -Dataset | 25 alphabets | 92.8% |
[4] | 2019 | ISL | CNN (MobleNet) | Self -Dataset | 26 alphabets | 87.69% |
[5] | 2021 | ASL | CNN | 1. Kaggle Sign MNIST dataset 2. Kaggle ASL dataset 3. Self dataset 4. Kaggle ASL Alphabet Test dataset | 26 alphabets | Not mentioned |
[6] | 2021 | ASL | SVM, ELM, LDA | Self-Dataset | 26 alphabets | 93% |
[7] | 2020 | ASL | Open CV | Self-Dataset | 26 alphabets | 100% |
[8] | 2020 | ASL | Open CV, Neural Networks | 1. MNIST Dataset 2. Self Dataset | 26 alphabets | Not mentioned |
[9] | 2016 | Bangla Sign Language | PCA, LSVM, KNN | Self-Dataset | 16 Bengali words & 11 Bengali numbers. | 96.46% |
[10] | 2018 | Argentinian Sign Language | CNN, RNN, Open CV | Argentinian Sign Language Dataset | 2300 videos for 46 gesture categories | 95.21% |
[11] | 2014 | ISL | Support Vector Machine (SVM) | Self-Dataset | English sentences | Not mentioned |
[12] | 2018 | German Sign Language | CNN, HMM (Hidden Markov Model) | Three state-of-the art continuous sign language data sets RWTH-PHOENIX-Weather 2012, RWTH-PHOENIX-Weather 2014 and SIGNUM | 266, 1080 and 465 for PHOENIX 2012, PHOENIX 2014 and SIGNUM | Not mentioned |
[13] | 2020 | ASL | CNN | Self-Dataset | 26 alphabets | 95% |
[14] | 2018 | ASL | CNN, RNN | Dataset created by Neidel et al. (The dataset was on the images) | 2400 images | 99% |
[15] | 2014 | Arabic Sign Language | Naive Bayes Classifier, MLP Neural Network | Self-Dataset | 2800 frames of data (28 alphabets) | 98.3% |
[16] | 2020 | Neural Sign Language | CNN, RNN | 1. Dataset consists of images with weak annotations collected from 3 sources prepared by Koller et al. 2. Self-Dataset | 30 signs and seven signers | Not mentioned |
[17] | 2015 | ISL | Minimum EigenValue Algorithm | Self-Dataset | 125 images | Not mentioned |
[18] | 2012 | ASL | RASR | 1.The SIGNUM Database 2.RWTH-PHOENIX-Weather dataset 3.The ASL Lexicon Video | 369 images | Not mentioned |
[19] | 1998 | ASL | HMM (Hidden Markov Model) | Self-Dataset | 500 sentences | Not mentioned |
[20] | 2017 | ISL | SMS, LFG (Lexical Functional Grammar) method | Self-Dataset | 250 sentences which include basic hand-shapes | Not mentioned |