Unveiling Cross-Linguistic Mastery: Advancing Multilingual Handwritten Numeral Recognition with Attention-driven Transfer Learning

doi:10.21203/rs.3.rs-3523391/v1

Download PDF

Research Article

Unveiling Cross-Linguistic Mastery: Advancing Multilingual Handwritten Numeral Recognition with Attention-driven Transfer Learning

https://doi.org/10.21203/rs.3.rs-3523391/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

In the realm of data analysis and document processing, the recognition of handwritten numerals stands as a pivotal advancement. This contribution has steered transformative shifts in optical character recognition, historical handwritten document analysis, and postal automation. A persistent challenge in this arena is the recognition of handwritten digits across a spectrum of languages, each with its idiosyncrasies. We present an innovative paradigm to surmount this hurdle, transcending the confines of monolingual recognition. Unlike the status quo, which gravitates toward a narrow subset of languages, our method orchestrates a comprehensive solution spanning 12 distinct languages, deftly navigating linguistic intricacies. The catalyst for this efficacy is transfer learning, amplifying image quality and recognition acumen. Emboldening this framework is an ingenuity-charged attention-based module that refines precision. Our rigorous experimentations substantiate quantum leaps in image quality and the prowess of linguistic and numeral recognition. Notably, we unearth significant accuracy strides, eclipsing 2% enhancements in specific languages vis-à-vis antecedent methodologies. This endeavor epitomizes a sturdy, economically sound avenue, unshackling multilingual handwritten numeral recognition to an expansive spectrum of languages.

Deep learning

Transfer learning

Multilingual

Handwritten numeral recognition

Handwritten numeral recognition holds a pivotal role in the fields of image processing and computer vision, offering significant contributions to various applications. The continuous advancements in this area have paved the way for accurate and efficient digit recognition, leading to transformative impacts on data analysis and document processing. This technology has found extensive utilization in critical domains such as bank check processing, postal code recognition, and historical handwritten document analysis [1–4]. By providing an automated recognition process, it eliminates the need for manual intervention, resulting in substantial savings in time and cost. However, recognizing handwritten digits poses numerous challenges due to the inherent diversity in shapes, sizes, and variations found in individual handwriting. Furthermore, the existence of diverse numeric systems across languages introduces an additional layer of complexity. The accuracy of handwritten numeral recognition holds immense importance, particularly in critical systems such as medical records and biometric identification, where errors can lead to severe consequences [5–7]. This paper aims to propose innovative methodologies and techniques that leverage state-of-the-art approaches to enhance the precision and efficiency of handwritten numeral recognition, addressing the intricacies posed by different writing styles, numeric systems, and inherent variations.

The field of machine learning has witnessed extensive research on handwritten numeral recognition and classification, capturing significant attention from researchers [8–10]. Various machine learning algorithms, including SVM (Supported Vector Machine) [11–13], KNN (K Nearest Neighbor) [14, 15], RF (Random Forest) [16], Fuzzy models and FDDL [14, 17], have been employed in the past for recognizing handwritten numerals, yielding promising results in terms of recognition accuracy. For instance, Abhisek Sethy proposed a subsequential method that combined LS-SVM and RF classifiers, achieving an impressive overall accuracy of 99.01% on handwritten Odia character datasets [16]. In recent years, the advent of deep learning has sparked renewed interest in enhancing the effectiveness of systems for recognizing handwritten scripts, particularly through the use of RNNs (Recurrent Neural Networks) and CNNs (Convolutional Neural Networks) [18–20]. Notably, Abu Sufian introduced an end-to-end approach utilizing a densely connected convolutional neural network (BDNet) and achieved a remarkable accuracy of 99.78% on ISI Bengali handwritten numerals [21]. It is worth noting that most of these methods are validated only on a single language and lack the versatility to handle multilingual handwritten numeral recognition. Additionally, some of these approaches encounter challenges related to the number of parameters and other complexities, further underscoring the need for innovative and comprehensive solutions in this domain.

While numerous studies have focused on identifying handwritten digits in single languages like English or Chinese [22, 23], relatively few research endeavors have explored the realm of multi-lingual digit recognition [1, 5, 13]. This task poses unique challenges due to variations in writing styles, stroke thickness, and the presence of language-specific features like dots, making it difficult to develop a unified model for accurate identification across all languages [1]. A robust and adaptable architecture is needed to generalize to new languages without extensive retraining [1]. Additionally, varying data availability and quality across languages complicate the creation of a well-balanced dataset [13]. Alternatively, the adoption of language-specific models trained on distinct datasets remains a viable approach. However, this methodology necessitates the development of individual digit recognition models for each language, thereby augmenting resource requirements and introducing complexities in model combination and processing procedures [1].

Our methodology for processing handwritten digits in multiple languages tackles the challenge of accurate recognition through a combination of techniques. Firstly, we employ convolutional neural networks (CNNs) to extract relevant features from digit images and improve the performance of our models. Additionally, we address the inherent difficulty of multi-class classification by leveraging a language recognition model. This approach proves to be challenging as it requires classifying a large number of diverse images using a single-digit classifier. Secondly, we introduce the MRAE (Multi-Resolution Attention Enrichment) U-Net model for the super-resolution task. This unique approach involves training the model with digitally generated numeral images and applying it to enhance the quality of handwritten digits without the need for fine-tuning on handwritten data. Furthermore, we leverage transfer learning from the language recognition model to the digit recognition models. By utilizing the pre-trained knowledge and features, we optimize the performance of the digit recognition models while significantly reducing training time and computational resources. The combination of transfer learning and the MRAE U-Net model proves to be effective in improving accuracy and overcoming the challenges of multi-lingual recognition of handwritten digits.

In this study, we present six significant contributions that introduce novel approaches and advancements in handwritten digit classification. These contributions can be summarized as follows:

Multilingual Categorization: We present a novel and advanced multilingual number recognition model capable of accommodating handwritten digits from 12 distinct languages. This model represents a significant advancement in the field, as previous works have primarily focused on fewer languages.

Attention-wise MRAE Module: We design the MRAE module based on an Attention-wise block, which is incorporated into various models such as U-Net, language recognition, and digit recognition models. This module plays a crucial role in enhancing model performance by selectively attending to relevant features and improving the overall representation learning process.

U-Net-based Image Enhancement: We introduce a groundbreaking U-Net-based model that enhances the quality of input images. By utilizing the MRAE module and employing transfer learning on digitally generated digit images, our model achieves high accuracy without further fine-tuning.

Digits recognition using Transfer Learning: In the context of digit recognition models, we adopt a pioneering application of transfer learning, capitalizing on the knowledge distilled from the language classifier. This innovative approach not only mitigates computational overheads but also unveils promising avenues for substantial progress in handwritten digit classification.

The subsequent sections of this paper follow a structured order. In Section 2, we provide a comprehensive review of related works pertaining to our research domain. The intricate details of our proposed model are expounded in Section 3, elucidating its conceptual design and operational framework. Subsequently, in Section 4, we conduct a meticulous performance evaluation, presenting the experimental results that corroborate the efficacy of our approach. Finally, in Section 5, we offer conclusive remarks, summarizing the key findings and outlining potential directions for future research endeavors.

This section presents a comprehensive overview of related works in the field of handwritten numeral recognition. The review encompasses various methodologies, including those reliant on handcrafted features, traditional classifiers, deep neural networks, sparse representation, transfer learning, and their amalgamations. By examining the literature on these diverse approaches, we aim to establish a comprehensive foundation for the proposed advancements in our research.

A. Classic Approach

The support vector machine (SVM) is a well-known and widely-used classification technique in pattern recognition. [11] proposes a Devanagari character classification method that utilizes support vector machines (SVM) to recognize and classify documents written in Hindi, Sanskrit, and Marathi, both in printed and handwritten forms. The method begins by preprocessing and segmenting the image through projection profiles, removing Shiro Rekha, and extracting features. The SVM algorithm is employed to classify characters that do not possess the Shiro Rekha into predetermined categories. The study demonstrates that the proposed SVM-based method achieves 99.54% and 98.35% as high classification accuracies for printed and handwritten images, respectively, which surpasses the performance of other Devanagari OCR-based techniques.

In [17], a technique was suggested for the recognition of handwritten Hindi and English numerals, utilizing exponential membership functions as a fuzzy model. The fuzzy sets are generated by calculating normalized distances using the Box approach. The membership function undergoes modification through the optimization of two structural parameters, achieved by maximizing entropy while ensuring the membership function attains unity. The recognition rates are 95% for Hindi numerals and 98.4% for English numerals. This method has potential applications in character recognition and can contribute to developing efficient recognition systems.

The research [12] proposes a hybrid MLP-SVM method for recognizing unconstrained handwritten digits. The method utilizes specialized Support Vector Machines (SVMs) to improve the MLP's performance in local areas around the separating surfaces between each pair of digit classes. The hybrid architecture achieves a recognition rate of 98.01% for real mail zip code digits recognition tasks. The method introduces a rejection mechanism based on the distances provided by the local SVMs, which improves the error/reject trade-off performance. The proposed method can be applied to other scenarios where an MLP network demonstrates strong performance.

B. Deep Learning Approach

The BDNet model based on densely connected convolutional neural networks is proposed for Bengali handwritten numeral digit recognition, achieving a test accuracy of 99.78% and reducing error by 47.62% In comparison to prior state-of-the-art models [21]. The training of the model is conducted utilizing the ISI Bengali handwritten numeral dataset with unconventional pre-processing and augmentation techniques. A dataset of 1000 Bengali handwritten numeral images are created to test the model, yielding promising results.

To improve the recognition accuracy of handwritten Kannada numerals, [24] introduces a method that takes a document image containing numerals written in diverse styles by various users as input. The method involves a series of pre-processing steps, including noise removal and attribute extraction techniques like Drift Length Count and DWT, followed by a deep convolution neural network (DCNN) classifier for classification. This approach leads to an impressive accuracy rate of 96%.

The study [25] presents a historical handwritten digit dataset called DIDA, containing single and multi-digit images from Swedish handwritten documents. The paper introduces DIGITNET, a deep learning framework consisting of DIGITNET-dect and DIGITNET-rec for digit detection and recognition, respectively. DIGITNET-dect outperforms previous methods, while DIGITNET-rec achieves the highest detection and recognition rates in historical Swedish handwritten documents with an accuracy of 97.12%, demonstrating the effectiveness and efficiency of the proposed framework.

Study [26] presents an EfficientDet-D4 model for recognizing handwritten digits. EfficientDet-D4 is based on EfficientNet-B4, which is a highly efficient convolutional neural network that balances depth, width, and resolution using compound scaling and advanced techniques, achieving state-of-the-art accuracy on various computer vision tasks. The deep learning-based model can accurately detect and classify digits with an accuracy of 99.83% on the MNIST dataset and is robust to post-processing attacks with an accuracy of 99.10% on the USPS dataset. The model is a reliable solution for digit recognition and has the potential for use in automated number plate recognition and optical character recognition applications.

A. Sparse Learning Approach

In paper [27], a novel technique for dictionary learning is proposed, called labeled projective dictionary pair learning, which uses a synthesis-analysis dictionary pair to simplify the calculation of sparse representation. A robust pattern recognition model is achieved by incorporating HOG features and class labels as a penalty term. The proposed method is evaluated on various databases, including Chinese, Arabic, and English handwritten number datasets, and compared with state-of-the-art methods. The results demonstrate the effectiveness of the proposed technique (~98%), which requires only eight parameters to be fine-tuned and operates on regular computers without relying on cloud servers or GPUs, functioning locally. However, dictionary learning may not be as effective as deep learning approaches in handling complex and high-dimensional data contributing to the classification task.

[22] introduced a novel approach to improve dictionary pair learning classification accuracy by adding an incoherence penalty term. This research offers a new dataset for benchmarking pattern recognition algorithms using the InDPL algorithm and cross-validation methods. The InDPL algorithm achieved superior results on a new Chinese number database, especially with limited training samples. The resulting accuracy with the InDPL algorithm, close to 97%, demonstrates better performance than the DPL and kNN methods.

In [14], the authors proposed the FDDL method for image classification. FDDL learns a structured dictionary with discriminative coefficients and uses reconstruction residual and representation coefficients for classification. FDDL outperformed numerous cutting-edge approaches based on dictionary learning methods on various image recognition tasks, with a reported recognition error rate of 2.89% on the USPS English handwritten database.

B. Transfer Learning Approach

The research study [28] aimed to develop effective deep learning models for recognizing Arabic (Indian) digits and texts using transfer learning approaches. The investigation utilized two well-known transfer models, namely GoogleNet + LSTM and VGG-16 + LSTM, with modifications that included the integration of LSTM layers to enhance the models' recognition capabilities. The LSTM layers functioned as a recurrent neural network, persisting the extracted features from the convolutional neural network (CNN) part. The research outcomes demonstrated the efficacy of using LSTM layers in the transfer learning models to learn long-term dependencies which achieve impressive accuracy levels, reaching up to 99%, accompanied by noteworthy recall and precision values when classifying the ten digits. The proposed models were shown to be comparable to, or even more effective than, state-of-the-art techniques, with accuracy ranging from approximately 98% up to 99.3%.

The study [5] proposes a new recognition algorithm for a pattern recognition system using deep convolutional neural networks (CNNs) to improve the recognition performance of handwritten digits. The approach involves developing several deep CNNs to maximize accuracy for each class, and the models are chosen based on standard deviation and median. To enlarge the MNIST dataset, standard augmentation techniques were employed. The dataset was then divided, with 75% allocated for training purposes and the remaining 25% for testing. This division allowed for a more accurate and comprehensive analysis to be conducted. Experimental results demonstrate high accuracy in recognizing digits, ranging from 97.82% to 99.72% within a few epochs. The proposed method enhances the intra-class correlation. Improving classification accuracy for each class, and demonstrates superior recognition ability compared to the state of the art in several cases.

The [18] study explores the categorization of handwritten Gujarati numerals ranging from zero to nine using deep transfer learning techniques. The investigation utilized ten pre-existing CNN architectures, namely LeNet, VGG16, InceptionV3, ResNet50, Xception, ResNet101, MobileNet, MobileNetV2, DenseNet169, and EfficientNetV2S, to identify the most suitable model through the fine-tuning of weight parameters. The implementation of the pre-trained models was carried out Utilizing a self-curated dataset of handwritten Gujarati digits, consisting of 8000 images of zero to nine along with data augmentation techniques. Multiple experiments were conducted using diverse performance evaluation matrices. The results revealed that the EfficientNetV2S model, considering all models, including three scenarios of transfer learning, showed encouraging results, achieving a training accuracy of 98.39%, testing accuracy of 97.92%, f1-score of 97.69%, and AUC of 97.15%.

C. Multilingual Learning Approach

Recognizing and classifying multilingual handwritten digits is challenging due to several factors. One of the primary challenges is the variations in writing styles and digit shapes across different languages. For example, the shapes of digits in Arabic and Chinese languages are significantly different from those in English and other Latin-based languages. This results in a need for large and diverse datasets to train and evaluate the models. Another challenge is the varying sizes of digits, as writers tend to have different writing sizes and styles. Furthermore, developing a model that can recognize and classify digits from multiple languages is more challenging than developing a model for a single language, as the model must handle the variations across multiple languages while maintaining high accuracy. Overcoming these challenges requires the development of robust and adaptable models that can handle diverse writing styles, shapes, sizes, and image qualities, which is an ongoing area of research in multilingual character recognition. [29]

The study [13] presents a novel approach to address the challenges in multilingual numeral recognition systems. The authors conducted extensive experiments on datasets containing numerical digits of 8 languages, including both Indic and non-Indic scripts. The proposed method achieved an impressive accuracy rate of 96.23% for all eight scripts combined, demonstrating the effectiveness of convolutional neural networks (CNN) in multilingual handwritten numeral recognition. The system stands out for its distinctive approach of employing a ten-class recognition system for multiple languages without the need for prior numeral identification. Additionally, it is the first attempt to develop a fusion-free approach for recognizing handwritten numerals in eight different languages, independent of the script used in each language, challenging the notion of multilingualism in handwritten character recognition.

The research work [19] developed a knowledgeable framework for Handwritten Character Recognition (HCR) using Neural Networks, which can accurately identify specific type-format characters. The proposed methodology involves the utilization of both a machine learning model and a character recognition MATLAB model to recognize and identify handwritten digits accurately. A translator using MATLAB is also designed to overcome language barriers. The proposed technique can be employed to convert English, Marathi, and Guajarati text into spoken English using a text-to-speech conversion approach. The experiments were performed on several datasets using MATLAB and ANACONDA software systems, including Gujrati, Hindi, and English literature. The accuracy of the digit recognition model was evaluated using existing datasets like MNIST and custom-made CSV files.

In [1] a CNN-based model that is not specific to any particular language is proposed to address the challenge of recognizing numerals in six different languages. The model includes language recognition and digit recognition components to handle multi-script images. Transfer learning is used to enhance image quality and recognition performance. Extensive experiments are conducted to verify the effectiveness of the model for recognizing recognition of digits associated with various languages. Testing the model with six different languages shows an average accuracy of up to 99.8%. The model's resilience and the procedure employed in its design makes it a cost-efficient solution for recognizing handwritten numeric symbols in various languages.

As shown in Figure 1, the proposed method consists of four parts:

Firstly, the input data needs to undergo a preprocessing procedure. The original datasets of handwritten digit images are typically presented in a smaller size, commonly 28x28 pixels. However, simply Up-sampling these images can result in a degradation of image quality, which adversely affects the performance of the classifier. To overcome this challenge, as our second step, we have developed a robust and novel U-Net-based model named MRAE U-Net, which utilizes transfer learning techniques to enhance the quality of the images. Thirdly, we incorporate a language recognition model to identify the language of each processed image. Following this, as our final step, we employ a digit recognition model that specifically focuses on classifying individual digits associated with the identified language. The digit recognition model leverages the transfer learning approach from the language recognition model and undergoes fine-tuning for digit recognition. Detailed implementation aspects of this system will be provided in subsequent sections.

A. Preprocessing

In this research, our focus was on classifying 12 different languages. However, high-quality handwritten images in the size of 128x128 pixels were not readily available for training the MRAE-U-Net model. To address this, a unique approach was adopted, involving the curation of a diverse collection of more than 1000 fonts encompassing all the languages. Data augmentation techniques were subsequently applied to ensure a balanced and varied dataset, particularly for languages with limited font availability. For the input of the super-resolution model, low-quality images were needed. As shown in Figure 1, to generate these inputs, we initially created high-quality images and then applied both down-sampling and up-sampling operations. This process allowed us to obtain low-resolution versions of each digit image, which served as suitable inputs for training the super-resolution model.

The training data for handwritten digits utilized in this study includes twelve datasets. The MNIST-MIX dataset was used for Persian, Bengali-Lekha, Tibetan, Urdu, ISI Bangla, ARDIS, and Kannada languages [29]. The dataset for Chinese handwritten numbers is written by 100 people with different handwritings, collected by the researchers at Newcastle University, UK, from Chinese nationals. Furthermore, a comprehensive dataset of English handwritten digits from USPS was collected [22]. Datasets containing Gurmukhi and Gujarati images were collected from GitHub repositories as well [30, 31]. Furthermore, English (USPS [20]) and Arabic datasets (MADBase [32]) considered to assess the proposed method. Table 1 presents representative examples from the datasets employed in this study. On the other hand, Table 2 provides a concise summary of the dataset characteristics, encompassing the number of samples and their sizes.

As previously mentioned, to overcome the limitations of data availability in some languages, we employed data augmentation techniques. Specifically, we utilized horizontal shift and rotation functions to increase the number of images by several folds. Data augmentation offers several benefits. Firstly, it addresses the issue of limited data by expanding the dataset, particularly in languages with insufficient sample sizes. By generating additional augmented samples, we provide the model with more diverse training examples. Secondly, data augmentation acts as a regularization technique by introducing variations and perturbations into the training data. This prevents the model from memorizing specific instances and encourages it to learn more robust and generalized features that can be applied to unseen data. Additionally, data augmentation ensures a more balanced representation, preventing the model from being biased towards dominant classes and improving its performance in accurately classifying all classes.

Table 1 Sample images of 12 different handwritten digits datasets.

Table 2 Summary of handwritten digits datasets for training.

Dataset	Size	Amount	Language
Custom	128x128	120,000	All Languages
USPS[20]	variable	20000	English
MNIST-MIX[29]	28x28	60000	Persian
MADBase[32]	28x28	60000	Arabic
Gujarati [30]	256x256	5600	Gujarati
Gurmukhi [31]	32x32	1000	Gurmukhi
Chinese [22]	64x64	60000	Chinese
MNIST-MIX [29]	28x28	6606	Urdu
MNIST-MIX [29]	28x28	14214	Tibetan
MNIST-MIX [29]	28x28	15798	BanglaLekha
MNIST-MIX [29]	28x28	19392	ISI Bangla
MNIST-MIX [29]	28x28	60000	ARDIS
MNIST-MIX [29]	28x28	60000	Kannada

B. Super Resolution

In our research, the imperative to improve the quality of high-resolution handwritten digit images posed a significant challenge. To overcome this hurdle, we sought a robust and high-capacity model with formidable learning capabilities. In light of this objective, we opted for a model based on the renowned U-Net architecture. The proposed model comprises three principal components, as depicted in Figure 2.

The encoder section of our model consists of multiple blocks, each is composed of two 3x3 convolution layers. The filter sizes for these layers are set to 64, 128, 256, and 512, respectively, for the first, second, third, and fourth blocks. At the end of each block, a 2x2 max-pooling operation is applied. The outputs from each encoder block are divided into two branches. One branch proceeds to the next encoder block, while the other branch enters the specially designed MRAE module within the skip connection part of the U-Net.

Moving on to the bottleneck section, it consists of four 3x3 convolution layers arranged sequentially. The outputs from the bottleneck section are then fed into the decoder section. The decoder section, which is the third section of our model, comprises four blocks. Each block includes a transposed convolution layer. The first input layer is obtained from the bottleneck section, followed by concatenation with the corresponding feature maps from the skip connection part at the same level in the encoder Subsequently, two convolution layers are applied. The filter size of each convolution section is reversed compared to the encoder blocks, with filter sizes of 512, 256, 128, and 64, respectively. Similarly, the output from each decoder section enters the higher-level decoder section, and this process continues.

We have introduced a novel module architecture called the MRAE module, which is integrated into the skip connection part of the U-Net. This module operates on the convolutional feature maps that are passed from each decoder block. As illustrated in Figure 3 the input feature maps undergo parallel processing through three convolutional layers with different filter sizes of (1x1, 3x3, 5x5). The outputs of these convolutional layers are then summed together. Following this block, a channel-wise attention mechanism is employed to enhance the feature extraction process. The attention block consists of a 3x3 convolutional layer, a Global Average Pooling (GAP) layer, a reshape operation, two additional convolutional layers in order with ReLu Sigmoid activation functions, an Up-sampling operation, and finally a multiply operation.

By incorporating MRAE module, we have observed a significant increase in accuracy. MRAE that includes three convolutional layers with different filter sizes, coupled with a channel-wise attention block, empowers the model to discern the significance of each channel and effectively allocate attention during the feature extraction process. By dynamically adapting to the input data, the MRAE module enhances the model's discriminative capabilities, thereby improving its overall performance in the handwritten digit recognition task.

For training the aforementioned model, we utilized the transfer learning technique. To create a balanced dataset, we collected 10,000 low-quality and high-quality digit images for each language, resulting in a total of 120,000 images across all languages. This carefully curated dataset was then used to train our robust model. Following the training process, the obtained model was utilized for prediction and generation of handwritten digit images without requiring any additional retraining. The results obtained from this approach were highly significant and yielded excellent outcomes.

C. Language Recognition

Given the multilingual nature of our datasets, which encompass 12 languages with 10 numerical characters each (excluding the Chinese dataset with 15 classes), we face the challenge of recognizing a total of 125 distinct number classes. Handling such a large number of classes with a single model presents significant difficulties. Additionally, treating all similar numbers across languages as the same class can hinder the model's ability to learn effectively, as there may be substantial variations in how the same numbers are represented across different languages. Moreover, adding a new language would require retraining the entire model.

To address these challenges, we propose an approach that involves using a language detection model to determine the language of the input image. Based on the detected language, a corresponding classifier model is selected to recognize the specific type of number. By decoupling the language detection process from the digit recognition model, we achieve a more streamlined approach for incorporating new languages into our system. This separation allows us to develop new number recognition models for additional languages and seamlessly integrate them into the existing system, without disrupting the language detection and other digit detection components. This flexibility ensures scalability as the number of supported languages continues to grow.

To address the task of language detection for our system, we sought a model capable of robustly extracting and learning essential features. In Figure 4 we introduce our novel "language classifier based on MRAE" model, which encompasses three blocks. Each block is composed of two convolutional layers employing a 3x3 filter, a max-pooling layer, and an MRAE module. Notably, the convolutional layers in these blocks are thoughtfully configured with 64, 128, and 256 filters, respectively.

Subsequent to the three feature extraction blocks, we incorporate three fully connected layers, comprising 512, 256, and 128 units, respectively. The final layer of the model consists of 12 neurons, aligning with the number of classes (languages), and employs the Softmax activation function to generate probability distributions over the language classes. Remarkably, the model is trained using a balanced dataset comprising 10,000 samples for each language, yielding exemplary accuracy in language detection.

One of the significant advantages of this proposed language recognition model lies in its simplicity and efficient structure, which contributes to a small number of parameters, rendering it computationally lightweight. Furthermore, due to its streamlined architecture, this model can also be seamlessly integrated into the digit recognition step, thereby providing a cohesive and unified framework for the entire handwritten digit recognition system.

D. Digit Recognition

In the digit recognition stage, we capitalize on the language-specific classifier models to achieve accurate number recognition, resulting in 12 distinct models, each corresponding to a particular language. To automate this process and streamline the recognition pipeline, we have developed a sophisticated module that seamlessly integrates the language detection model with the corresponding number recognition model. This integration enables us to efficiently direct the input image to the appropriate number recognition model based on the identified language, thereby enhancing the overall efficiency of the system's output.

The structure of the digit recognition model closely resembles that of the language detection model, leveraging transfer learning techniques and fine-tuning to optimize performance. Within the last fully connected layer, we utilize 10 neurons to facilitate the classification of the ten classes, representing the numbers from 0 to 9. It is important to note that the final model structure is individualized for each language.

The incorporation of transfer learning in the digit recognition process imparts several significant advantages during model training. As illustrated in Figure 4 this approach significantly reduces the required training time and computational resources compared to training a model from scratch. Furthermore, transfer learning enables effective training even with limited data, as it capitalizes on the pre-trained knowledge and salient features extracted from the language detection model. By exploiting transfer learning, we improve the accuracy and performance of the digit recognition model, leading to enhanced generalization and overall system efficiency. Notably, the use of pre-trained language recognition models also results in a reduced number of parameters, ultimately contributing to lower computation costs during the recognition process.

In this section, we present the experimental results and analyses of our proposed approach for multilingual handwritten digit recognition. To evaluate the performance of our model, we compare it with existing methodologies. Additionally, we investigate the effects of transfer learning on the model's performance and conduct an ablation study to assess the impact of individual components within the proposed model. Notably, the implementation and training of our models were conducted on Google Colab, providing a reliable and scalable computational environment for our experiments.

A. Comparative performance

In this subsection, we performed a thorough comparison of the performance between our proposed model and other relevant methods. As depicted in Table 4 the datasets of various languages, including English, Chinese, Arabic, Persian, Urdu, Gujarati, Gurmukhi, Tibetan, BanglaLekha, ARDIS, Kannada, and ISI Bangla, were trained using different approaches. Among the tested methods, Gupta CNN-Based [13] achieved a commendable accuracy of 96.70% in recognizing ISI Bangla digits. In comparison, Fateh CNN-Based [1] displayed competitive accuracy in recognizing Chinese and Arabic digits, with respective accuracies of 99.26% and 99.18%. InceptionV3, a widely recognized method, demonstrated strong performance across all languages, attaining an impressive accuracy of 99.00% in recognizing ARDIS digits.

Through a comparative analysis of the obtained results, it is evident that the proposed method has consistently outperformed the Lenet-5, VGG16, ResNet50, and MNIST-MIX methods across different languages. While certain limitations may arise when comparing specific languages to alternative methods, such as Chinese dataset where Fateh CNN-Based achieved higher accuracy of 99.26 % on the testing data, the inclusion of MRAE U-Net addressed this limitation and increased the accuracy from 98.38 to 99.30%, surpassing Fateh CNN-Based. Furthermore, the proposed method enhanced by U-Net exhibited substantial improvements, surpassing other methods in recognizing Gujarati, Arabic, Persian, and English digits, with exceptional accuracy rates of 99.23%, 99.46%, 99.70%, and 99.75%, respectively.

In our study, we compared the performance of the proposed model with other relevant techniques, particularly dictionary learning methods. We evaluated six dictionary learning methods (SRC, DPL, DLSI, InDPL, LC-KSVD1 and LC-KSVD2). we conducted language measurements for English, Persian, Arabic, Chinese, and Urdu. The results showed that among the various dictionary learning methods, DPL stood out as the most effective. Additionally, our proposed method demonstrated superior performance compared to the other dictionary learning methods.

The proposed model, complemented by the enhancement technique, demonstrated notable superiority in accuracy compared to all other methods examined in this study. The remarkable efficacy of the proposed method with enhancement underscores its effectiveness and superiority in achieving accurate digit recognition across diverse language datasets.

B. Transfer learning effects

As previously stated, we developed the MRAE U-Net model and utilized transfer learning to enhance the resolution of handwritten digit images. We trained our model on a dataset consisting of 120,000 handchosen digital images generated. This dataset includes both high-quality and low-quality images, with a standardized resolution of 128x128 pixels. Remarkably, our model exhibited exceptional performance in enhancing images from all 12 languages, despite not undergoing any specific fine-tuning on the handwritten data. Table 4 showcases the improved results obtained from nearly all datasets.

In the context of training digit recognition models, the considerable number of language-specific models results in a high computational cost. To lower this expense, we employed the transfer learning approach, leveraging the knowledge acquired from the language recognition model for each digit recognition task. As depicted in Figure 4, by strategically freezing the convolutional and MRAE modules' trainable layers, including the first fully connected layer, and subsequently fine-tuning the subsequent fully-connected layers, we adeptly adapted the model to the specific digit recognition dataset, thereby significantly reducing computational expenses.

In Table 3, we compare two distinct approaches: Approach 1 involves training each digit recognition model from scratch, while Approach 2 incorporates transfer learning to efficiently reduce the number of parameters. Notably, the results showcased in Table 3 demonstrate that Approach 2 effectively reduces the number of parameters from 139 million to 2.2 million, underscoring the efficiency of deep learning models in handling complex patterns and variations within handwritten digits.

Table 3: comparing number of parameters using transfer learning

Table 4 Experimental results on 12 different datasets

Language recognition		English	Persian	Arabic	Gujarati	Gurmukhi	Chinese	Urdu	Tibetan	Bangla Lekha	ISI Bangla	ARDIS	Kannada
Dictionary learning-based methods	LC-KSVD1 [33]	91.25%	91.15%	96.49%	-	-	95.23%	87.48%	-	-	-	-	-
	LC-KSVD2 [34]	91.10%	91.15%	96.49%	-	-	95.24%	87.74%	-	-	-	-	-
	DLSI [35]	96.10%	97.30%	97.62%	-	-	97.80%	89.03%	-	-	-	-	-
	DPL [36]	96.68%	98.46%	98.21%	-	-	98.36%	95.23%	-	-	-	-	-
	SRC [37]	81.81%	82.69%	90.97%	-	-	90.97%	85.32%	-	-	-	-	-
Deep learning-based methods	Gupta CNN–Based [13]	99.68%	-	96.53%	99.22%	-	-	-	-	-	96.70%	-	-
	Fateh CNN-Based [1]	97.33%	98.99%	99.18%	-	-	99.26%	98.23%	-	-	-	-	88.01%
	InceptionV3 [38]	98.74%	99.00%	97.20%	97.67%	94.38%	98.64%	98.10%	97.41%	97.67%	97.46%	98.48%	86.30%
	ResNet-50 [39]	90.33%	97.10%	96.89%	95.29%	93.82%	92.57%	96.46%	96.87%	92.58%	91.69%	92.19	80.14%
	VGG-16 [40]	96.33%	96.95%	98.58%	97.41%	99.23%	96.67%	96.67%	96.31%	93.74%	94.45%	92.79%	79.49%
	LeNet-5 [41]	96.86%	98.40%	98.18%	98.58%	96.62%	96.81%	96.81%	98.14%	96.50%	96.20%	98.40%	84.66%
	MNIST-MIX [29]	-	98.18%	-	-	-	-	97.31%	98.28%	94.86%	97.05%	98.20%	85.70%
Proposed method	Ours	98.86%	98.52%	98.12%	98.86%	97.87%	98.38%	99.00%	99.14%	99.24%	98.84%	99.10%	88.50%
Proposed method	Ours^SR	99.75%	99.70%	99.46%	99.23%	99.09%	99.30%	99.55%	99.55%	99.75%	99.78%	99.65%	90.28%

C. Ablation Study

Ablation study in AI refers to a systematic experimentation process where specific components or modules of a model or algorithm are individually removed or modified to understand their impact on the overall performance. The goal is to isolate and evaluate the contribution of each component to the system's effectiveness.

In this subsection, we conducted an ablation study on our proposed method using dataset of Chinese handwritten characters as the baseline. Initially, we employed a simple approach without incorporating the MRAE module and U-Net enhancement. Then, we systematically introduced and experimented with different parts of our method by adding these components to understand the specific contributions and effects of the MRAE module and U-Net enhancement on the performance of our system.

Table 5 Effects of MRAE and U-Net super-resolution

As shown in Ablation Study

Table 5, the inclusion of both U-Net and MRAE components in the model leads to outstanding accuracy. The baseline model with no MRAE and U-Net integration achieved an accuracy of 95.06%, indicating the importance of incorporating these components. Introducing the MRAE module alone resulted in an improved accuracy of 98.38%, demonstrating its individual contribution. Similarly, utilizing U-Net with super-resolution capabilities resulted in a higher accuracy of 98.89%.

Comparatively, the combination of MRAE module and U-Net proved to be the most effective, yielding the highest and remarkable accuracy of 99.30%. surpassing all other variations tested. This suggests that the integration of these components synergistically enhances the model's performance, leading to outstanding accuracy in recognizing Chinese handwritten digit images. we further evaluated the performance of the combination approach by drawing the confusion matrix and calculating precision and recall metrics.

M_ii represents the diagonal element corresponding to the i-th class. The off-diagonal elements, Mij and Mji, are located in the upper and lower triangles of the table, respectively.

The provided confusion matrix showcases the classification results for digit images in the Chinese dataset. Although the training model for the Chinese dataset consists of 15 classes, the confusion matrix displays numbers ranging from 0 to 9 to be similar with other language datasets.

Upon analysis, it is evident that a significant number of digits are correctly classified, as indicated by the high values along or near the diagonal. However, there are instances where confusion arises among similar numbers. It is worth noting that there are eight instances where the number 9 is incorrectly identified as 7, and there are also cases where the numbers 8 and 3 are mistakenly associated with 1 and 2. This indicates that the classifier sometimes assigns incorrect labels to instances of the digits 9, 8, and 3, mistaking them for 7, 1, and 2. The observation underscores the possibility of visual similarities among these digits within the Chinese dataset.

Table 6 Confusion matrix for Chinese dataset

	0	1	2	3	4	5	6	7	8	9	*recall*
0	1000	0	0	0	1	1	3	0	0	0	1.00
1	0	1000	0	0	0	0	0	0	0	0	1.00
2	0	2	998	0	0	0	0	0	0	0	1.00
3	0	0	6	993	0	1	0	0	0	0	0.99
4	0	0	0	0	990	0	0	0	0	0	0.99
5	0	0	0	1	0	989	0	1	1	1	0.99
6	0	0	1	0	0	0	998	0	1	0	1.00
7	0	1	0	0	0	0	0	985	0	1	0.98
8	0	5	0	0	0	0	1	0	993	0	0.99
9	0	0	0	0	0	0	0	8	0	988	0.99
percision	1.00	0.99	0.99	1.00	1.00	1.00	1.00	0.99	1.00	0.99

This research paper introduces a novel multi-lingual approach for recognizing handwritten digits. The proposed system consists of three key models: a U-Net-based, a language recognition model, and a digit recognition model, all based on a CNN architecture. The system first enhances the quality of input images and then determines the language of the image. Additionally, transfer learning is utilized to ensure consistent performance across different databases, resulting in robust and accurate recognition for various handwritten languages. Extensive experiments were conducted to optimize parameters, leading to superior accuracy compared to other techniques, including CNN-based approaches. The integration of transfer learning played a crucial role in achieving these outstanding results while reducing computational requirements. A specially designed module called MRAE was interduced and significantly improved image quality and contributed to high recognition accuracy. For future work, we aspire to develop a model that demonstrates proficiency across a broader spectrum of languages. In tandem, we endeavor to leverage transfer learning on a larger scale, proposing the extraction and transfer of knowledge obtained from handwritten numeral datasets to diverse domains, including medical imagery analysis. This knowledge transfer will serve as a foundational basis for enhancing the model's performance and adaptability across various linguistic and domain-specific contexts.

Fateh, A., M. Fateh, and V. Abolghasemi, Multilingual handwritten numeral recognition using a robust deep network joint with transfer learning. Information Sciences, 2021. 581: p. 479-494.
Ahmed, R., W.G. Al-Khatib, and S. Mahmoud, A survey on handwritten documents word spotting. International Journal of Multimedia Information Retrieval, 2017. 6: p. 31-47.
Rebelo, A., et al., Optical music recognition: state-of-the-art and open issues. International Journal of Multimedia Information Retrieval, 2012. 1: p. 173-190.
Lamghari, N. and S. Raghay, DBAHCL: database for Arabic handwritten characters and ligatures. International Journal of Multimedia Information Retrieval, 2017. 6: p. 263-269.
Azawi, N., Handwritten digits recognition using transfer learning. Computers and Electrical Engineering, 2023. 106: p. 108604.
Muthureka, K., U. Srinivasulu Reddy, and B. Janet, An improved customized CNN model for adaptive recognition of cerebral palsy people’s handwritten digits in assessment. International Journal of Multimedia Information Retrieval, 2023. 12(2): p. 23.
Georgiou, T., et al., A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. International Journal of Multimedia Information Retrieval, 2020. 9(3): p. 135-170.
Sahare, P. and S.B. Dhok, Script identification algorithms: a survey. International Journal of Multimedia Information Retrieval, 2017. 6: p. 211-232.
Naosekpam, V. and N. Sahu, Text detection, recognition, and script identification in natural scene images: A Review. International Journal of Multimedia Information Retrieval, 2022. 11(3): p. 291-314.
Chanu, O.B. and A. Neelima, A survey paper on secret image sharing schemes. International Journal of Multimedia Information Retrieval, 2019. 8(4): p. 195-215.
Puri, S. and S.P. Singh, An efficient Devanagari character classification in printed and handwritten documents using SVM. Procedia Computer Science, 2019. 152: p. 111-121.
Bellili, A., M. Gilloux, and P. Gallinari, An MLP-SVM combination architecture for offline handwritten digit recognition: Reduction of recognition errors by Support Vector Machines rejection mechanisms. Document Analysis and Recognition, 2003. 5(4): p. 244-252.
Gupta, D. and S. Bag, CNN-based multilingual handwritten numeral recognition: A fusion-free approach. Expert Systems with Applications, 2021. 165: p. 113784.
Yang, M., et al., Sparse representation based fisher discrimination dictionary learning for image classification. International Journal of Computer Vision, 2014. 109: p. 209-232.
Houle, M.E., et al., Improving the quality of K-NN graphs through vector sparsification: application to image databases. International Journal of Multimedia Information Retrieval, 2014. 3(4): p. 259-274.
Sethy, A., P.K. Patra, and S.R. Nayak, A hybrid system for handwritten character recognition with high robustness. Traitement du Signal, 2022. 39(2): p. 567.
Hanmandlu, M. and O.R. Murthy, Fuzzy model based recognition of handwritten numerals. pattern recognition, 2007. 40(6): p. 1840-1854.
Goel, P. and A. Ganatra, Handwritten Gujarati Numerals Classification Based on Deep Convolution Neural Networks Using Transfer Learning Scenarios. IEEE Access, 2023. 11: p. 20202-20215.
Vidhale, B., et al. Multilingual Text & Handwritten Digit Recognition and Conversion of Regional languages into Universal Language Using Neural Networks. in 2021 6th International Conference for Convergence in Technology (I2CT). 2021. IEEE.
Hull, J.J., A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 1994. 16(5): p. 550-554.
Sufian, A., et al., Bdnet: bengali handwritten numeral digit recognition based on densely connected convolutional neural networks. Journal of King Saud University-Computer and Information Sciences, 2022. 34(6): p. 2610-2620.
Abolghasemi, V., et al., Incoherent dictionary pair learning: Application to a novel open-source database of chinese numbers. IEEE Signal Processing Letters, 2018. 25(4): p. 472-476.
Kaur, A., R. Dhir, and G.S. Lehal, A survey on camera-captured scene text detection and extraction: towards Gurmukhi script. International Journal of Multimedia Information Retrieval, 2017. 6: p. 115-142.
Hallur, V.C. and R. Hegadi, Handwritten Kannada numerals recognition using deep learning convolution neural network (DCNN) classifier. CSI Transactions on ICT, 2020. 8: p. 295-309.
Kusetogullari, H., et al., DIGITNET: A deep handwritten digit detection and recognition method using a new historical handwritten digit dataset. Big Data Research, 2021. 23: p. 100182.
Ahmed, S.S., et al., A novel technique for handwritten digit recognition using deep learning. Journal of Sensors, 2023. 2023.
Ameri, R., et al., Labeled projective dictionary pair learning: application to handwritten numbers recognition. Information Sciences, 2022. 609: p. 489-506.
Alkhawaldeh, R.S., Arabic (Indian) digit handwritten recognition using recurrent transfer deep architecture. Soft Computing, 2021. 25(4): p. 3131-3141.
Jiang, W., MNIST-MIX: a multi-language handwritten digit recognition dataset. IOP SciNotes, 2020. 1(2): p. 025002.
Gandhi, M., Gujarati-Dataset. 2020: https://github.com/MikitaGandhi/Gujarati-Database-.
Pramanik, S., Gurmukhi-Dataset. 2023: https://github.com/siddharthapramanik771/Gurmukhi-Handwritten-Digit-Classification.
Abdelazeem, S. and E. El-Sherif, Modified Arabic Digits Database. 2008, School of Science and Engineering, Department of Electronics Engineering ….
Jiang, Z., Z. Lin, and L.S. Davis. Learning a discriminative dictionary for sparse coding via label consistent K-SVD. in CVPR 2011. 2011. IEEE.
Jiang, Z., Z. Lin, and L.S. Davis, Label consistent K-SVD: Learning a discriminative dictionary for recognition. IEEE transactions on pattern analysis and machine intelligence, 2013. 35(11): p. 2651-2664.
Ramirez, I., P. Sprechmann, and G. Sapiro. Classification and clustering via dictionary learning with structured incoherence and shared features. in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2010. IEEE.
Gu, S., et al., Projective dictionary pair learning for pattern classification. Advances in neural information processing systems, 2014. 27.
Wright, J., et al., Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence, 2008. 31(2): p. 210-227.
Szegedy, C., et al. Rethinking the inception architecture for computer vision. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
He, K., et al. Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Simonyan, K. and A. Zisserman, Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
El-Sawy, A., H. El-Bakry, and M. Loey. CNN for handwritten arabic digits recognition based on LeNet-5. in Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016 2. 2017. Springer.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Unveiling Cross-Linguistic Mastery: Advancing Multilingual Handwritten Numeral Recognition with Attention-driven Transfer Learning

Status:

Version 1

Abstract

Figures

I. INTRODUCTION

II. Related Work

A. Classic Approach

B. Deep Learning Approach

A. Sparse Learning Approach

B. Transfer Learning Approach

C. Multilingual Learning Approach

III. Proposed Method

A. Preprocessing

B. Super Resolution

C. Language Recognition

D. Digit Recognition

IV. Experimental results

A. Comparative performance

B. Transfer learning effects

C. Ablation Study

V. Conclusion

References

Additional Declarations

Status:

Version 1