As shown in Figure 1, the proposed method consists of four parts:
Firstly, the input data needs to undergo a preprocessing procedure. The original datasets of handwritten digit images are typically presented in a smaller size, commonly 28x28 pixels. However, simply Up-sampling these images can result in a degradation of image quality, which adversely affects the performance of the classifier. To overcome this challenge, as our second step, we have developed a robust and novel U-Net-based model named MRAE U-Net, which utilizes transfer learning techniques to enhance the quality of the images. Thirdly, we incorporate a language recognition model to identify the language of each processed image. Following this, as our final step, we employ a digit recognition model that specifically focuses on classifying individual digits associated with the identified language. The digit recognition model leverages the transfer learning approach from the language recognition model and undergoes fine-tuning for digit recognition. Detailed implementation aspects of this system will be provided in subsequent sections.
A. Preprocessing
In this research, our focus was on classifying 12 different languages. However, high-quality handwritten images in the size of 128x128 pixels were not readily available for training the MRAE-U-Net model. To address this, a unique approach was adopted, involving the curation of a diverse collection of more than 1000 fonts encompassing all the languages. Data augmentation techniques were subsequently applied to ensure a balanced and varied dataset, particularly for languages with limited font availability. For the input of the super-resolution model, low-quality images were needed. As shown in Figure 1, to generate these inputs, we initially created high-quality images and then applied both down-sampling and up-sampling operations. This process allowed us to obtain low-resolution versions of each digit image, which served as suitable inputs for training the super-resolution model.
The training data for handwritten digits utilized in this study includes twelve datasets. The MNIST-MIX dataset was used for Persian, Bengali-Lekha, Tibetan, Urdu, ISI Bangla, ARDIS, and Kannada languages [29]. The dataset for Chinese handwritten numbers is written by 100 people with different handwritings, collected by the researchers at Newcastle University, UK, from Chinese nationals. Furthermore, a comprehensive dataset of English handwritten digits from USPS was collected [22]. Datasets containing Gurmukhi and Gujarati images were collected from GitHub repositories as well [30, 31]. Furthermore, English (USPS [20]) and Arabic datasets (MADBase [32]) considered to assess the proposed method. Table 1 presents representative examples from the datasets employed in this study. On the other hand, Table 2 provides a concise summary of the dataset characteristics, encompassing the number of samples and their sizes.
As previously mentioned, to overcome the limitations of data availability in some languages, we employed data augmentation techniques. Specifically, we utilized horizontal shift and rotation functions to increase the number of images by several folds. Data augmentation offers several benefits. Firstly, it addresses the issue of limited data by expanding the dataset, particularly in languages with insufficient sample sizes. By generating additional augmented samples, we provide the model with more diverse training examples. Secondly, data augmentation acts as a regularization technique by introducing variations and perturbations into the training data. This prevents the model from memorizing specific instances and encourages it to learn more robust and generalized features that can be applied to unseen data. Additionally, data augmentation ensures a more balanced representation, preventing the model from being biased towards dominant classes and improving its performance in accurately classifying all classes.
Table 1 Sample images of 12 different handwritten digits datasets.
Table 2 Summary of handwritten digits datasets for training.
Dataset
|
Size
|
Amount
|
Language
|
Custom
|
128x128
|
120,000
|
All Languages
|
USPS[20]
|
variable
|
20000
|
English
|
MNIST-MIX[29]
|
28x28
|
60000
|
Persian
|
MADBase[32]
|
28x28
|
60000
|
Arabic
|
Gujarati [30]
|
256x256
|
5600
|
Gujarati
|
Gurmukhi [31]
|
32x32
|
1000
|
Gurmukhi
|
Chinese [22]
|
64x64
|
60000
|
Chinese
|
MNIST-MIX [29]
|
28x28
|
6606
|
Urdu
|
MNIST-MIX [29]
|
28x28
|
14214
|
Tibetan
|
MNIST-MIX [29]
|
28x28
|
15798
|
BanglaLekha
|
MNIST-MIX [29]
|
28x28
|
19392
|
ISI Bangla
|
MNIST-MIX [29]
|
28x28
|
60000
|
ARDIS
|
MNIST-MIX [29]
|
28x28
|
60000
|
Kannada
|
B. Super Resolution
In our research, the imperative to improve the quality of high-resolution handwritten digit images posed a significant challenge. To overcome this hurdle, we sought a robust and high-capacity model with formidable learning capabilities. In light of this objective, we opted for a model based on the renowned U-Net architecture. The proposed model comprises three principal components, as depicted in Figure 2.
The encoder section of our model consists of multiple blocks, each is composed of two 3x3 convolution layers. The filter sizes for these layers are set to 64, 128, 256, and 512, respectively, for the first, second, third, and fourth blocks. At the end of each block, a 2x2 max-pooling operation is applied. The outputs from each encoder block are divided into two branches. One branch proceeds to the next encoder block, while the other branch enters the specially designed MRAE module within the skip connection part of the U-Net.
Moving on to the bottleneck section, it consists of four 3x3 convolution layers arranged sequentially. The outputs from the bottleneck section are then fed into the decoder section. The decoder section, which is the third section of our model, comprises four blocks. Each block includes a transposed convolution layer. The first input layer is obtained from the bottleneck section, followed by concatenation with the corresponding feature maps from the skip connection part at the same level in the encoder Subsequently, two convolution layers are applied. The filter size of each convolution section is reversed compared to the encoder blocks, with filter sizes of 512, 256, 128, and 64, respectively. Similarly, the output from each decoder section enters the higher-level decoder section, and this process continues.
We have introduced a novel module architecture called the MRAE module, which is integrated into the skip connection part of the U-Net. This module operates on the convolutional feature maps that are passed from each decoder block. As illustrated in Figure 3 the input feature maps undergo parallel processing through three convolutional layers with different filter sizes of (1x1, 3x3, 5x5). The outputs of these convolutional layers are then summed together. Following this block, a channel-wise attention mechanism is employed to enhance the feature extraction process. The attention block consists of a 3x3 convolutional layer, a Global Average Pooling (GAP) layer, a reshape operation, two additional convolutional layers in order with ReLu Sigmoid activation functions, an Up-sampling operation, and finally a multiply operation.
By incorporating MRAE module, we have observed a significant increase in accuracy. MRAE that includes three convolutional layers with different filter sizes, coupled with a channel-wise attention block, empowers the model to discern the significance of each channel and effectively allocate attention during the feature extraction process. By dynamically adapting to the input data, the MRAE module enhances the model's discriminative capabilities, thereby improving its overall performance in the handwritten digit recognition task.
For training the aforementioned model, we utilized the transfer learning technique. To create a balanced dataset, we collected 10,000 low-quality and high-quality digit images for each language, resulting in a total of 120,000 images across all languages. This carefully curated dataset was then used to train our robust model. Following the training process, the obtained model was utilized for prediction and generation of handwritten digit images without requiring any additional retraining. The results obtained from this approach were highly significant and yielded excellent outcomes.
C. Language Recognition
Given the multilingual nature of our datasets, which encompass 12 languages with 10 numerical characters each (excluding the Chinese dataset with 15 classes), we face the challenge of recognizing a total of 125 distinct number classes. Handling such a large number of classes with a single model presents significant difficulties. Additionally, treating all similar numbers across languages as the same class can hinder the model's ability to learn effectively, as there may be substantial variations in how the same numbers are represented across different languages. Moreover, adding a new language would require retraining the entire model.
To address these challenges, we propose an approach that involves using a language detection model to determine the language of the input image. Based on the detected language, a corresponding classifier model is selected to recognize the specific type of number. By decoupling the language detection process from the digit recognition model, we achieve a more streamlined approach for incorporating new languages into our system. This separation allows us to develop new number recognition models for additional languages and seamlessly integrate them into the existing system, without disrupting the language detection and other digit detection components. This flexibility ensures scalability as the number of supported languages continues to grow.
To address the task of language detection for our system, we sought a model capable of robustly extracting and learning essential features. In Figure 4 we introduce our novel "language classifier based on MRAE" model, which encompasses three blocks. Each block is composed of two convolutional layers employing a 3x3 filter, a max-pooling layer, and an MRAE module. Notably, the convolutional layers in these blocks are thoughtfully configured with 64, 128, and 256 filters, respectively.
Subsequent to the three feature extraction blocks, we incorporate three fully connected layers, comprising 512, 256, and 128 units, respectively. The final layer of the model consists of 12 neurons, aligning with the number of classes (languages), and employs the Softmax activation function to generate probability distributions over the language classes. Remarkably, the model is trained using a balanced dataset comprising 10,000 samples for each language, yielding exemplary accuracy in language detection.
One of the significant advantages of this proposed language recognition model lies in its simplicity and efficient structure, which contributes to a small number of parameters, rendering it computationally lightweight. Furthermore, due to its streamlined architecture, this model can also be seamlessly integrated into the digit recognition step, thereby providing a cohesive and unified framework for the entire handwritten digit recognition system.
D. Digit Recognition
In the digit recognition stage, we capitalize on the language-specific classifier models to achieve accurate number recognition, resulting in 12 distinct models, each corresponding to a particular language. To automate this process and streamline the recognition pipeline, we have developed a sophisticated module that seamlessly integrates the language detection model with the corresponding number recognition model. This integration enables us to efficiently direct the input image to the appropriate number recognition model based on the identified language, thereby enhancing the overall efficiency of the system's output.
The structure of the digit recognition model closely resembles that of the language detection model, leveraging transfer learning techniques and fine-tuning to optimize performance. Within the last fully connected layer, we utilize 10 neurons to facilitate the classification of the ten classes, representing the numbers from 0 to 9. It is important to note that the final model structure is individualized for each language.
The incorporation of transfer learning in the digit recognition process imparts several significant advantages during model training. As illustrated in Figure 4 this approach significantly reduces the required training time and computational resources compared to training a model from scratch. Furthermore, transfer learning enables effective training even with limited data, as it capitalizes on the pre-trained knowledge and salient features extracted from the language detection model. By exploiting transfer learning, we improve the accuracy and performance of the digit recognition model, leading to enhanced generalization and overall system efficiency. Notably, the use of pre-trained language recognition models also results in a reduced number of parameters, ultimately contributing to lower computation costs during the recognition process.