Ancient Tamil inscriptions exhibit significant differences compared to modern Tamil, which pose difficulties to common man in understanding. Therefore, tasks like transliteration and translation require the intervention of human experts. Different AI-based systems have been proposed to solve the transliteration sub-tasks like character recognition, character reconstruction, and word and sentence segmentation. Shadow photometric stereo method [1] and bi-directional Long Short-Term Memory (Bi-LSTM) [2] are the two prominent methods used for character recognition and character reconstruction.
The Tamil language is based on the Brahmi script, known for its distinctive "scriptio continua" characteristic. We use the whitespace character ‘ ‘ in modern Tamil and English to demarcate the word boundaries. Scriptio continua languages like ancient Tamil and many South Asian languages such as Tibetan, Thai, Chinese, and Japanese lack explicit word boundaries. The absence of clearly defined word boundaries in these languages makes the task of word segmentation challenging and non-trivial.
Addressing the issue of word boundary identification in "scriptio continua" scripts is an active and ongoing area of research. This word segmentation process is considered a crucial preprocessing step in various Natural Language Processing (NLP) tasks. The development of accurate word segmentation techniques in scriptio continua languages will ultimately result in robust NLP systems to perform downstream tasks like machine translation, information retrieval, and text analysis, contributing to a deeper understanding of these diverse and culturally rich linguistic traditions. Several statistical and machine learning-based approaches are being explored to achieve efficient and accurate word segmentation. Tamil language poses additional complexities such as, in English, an alphabet typically represents the minimal unit of a word. But Tamil script operates under a different system. In Tamil, a single letter can symbolize a vowel, consonant, or even a combination of both, making it a complex and versatile script. This characteristic is particularly pronounced when considering the structural and usage differences between ancient and modern Tamil.
Related Work
The methods proposed for word segmentation can be broadly classified into Dictionary based and Machine Learning based approaches.
Dyer et al. [4] compares token- and character level approaches to restoration of spaces, punctuation, and capitalization to unformatted sequences of input characters. The proposed model is a novel character-level end-to-end BiLSTM model (overall F-score 0.90) which has the advantages of being able to restore midtoken capitalization and punctuation and of not requiring space characters to be present in input strings.This method also can be viable for building word segmentation models for low resource languages due to its small number of trainable parameters but in large extensive corpuses the model may falter due to the inability to generalize with a huge number of cases and edge case and also have low latency issues during inference.
Haruechaiyasak and Kongthon [6] proposed LexToPlus, a tokenization and normalization tool for Thai language which has scriptio-continual property. They modified the Dictionary based segmentation method which handled insertion, transformation, transliteration and onomatopoeia in Thai social media texts. The proposed methods performed better than Machine Learning based approaches like Conditional Random Fields and Support Vector Machines.This approach also requires the system to have a well developed Dictionary which will be hard to obtain of low resource languages such as Ancient Tamil.
Paripremkul and Sornil [7] proposed a novel technique for Thai word segmentation. They divided the word segmentation task into three phases: Minimum Text Unit (MTU) extraction (MTU is the smallest unit of a word in Thai language), syllable identification and word construction. MTU extraction and Syllable identification used Conditional Random Fields with features engineered from language characteristics. Word construction incorporated a combination of dictionary dependent longest word matching and rule based approach. This proposed method showed better performance in handling ambiguous word boundaries and Out of Vocabulary (OOV) words compared to baseline CNN model.
Peng, Feng, and McCallum [8] present a way to extract words from Chinese scripts, which do not delimit words with spaces. They use linear conditional random fields to segment a script into individual words and identify new words which are not in the existing vocabulary. As Chinese is a morphologically rich language.
Widiarti and Pulungan [10] proposed brute force and greedy algorithms for word segmentation which was essential in the transliteration of ancient Javanese manuscripts to modern Indonesian language. The algorithms worked by composing syllables extracted from manuscripts (Hamong Tani Book) into meaningful words found in the Bausastra dictionary given a string of continuous syllables as an input. Experimental results show that the greedy algorithms were more efficient than the brute-force algorithm. They reported inefficiencies in processing deformed characters and Out of Vocabulary (OOV) words. Greedy algorithms are good for solving simple incremental optimization problems allowing for building models even in low resource constraint situations but they falter when advanced distributions need to be learned for the final solution.
Raman and Sahu [9] cover a method of reading Devanagari script with OCR by preprocessing and segmenting the script into three constituent parts for each character. By reading the upper, middle, and lower regions for each character individually, they manage to achieve a high accuracy at recognizing the characters.This technique allows one to have a degree of contextual knowledge of the input data even in low resource conditions but the model will lack flexibility in handling out of distribution cases.
Zia, Raza, and Athar [11] present a word segmentation system for Urdu, which does not always add spaces between two words, and sometimes adds spaces within a single word, using linear chain conditional random fields, which use undirected graphs connecting each element to its immediate neighbors to allow for predictions using context clues. Conditional Random Field approaches are hard to define and inflexible to changes but are very good at building simple probabilistic state models.
Overview of proposed method and objectives
The purpose of our method is to segment low-resource scriptio-continua languages, Ancient Tamil in our case, efficiently using Machine Learning Techniques with data collected from stone inscriptions.The major contributions of this study include,
-
A Segmentation model trained on text extracted from South Indian Inscriptions Books published by the Archaeological Survey of India.
-
Ancient Tamil corpus extracted from Sangam literature and South Indian Inscriptions Books which can be used in NLP tasks for Tamil Language.