Word Segmentation of Ancient Tamil Text extracted from inscriptions

doi:10.21203/rs.3.rs-4901928/v1

Download PDF

Research Article

Word Segmentation of Ancient Tamil Text extracted from inscriptions

https://doi.org/10.21203/rs.3.rs-4901928/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

The absence of word boundaries between words in scriptio continuo languages hinder the development of NLP models for such languages. The objective of this research is to facilitate the building of NLP models for scriptio continuo languages by designing a word segmentation model for predicting word boundaries between letters in sentences, focusing particularly on ancient Tamil scripts. We have utilized a N-gram Naive Bayes model to predict the existence of word boundaries between two characters in a scriptio continuo text. We trained and assessed the model on a dataset of ancient Tamil writing, achieving an accuracy of 91.28%. Efficiently segmenting ancient Tamil texts not only helps preserve and comprehend historical manuscripts, but it also enables advancements in automated text segmentation. This model will assist archaeologists in constructing NLP models utilizing ancient Tamil, allowing for the extraction of significant information from ancient Tamil manuscripts without the need for a language expert. Additional research may be undertaken to examine more effective techniques for word segmentation with better performance, managing scripts from several centuries and developing models for additional languages.

Ancient Tamil inscriptions exhibit significant differences compared to modern Tamil, which pose difficulties to common man in understanding. Therefore, tasks like transliteration and translation require the intervention of human experts. Different AI-based systems have been proposed to solve the transliteration sub-tasks like character recognition, character reconstruction, and word and sentence segmentation. Shadow photometric stereo method [1] and bi-directional Long Short-Term Memory (Bi-LSTM) [2] are the two prominent methods used for character recognition and character reconstruction.

The Tamil language is based on the Brahmi script, known for its distinctive "scriptio continua" characteristic. We use the whitespace character ‘ ‘ in modern Tamil and English to demarcate the word boundaries. Scriptio continua languages like ancient Tamil and many South Asian languages such as Tibetan, Thai, Chinese, and Japanese lack explicit word boundaries. The absence of clearly defined word boundaries in these languages makes the task of word segmentation challenging and non-trivial.

Addressing the issue of word boundary identification in "scriptio continua" scripts is an active and ongoing area of research. This word segmentation process is considered a crucial preprocessing step in various Natural Language Processing (NLP) tasks. The development of accurate word segmentation techniques in scriptio continua languages will ultimately result in robust NLP systems to perform downstream tasks like machine translation, information retrieval, and text analysis, contributing to a deeper understanding of these diverse and culturally rich linguistic traditions. Several statistical and machine learning-based approaches are being explored to achieve efficient and accurate word segmentation. Tamil language poses additional complexities such as, in English, an alphabet typically represents the minimal unit of a word. But Tamil script operates under a different system. In Tamil, a single letter can symbolize a vowel, consonant, or even a combination of both, making it a complex and versatile script. This characteristic is particularly pronounced when considering the structural and usage differences between ancient and modern Tamil.

Related Work

The methods proposed for word segmentation can be broadly classified into Dictionary based and Machine Learning based approaches.

Dyer et al. [4] compares token- and character level approaches to restoration of spaces, punctuation, and capitalization to unformatted sequences of input characters. The proposed model is a novel character-level end-to-end BiLSTM model (overall F-score 0.90) which has the advantages of being able to restore midtoken capitalization and punctuation and of not requiring space characters to be present in input strings.This method also can be viable for building word segmentation models for low resource languages due to its small number of trainable parameters but in large extensive corpuses the model may falter due to the inability to generalize with a huge number of cases and edge case and also have low latency issues during inference.

Haruechaiyasak and Kongthon [6] proposed LexToPlus, a tokenization and normalization tool for Thai language which has scriptio-continual property. They modified the Dictionary based segmentation method which handled insertion, transformation, transliteration and onomatopoeia in Thai social media texts. The proposed methods performed better than Machine Learning based approaches like Conditional Random Fields and Support Vector Machines.This approach also requires the system to have a well developed Dictionary which will be hard to obtain of low resource languages such as Ancient Tamil.

Paripremkul and Sornil [7] proposed a novel technique for Thai word segmentation. They divided the word segmentation task into three phases: Minimum Text Unit (MTU) extraction (MTU is the smallest unit of a word in Thai language), syllable identification and word construction. MTU extraction and Syllable identification used Conditional Random Fields with features engineered from language characteristics. Word construction incorporated a combination of dictionary dependent longest word matching and rule based approach. This proposed method showed better performance in handling ambiguous word boundaries and Out of Vocabulary (OOV) words compared to baseline CNN model.

Peng, Feng, and McCallum [8] present a way to extract words from Chinese scripts, which do not delimit words with spaces. They use linear conditional random fields to segment a script into individual words and identify new words which are not in the existing vocabulary. As Chinese is a morphologically rich language.

Widiarti and Pulungan [10] proposed brute force and greedy algorithms for word segmentation which was essential in the transliteration of ancient Javanese manuscripts to modern Indonesian language. The algorithms worked by composing syllables extracted from manuscripts (Hamong Tani Book) into meaningful words found in the Bausastra dictionary given a string of continuous syllables as an input. Experimental results show that the greedy algorithms were more efficient than the brute-force algorithm. They reported inefficiencies in processing deformed characters and Out of Vocabulary (OOV) words. Greedy algorithms are good for solving simple incremental optimization problems allowing for building models even in low resource constraint situations but they falter when advanced distributions need to be learned for the final solution.

Raman and Sahu [9] cover a method of reading Devanagari script with OCR by preprocessing and segmenting the script into three constituent parts for each character. By reading the upper, middle, and lower regions for each character individually, they manage to achieve a high accuracy at recognizing the characters.This technique allows one to have a degree of contextual knowledge of the input data even in low resource conditions but the model will lack flexibility in handling out of distribution cases.

Zia, Raza, and Athar [11] present a word segmentation system for Urdu, which does not always add spaces between two words, and sometimes adds spaces within a single word, using linear chain conditional random fields, which use undirected graphs connecting each element to its immediate neighbors to allow for predictions using context clues. Conditional Random Field approaches are hard to define and inflexible to changes but are very good at building simple probabilistic state models.

Overview of proposed method and objectives

The purpose of our method is to segment low-resource scriptio-continua languages, Ancient Tamil in our case, efficiently using Machine Learning Techniques with data collected from stone inscriptions.The major contributions of this study include,

A Segmentation model trained on text extracted from South Indian Inscriptions Books published by the Archaeological Survey of India.

Ancient Tamil corpus extracted from Sangam literature and South Indian Inscriptions Books which can be used in NLP tasks for Tamil Language.

In this section we describe the proposed system starting with an overview of model, data used for training and evaluation and the training methodology incorporated.

Model Overview

The proposed system aims to produce a word segmentation model trained for segmenting ancient Tamil extracted from stone inscriptions. This model uses a n-gram language model, storing the frequency of every N-gram where N is less than or equal to a specified value, k in memory. During execution, it use this data to compute probabilities for known N-grams. If a specific N-gram is not present in memory, it employs the stupid backoff algorithm [12], recursively calculating probabilities based on (N − 1)-grams until a unigram is produced. In the case of a unknown Unigram, the model computes the probability based on the length of the new word. The segmentation process, trained for Ancient Tamil, ensures that ‘uyir’ letters is never produced in the middle of a word and ‘mei’ letters are never present at the beginning of the word. The model is primarily trained on text from Sangam literature and South Indian inscriptions, along with some text in modern Tamil corpus from various sources due to the scarcity of ancient Tamil data.

Data

For the ancient Tamil corpus, we have used the South Indian Inscriptions books published by Archaeological Survey of India. South Indian Inscriptions is an epigraphical series that has been published by the Archaeological Survey of India in 27 volumes from 1890 through the present. The texts are supplemented with summaries and an overview of the texts, both in English and Tamil. Hundreds of inscriptions were copied from different parts of South India which were in Dravidian languages like Tamil, Kannada and Telugu. We use the Tesseract OCR model to extract text from PDF files, South Indian Inscriptions book in our case. The model converts each PDF page to image and uses the Tesseract model to extract text from the image. The quality of extracted inscriptions were verified with the guidance of Archaeological experts. The raw text is processed using regex to remove unwanted characters (english characters, special characters, numbers). For evaluation we use a subset of the extracted corpus along FLORES-200 and IN22 benchmarking datasets. FLORES-200 [3] is a multi-domain general-purpose benchmark designed for evaluating translations across 200 languages, including 19 Indic languages. The English sentences are source-original and have been translated into other languages. It comprises sentences sourced from Wikimedia entities with equal portions of news, travel, and non-fiction content from children’s books. IN22 [5] is a comprehensive benchmark for evaluating machine translation performance in multi-domain, n-way parallel contexts across 22 Indic languages. It comprises three distinct subsets, namely IN22-Wiki, IN22-Web, and IN22-Conv. The Wikipedia and Web sources subsets offer diverse content spanning news, entertainment, culture, legal, and India-centric topics.

Data Mapping

Unlike languages like English, Tamil characters are comprised of multiple Unicode characters (தொ, for example, is comprised of த and ொ), the tamil texts are mapped such that each set of Unicode characters representing a single Tamil character is mapped to a unique single Unicode character. The mapping is one-to-one, so that it can be unmapped into legible text after processing. Algorithm 1depicts how the mapping works. The list of Tamil characters is obtained and sorted in descending order of bytes per character. This is to ensure that characters like தொ come before த, as otherwise the single-byte characters would be mapped first, leaving the multi-byte characters broken. Then, each character in the list is mapped to a unique and arbitrary single-byte Unicode character. Then, the list is iterated over in order, and instances of a character in the sequence to be mapped are replaced with the corresponding mapped character.

Algorithm 1

Text Mapping

Input

sequence

Output

Mapped Text

tamil_characters = Load(Tamil Character List)

tamil_characters = sort(tamil_characters, descending = true, key = bytes per character)

M = Map()

index = 0

for character in tamil_characters:

M.add({character: char(40000 + index)})

index + = 1

for key in M:

sequence = sequence.replace(key, M[key])

return sequence

Model Training

The N-gram model trains by storing the frequency of every N-gram for values of N less than or equal to a given k. When executing, it recalls this data to calculate a probability in the case of any N-gram it has seen before. In case it has not seen a particular N-gram, it uses the stupid backoff algorithm and attempts to calculate a probability based on the (N − 1)-gram obtained by dropping the least recent word from the sequence. It performs this until it reaches a unigram (single word). If the unigram has not been seen either, it uses a function to approximate the probability of the unseen word given the new word’s length. Segmentation is done using a dynamic programming approach that iterates through every possible segmentation and returns the one with the highest probability. The segmentation process has been specialized to Tamil, where it shall never allow an ‘uyir’ letter to appear in the middle of a word, and will never allow a ‘mei’ letter to appear at the beginning of a word.

Algorithm 2

Segment text

Input

unsegmented_text, preceding n words, maximum_word_length

Output

Probability of segmentation, Best segmentation of the text

if unsegmented_text is empty:

return 100%, [ ]

best_segmentation = NULL

for each space location S in Split(unsegmented_text, maximum_word_length):

W = Word obtained by splitting unsegmented_text at S

P’, S’ = Segment(text after S in unsegmented_text, (preceding n-1 words, W), maximum_word_length)

P = Probability(W, preceding n words) * P’

if P > best_segmentation’s probability:

best_segmentation = P, (W, S’)

return best_segmentation

Algorithm 2

depicts the logic used to segment the input text. It works by calling on Algorithm 3 to generate every possible location a space can occur, and calculating each word that would be generated by inserting a space at those locations. For each word, it calls upon itself recursively, giving the unsegmented text after the space and the new word generated as inputs, along with the past words from its own input. It then receives the best possible split and its probability from that function call. It multiplies this probability with the probability of the word it generated, calculated using Algorithm 4, and if this value is better than the best split it has generated so far, it stores its current split as the best, which is then returned at the end.

Algorithm 3

Find Split of word

Input

unsegmented_text, maximum_word_length

Output

All possible split locations

if an uyir character is present at location i < maximum_word_length:

return {1..i}

else:

return {1..maximum_word_length}

Algorithm 3

is used to generate all possible locations a space can be added to unsegmented text. It uses the maximum word length hyperparameter, and returns a list with every position from zero to the maximum word length. An exception is made if an ‘uyir’ character is found at a location before the maximum word length. As no Tamil word can have an ‘uyir’ character at any position other than the beginning, it only returns positions up until then.

Algorithm 4

Compute probability of word

Input

word, preceding k words

Output

Probability of word

if word begins with a mei character:

return 0

if the sequence (preceding k words, word) is new:

return Probability(word, preceding k-1 words)

if preceding k words is empty and word is new:

return UnkownProbability(word)

return frequency[(preceding k words, word)] / frequency[preceding k words]

Algorithm 4

computes the probability of a word, given any preceding words. It has four steps: first, if the word begins with a ‘mei’ character (such as க் or ப்), it returns a probability of zero, as no Tamil word can begin with one of these characters. Then, if the sequence of words given by the preceding words and the new word is not present in the training data, it uses the Stupid Backoff algorithm and recursively calls itself with near-identical parameters, except the oldest word from the preceding words is truncated. If there are no preceding words and the new word is not present in the training data, a function is called to estimate the probability of the word. The probability is computed as, where is the length of the word and is the average word length obtained from the training data. Lastly, if it does find the sequence of words given by the preceding words and the new word from the training data, it returns the frequency of that sequence divided by the frequency of just the preceding words.

Hyperparameter Details

The N-gram model uses two hyperparameters: the maximum word length (L), and ‘lambda’ used in predicting the probability of an unseen word. These variables are tweaked using a validation set, on which the model is run repeatedly with different values of L and lambda. Performance metrics are calculated for each combination of L and lambda and the combination that yields the best results are selected.

The performance evaluation was conducted on the Stone inscriptions, FLORES-200, and IN22 datasets, specifically focusing on the segmentation task. Word segmentation can be approached as a classification task where we classify each predicted word separator ‘whitespace’ by comparing them with the ground truth sentence. Therefore, we employed Precision, Recall, and F1-Score metrics to assess the model's performance. In addition to these traditional metrics, we also leveraged BLEU score to quantify the similarity between the predicted and reference sentences. Figure 1 shows the predictions from some examples in the test.

To further evaluate the similarity between predicted and reference sentences, we employed Cosine Similarity. This involved utilizing the FastText [9] Tamil model to compute sentence embeddings for both the predicted and reference sentences. By comparing these embeddings, we obtained a measure of their cosine similarity, providing insight into the semantic similarity between the predicted and ground truth segments. Table 1 shows performance metrics on various datasets discussed (TP - valid split, FP – invalid split, FN – missed split).

Table 1

Performance metrics evaluated for Segmentation
Metric	Dataset
Metric	South Indian Inscriptions	FLORES-200	IN-22
Precision	0.918	0.936	0.913
Recall	0.567	0.994	0.997
F1-Score	0.701	0.964	0.953
Bleu	0.890	0.981	0.976
Cosine Similarity	0.930	0.987	0.980

Hyperparameter Tuning

The experiments include an exhaustive grid search over ‘L’ and ‘lambda’ hyperparameter values. 10 different values of L and lambda were given, and the model was run on a small validation set consisting of 20 data points with every possible combination of hyperparameters possible. The final values of L and lambda arrived at were 13 and 1E-6, respectively. Table 2–4 show the experimentation results for South Indian Inscriptions, Flores-200, IN22 for different values of max n grams.

Table 2

Experimentation results on FLORES-200 dataset
Metric	max n gram(n)
Metric	1	2	3	4
Precision	0.721	0.816	0.866	0.936
Recall	0.563	0.693	0.830	0.994
F1-Score	0.717	0.788	0.792	0.964
BLEU	0.678	0.697	0.813	0.981
Cosine Similarity	0.706	0.822	0.874	0.987
Avg Time (in s)	0.051	0.236	2.204	2.567

Table 3

Experimentation results on IN-22 dataset
Metric	max n gram(n)
Metric	1	2	3	4
Precision	0.600	0.722	0.848	0.913
Recall	0.637	0.725	0.848	0.997
F1-Score	0.663	0.793	0.858	0.953
BLEU	0.710	0.757	0.778	0.976
Cosine Similarity	0.706	0.706	0.807	0.980
Avg Time (in s)	0.058	0.269	2.023	2.839

Table 4

Experimentation results on Stone Inscriptions dataset
Metric	max n gram(n)
Metric	1	2	3	4
Precision	0.908	0.918	0.920	0.920
Recall	0.478	0.559	0.559	0.567
F1-Score	0.626	0.696	0.696	0.701
BLEU	0.871	0.887	0.887	0.890
Cosine Similarity	0.919	0.910	0.925	0.930
Avg Time (s)	0.696	1.047	2.118	8.522

Digitalization of ancient texts in the form of inscriptions is a key for understanding the history and cultural aspects of society. For ancient tamil, an important hurdle in the digitization process is to segment the words in scriptio-continua text. This paper proposes an n-gram language model based on stupid backoff algorithm which effectively calculates probabilities for known N-grams and approximates probabilities for unseen N-grams based on length, thus ensuring robust segmentation performance even in the face of limited training data. We also added support to add custom language-specific segmentation rules, in our case, we added a rule to ensure ‘uyir’ letters do not appear in the middle of a word and ‘mei’ letters do not appear at the beginning. We evaluated the segmentation performance on texts extracted from South Indian Inscriptions Books published by Archaeological Survey of India which represents the ancient tamil text using Precision, Recall, F1-Score, BLEU score, and Cosine Similarity. We achieved ~ 92% precision and ~ 93% similarity (based on cosine score) indicating that the system not only achieves high segmentation accuracy but also maintains semantic fidelity to the reference sentences.

In summary, the model presents a significant advancement in the field of ancient Tamil word segmentation, providing a valuable tool for linguistic researchers and historians.

Future Directions

Future work could explore further enhancements through ensemble techniques and expansion of the training corpus to include more diverse sources of ancient Tamil text. The proposed model has been trained from a corpus consisting of ancient 11th century ancient texts. To enhance its effectiveness across in a variety of contexts, Improvements can be obtained by expanding the corpus size of the text corpus. By the addition of mixture of experts neural block before the classifier We can train networks which can generalize to a variety of data distrubtion, producing expert networks for all text distributions in Ancient tamil. It can also be extended to be a single model to be capable of handling multiple versions of the characters across centuries. This advancement not only improves accuracy but also augments the generalizability across mutiple century text.

Ethics approval and consent to participate

Not applicable

Consent for publication

All the authors consent for the publication of this study

Availability of data and materials

Data sets generated during the current study are available from the corresponding author on reasonable request. The Ancient Tamil language text extracted data extracted from inscriptions are available from the book series South Indian Inscriptions published by the Archaeological Survey of India but restrictions apply to the availability of these data, and so are not publicly available.

Competing interests

Not applicable

Funding

Not applicable

Author’s contributions

Bharadwaj Sudarsan. worked on designing the experiments, undertaking experimental work. Sanjith S worked on experiment design and dataset preparation. Sandeep S worked on developing the concept and tabulating the experimentation results. All authors equally contributed to the manuscript drafting, literature survey and review.

Acknowledgements

Not applicable

Author’s information

Sandeep S

Independent Researcher

Chennai, Tamil Nadu, India

[email protected]

Bharadwaj Sudarsan

Independent Researcher

Chennai, Tamil Nadu, India

[email protected]

Sanjith S

Independent Researcher

Chennai, Tamil Nadu, India

[email protected]

Bhuvaneswari G, Subbiah Bharathi V. (2016) ‘An efficient method for digital imaging of ancient stone inscriptions’. Curr Sci, pp. 245–50.
Bojanowski P, Grave E, Joulin A, Mikolov T. (2017) ‘Enriching Word Vectors with Subword Information’, Transactions of the Association for Computational Linguistics, 5, pp.135–46.
Costa-jussà MR, Cross J, Çelebi O, Elbayad M, Heafield K, Heffernan K, NLLB Team. …. (2022) ‘No language left behind: Scaling human-centered machine translation’, arXiv preprint arXiv:2207.04672.
Dyer L, Hughes A, Shah D, Can B. (2022) ‘Comparison of token-and character-level approaches to restoration of spaces, punctuation, and capitalization in various languages’, In Proceedings of the 5th International Conference on Natural Language and Speech Processing, pp.168–178.
Gala J, Chitale PA, Raghavan AK, Gumma V, Doddapaneni S, Kunchukuttan MAK, A. IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22. Scheduled Indian Languages’, Transactions on Machine Learning Research; 2023.
Haruechaiyasak C, Kongthon A. (2013) 'LexToPlus: A thai lexeme tokenization and normalization tool', In Proceedings of the 4th Workshop on South and Southeast Asian Natural Language Processing, pp.9–16.
Paripremkul K, Sornil O. Segmenting words in Thai language using Minimum text units and conditional random Field. J Adv Inform Technol Vol. 2021;12(2):135–41.
Peng F, Feng F, McCallum A. (2004) ‘Chinese segmentation and new word detection using conditional random fields’, In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pp.562–568.
Raman NK, Sahu N. (2015) ‘Efficient system for Devnagari script segmentation’, 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, pp.722–725.
Widiarti AR, Pulungan R. (2020) ‘A method for solving scriptio continua in Javanese manuscript transliteration’. Heliyon, 6(4).
Zia HB, Raza AA, Athar A. (2018) ‘Urdu word segmentation using conditional random fields (CRFs)’, In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp.2562–2659.
Brants T, Popat A, Xu P, Och FJ, Dean J. (2007, June). Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 858–867).

No competing interests reported.

Download PDF

Editorial decision: Revision requested
22 Aug, 2024
Editor assigned by journal
19 Aug, 2024
Submission checks completed at journal
19 Aug, 2024
First submitted to journal
12 Aug, 2024

You are reading this latest preprint version

Word Segmentation of Ancient Tamil Text extracted from inscriptions

Status:

Version 1

Abstract

Figures

Introduction

Related Work

Overview of proposed method and objectives

Methods

Model Overview

Data

Data Mapping

Model Training

Hyperparameter Details

Results and Discussion

Hyperparameter Tuning

Conclusion

Future Directions

Declarations

References

Additional Declarations

Status:

Version 1