Part-of-Speech (POS) tagging involves assigning lexical tags to words and symbols in the text [1]. These tags indicate the syntactic roles of symbols and words based on their context and sentence structure. POS tagging is fundamental in numerous natural language processing (NLP) applications, including question-answering, machine translation, information retrieval, and text summarization [2]. POS tags provide insights into the structural characteristics of lexical terms within a sentence or text, enabling assumptions about semantics. They also find applications in Named Entity Recognition, Co-reference Resolution, and Speech Recognition.
For instance, Chotriat et al. utilized POS tags for question classification in Thai [3] to enhance classification accuracy. In [4], a recommender system was proposed for social networks to categorize content and provide suggestions based on overlapping interests. They employed POS tags for each word to assign values and obtain information representations to identify shared interests. In Persian Sentiment Analysis, most studies use POS tags to classify sentiments [5, 6]. POS tagging helps to disambiguate words; words with an adjective role in a sentence are particularly useful in determining the overall sentiment of a document and play a crucial role in selecting the best features.
Like many other languages, colloquial Persian is prevalent in social networks, which presents challenges in processing the Persian language due to its unique features and diverse writing styles. The conversational writing style found on social networks further exacerbates the difficulties in processing textual content. Even linguistic experts may encounter challenges when dealing with everyday text from social networks and may require assistance.
Abnormal styles, such as sentence parts (verb, subject, object, etc.) being deleted or their order changed (e.g., "رفتم خونه ی دوستم" instead of "من به خانه ی دوستم رفتم," equal to "I went to my friend's house" in English), as well as the abnormal repetition of letters (e.g., "لااایک," equal to "like" in English), and the use of misspelled words (e.g., "حتا," equal to "even" in English), are often found in social network texts.
While there are powerful tools for POS tagging in English, some languages, such as Persian, still need a comprehensive tool for this purpose, particularly for colloquial Persian. In recent years, there have been attempts to develop integrated Persian preprocessing packages, with notable examples being "Hazm" [8] and "ParsiPardaz", [9] both of which are almost complete and open-source.
The flexibility of the Persian language presents another challenge in POS tagging, making it more difficult than in other languages [10]. The creation of. Creating new terms in Persian is effortless, and prefixes and suffixes can easily combine with other words to generate new terms. Persian is considered a morphologically rich language [11]. Therefore, it is possible to make up, allowing for the formation of different terms by changing affixes. For instance, the word "آمد" (which means "(s)he came") is a verb. When combined with "در," the new word is "درآمد," which means "income" with the noun tag. Similarly, if it is combined with the prefix "کار," the resulting word is "کارآمد," meaning "efficient" with an adjective tag [12]. On the other hand. However, this morphological richness also adds complexity to POS tagging, as the correct tag needs to be assigned based on the word's context.
Furthermore, distinguishing Persian text content from languages like Arabic, Urdu, and Pashto is challenging, and processing extracted colloquial text content presents difficulties. Hazm defines POS tags for formal tokens well but may need to be more effective for informal texts. Additionally, words and phrases in language can evolve, especially with the rapid growth of social networks. New terms and expressions constantly enter the Persian language due to the expansion of the internet such as "میچتیم" (in English: we are speaking).Hence, it is imperative to note that conventional tools like Hazm may be unreliable for part-of-speech (POS) tagging within social network analysis. This limitation stems from the fact that the underlying model of such tools is primarily trained on formal textual sources, such as official news articles and newspapers. Furthermore, the Persian popular corpus, exemplified by BijanKhan[13], which has historically served as a valuable resource for numerous studies in Persian text processing, predominantly comprises data derived from daily news and common textual materials. However, it is worth emphasizing that this corpus may no longer align with contemporary research demands, particularly in the context of POS tagging for informal vernacular expressions encountered on platforms like Twitter or Instagram.
Automatic POS tagging is a crucial step in NLP pipelines, but it requires a significant amount of annotated data to train reliable models. Social media texts differ greatly from formal texts in terms of grammar, spelling variations, slang, and abbreviated expressions, especially in Persian. Existing Persian corpora are valuable for analyzing formal text and POS tagging. Still, they may need to be more suitable for processing social media texts for applications like intelligent advertising or recommender systems.
The lack of a sufficiently large reference corpus of social media texts for training POS taggers could be why automatic POS tagging for social media texts has rarely been studied in the Persian language. Building labeled corpora, including part-of-speech tagging, presents numerous challenges, such as gathering data from diverse sources like Telegram, Twitter, and Instagram, each with its methods and techniques. Accessing large volumes of data for research purposes can be difficult, and setting up data collection tools to extract such data can be time-consuming and expensive.
Despite the significant research in this area, more work must be done on colloquial POS tagging in Persian. In this study, we aim to address this gap by introducing a comprehensive corpus called CPPOS (Colloquial Persian Part of Speech), designed to tackle the challenges associated with informal text. CPPOS contains formal and informal sentences from three social media platforms: Telegram, Twitter, and Instagram. The data collection process spans from June 22, 2019, to March 20, 2021, to ensure coverage of various topics.
After data collection, a preprocessing step is conducted to clean the text by removing links, extra symbols, etc. The cleaned text is then tokenized using a Persian tokenizer, making it ready for annotation. Before annotation, three Persian linguistic experts created an annotation guideline, including general and specific rules and examples for specific cases. This guideline is continuously updated and is a fundamental reference for current annotators and possible future expansions of labeled data. The annotation process spans six months and involves 520K tokens, and the tagset comprises more than 60 tags.
Finally, based on this annotated daدظtaset, we train a BiLSTM model for automated POS labeling. Our contributions to this research can be summarized as follows:
-
Providing an annotation guideline for colloquial Persian POS tagging.
-
Introducing a novel colloquial Persian corpus called CPPOS, collected from social media platforms (Telegram, Twitter, and Instagram), consisting of over 520k tokens of both formal and informal text, including everyday phrases, encompassing more than 60 tags.
-
Development of an automated labeling model based on this corpus.
The rest of the study is organized as follows: The next section reviews previous works on POS tagging. The details of the CPPOS corpus are then presented in the section titled "Corpus Preparation." The subsequent section describes the proposed method. The section on Evaluations and Results contains the performance analysis and experimental results. Finally, the conclusions and future work are presented.