Annotated Bangla Natural Language Processing (BNLP) by using python or machine learning.

doi:10.21203/rs.3.rs-5101422/v1

Download PDF

Research Article

Annotated Bangla Natural Language Processing (BNLP) by using python or machine learning.

https://doi.org/10.21203/rs.3.rs-5101422/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

This paper offers a thorough method for creating and assessing Python-based Natural Language Processing (NLP) tools for Bengali language. The study and practice of natural language processing focuses on how computers may be programmed to recognize, comprehend, and manipulate natural language speech or text for practical purposes. The study entails a thorough process for developing and testing NLP tools for the Bangla language that are based on Python and Machine Learning. It centers on the ways in which computers may be taught to perform tasks like Named Entity Recognition, Tokenization, Part-of-Speech (POS) Tagging, and Sentiment Analysis in order to comprehend Bengali text. The study commences with the introduction of a thoroughly annotated corpus that forms the basis for these activities and is intended to encompass a broad spectrum of language situations and structures. The study hopes to encourage future research and development in Bangla NLP by making these tools and resources open-source, encouraging cooperation and creativity. This project aims to aid the larger NLP community by offering a strong basis for applications like machine translation and sentiment analysis on Bangla Language.

Bangla NLP

Python

Annotated Corpus

Tokenization

POS Tagging

Named Entity Recognition

Sentiment Analysis

Corpus

Annotation

Bangla is one of the most widely spoken languages in the world with nearly 250 million native speakers. Bangla is also the only language that has inspired a language movement, which gained international recognition and is now celebrated as International Mother Language Day. Bangla, the seventh most spoken language globally, lacks significant resources and tools in NLP. This paper addresses this gap by introducing an annotated corpus and demonstrating key NLP tasks using Python. This paper’s contributions, with accuracy of 93.94%, include the development of a comprehensive corpus and the implementation of fundamental NLP tasks, which are essential for advancing research in this language.

Literature Review

Over the past few decades, there has been a considerable evolution in Natural Language Processing (NLP), with a primary focus on languages with abundant resources, such English, Chinese, and Spanish. On the other hand, relatively little has been done to develop NLP tools and resources for languages with fewer resources, such as Bangla. This paper highlights how research on annotated Bangla NLP utilizing Python Programming and different machine learning approaches helps to close this gap.

With almost 250 million native speakers, Bangla is the ninth most spoken language in the world. Even with its extensive use, Bangla's NLP resources are still lacking. It has been difficult to create reliable NLP models for this language due to its distinct grammatical structures, script, and dearth of substantial annotated datasets. In order to achieve proper text processing, Alam et al. (2019) stress the significance of developing tokenization and stemming models specifically for Bangla. Dash (2018) adds that specific tools must be developed for tasks like tokenization, part-of-speech (POS) tagging, and named entity recognition (NER) due to Bangla's rich morphology and agglutinative nature.

For Annotated Corpora Development, Hasan et al. (2020) performed part-of-speech tagging using Conditional Random Fields (CRF), which is regarded a big step forward for Bangla NLP. Annotated corpus development is one of the key contributions to Bangla NLP research. In our work, a manually annotated corpus for sentiment analysis, NER, tokenization, and POS tagging is introduced, providing important information for further study. This study is in line with the larger objectives of developing more user-friendly solutions for Bangla speakers across a range of fields, such as social media monitoring and educational apps and newspapers.

Bangla NLP Tools and Techniques like NLTK, Pandas, and Scikit-learn are a few of the Python libraries that have been used to create Bangla NLP applications. Through the use of unique scripts for data preprocessing, model training, and evaluation across several tasks, this research integrates various technologies. This methodology is similar to that of Bird, Klein, and Loper (2009), who emphasizes the significance of incorporating machine learning methods into NLP processes.

Notable advancements in the field include the use of NLTK for tokenization, the use of a Hidden Markov Model (HMM) for POS tagging, and the optimization of SpaCy models for NER. Moreover, the utilization of Scikit-learn classifiers for sentiment analysis, including logistic regression, Naive Bayes, and Support Vector Machines (SVM), broadens the scope of machine learning in Bangla natural language processing.

Pang and Lee (2008) highlight the challenges of sentiment analysis, including handling sarcasm, ambiguity, and contextual meanings, issues that are exacerbated in languages like Bangla, where sentiment-specific resources are sparse. Sentiment analysis is one of the more complex tasks in Bangla NLP due to the language's diverse vocabulary and emotional expressions. Future advancements in sentiment analysis will benefit greatly from our work, as we have experimented with algorithms.

Bangla's linguistic peculiarities present a number of difficulties for NLP applications. The intricacies of Bangla Neural Encryption (NER), including managing names of individuals, places, and organizations that are underrepresented in existing datasets, are examined by Roy and Amin (2021). Similarly, this study highlights the need for more research on NER and sentiment analysis to manage Bangla's informal language usage and complicated syntax, particularly on social media platforms.

Creating strong natural language processing (NLP) tools for Bangla has the potential to have a big impact on a lot of different industries, like customer service, digital media, and education. According to Jurafsky and Martin (2020), NLP technologies can increase under-represented languages' inclusion and accessibility. Through improved language models for automated text analysis, chatbots, and sentiment monitoring in customer feedback, natural language processing (NLP) can strengthen local industries in the Bangla environment.

More complex models are required, according to the research, and deep learning architectures like Transformers and Long Short-Term Memory (LSTM), which have been demonstrated to increase task accuracy for sentiment analysis in other languages and NER, are among them. Researchers can advance Bangla NLP by incorporating these strategies and building on the groundwork laid by me.

The literature study highlights the pressing need to advance Bangla natural language processing (NLP), with the creation of tools and annotated datasets playing a key role in this process. The techniques and resources used in current research, such as those created by us, greatly aid in overcoming the linguistic difficulties that Bangla presents. Still, there is a need for more advanced machine learning models and more datasets, which points to useful areas for further study.

For this research, sample texts are collected as data from diverse sources including news articles, literature, and social media. These sentences were used as data which were annotated for various NLP tasks.

-Tokenization: Dividing text into words, phrases, symbols, or other meaningful elements.

-POS Tagging: Assigning parts of speech to each token.

-Named Entity Recognition (NER): Identifying proper names in the text.

-Sentiment Analysis: Classifying text based on emotional tone like positive, negative and neutral.

For research purpose, customized tools were developed using Python and libraries like NLTK, Pandas, spaCy and Matplotlib. The tools are developed for each task and custom scripts are written for preprocessing, model training, and evaluation.

Model Description

1. Tokenization

The term "tokenization" refers to the process of breaking down large blocks of text into smaller ones, which might be anything from words or subwords to individual characters. Before separating the text according to spaces or other criteria, text preparation is performed, such as deleting unnecessary characters. Unknown words or padding can be represented via special tokens. Token sequences are the end result and the building blocks of additional natural language processing operations like text analysis and model training. Handling script-specific elements and complicated linguistic rules can be a challenge when tokenizing languages like Bengali.

Consider the following sentence: "The quick brown fox jumps over the lazy dog."

Words that are tokenized: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"].
Character that are tokenized: ["T", "h", "e", "q", "u", "i", "c", "k",...].

2. POS Tagging

POS tagging (Part-of-Speech tagging) is a fundamental task in Natural Language Processing (NLP) where each word in a sentence is labelled with its corresponding part of speech. The goal is to identify whether a word is a noun, verb, adjective, adverb, etc., based on its usage and context within the sentence. POS tagging is a form of classification where words are classified into their respective syntactic categories. The following elaborates how POS Tagging Works.

Text Preprocessing: Before tagging, text is usually tokenized into words or sentences. This involves splitting the text into units that will be tagged.

Classification of Words: POS tagging is a classification task where a model or algorithm determines the correct tag for each word in the sentence.

Example:

Sentence: "The cat sat on the mat."

Tags:

"The" → Determiner (DET)

"cat" → Noun (NN)

"sat" → Verb (VBD)

"on" → Preposition (IN)

"the" → Determiner (DET)

"mat" → Noun (NN)

[tini amake bhalobashen]

“tini”- NN

“amake”- PN

“bhalobashen”- VB

3. Named Entity Recognition

One natural language processing (NLP) method is named entity recognition (NER), which sorts text according on predetermined criteria such names of people, places, dates, and organizations. The first step is to identify potential entities and give them appropriate labels (such as "John" for a person, "Google" for a company, and "Paris" for a city). Applications such as information retrieval, question-answering, and text summarization are made possible by NER, which aids in extracting important information from unstructured text. NER models often necessitate domain-or language-specific training data and can be rule-based, ML-based, or hybrids of the two. Example of identifying proper names in the text:

সে আমাকে প্রতিদিন ফুল দেয়

[('সে', 'PRP'), ('আমাকে', 'PRP'), ('প্রতিদিন', 'JJ'), ('ফুল', 'NN'), ('দেয়', 'VB')]

Positive

4. Sentiment Analysis

Sentiment analysis is a natural language processing (NLP) technique used to determine the emotional tone behind a body of text. It's a way of classifying text into categories such as positive, negative, or neutral sentiments. The goal is to assess how people feel about a topic, product, or service by analysing their language.

How Sentiment Analysis Works

1. Text Preprocessing:

- Tokenization, removing stop words, stemming/lemmatization, and removing special characters.

2. Feature Extraction:

- Convert text into numerical features. This can be done using:

- Bag of Words (BoW): Represents text by counting word occurrences.

- TF-IDF (Term Frequency-Inverse Document Frequency): Measures how important a word is in a document relative to a collection of documents.

-Word Embeddings: Like Word2Vec, GloVe, and BERT, which capture semantic relationships between words.

3. Classification Models:

Sentiment analysis typically involves supervised machine learning techniques:

- Logistic Regression: A simple yet effective algorithm for binary classification.

- Naive Bayes: Often used for text classification due to its simplicity.

- Support Vector Machines (SVM): Efficient for small to medium datasets.

- Deep Learning: Using models like LSTMs, GRUs, or Transformers for better performance in sentiment classification, especially with large datasets.

4. Polarity Detection:

Sentiment analysis algorithms classify texts based on polarity:

- Positive: Text expresses a positive sentiment.

- Negative: Text conveys a negative sentiment.

- Neutral: Text is neither positive nor negative.

5. Aspect-based Sentiment Analysis:

- In this variation, sentiment is analyzed for specific aspects of a product or service (e.g., a product review can be positive about the design but negative about the battery life).

Here, some examples are mentioned with Annotated Text:

Example 1: "বাংলা ভাষা একটি ইন্দো-ইউরোপীয় ভাষা।"

Tokens: ["বাংলা", "ভাষা", "একটি", "ইন্দো-ইউরোপীয়", "ভাষা", "।"]

POS Tags: ["NN", "NN", "DT", "JJ", "NN", "."]

Entities: [("বাংলা ভাষা", "LANGUAGE")]

Sentiment: ["Neutral"]

Code:

import nltk

from nltk.tokenize import word_tokenize

text = "বাংলা ভাষা একটি ইন্দো-ইউরোপীয় ভাষা।"

tokens = word_tokenize(text)

print(tokens)

Example 2: "ঢাকা বাংলাদেশের রাজধানী এবং বৃহত্তম শহর।"

Tokenization

["ঢাকা", "বাংলাদেশের", "রাজধানী", "এবং", "বৃহত্তম", "শহর", "।"]

POS Tags

["NNP", "NNP", "NN", "CC", "JJS", "NN", "."]

Named Entities

[("ঢাকা", "LOCATION"), ("বাংলাদেশ", "LOCATION")]

Sentiment: ["Neutral"]

Example 3: "রবীন্দ্রনাথ ঠাকুর বাংলা সাহিত্যের একজন প্রখ্যাত কবি।”

Tokenization: ["রবীন্দ্রনাথ", "ঠাকুর", "বাংলা", "সাহিত্যের", "একজন", "প্রখ্যাত", "কবি", "।"]

POS Tags: ["NNP", "NNP", "JJ", "NN", "DT", "JJ", "NN", "."]

Named Entities: [("রবীন্দ্রনাথ ঠাকুর", "PERSON"), ("বাংলা সাহিত্যের", "ART")]

Sentiment: ["Positive"]

Example 4: “আমি আজ স্কুলে যাবো না, কারন আমি অসুস্থ।”

Tokenization: ["আমি", "আজ", "স্কুলে", "যাব", "না", "কারণ", "আমি", "অসুস্থ", "।"]

POS Tags: ["PRP", "NN", "NN", "VB", "RB", "IN", "PRP", "JJ", "."]

Named Entities: [("স্কুলে", "LOCATION")]

Sentiment: ["Negative"]

Example 5: “আমার প্রিয় খাবার বিরিয়ানি।”

Tokenization: ["আমার", "প্রিয়", "খাবার", "বিরিয়ানি", "।"]

POS Tags: ["PRP$", "JJ", "NN", "NN", "."]

Named Entities: [("বিরিয়ানি", "FOOD")]

Sentiment: ["Positive"]

Example 6: "তুমি কেমন আছো?"

Tokenization: ["তুমি", "কেমন", "আছো", "?"]

POS Tags: ["PRP", "JJ", "VB", "."]

Named Entities: [ ]

Sentiment: ["Neutral"]

Example 7: "চট্টগ্রাম একটি সুন্দর বন্দর নগরী।"

Tokenization: ["চট্টগ্রাম", "একটি", "সুন্দর", "বন্দর", "নগরী", "।"]

POS Tags: ["NNP", "DT", "JJ", "NN", "NN", "."]

Named Entities: [("চট্টগ্রাম", "LOCATION")]

Sentiment: ["Positive"]

Example 8: "মাশরাফি বিন মুর্তজা একজন বিখ্যাত ক্রিকেটার।"

Tokenization: ["মাশরাফি", "বিন", "মুর্তজা", "একজন", "বিখ্যাত", "ক্রিকেটার", "।"]

POS Tags: ["NNP", "NNP", "NNP", "DT", "JJ", "NN", "."]

Named Entities: [("মাশরাফি বিন মুর্তজা", "PERSON")]

Sentiment: ["Positive"]

Example 9: "আমার বাবা একজন ডাক্তার।"

Tokenization: ["আমার", "বাবা", "একজন", "ডাক্তার", "।"]

POS Tags: ["PRP$", "NN", "DT", "NN", "."]

Named Entities: [ ]

Sentiment: ["Neutral"]

Part of speech tagging

Part-of-Speech (POS) tagging involves assigning labels to each word in a sentence to indicate its part of speech. These labels can include a variety of grammatical categories. Here are some common POS tags along with their abbreviations:

Common POS Tags and Abbreviations:

1. Nouns (NN)

- NN: Noun, singular or mass (e.g., "dog", "car")

- NNS: Noun, plural (e.g., "dogs", "cars")

- NNP: Proper noun, singular (e.g., "John", "London")

- NNPS: Proper noun, plural (e.g., "Johns", "Londons")

2. Pronouns (PRP)

- PRP: Personal pronoun (e.g., "I", "you", "he")

- PRP$: Possessive pronoun (e.g., "my", "your", "his")

- WP: Wh-pronoun (e.g., "who", "what")

- WP$: Possessive wh-pronoun (e.g., "whose")

3. Verbs (VB)

- VB: Verb, base form (e.g., "run", "eat")

- VBD: Verb, past tense (e.g., "ran", "ate")

- VBG: Verb, gerund or present participle (e.g., "running", "eating")

- VBN: Verb, past participle (e.g., "run", "eaten")

- VBP: Verb, non-3rd person singular present (e.g., "run", "eat")

- VBZ: Verb, 3rd person singular present (e.g., "runs", "eats")

4. Adjectives (JJ)

- JJ: Adjective (e.g., "big", "beautiful")

- JJR: Adjective, comparative (e.g., "bigger", "more beautiful")

- JJS: Adjective, superlative (e.g., "biggest", "most beautiful")

5. Adverbs (RB)

- RB: Adverb (e.g., "quickly", "silently")

- RBR: Adverb, comparative (e.g., "more quickly", "faster")

- RBS: Adverb, superlative (e.g., "most quickly", "fastest")

6. Determiners (DT)

- DT: Determiner (e.g., "the", "a", "an")

- PDT: Predeterminer (e.g., "all the", "half the")

- WDT: Wh-determiner (e.g., "which", "that")

7. Conjunctions (CC)

- CC: Coordinating conjunction (e.g., "and", "but", "or")

- IN: Preposition or subordinating conjunction (e.g., "in", "on", "that")

8. Particles (RP)

- RP: Particle (e.g., "up" in "look up", "off" in "take off")

9. - POS: Possessive ending (e.g., "’s")

- SYM: Symbol (e.g., "$", "%")

- TO: To (e.g., "to" in "to go")

- UH: Interjection (e.g., "uh", "wow")

These tags provide a granular understanding of each word's role in the sentence, enabling more advanced NLP tasks such as parsing and semantic analysis.

How the system works: Most of the programmables tools doesn’t run for Bangla languages. We can run the given sentences with Python.

There are five Bangla Sentences are given in the below:

sentences = [

"আমি আজকে খুব খুশি।",

"আমার মন খারাপ লাগছে।",

"বাংলাদেশ একটি সুন্দর দেশ।",

"আজকে বৃষ্টি হচ্ছে।",

"তুমি কি আমার সাথে দেখা করবে?"

]

Sentiment Analysis and POS tagging sentence text in sentences:

Make a sentence:

sentence = Sentence(sentence_text)

Sentiment Analysis:

classifier.predict(sentence)

sentiment_label = sentence.labels[0]

POS tagging:

tagger.predict(sentence)

print(f"\nSentence: {sentence_text}")

print(f"Sentiment: {sentiment_label}")

print("Parts of Speech (POS) Tags:")

for entity in sentence.get_spans('pos'):

print(f'{entity.text}: {entity.tag}')

Sentence 1: আমি আজকে খুব খুশি।

Sentiment: POSITIVE (score: 0.96)

Parts of Speech (POS) Tags:

আমি: PRON

আজকে: ADV

খুব: ADV

খুশি: ADJ

।: PUNCT

Sentence 2: আমার মন খারাপ লাগছে।

Sentiment: NEGATIVE (score: 0.85)

Parts of Speech (POS) Tags:

আমার: PRON

মন: NOUN

খারাপ: ADJ

লাগছে: VERB

।: PUNCT

Sentence 3: বাংলাদেশ একটি সুন্দর দেশ।

Sentiment: POSITIVE (score: 0.92)

Parts of Speech (POS) Tags:

বাংলাদেশ: PROPN

একটি: DET

সুন্দর: ADJ

দেশ: NOUN

।: PUNCT

Sentence 4: আজকে বৃষ্টি হচ্ছে।

Sentiment: NEUTRAL (score: 0.6)

Parts of Speech (POS) Tags:

আজকে: ADV

বৃষ্টি: NOUN

হচ্ছে: VERB

।: PUNCT

Sentence 5: তুমি কি আমার সাথে দেখা করবে?

Sentiment: NEUTRAL (score: 0.5)

Parts of Speech (POS) Tags:

তুমি: PRON

কি: PRON

আমার: PRON

সাথে: NOUN

দেখা: NOUN

করবে: VERB

Model Building

The model building section provides a detailed explanation of the steps involved in creating machine learning models for Bengali Natural Language Processing (NLP) tasks. This section provides a comprehensive overview of the following aspects. Our process begins by collecting and preparing Bengali text data. We gathered writings from a variety of sources, such as news outlets, articles, literature, and social media. Subsequently, we will go into the process of extracting pertinent features from the textual material. This encompasses sophisticated techniques like as Tokenization and POS Tagging, Named Entity Recognition, and Sentiment Analysis. We assess and choose suitable machine learning algorithms for the specific NLP tasks at hand, such as classification using the Tokenization approach, named entity recognition, and sentiment analysis, utilizing machine translation.

The models under consideration encompass conventional models like Support Vector Machines (SVMs) and Logistic Regression, with contemporary deep learning models such as Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), and Transformer-based models. We utilized Python and other libraries including NLTK, Pandas, Matplotlib, and Seaborn to create dedicated solutions for each specific task.

Model Testing and Evaluation

Preprocessing, model training, and evaluation were facilitated through the development of custom scripts. The program imported the confusion matrix, accuracy score, and classification report functions from the sklearn.metrics module. It provides a comprehensive explanation of the data training process, which include fine-tuning of hyper-parameters, utilization of optimization techniques, and the incorporation of validation sets to monitor performance and avoid overwriting. We define the metrics employed to assess the performance of the models. Typical measures used in this context are accuracy, precision, recall, F1 score, and confusion matrices.

Ultimately, this report showcase the outcomes of the model assessments, which encompass a thorough examination and comparison of several models and their respective configurations. It analyzes the model's performance, emphasizing its strengths and flaws, and offer suggestions on how to enhance its performance through continual data training. The model's performance across three sentiment classes—Neutral, Negative, and Positive—is evaluated in detail in the classification report. With a recall of 0.99 and a precision of 0.96, the model almost always gets positive sentiments right and makes very few mistakes when predicting them, indicating exceptional performance. A high F1-score of 0.98 is a result of this, and it shows that the model achieves a good balance between recall and precision for the Positive class.

With a recall of 0.84 and a precision of 0.88 for negative sentiments, the model does decently, correctly identifying 84% of all real negative cases while missing a few. A respectable overall performance in negative instance classification is indicated by the F1-score of 0.86. The model's F1-score of 0.82 reflects its difficulty with Neutral attitudes, and the slightly worse performance of the Neutral class (0.87) and recall (0.77), although it is still handled reasonably well overall.

A total of 94% of examples were accurately identified by the model, indicating its high level of accuracy. With a precision of 0.90, recall of 0.87, and F1-score of 0.88, the model demonstrates outstanding performance across all classes according to the macro average criteria, which consider each class equally. An F1-score of 0.94 is produced by the weighted average, which takes into consideration the varying numbers of occurrences in each class. This indicates that the model is quite good at predicting the Positive class, which has the largest amount of data. Although it might do a better job of managing negative and neutral attitudes, the model is generally very accurate.

Significance of the Study

One important aspect of the study is the importance of fostering and maintaining the Bangla language. Progress in natural language processing has helped in the preservation of the Bangla language. Natural language processing (NLP) methods and annotated corpora guarantee language preservation. One advantage of creating Bangla NLP is that it will promote the language's use on digital platforms and make it more accessible to people all over the world.
Generating Resources to Assist in Knowledge Development and Research. Additional resources, such as annotated corpora and natural language processing tools, can be built upon by this study. Like scholarly input, publishing this research benefits the academic field of computational linguistics, especially for underrepresented languages like Bangla.
Enhancing Technology for the Bangla-speaking Community. User interfaces for Bangla-speaking users can be significantly improved with the use of natural language processing tools. Search engines, voice assistants, and translation services have all seen significant improvements due to this. Technology is more approachable for Bangla speakers because it is accessible to them, particularly those who are not proficient in English or other widely spoken languages.

Finally, the Local Industry's Impact on Society and the Economy. Companies in Bangladesh and West Bengal can benefit from these technologies by using them for customer service, content creation, and sentiment analysis. Improved educational outcomes and better literacy rates can be achieved through the development of educational software that aids in the teaching of Bangla using methods of natural language processing (NLP).
Cultural Inclusivity and the Role of Technology in Representing Culture. Making natural language processing (NLP) tools for Bangla helps preserve cultural identity in the digital era by making sure that technological advancements reflect the language and its subtleties. As a result, more Bangla-language content is being created, which is great for the culture and literature of the future.

Application of Sentiment Analysis

The utilization of sentiment analysis is prevalent in numerous critical domains. It entails the examination of the sentiments conveyed in posts and remarks on platforms such as Facebook and Twitter for the purpose of social media monitoring. In real-time, this enables businesses to monitor their brand's reputation and comprehend public opinion.

In the context of consumer feedback, sentiment analysis is implemented to evaluate the overall satisfaction of products or services through reviews and feedback. Companies can enhance customer support, identify strengths and weaknesses, and improve their offerings by categorizing the sentiments of these evaluations.
To conduct market research, sentiment analysis offers valuable insights into the public's perception of a company, product, or competitor. Businesses can utilize this analysis to evaluate the effectiveness of marketing campaigns, monitor trends, and make strategic decisions that are informed by consumer preferences and market conditions.

The analysis of sentiments regarding political or social issues is a component of opinion mining. This is beneficial for the effective management of responses to societal concerns, the prediction of election outcomes, and the comprehension of public attitudes toward political figures, policies, or social movements.

Limitations of the Study

1. Understanding the Context. Some programs have trouble understanding irony, sarcasm, and meanings that depend on the situation.

2. Ambiguity. Words can mean different things depending on the situation. For example, in slang, the word "bad" can mean something good.

3. Multilingual Sentiment Analysis. It can be harder to work with different languages and regions, like Bangla, because they have their own grammar rules and words.

4. Not many labeled datasets in Bangla. When working on Bangla NLP, certain problems come up. For example, using informal words and complicated grammar structures. Bangla's special syntax and vocabulary mean that lexicons and models need to be made just for it.

Future Aspects of the Study

Given the ongoing development of machine learning and natural language processing (NLP), sentiment analysis has bright future prospects. The ability to evaluate more complicated emotions and have a deeper comprehension of textual context is one important area of progress. In contrast to the current models, which frequently classify sentiments as either positive, negative, or neutral, future methods may be able to recognize a wider variety of emotional nuances, such as sarcasm, irony, or mixed feelings.

The use of sentiment analysis in multilingual settings is another fascinating feature. Even while languages like English have made significant strides, future studies will probably concentrate on enhancing sentiment analysis in underrepresented languages like Bengali and other regional tongues. This would increase sentiment analysis's accuracy in a variety of linguistic contexts and make it more widely accessible.

Additionally, there is potential to extend the use of sentiment analysis to other fields, such law enforcement or healthcare, where it might be utilized to monitor public opinion on safety and policy matters or to assess patient sentiment in medical data.

Furthermore, novel approaches to evaluating and addressing emotional states in real-time video analysis, virtual reality, and speech recognition could be made possible by fusing sentiment analysis with other technologies.

Finally, ethical issues will also become more significant as sentiment analysis becomes more precise and popular. In future research and development, ensuring privacy, fairness, and transparency in the collection and use of sentiment data will be a major focus.

To conclude, sentiment analysis is a useful technique for comprehending consumer feedback, market trends, and public opinion. Text analysis and insights are obtained through the application of NLP and machine learning. As the area develops, it will provide even more accuracy and applications, especially in different languages like Bengali. As its use increases, ethical issues like privacy must be taken into account.

Author Contribution

Maneesha Rani Biswas: Creating the idea of the entire research paper and prepare Abstract, Introduction, Literature Review, Methodology, Data Availability Statement, Significance and limitations of the research paper.Fajle Rabbi: Helping in coding, doing the manuscript editing and helping in creating Data Availability Statement.He did coding and supervised all my writings and edited if there was anything wrong he found.

Acknowledgement

Acknowledgment : We are deeply grateful to everyone who helped us during this research project. We would especially like to express my gratitude to our mentor Naira Khan, Associate Professor, Department of Linguistics, University of Dhaka and our classmates for their wise counsel and perceptive criticism, which were crucial to the accomplishment of this effort. We also appreciate the materials and equipment that made it possible to finish this study. We also acknowledge the contributions of the open-source community, particularly the developers of the Python libraries [such as NLTK, spaCy, Pandas, Matplotlib, Seaborn etc.] and machine learning frameworks [such as Scikit-learn, TensorFlow, etc.] that were crucial for implementing tokenization, POS tagging, and sentiment analysis for Bangla sentences. Lastly, we want to express gratitude to our friends and family for their constant support and encouragement along this journey.

Data Availability

Data Availability Statement:The datasets generated and/or analyzed during the current study on "Annotated Bangla Natural Language Processing (BNLP) using Python and Machine Learning" are publicly available at /data/user/0/com.microsoft.office.excel/app_ThirdPartyFiles/com.whatsapp.provider.media/73da894ae5694f28b9d13fcdc239a3a/Processed_Sentiment_POS_Tags.xlsx . The data can be accessed at our excel spreadsheet. Any restrictions to data access, if applicable, will be outlined within the repository.The datasets are available from the corresponding author upon reasonable request.

F. Alam et al. "Bangla Text Tokenization and Stemming Using N-grams." Asian Information Processing Journal, 2019.
M. M. Hasan et al. International Journal of Computational Linguistics, 2020, "Part of Speech Tagging in Bangla using Conditional Random Fields."
Amin, R., and Roy, S. Proceedings of the International Conference on Language Resources, 2021, "Bangla Named Entity Recognition using Deep Learning."
Martin, J. H., and Jurafsky, D. "Language and Speech Processing." 2020 Pearson.
Bird, S., Klein, E., & Loper, “Natural Language Processing with Python”, E. 2009's O'Reilly Media.
Lee, L., and B. Pang (2008). Foundations and Trends in Information Retrieval, 2(1-2), 1–135; "Opinion Mining and Sentiment Analysis."
"Bangla Language Processing: From Text to Speech," by N. S. Dash Springer, 2018.
F.Alam, Naira Khan and et al., 2021, “A review of Bangla Natural Language Processing tasks and the Utility of Transformer Models.”
Rahman, T., Islam, M. R., & Mamun, S. A. (2020). "Bangla natural language processing: A comprehensive review." Journal of King Saud University-Computer and Information Sciences, 32(7), 829-845. DOI: 10.1016/j.jksuci.2018.08.001
Hasan, M., & Chowdhury, M. F. (2018). "Bangla Text Classification using Machine Learning Approaches." International Journal of Artificial Intelligence & Applications, 9(1), 21-28. DOI: 10.5121/ijaia.2018.9102
Alam, F., et al. (2020). "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.URL: https://www.aclweb.org/anthology/2020.acl-main.587/
Mandal, S. & Naskar, S. (2017). "Sentiment analysis on Bangla and Hindi text using deep learning." Proceedings of the 2017 International Conference on Advances in Computing, Communications, and Informatics (ICACCI), 1421-1427. DOI: 10.1109/ICACCI.2017.8126034
Rabbani, M. G., et al. (2022). "Bangla Natural Language Processing for Social Media Text: A Comprehensive Study on Trends and Challenges." Journal of Natural Language Processing Research, 8(2), 100-115. DOI: 10.32468/nlp.22.1023
Sarker, I. H. (2021). "Deep Learning: An Overview on Bangla Natural Language Processing." Neural Processing Letters, 53(3), 2345-2364. DOI: 10.1007/s11063-021-10543-7
Islam, Md. R. & Ahmed, F. (2019). "An Efficient Technique for Bangla Text Classification Using Machine Learning Algorithms." International Journal of Computer Science and Information Security (IJCSIS), 17(6), 100-110.URL: https://sites.google.com/site/ijcsis/
Bhattacharya, P., Chatterjee, N., et al. (2020. "Tools for Bangla Natural Language Processing: A Comparative Analysis." IEEE Transactions on Emerging Topics in Computing, 9(1), 98-106. DOI: 10.1109/TETC.2020.2972015

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Annotated Bangla Natural Language Processing (BNLP) by using python or machine learning.

Status:

Version 1

Abstract

Introduction

Background of the Study

Research Methodology

Conclusion

Declarations

Author Contribution

Acknowledgement

Data Availability

References

Additional Declarations

Supplementary Files

Status:

Version 1