For this research, sample texts are collected as data from diverse sources including news articles, literature, and social media. These sentences were used as data which were annotated for various NLP tasks.
-Tokenization: Dividing text into words, phrases, symbols, or other meaningful elements.
-POS Tagging: Assigning parts of speech to each token.
-Named Entity Recognition (NER): Identifying proper names in the text.
-Sentiment Analysis: Classifying text based on emotional tone like positive, negative and neutral.
For research purpose, customized tools were developed using Python and libraries like NLTK, Pandas, spaCy and Matplotlib. The tools are developed for each task and custom scripts are written for preprocessing, model training, and evaluation.
Model Description
1. Tokenization
The term "tokenization" refers to the process of breaking down large blocks of text into smaller ones, which might be anything from words or subwords to individual characters. Before separating the text according to spaces or other criteria, text preparation is performed, such as deleting unnecessary characters. Unknown words or padding can be represented via special tokens. Token sequences are the end result and the building blocks of additional natural language processing operations like text analysis and model training. Handling script-specific elements and complicated linguistic rules can be a challenge when tokenizing languages like Bengali.
Consider the following sentence: "The quick brown fox jumps over the lazy dog."
Words that are tokenized: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"].
Character that are tokenized: ["T", "h", "e", "q", "u", "i", "c", "k",...].
2. POS Tagging
POS tagging (Part-of-Speech tagging) is a fundamental task in Natural Language Processing (NLP) where each word in a sentence is labelled with its corresponding part of speech. The goal is to identify whether a word is a noun, verb, adjective, adverb, etc., based on its usage and context within the sentence. POS tagging is a form of classification where words are classified into their respective syntactic categories. The following elaborates how POS Tagging Works.
Text Preprocessing: Before tagging, text is usually tokenized into words or sentences. This involves splitting the text into units that will be tagged.
Classification of Words: POS tagging is a classification task where a model or algorithm determines the correct tag for each word in the sentence.
Example:
Sentence: "The cat sat on the mat."
Tags:
"The" → Determiner (DET)
"cat" → Noun (NN)
"sat" → Verb (VBD)
"on" → Preposition (IN)
"the" → Determiner (DET)
"mat" → Noun (NN)
[tini amake bhalobashen]
“tini”- NN
“amake”- PN
“bhalobashen”- VB
3. Named Entity Recognition
One natural language processing (NLP) method is named entity recognition (NER), which sorts text according on predetermined criteria such names of people, places, dates, and organizations. The first step is to identify potential entities and give them appropriate labels (such as "John" for a person, "Google" for a company, and "Paris" for a city). Applications such as information retrieval, question-answering, and text summarization are made possible by NER, which aids in extracting important information from unstructured text. NER models often necessitate domain-or language-specific training data and can be rule-based, ML-based, or hybrids of the two. Example of identifying proper names in the text:
সে আমাকে প্রতিদিন ফুল দেয়
|
[('সে', 'PRP'), ('আমাকে', 'PRP'), ('প্রতিদিন', 'JJ'), ('ফুল', 'NN'), ('দেয়', 'VB')]
|
Positive
|
4. Sentiment Analysis
Sentiment analysis is a natural language processing (NLP) technique used to determine the emotional tone behind a body of text. It's a way of classifying text into categories such as positive, negative, or neutral sentiments. The goal is to assess how people feel about a topic, product, or service by analysing their language.
How Sentiment Analysis Works
1. Text Preprocessing:
- Tokenization, removing stop words, stemming/lemmatization, and removing special characters.
2. Feature Extraction:
- Convert text into numerical features. This can be done using:
- Bag of Words (BoW): Represents text by counting word occurrences.
- TF-IDF (Term Frequency-Inverse Document Frequency): Measures how important a word is in a document relative to a collection of documents.
-Word Embeddings: Like Word2Vec, GloVe, and BERT, which capture semantic relationships between words.
3. Classification Models:
Sentiment analysis typically involves supervised machine learning techniques:
- Logistic Regression: A simple yet effective algorithm for binary classification.
- Naive Bayes: Often used for text classification due to its simplicity.
- Support Vector Machines (SVM): Efficient for small to medium datasets.
- Deep Learning: Using models like LSTMs, GRUs, or Transformers for better performance in sentiment classification, especially with large datasets.
4. Polarity Detection:
Sentiment analysis algorithms classify texts based on polarity:
- Positive: Text expresses a positive sentiment.
- Negative: Text conveys a negative sentiment.
- Neutral: Text is neither positive nor negative.
5. Aspect-based Sentiment Analysis:
- In this variation, sentiment is analyzed for specific aspects of a product or service (e.g., a product review can be positive about the design but negative about the battery life).
Here, some examples are mentioned with Annotated Text:
Example 1: "বাংলা ভাষা একটি ইন্দো-ইউরোপীয় ভাষা।"
Tokens: ["বাংলা", "ভাষা", "একটি", "ইন্দো-ইউরোপীয়", "ভাষা", "।"]
POS Tags: ["NN", "NN", "DT", "JJ", "NN", "."]
Entities: [("বাংলা ভাষা", "LANGUAGE")]
Sentiment: ["Neutral"]
Code:
import nltk
from nltk.tokenize import word_tokenize
text = "বাংলা ভাষা একটি ইন্দো-ইউরোপীয় ভাষা।"
tokens = word_tokenize(text)
print(tokens)
Example 2: "ঢাকা বাংলাদেশের রাজধানী এবং বৃহত্তম শহর।"
Tokenization
["ঢাকা", "বাংলাদেশের", "রাজধানী", "এবং", "বৃহত্তম", "শহর", "।"]
POS Tags
["NNP", "NNP", "NN", "CC", "JJS", "NN", "."]
Named Entities
[("ঢাকা", "LOCATION"), ("বাংলাদেশ", "LOCATION")]
Sentiment: ["Neutral"]
Example 3: "রবীন্দ্রনাথ ঠাকুর বাংলা সাহিত্যের একজন প্রখ্যাত কবি।”
Tokenization: ["রবীন্দ্রনাথ", "ঠাকুর", "বাংলা", "সাহিত্যের", "একজন", "প্রখ্যাত", "কবি", "।"]
POS Tags: ["NNP", "NNP", "JJ", "NN", "DT", "JJ", "NN", "."]
Named Entities: [("রবীন্দ্রনাথ ঠাকুর", "PERSON"), ("বাংলা সাহিত্যের", "ART")]
Sentiment: ["Positive"]
Example 4: “আমি আজ স্কুলে যাবো না, কারন আমি অসুস্থ।”
Tokenization: ["আমি", "আজ", "স্কুলে", "যাব", "না", "কারণ", "আমি", "অসুস্থ", "।"]
POS Tags: ["PRP", "NN", "NN", "VB", "RB", "IN", "PRP", "JJ", "."]
Named Entities: [("স্কুলে", "LOCATION")]
Sentiment: ["Negative"]
Example 5: “আমার প্রিয় খাবার বিরিয়ানি।”
Tokenization: ["আমার", "প্রিয়", "খাবার", "বিরিয়ানি", "।"]
POS Tags: ["PRP$", "JJ", "NN", "NN", "."]
Named Entities: [("বিরিয়ানি", "FOOD")]
Sentiment: ["Positive"]
Example 6: "তুমি কেমন আছো?"
Tokenization: ["তুমি", "কেমন", "আছো", "?"]
POS Tags: ["PRP", "JJ", "VB", "."]
Named Entities: [ ]
Sentiment: ["Neutral"]
Example 7: "চট্টগ্রাম একটি সুন্দর বন্দর নগরী।"
Tokenization: ["চট্টগ্রাম", "একটি", "সুন্দর", "বন্দর", "নগরী", "।"]
POS Tags: ["NNP", "DT", "JJ", "NN", "NN", "."]
Named Entities: [("চট্টগ্রাম", "LOCATION")]
Sentiment: ["Positive"]
Example 8: "মাশরাফি বিন মুর্তজা একজন বিখ্যাত ক্রিকেটার।"
Tokenization: ["মাশরাফি", "বিন", "মুর্তজা", "একজন", "বিখ্যাত", "ক্রিকেটার", "।"]
POS Tags: ["NNP", "NNP", "NNP", "DT", "JJ", "NN", "."]
Named Entities: [("মাশরাফি বিন মুর্তজা", "PERSON")]
Sentiment: ["Positive"]
Example 9: "আমার বাবা একজন ডাক্তার।"
Tokenization: ["আমার", "বাবা", "একজন", "ডাক্তার", "।"]
POS Tags: ["PRP$", "NN", "DT", "NN", "."]
Named Entities: [ ]
Sentiment: ["Neutral"]
Part of speech tagging
Part-of-Speech (POS) tagging involves assigning labels to each word in a sentence to indicate its part of speech. These labels can include a variety of grammatical categories. Here are some common POS tags along with their abbreviations:
Common POS Tags and Abbreviations:
1. Nouns (NN)
- NN: Noun, singular or mass (e.g., "dog", "car")
- NNS: Noun, plural (e.g., "dogs", "cars")
- NNP: Proper noun, singular (e.g., "John", "London")
- NNPS: Proper noun, plural (e.g., "Johns", "Londons")
2. Pronouns (PRP)
- PRP: Personal pronoun (e.g., "I", "you", "he")
- PRP$: Possessive pronoun (e.g., "my", "your", "his")
- WP: Wh-pronoun (e.g., "who", "what")
- WP$: Possessive wh-pronoun (e.g., "whose")
3. Verbs (VB)
- VB: Verb, base form (e.g., "run", "eat")
- VBD: Verb, past tense (e.g., "ran", "ate")
- VBG: Verb, gerund or present participle (e.g., "running", "eating")
- VBN: Verb, past participle (e.g., "run", "eaten")
- VBP: Verb, non-3rd person singular present (e.g., "run", "eat")
- VBZ: Verb, 3rd person singular present (e.g., "runs", "eats")
4. Adjectives (JJ)
- JJ: Adjective (e.g., "big", "beautiful")
- JJR: Adjective, comparative (e.g., "bigger", "more beautiful")
- JJS: Adjective, superlative (e.g., "biggest", "most beautiful")
5. Adverbs (RB)
- RB: Adverb (e.g., "quickly", "silently")
- RBR: Adverb, comparative (e.g., "more quickly", "faster")
- RBS: Adverb, superlative (e.g., "most quickly", "fastest")
6. Determiners (DT)
- DT: Determiner (e.g., "the", "a", "an")
- PDT: Predeterminer (e.g., "all the", "half the")
- WDT: Wh-determiner (e.g., "which", "that")
7. Conjunctions (CC)
- CC: Coordinating conjunction (e.g., "and", "but", "or")
- IN: Preposition or subordinating conjunction (e.g., "in", "on", "that")
8. Particles (RP)
- RP: Particle (e.g., "up" in "look up", "off" in "take off")
9. - POS: Possessive ending (e.g., "’s")
- SYM: Symbol (e.g., "$", "%")
- TO: To (e.g., "to" in "to go")
- UH: Interjection (e.g., "uh", "wow")
These tags provide a granular understanding of each word's role in the sentence, enabling more advanced NLP tasks such as parsing and semantic analysis.
How the system works: Most of the programmables tools doesn’t run for Bangla languages. We can run the given sentences with Python.
There are five Bangla Sentences are given in the below:
sentences = [
"আমি আজকে খুব খুশি।",
"আমার মন খারাপ লাগছে।",
"বাংলাদেশ একটি সুন্দর দেশ।",
"আজকে বৃষ্টি হচ্ছে।",
"তুমি কি আমার সাথে দেখা করবে?"
]
Sentiment Analysis and POS tagging sentence text in sentences:
Make a sentence:
sentence = Sentence(sentence_text)
Sentiment Analysis:
classifier.predict(sentence)
sentiment_label = sentence.labels[0]
POS tagging:
tagger.predict(sentence)
print(f"\nSentence: {sentence_text}")
print(f"Sentiment: {sentiment_label}")
print("Parts of Speech (POS) Tags:")
for entity in sentence.get_spans('pos'):
print(f'{entity.text}: {entity.tag}')
Sentence 1: আমি আজকে খুব খুশি।
Sentiment: POSITIVE (score: 0.96)
Parts of Speech (POS) Tags:
আমি: PRON
আজকে: ADV
খুব: ADV
খুশি: ADJ
।: PUNCT
Sentence 2: আমার মন খারাপ লাগছে।
Sentiment: NEGATIVE (score: 0.85)
Parts of Speech (POS) Tags:
আমার: PRON
মন: NOUN
খারাপ: ADJ
লাগছে: VERB
।: PUNCT
Sentence 3: বাংলাদেশ একটি সুন্দর দেশ।
Sentiment: POSITIVE (score: 0.92)
Parts of Speech (POS) Tags:
বাংলাদেশ: PROPN
একটি: DET
সুন্দর: ADJ
দেশ: NOUN
।: PUNCT
Sentence 4: আজকে বৃষ্টি হচ্ছে।
Sentiment: NEUTRAL (score: 0.6)
Parts of Speech (POS) Tags:
আজকে: ADV
বৃষ্টি: NOUN
হচ্ছে: VERB
।: PUNCT
Sentence 5: তুমি কি আমার সাথে দেখা করবে?
Sentiment: NEUTRAL (score: 0.5)
Parts of Speech (POS) Tags:
তুমি: PRON
কি: PRON
আমার: PRON
সাথে: NOUN
দেখা: NOUN
করবে: VERB
Model Building
The model building section provides a detailed explanation of the steps involved in creating machine learning models for Bengali Natural Language Processing (NLP) tasks. This section provides a comprehensive overview of the following aspects. Our process begins by collecting and preparing Bengali text data. We gathered writings from a variety of sources, such as news outlets, articles, literature, and social media. Subsequently, we will go into the process of extracting pertinent features from the textual material. This encompasses sophisticated techniques like as Tokenization and POS Tagging, Named Entity Recognition, and Sentiment Analysis. We assess and choose suitable machine learning algorithms for the specific NLP tasks at hand, such as classification using the Tokenization approach, named entity recognition, and sentiment analysis, utilizing machine translation.
The models under consideration encompass conventional models like Support Vector Machines (SVMs) and Logistic Regression, with contemporary deep learning models such as Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), and Transformer-based models. We utilized Python and other libraries including NLTK, Pandas, Matplotlib, and Seaborn to create dedicated solutions for each specific task.
Model Testing and Evaluation
Preprocessing, model training, and evaluation were facilitated through the development of custom scripts. The program imported the confusion matrix, accuracy score, and classification report functions from the sklearn.metrics module. It provides a comprehensive explanation of the data training process, which include fine-tuning of hyper-parameters, utilization of optimization techniques, and the incorporation of validation sets to monitor performance and avoid overwriting. We define the metrics employed to assess the performance of the models. Typical measures used in this context are accuracy, precision, recall, F1 score, and confusion matrices.
Ultimately, this report showcase the outcomes of the model assessments, which encompass a thorough examination and comparison of several models and their respective configurations. It analyzes the model's performance, emphasizing its strengths and flaws, and offer suggestions on how to enhance its performance through continual data training. The model's performance across three sentiment classes—Neutral, Negative, and Positive—is evaluated in detail in the classification report. With a recall of 0.99 and a precision of 0.96, the model almost always gets positive sentiments right and makes very few mistakes when predicting them, indicating exceptional performance. A high F1-score of 0.98 is a result of this, and it shows that the model achieves a good balance between recall and precision for the Positive class.
With a recall of 0.84 and a precision of 0.88 for negative sentiments, the model does decently, correctly identifying 84% of all real negative cases while missing a few. A respectable overall performance in negative instance classification is indicated by the F1-score of 0.86. The model's F1-score of 0.82 reflects its difficulty with Neutral attitudes, and the slightly worse performance of the Neutral class (0.87) and recall (0.77), although it is still handled reasonably well overall.
A total of 94% of examples were accurately identified by the model, indicating its high level of accuracy. With a precision of 0.90, recall of 0.87, and F1-score of 0.88, the model demonstrates outstanding performance across all classes according to the macro average criteria, which consider each class equally. An F1-score of 0.94 is produced by the weighted average, which takes into consideration the varying numbers of occurrences in each class. This indicates that the model is quite good at predicting the Positive class, which has the largest amount of data. Although it might do a better job of managing negative and neutral attitudes, the model is generally very accurate.
Significance of the Study
One important aspect of the study is the importance of fostering and maintaining the Bangla language. Progress in natural language processing has helped in the preservation of the Bangla language. Natural language processing (NLP) methods and annotated corpora guarantee language preservation. One advantage of creating Bangla NLP is that it will promote the language's use on digital platforms and make it more accessible to people all over the world.
Generating Resources to Assist in Knowledge Development and Research. Additional resources, such as annotated corpora and natural language processing tools, can be built upon by this study. Like scholarly input, publishing this research benefits the academic field of computational linguistics, especially for underrepresented languages like Bangla.
Enhancing Technology for the Bangla-speaking Community. User interfaces for Bangla-speaking users can be significantly improved with the use of natural language processing tools. Search engines, voice assistants, and translation services have all seen significant improvements due to this. Technology is more approachable for Bangla speakers because it is accessible to them, particularly those who are not proficient in English or other widely spoken languages.
Finally, the Local Industry's Impact on Society and the Economy. Companies in Bangladesh and West Bengal can benefit from these technologies by using them for customer service, content creation, and sentiment analysis. Improved educational outcomes and better literacy rates can be achieved through the development of educational software that aids in the teaching of Bangla using methods of natural language processing (NLP).
Cultural Inclusivity and the Role of Technology in Representing Culture. Making natural language processing (NLP) tools for Bangla helps preserve cultural identity in the digital era by making sure that technological advancements reflect the language and its subtleties. As a result, more Bangla-language content is being created, which is great for the culture and literature of the future.
Application of Sentiment Analysis
The utilization of sentiment analysis is prevalent in numerous critical domains. It entails the examination of the sentiments conveyed in posts and remarks on platforms such as Facebook and Twitter for the purpose of social media monitoring. In real-time, this enables businesses to monitor their brand's reputation and comprehend public opinion.
In the context of consumer feedback, sentiment analysis is implemented to evaluate the overall satisfaction of products or services through reviews and feedback. Companies can enhance customer support, identify strengths and weaknesses, and improve their offerings by categorizing the sentiments of these evaluations.
To conduct market research, sentiment analysis offers valuable insights into the public's perception of a company, product, or competitor. Businesses can utilize this analysis to evaluate the effectiveness of marketing campaigns, monitor trends, and make strategic decisions that are informed by consumer preferences and market conditions.
The analysis of sentiments regarding political or social issues is a component of opinion mining. This is beneficial for the effective management of responses to societal concerns, the prediction of election outcomes, and the comprehension of public attitudes toward political figures, policies, or social movements.
Limitations of the Study
1. Understanding the Context. Some programs have trouble understanding irony, sarcasm, and meanings that depend on the situation.
2. Ambiguity. Words can mean different things depending on the situation. For example, in slang, the word "bad" can mean something good.
3. Multilingual Sentiment Analysis. It can be harder to work with different languages and regions, like Bangla, because they have their own grammar rules and words.
4. Not many labeled datasets in Bangla. When working on Bangla NLP, certain problems come up. For example, using informal words and complicated grammar structures. Bangla's special syntax and vocabulary mean that lexicons and models need to be made just for it.
Future Aspects of the Study
Given the ongoing development of machine learning and natural language processing (NLP), sentiment analysis has bright future prospects. The ability to evaluate more complicated emotions and have a deeper comprehension of textual context is one important area of progress. In contrast to the current models, which frequently classify sentiments as either positive, negative, or neutral, future methods may be able to recognize a wider variety of emotional nuances, such as sarcasm, irony, or mixed feelings.
The use of sentiment analysis in multilingual settings is another fascinating feature. Even while languages like English have made significant strides, future studies will probably concentrate on enhancing sentiment analysis in underrepresented languages like Bengali and other regional tongues. This would increase sentiment analysis's accuracy in a variety of linguistic contexts and make it more widely accessible.
Additionally, there is potential to extend the use of sentiment analysis to other fields, such law enforcement or healthcare, where it might be utilized to monitor public opinion on safety and policy matters or to assess patient sentiment in medical data.
Furthermore, novel approaches to evaluating and addressing emotional states in real-time video analysis, virtual reality, and speech recognition could be made possible by fusing sentiment analysis with other technologies.
Finally, ethical issues will also become more significant as sentiment analysis becomes more precise and popular. In future research and development, ensuring privacy, fairness, and transparency in the collection and use of sentiment data will be a major focus.