Afaan Oromo Fake News Detection on Social Media: - Using Deep Learning Approach

doi:10.21203/rs.3.rs-5252245/v1

Download PDF

Article

Afaan Oromo Fake News Detection on Social Media: - Using Deep Learning Approach

https://doi.org/10.21203/rs.3.rs-5252245/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Due to the rapid growth of the internet in recent years, social media has made it easier to create and share information via computer-mediated technologies. As a result of the social media revolution, people's communication and acquaintances have changed. The majority of people nowadays use social media to find and consume news rather than traditional news agencies. On the other hand, while online social media has grown in importance as a source of information and a method of connecting people together, detecting fake information remains a difficult problem to address. The majority of previous works presented appropriate machine learning models of fake news detections are concerned with specific group of language such as English. Despite the fact, the problem is not constricted to language. As a result, we implemented deep learning models on the Afaan Oromo language news dataset extracted from twitter and Facebook pages, to detect fake news on social media. In this study we annotated dataset of 1838 news text for train and testing purposes. Three popular deep learning methods used in the experiments: LSTM, GRU, and Bi-LSTM, to determine the best performing models, and each of the model performance is evaluated with various parameters. Then, when it came to recognizing fake news in news content written in the Afaan Oromo language, bidirectional-LSTM beat the other algorithms, namely unidirectional LSTM and GRU. The model is capable of making predictions on our dataset, with an f-1 score of 90%. Finally, python flask API server with spyder IDE was used for the web-based prototype development of the trained Bi-LSTM models for prediction of news.

Biological sciences/Computational biology and bioinformatics/Data acquisition

Biological sciences/Computational biology and bioinformatics/Machine learning

Biological sciences/Computational biology and bioinformatics/Software

Afaan Oromo

Fake News detection

deep learning

Bi-directional LSTM

Fake news is defined as falsehoods formatted and circulated among communication and interaction of a peoples in a way as to make them real news and it was existing long time(Talwar et al., 2019). In the past, common man used to wait for a day or a week to see or hear what has happened in the world yesterday. But recent study shows that, a majority of people access the news from digital device which is almost at the speed of light(Thota et al., 2018). Before Facebook become popular in Ethiopia, people preferred to listen to radios and watch TVs broadcasted from outside the country, such as Voice of America (VOA) broadcasting from America in Amharic, Afaan Oromo, Tigrigna and Duetchevelle broadcasted from Germany radio stations. But nowadays, the fashion was changed and anyone can use his/her mobile or tablet to get any news from social media. Even those traditional media are also available in social media by having their own Facebook pages.

In today’s world instead of television and other live stream media people prefer Social media news like Facebook and twitter has one of the main news sources for millions of people around the globe due to their low cost, easy access, and rapid dissemination(Monti et al., 2019). According to (Gurmessa, 2020) Research literature on fake news detection, news markets and the impact on the news quality we trust in the social media is much lower in percent, while 46% respondents were concerned about the accuracy of news on social media in the era of ‘fake news’, traditional media remained well-trusted.

Afaan Oromoo language is a native to the Ethiopian state of Oromia and spoken predominantly by the Oromo people and neighboring ethnic groups in the horn of Africa (Jimalo et al., 2017). However, in order to process Afaan Oromo language very little works has been done in terms creating useful higher-level computer-based application. Fake news detection in Afaan Oromo language is on the early stage and hottest research area but didn’t get concrete solution. Therefore, its critical issues to identifying and monitor fake news posted or twitted in the language as well as prevent the propagation of Afaan Oromo language fake news and possible to unfold in to acts of fear, hate, criminals, and disturb the live of individuals.

The study's key contribution is that it backs up the hypothesis that deep learning could be effective for detecting fake news text for Afaan Oromo language in a novel way.
The paper provides a new dataset to address the problem of detecting fake news in Afaan Oromo.
We present the findings of a variety of deep learning classification models that were generated using a variety of methodologies and characteristics and can be used as a starting point for further work with this or other datasets.
Comparing the performances of three neural network models GRU, LSTM and Bi-LSTM for Afaan Oromo fake news detection task.

Academics are now concentrating their efforts on the challenge of identifying false news in order to solve it and propose new solutions. An article (Ruchansky et al., 2017) provides a model for reliable prediction of fake news based on user comments, news source, and online community propagation. The method's application in determining whether a news article is authentic or fake is examined in this study, as well as efforts to increase the accuracy of the current model.

The first module of this research looks into the content of news articles to see if the substance and headline are consistent. Natural language processing and machine learning techniques, on the other hand, are more commonly used on news information to extract textual aspects and assess if the news is true or fake. As a result, using text alone to classify fake news leads to false positives unless linguistic traits are taken into account. In the article's second scenario, network representation was used to account for the context of user comments. Motivating language is regularly utilized in the comment sections when bogus news texts are shared and disseminated on social media, providing a useful platform for monitoring how followers react to the news text. This strategy, on the other hand, necessitates a large team or a significant amount of effort, and assessing whether news is genuine or phony takes a long time. Another technique addressed in the paper is contacting the source of the news piece with publishers or to use a uniform resource locator. The fact that this method combines hand-crafted feature selection with machine learning methods, however, is a significant drawback. To get over these methods' limitations, the author offers a deep learning model that can select features from input data on its own and classify bogus news.

(Ahmed et al., 2017)Propose a model for detecting bogus news that incorporates n-gram analysis and machine learning. The author explores and contrasts six classification strategies as well as two feature extraction techniques using an existing data set. The experimental results show that using Term Frequency-Inverted Document Frequency (TF-IDF) as a feature extraction technique yields the best results, and the proposed model achieves the highest accuracy when using unigram features and Linear Support Vector Machine (LSVM), with a 92 percent accuracy. Feature selection is critical for identifying and deleting redundant attributes from data that do not contribute to the model's accuracy and may even reduce it. (Castillo et al., 2011)Took advantage of feature-based methods to assess the credibility of tweets on Twitter. The results indicate there are measurable differences in the way messages propagate, that can be used to classify the news text automatically as credible or not credible. On the other hand, those study has heavily relied on feature engineering, which is expensive and time-consuming.

Consequently, more recent endeavors using deep neural network were performed to get grid of the need of feature engineering. (Monti et al., 2019)Present a novel automatic fake news detection model based on geometric deep learning and the model was trained on verified news story that were spread over twitter. The author used graph-based expansion of convolutional neural networks and the experiments indicate that social network structure and propagation are important features allowing highly accurate and point to the promise of propagation-based approaches for fake news detection as an alternative strategy to content-based approaches rather than feature engineering.

Most recent works conducted on English language. (Leea & Song, 2020) has been dealing English-Korean language fake news detection and they examined how the recurrent network handles sequential data in detail as well as the main concept to build an RNN-based translation model from processing raw input sentences to generating target output or sentences. (Leea & Song, 2020)apply multiple translation models with Gated Recurrent Unites (GRUs) in Keras on English-Korean sentences that contain about 26,000 pairwise sequences in total from two different corpora, colloquialism and news. The critical problem with neural network were since it accepts only the numerical values. However, in this case the author summarizes the word tokenization and embedding which is the core technique representing texts for the most fundamental part of natural language processing which is suitable for any language. Then, they trained the sequence with bidirectional recurrent neural network or, RNN-based seq2seq network and put the best with RNN based on the experiments.

To ensure valid and reliable results systematically designs the research methodology and employs qualitative research strategies throughout the life cycle to achieve objective of the research. We employed the research design strategies listed below to do this study.

3.1. Data set preparation

In order to create an artifact capable of recognize fake news, a vast amount of Afaan Oromo text article for fake and real news is required. Thus, we extracted the data set from twitter and Facebook which have relatively the large number of followers. Precisely, we have used Octoparse tools to scrap the real news from the official twitter page of Voice of America (VOA) and British Broad cast Cooperation (BBC) Afaan Oromo authorized by government in addition to verification of company, as well as Face pager, have used to extract Afaan Oromo news from Facebook account and pages can post the news in different languages and they can easily mislead or misinform the massive society.

To optimize the cost of finding fake news, the news article consisted of misinformation from fake accounts that Facebook Company has been working to stamp out and we use government media’s criteria, they identify news weather real or fake with its source of news, credibility and legal person like profile image. We also explored our own criteria for the content that would be taken off the page if: -

The site's news never follows up on the headline's premise and instead descend into speculation, opinion, and statements with no sources or links to research. Take a look at the news's source. Is it a mainstream news organization with a good reputation? Or isn't it?
The headline that sounds like the summary of an opinion is not to be trusted.
Is it from a neutral source, if it isn't?
Examine some of the other stories available through this source or account to see if at least one of them has been flagged as a hoax.
If the impression that the authors are emotionally invested in their subjects (and hence biased)? Do they appear enraged or awestruck by their subjects?
Checking the dates is also key because the story could be old and republished if it isn't current news or relevant.

3.2. Data set preprocessing

Data preprocessing is a crucial phase for any text analysis. There will be unwanted text in the news article that can be obstacle to the model efficiency. We have actual dataset embodies 1838 samples, which encompasses with news headline, body and label (i.e. 0 for fake news, 1 for real news). Hence, the headline and body of the news article has been marge together and data requires special preprocessing to implement deep learning algorithms for subsequent Afaan Oromo news text fake news detections. To reduce the size of actual data, we used a generic refinement like: - stop-word removal, tokenization, and normalization were applying to removing the irrelevant information that exists in the dataset.

3.3. Development tools

The following are the development tools that were used in this study to detect fake news.

Octoparse (version 8)

An open-source web scraping tool that allows you to extract data from Twitter and other websites, without having to write any code. It is used in the data collection process to build the dataset. It also provides easy to use the interface, 24/7 cloud service, and deals with all websites.

Colab

Colaboratory, or simply Colab, is a Google Research tool that allows developers to write and execute Python code directly from their browser. It's a hosted Jupyter notebook that doesn't require installation and includes a great free version that offers you free access to Google processing resources like GPUs and TPUs. It aids in the efficient execution of deep learning algorithms that consume a lot of resources and time. Furthermore, no packages must be explicitly installed.

Spyder 3.7

Spyder is a cross-platform, open-source integrated development environment (IDE) for scientific Python programming.

3.4. Proposed System Architecture

In order to create an artifact capable of detecting fake news, the news text is first preprocessed to remove unwanted characteristics left over from the data acquisition phase, and then the cleaning phase performs more actions than non-textual characters, such as fixing spelling and syntax errors, standardizing data sets, and correcting mistakes. Word embedding is used as a feature representation that works to capture the semantics of the words. The dataset is then split into train and test sets in order to check the model's accuracies and precisions by training and testing it. As a result, we fit the model using the training data. Then, the words are fed into a deep learning model with input layer, hidden layer and output layers. The first layer is an input layer and it is a filter that can extract features from the data, then pass to the hidden layers and the output layers takes the highest value for examine. When training is carried out on the news article corpus and the models generated are to predict the results of unknown sets. In the following section, we described each component of the architecture and module in detail.

3.5. Feature extraction and Word embedding

The biggest advantage of deep learning is that we do not need to manually extract features from the dataset like machine learning. The network learns to extract features while training. After we perform the preprocessing technique on the Afaan Oromo fake news data set, Word embedding is a learned for representation of text.

Word2Vec

Word2vec generates vectors based on numerical representations of word elements, as well as attributes like individual word context. It accomplishes this without the need for human interaction. The vector space is sometimes referred to as a "semantic" or "contextual" vector space. Word2vec gives a much better representation of the words in vectors since it checks the relationship between the surrounding words and, words that are close to one other in a Euclidean distance sense are semantically similar in this vector space. For example: The phrases “Konkolaata” in Afaan Oromo which means "vehicle" and “Motora” it means that "engine" is semantically near to each other. This is done fully unsupervised by word2vec, which just calculates word distributions and a surrounding window of context words.

3.6. Deep learning Classification and Detection Technique

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems. It was designed to prevent the vanishing gradient problem of RNN (Leea & Song, 2020). This is a behavior required in complex problem domains like machine translation, speech recognition, and more. LSTMs are a complex area of deep learning. The main idea employ LSTM have ability to recognize pattern on sequential data and even if the training would be slow due to complexity, there are few remarkable variants that significantly outperform standard LSTM’s accuracy.

A Gated Recurrent Unit is a variant of the RNN architecture, and deploys gated procedure in order to control and operate the flow of information between cells in the neural networks. GRU has simplified gates which makes it easier to understand and simpler design when compared with LSTM. Unlike long short term memory, it does not have a cell state, so it only has two gates: - the reset gate, and the update gate, as opposed to the LSTM’s four gates(Cueva et al., 2020). The update gate functions the same as the, forget and input gate in long short-term memory, throwing old information away and introducing new data.

To overcome the limitation of Unidirectional LSTM cell that can only capture previous context but not utilize the future context, Schuster and Paliwal, invented Bi-LSTM to combine two separate hidden LSTM layers of opposite directions to the same output (Zaman et al., 2020). Bi-Directional Long Short-Term Memory network is a sequence processing model that consists of two LSTM layers through the input sequence in both directions at the same time. It resolves the long-term dependency problem of recurrent neural network (RNN) using both the hidden state and the cell state, which is a memory for storing past input information and the gates that are used to regulate the ability to remove or add information to the cell state. The multiplicative gates and memory are defined for time t (Kong et al., 2020).

3.7. Evaluation

In this research work, the performance of described deep-learning models on Afaan Oromo fake news dataset to evaluate how well the model performed on the predictions, the confusion matrix represents the 4 possible outcomes of the classification.

From those measurements, we derive the following metrics which will be used for evaluating our models:

Accuracy is a measure calculated as the ratio of correct predictions over the total number of data set.
Precision: - Represents the proportion of positive identifications that where we actually correct. It is expressed as: -

Precision = TP/TP + FP

Recall: - Is to measure the percentage of correct predictions the classifier catches and is defined as:

Recall = TP/TP + FN

F-score: - It is a measure which takes on account both precision and recall and is computed as: -

F1 = 2*precision * recall / precision + recall.

4.1. Experiment step

In this experiment we have used two Afan Oromo news article dataset: - One is Afan Oromo real news article labeled as 1, and the other is fake news article which labeled as 0.

4.2. Experiments on Long short-term memory (LSTM)

In these experiments, we used Keras layers to build a sequential model using Embedding, unidirectional LSTM, Dropout, Dense, and Embedding. In order to train an LSTM Neural Network to generate text, we train our model for 50 epochs with a batch size of 128. Since, we have already trained LSTM with the training set, it will use the learning from the training process to make predictions on the test set and one of the most important aims in deep learning model is to make it such that it can generalize better for unseen data.

Now if the model starts memorizing the training data only then its ability to generalize better for unseen data will deteriorate and it will produce poor output results. Consequently, as shown Fig. 3 we find that the validation loss is not very close to the training loss and this is a sign of slight overfitting in training model.

4.3. Experiments on Gated recurrent units (GRU)

GRU is similar to LSTM units. However, GRU comprises of the reset gate and the update gate instead of the input, output and forget gate of the LSTM. In these experiments, we have used Embedding, GRU, Dropout, Dense, layers from Keras layers to build a sequential model. In order to train GRU to generate text, we train our model for 100 epochs with a batch size of 128. Since, we have already trained GRU with the training set, it will use the learning from the training process to make predictions and one of the most important aims in deep learning model is to make it such that it can generalize better for unseen data.

Like other deep learning model GRU learning curve its interpretation is based on how well the model is doing in the training sets. It is the sum of errors made for each example in training sets.

4.4. Experiments on Bi-directional long short-term memory (Bi-LSTM)

In this experiment of Bi-LSTM the network only one fully connected hidden layer consisting of 256 neurons with activation function, followed by an output layer with one neuron and sigmoid activation function giving the real or fake detection. Figure 4. Shows bidirectional Long Short-Term Memory model build to our task. We train Bi-LSTM which enable additional training by traversing the input data twice and the optimization function used was Adam and the loss was computed using binary cross-entropy. The model was fit in 150 epochs with a batch size of 256 samples.

4.5. Discussion

We conducted various experiments on Afaan Oromo language news content retrieved from social media in this study. Figure-6 summarizes the experiment accuracy for several deep learning models. However, the amount of training required which is comparatively much more than other kinds of networks and RNNs suffer from the problem of vanishing gradients, which hampers learning of data sequences and gradients carry information used in the Recurrent Neural Network parameters update and when the gradient becomes smaller, the parameter updates become insignificant, which means no real learning is done and to overcome the problem which alleviate some of the training difficulties in RNN. LSTM to its ability to effectively capture long range dependencies in the sequence of text, and has been applied to fake news detection. GRU is less complex than LSTM because it has less number of gates. However, GRU is preferred for small datasets otherwise long short-term memory is recommended for the larger dataset. Bi-LSTM are an extension of LSTMs that can improve model performance on sequence classification problems and enable additional training by traversing the input data twice, forward and backward direction in accumulation to solves the vanishing gradient problem and the model has higher prediction accuracy than the other model and managed to detect all fake news articles in the testing set, while still kept the FP ratio at a reasonable level.

During bidirectional long short-term training we follow the following entire procedures: -

The models are prone to a problem known as overfitting which is common problem in deep learning algorithm. This is a consequence where by the model memorizes the results in the training set and is not able to generalize on data that, it has not seen. As a final point, bidirectional long short-term memory performance measured by f-measure values are outperform 90% of accuracy. In general, the size of the data can affect the accuracy of the model and as the size of the data increases, accuracy also increases as well as the number of training increases and the shortcoming of the current model is that it requires more training data and training time than the existing baselines. Further along a line as a future work, the inclusion of features like source or the author of the article, user response, along with the model proposed, and increased the volume of dataset can lead the best solution for the problem of fake news. Finally, based on the findings, Bi-LSTM was chosen as the best model for determining the truth of Afaan Oromo news in social media.

Deep learning is gaining lots of attention recently and improved the solution for many problems that AI and machine learning faced for a lot of years. However, a model to predict Afaan Oromo news text does not come out of thin air, first the social media news was extracted with Octoparse tools from social media like: - VOA Afaan Oromo twitter page, BBC Afaan Oromo twitter page, OBN Facebook page, Oromia communication bureau Facebook page, and FBC Facebook page. Formerly, data set must be preprocessed and trained on the collected sample data set. The neural network for this task does not operate directly on news texts, and its challenging to perform text analytics, so far, we have seen deterministic methods to determine word to vectors which is combination of techniques. We consider and apply the context of word embedding by training it with Bi-directional LSTM and cosine similarity measure are passed as input features to neural network. Once the classifier was trained to classify, a threshold was applied to the output score to determine whether it’s considered a Real or Fake and statistical analysis.

To conclude, the contributions of this work to the contemporary and developing research topic of automated fake news detection in Afaan Oromo are meant to advance the field and pursue fresh solutions to this problem. It lays the framework for future efforts in the goal of bringing Afaan Oromo closer to automatic fake news detection.

Author Contribution

A key objective of this study involves comparing four simple deep learning models Long short-term memory, Gated recurrent units, and Bi-directional long short-term memory for detecting and classifying fake news in Afaan Oromo. The primary significance of this research lies in that it supports the hypothesis that deep learning can be effective in detecting false news text in Afaan Oromo language in a novel manner. The study provides a new corpus to address the issue of detecting fake news in Afaan Oromo. We present the findings of deep learning detectaction models generated using a variety of methodologies and characteristics that can be used as a starting point for further work with this or other datasets. We assess the performances of three deep learning models, Long short-term memory, Gated recurrent units, and Bi-directional long short-term memory for Afaan Oromo fake news detection.

Acknowledgement

The authors would like to thank Dr. Kula Kekeba for their assistance of valuable discussions, technical support, data collection process assistance. We are also grateful to Adama Science and Technology University, Ethiopia for providing access to laboratory facilities.

Data Availability

Data supporting the findings of this study are available from the corresponding author on reasonable request. The data are not publicly available due to ethical restrictions and participant privacy concerns.

Adamu, M. A. Role of social media in Ethiopias recent political transition. J. Media Communication Stud. 12 (2), 13–22. https://doi.org/10.5897/jmcs2020.0695 (2020).
Ahmed, H., Traore, I. & Saad, S. Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 10618 LNCS(October), 127–138. (2017). https://doi.org/10.1007/978-3-319-69155-8_9
Castillo, C., Mendoza, M. & Poblete, B. Information credibility on Twitter. Proceedings of the 20th International Conference Companion on World Wide Web, WWW 2011, 675–684. (2011). https://doi.org/10.1145/1963405.1963500
Cueva, E. et al. Detecting Fake News on Twitter Using Machine Learning Models. 1–14. (2020).
Gurmessa, D. Afaan Oromo Fake News Detection Using Natural Language Processing and Passive-Aggressive. United Int. J. Res. Technol. 02 (02), 33–40 (2020).
Jimalo, K. M., Babu, P., Assabie, Y. & R., & Afaan Oromo News Text Categorization using Decision Tree Classifier and Support Vector Machine: A Machine Learning Approach. Int. J. Comput. Trends Technol. 47 (1), 29–41. https://doi.org/10.14445/22312803/ijctt-v47p104 (2017).
Kong, S. H., Tan, L. M., Gan, K. H. & Samsudin, N. H. Fake News Detection using Deep Learning. ISCAIE 2020 - IEEE 10th Symposium on Computer Applications and Industrial Electronics, 102–107. (2020). https://doi.org/10.1109/ISCAIE47305.2020.9108841
Leea, H. & Song, J. Understanding recurrent neural network for texts using English-Korean corpora. Commun. Stat. Appl. Methods. 27 (3), 313–326. https://doi.org/10.29220/CSAM.2020.27.3.313 (2020).
Monti, F., Frasca, F., Eynard, D., Mannion, D. & Bronstein, M. M. Fake News Detection on Social Media using Geometric Deep Learning. 1–15. (2019).
Ruchansky, N., Seo, S. & Liu, Y. CSI: A hybrid deep model for fake news detection. International Conference on Information and Knowledge Management, Proceedings, Part F1318, 797–806. (2017). https://doi.org/10.1145/3132847.3132877
Talwar, S., Dhir, A., Kaur, P. & Zafar, N. Why do people share fake news ? Associations between the dark side of social media use and fake news sharing behavior Journal of Retailing and Consumer Services Why do people share fake news ? Associations between the dark side of social media use and fak. September. (2019). https://doi.org/10.1016/j.jretconser.2019.05.026
Thota, A. et al. Fake News Detection: A Deep Learning Approach. SMU Data Sci. Rev. 1 (3), 10 (2018).
Zaman, B., Justitia, A., Sani, K. N. & Purwanti, E. An Indonesian Hoax News Detection System Using Reader Feedback and Naïve Bayes Algorithm. Cybernetics Inform. Technol. 20 (1), 82–94. https://doi.org/10.2478/cait-2020-0006 (2020).

No competing interests reported.

Download PDF

Editor assigned by journal
07 Nov, 2024
Editor invited by journal
07 Nov, 2024
Submission checks completed at journal
06 Nov, 2024
First submitted to journal
12 Oct, 2024

You are reading this latest preprint version

Afaan Oromo Fake News Detection on Social Media: - Using Deep Learning Approach

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data set preparation

3.2. Data set preprocessing

3.3. Development tools

3.4. Proposed System Architecture

3.5. Feature extraction and Word embedding

3.6. Deep learning Classification and Detection Technique

3.7. Evaluation

4. Results and Discussion

4.1. Experiment step

4.2. Experiments on Long short-term memory (LSTM)

4.3. Experiments on Gated recurrent units (GRU)

4.4. Experiments on Bi-directional long short-term memory (Bi-LSTM)

4.5. Discussion

5. Conclusion

Declarations

Author Contribution

Acknowledgement

Data Availability

References

Additional Declarations

Status:

Version 1