To ensure valid and reliable results systematically designs the research methodology and employs qualitative research strategies throughout the life cycle to achieve objective of the research. We employed the research design strategies listed below to do this study.
3.1. Data set preparation
In order to create an artifact capable of recognize fake news, a vast amount of Afaan Oromo text article for fake and real news is required. Thus, we extracted the data set from twitter and Facebook which have relatively the large number of followers. Precisely, we have used Octoparse tools to scrap the real news from the official twitter page of Voice of America (VOA) and British Broad cast Cooperation (BBC) Afaan Oromo authorized by government in addition to verification of company, as well as Face pager, have used to extract Afaan Oromo news from Facebook account and pages can post the news in different languages and they can easily mislead or misinform the massive society.
To optimize the cost of finding fake news, the news article consisted of misinformation from fake accounts that Facebook Company has been working to stamp out and we use government media’s criteria, they identify news weather real or fake with its source of news, credibility and legal person like profile image. We also explored our own criteria for the content that would be taken off the page if: -
-
The site's news never follows up on the headline's premise and instead descend into speculation, opinion, and statements with no sources or links to research. Take a look at the news's source. Is it a mainstream news organization with a good reputation? Or isn't it?
-
The headline that sounds like the summary of an opinion is not to be trusted.
-
Is it from a neutral source, if it isn't?
-
Examine some of the other stories available through this source or account to see if at least one of them has been flagged as a hoax.
-
If the impression that the authors are emotionally invested in their subjects (and hence biased)? Do they appear enraged or awestruck by their subjects?
-
Checking the dates is also key because the story could be old and republished if it isn't current news or relevant.
3.2. Data set preprocessing
Data preprocessing is a crucial phase for any text analysis. There will be unwanted text in the news article that can be obstacle to the model efficiency. We have actual dataset embodies 1838 samples, which encompasses with news headline, body and label (i.e. 0 for fake news, 1 for real news). Hence, the headline and body of the news article has been marge together and data requires special preprocessing to implement deep learning algorithms for subsequent Afaan Oromo news text fake news detections. To reduce the size of actual data, we used a generic refinement like: - stop-word removal, tokenization, and normalization were applying to removing the irrelevant information that exists in the dataset.
3.3. Development tools
The following are the development tools that were used in this study to detect fake news.
An open-source web scraping tool that allows you to extract data from Twitter and other websites, without having to write any code. It is used in the data collection process to build the dataset. It also provides easy to use the interface, 24/7 cloud service, and deals with all websites.
Colaboratory, or simply Colab, is a Google Research tool that allows developers to write and execute Python code directly from their browser. It's a hosted Jupyter notebook that doesn't require installation and includes a great free version that offers you free access to Google processing resources like GPUs and TPUs. It aids in the efficient execution of deep learning algorithms that consume a lot of resources and time. Furthermore, no packages must be explicitly installed.
Spyder is a cross-platform, open-source integrated development environment (IDE) for scientific Python programming.
3.4. Proposed System Architecture
In order to create an artifact capable of detecting fake news, the news text is first preprocessed to remove unwanted characteristics left over from the data acquisition phase, and then the cleaning phase performs more actions than non-textual characters, such as fixing spelling and syntax errors, standardizing data sets, and correcting mistakes. Word embedding is used as a feature representation that works to capture the semantics of the words. The dataset is then split into train and test sets in order to check the model's accuracies and precisions by training and testing it. As a result, we fit the model using the training data. Then, the words are fed into a deep learning model with input layer, hidden layer and output layers. The first layer is an input layer and it is a filter that can extract features from the data, then pass to the hidden layers and the output layers takes the highest value for examine. When training is carried out on the news article corpus and the models generated are to predict the results of unknown sets. In the following section, we described each component of the architecture and module in detail.
3.5. Feature extraction and Word embedding
The biggest advantage of deep learning is that we do not need to manually extract features from the dataset like machine learning. The network learns to extract features while training. After we perform the preprocessing technique on the Afaan Oromo fake news data set, Word embedding is a learned for representation of text.
Word2vec generates vectors based on numerical representations of word elements, as well as attributes like individual word context. It accomplishes this without the need for human interaction. The vector space is sometimes referred to as a "semantic" or "contextual" vector space. Word2vec gives a much better representation of the words in vectors since it checks the relationship between the surrounding words and, words that are close to one other in a Euclidean distance sense are semantically similar in this vector space. For example: The phrases “Konkolaata” in Afaan Oromo which means "vehicle" and “Motora” it means that "engine" is semantically near to each other. This is done fully unsupervised by word2vec, which just calculates word distributions and a surrounding window of context words.
3.6. Deep learning Classification and Detection Technique
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems. It was designed to prevent the vanishing gradient problem of RNN (Leea & Song, 2020). This is a behavior required in complex problem domains like machine translation, speech recognition, and more. LSTMs are a complex area of deep learning. The main idea employ LSTM have ability to recognize pattern on sequential data and even if the training would be slow due to complexity, there are few remarkable variants that significantly outperform standard LSTM’s accuracy.
A Gated Recurrent Unit is a variant of the RNN architecture, and deploys gated procedure in order to control and operate the flow of information between cells in the neural networks. GRU has simplified gates which makes it easier to understand and simpler design when compared with LSTM. Unlike long short term memory, it does not have a cell state, so it only has two gates: - the reset gate, and the update gate, as opposed to the LSTM’s four gates(Cueva et al., 2020). The update gate functions the same as the, forget and input gate in long short-term memory, throwing old information away and introducing new data.
To overcome the limitation of Unidirectional LSTM cell that can only capture previous context but not utilize the future context, Schuster and Paliwal, invented Bi-LSTM to combine two separate hidden LSTM layers of opposite directions to the same output (Zaman et al., 2020). Bi-Directional Long Short-Term Memory network is a sequence processing model that consists of two LSTM layers through the input sequence in both directions at the same time. It resolves the long-term dependency problem of recurrent neural network (RNN) using both the hidden state and the cell state, which is a memory for storing past input information and the gates that are used to regulate the ability to remove or add information to the cell state. The multiplicative gates and memory are defined for time t (Kong et al., 2020).
3.7. Evaluation
In this research work, the performance of described deep-learning models on Afaan Oromo fake news dataset to evaluate how well the model performed on the predictions, the confusion matrix represents the 4 possible outcomes of the classification.
From those measurements, we derive the following metrics which will be used for evaluating our models:
-
Accuracy is a measure calculated as the ratio of correct predictions over the total number of data set.
-
Precision: - Represents the proportion of positive identifications that where we actually correct. It is expressed as: -
Precision = TP/TP + FP
Recall = TP/TP + FN
F1 = 2*precision * recall / precision + recall.