We explain how to collect, label and preprocess data for the problem of classification of news in our news dataset with natural language processing and machine learning methods. Using the data set we prepared after these steps that contributed to the methodology of the study, the tweets we pulled from Twitter were classified with fastText. The methodology of the study is as given in Figure-1.
3.1. Data Resources
As internet access becomes more widespread, the use of social media has increased significantly in terms of both users and the volume of content produced [37]. This rapid increase in access to the Internet has led to the explosion of online forums and social media groups, thus transforming the information seeking practices of Turkish Internet users [38]. Towards the end of the 2000s, such groups and forums became the main source of urban legends and misinformation, resulting in the need for platforms that could verify and verify the accuracy of these claims. Fact checking is often one of the main antidotes to the problems of disinformation and fake news. The purpose of the International Verification Network (IFCN) is to bring together verification platforms operating in every country in the world, and to bring a certain standardization and accountability to the activities of these platforms. There are 77 active verification platforms worldwide that are IFCN members. In Turkey, only Doğruluk Payı and Teyit.org are members of the IFCN [14]. In the scope of the research, the Twitter accounts of the accuracy control platforms we used while creating the data set were preferred. The accuracy platforms we use in this context in Turkey are shown in Table-1.
Table-1. Fact checking platforms that I used for fake news data set in the scope of the research
Twitter Account | Year of foundation | Their Task | IFCN Member? |
malumatfurusorg | 2009 | Newspaper columnists fact checking | No |
dogrulukpayicom | 2014 | Political statement verification | Yes |
teyitorg | 2015 | Refutation fake news allegations | Yes |
gununyalanlari | 2015 | Refutation fake news allegations | No |
dogrulaorg | 2017 | Refutation fake news allegations | No |
As seen in Table-1, we obtained our fake news dataset from these Twitter accounts. 2 of the accounts here are IFCN members and the remaining 3 are accounts that comply with IFCN rules. The common feature of these accounts is that they are located in Turkey and they are Twitter accounts that share Turkish tweets, fake news spread through online channels, unconfirmed news circulating on social media, and urban legends. For this reason, we took a total of 16250 tweets from 5 Twitter users. Since these accounts were also confirming correct news tweets at the same time, only fake news tweets were manually tagged during data pre-processing, and the number of tweets pulled decreased.
Within the scope of the study, our tweets are drawn by using different news sources from the Twitter platform. We pulled our dataset of real news from Twitter verified news accounts. Twitter accounts of local and lesser-known news sources were ignored because they would be difficult to verify. For this reason, we pulled a total of 61750 real news tweets from 19 Twitter users. This number decreased a little more after the data preprocessing steps, as we mentioned in the main requirements, as per the article “Tweets must consist of text and be homogeneous”.
In order to form the data collection, the news sources from which the news tweets are collected, the verification and validation of the data are explained in this section. In our study, as used by [39] in his publications, the main topics required for the compilation of the news were made as in the study that we used as a guide as a result of the researches in the literature [40], and these main requirements and our reasons for choosing Twitter as a social media platform are as follows:
-
Tweets containing fake and real news items,
-
Tweets consist of only text,
-
It is easy for researchers like us to access Twitter with the help of API,
-
Have a verifiable truth,
-
Its length is homogeneous,
-
Presenting tweets consisting of fake and real news in the same way and for the same purpose,
-
Providing access to the time zone when requested,
-
Presented as open source to the public,
-
Respect for language and culture differences.
3.2. Data gathering and labeling
Before starting the dataset creation phase, the existence of the dataset consisting of Turkish tweets that we need in the literature was checked, but it was not found. In order to avoid misunderstandings, the data we are looking for is Twitter data and consists of both fake and real news.
We captured the tweets in question using the Twitter Standard API and Turkish tweets containing fake and real news from Twitter using Python's tweepy module. The Standard API allows users to download tweets published in the last 7 days from Twitter, which can receive 200 tweets on a user-based basis under normal conditions. However, within the scope of dataset improvement studies, when we pulled tweets from Twitter with the “.json” extension, the number of tweets we could pull from each user at once was 3250. Thus, we were able to obtain tweets without the last 7 days filter.
At this stage, word clouds belonging to the words in the dataset containing fake and real news in the dataset I prepared for the study were also created. The function I use for the word cloud is as follows.
wordcloud = WordCloud(width = 1000, height = 500, background_color='black', max_words = 50).fit_words(freq)
Here, 50 most used words are placed in a 1000*500 area. The word cloud created for real news is as follows in Figure-2.
The word cloud created for fake news is as follows in Figure-3.
3.3. Data preprocessing
In order for the text classification to be carried out effectively and to create the classification model, the data set must be preprocessed. When the literature is examined, the results obtained when classifying the data in the pre-processed data sets are better [41]. The purpose of applying preprocessing steps is to achieve better performance with smaller vector space and less size. The open source Zemberek Library [42] was preferred in the natural language processing processes for the data set we prepared for the study, and the code developments in the library were made with the Java programming language. After the preprocessing steps, 3848 tweet sentences remained in our data collection in order to provide balance within the sets. The processes we do for preprocessing are listed below.
-
Duplicate tweets have been deleted.
-
Special characters, emoticons and punctuation marks have been deleted when written with words, as they also affect natural language processing.
-
Word errors or missing letters entered by the users were applied to the data corpus and normalization was performed.
-
Mentions and tweet tags have been deleted.
-
Stop words that do not add any meaning to the sentence have been deleted.
-
Words have been converted to lowercase.
As a result of the data processing study, the raw and rooted versions of 3 sample tweets belonging to the tweet sentences in the data corpus are shown in Table-2. Here the first two examples are fake news tweets and the third example is real news tweets.
Table-2. Pre-processing (rooted) and post-processing(raw) of fake and real news given as an example from the data set
Sample 1 | Raw tweet | ❌Diş macunundaki florürün epifiz bezinin kireçlenmesinin sebebi oldu https://t.co/Keb745abRd |
English Version | ❌Fluoride in toothpaste caused calcification of the pineal gland https://t.co/Keb745abRd |
Rooted tweet | diş macun florür epifiz bez kireçlemek sebep olmak |
English Version | toothpaste fluoride pineal gland to cause calcification |
Sample 2 | Raw tweet | @SibeLiks Küba’da ilaç çalışmaları için denek olarak tecavüzcüler kullanıldı ❌\n\nhttps://t.co/PwhlTwbh45" |
English Version | @SibeLiks Rapists used as test subjects for drug trials in Cuba ❌\n\nhttps://t.co/PwhlTwbh45" |
Rooted tweet | Küba ilaç çalışmak denemek olmak tecavüz kullanmak |
English Version | Cuba drug work try be rape use |
Sample 3 | Raw tweet | #Canlı: Mardin'in Kızıltepe ilçesinde yoğun kar yağışı nedeniyle zincirleme trafik kazası meydana geldi https://t.co/8sULz7qdKO |
English Version | #Live: A chain traffic accident occurred due to heavy snowfall in Kızıltepe district of Mardin https://t.co/8sULz7qdKO |
Rooted tweet | Mardin Kızıltepe ilçe yoğun kar yağmak neden zincir trafik kaza meydan gelmek |
English Version | Mardin Kızıltepe district causes heavy snowfall occur chain traffic accident |
3.4. Training and test set preparation
The classification process of fake and real news is the subtitle of text classification. Developing language models is also an important step in facilitating detection and generating training and testing data. The most important thing to consider when creating training and test data is that the datasets are balanced. Considering the data set we have prepared, tweet sentences with different tags should be found proportionally in all training and test sets. The tweets in the test set were prepared to be 20% of the tweets in the training set. With the understanding of a balanced data set, the number of fake and real news in the training and test data sets is arranged to be half. Otherwise, it is possible to obtain different results.
As a result of the literature review, it was evaluated that the stratified random sampling model would be appropriate for our study. The advantage of this sampling model is that its results are high compared to other sampling models, it is easy to teach and easy to understand by researchers, and it gives good results for the smallest size data sets [42].
The scikit-learn library has been used on the python programming language to split our dataset with stratified random sampling for 10-fold cross validation [43].
3.5. fastText usage
fastext is a library developed by the Facebook Research Team to effectively learn word embedding and text classification. [44]. The most important feature of the Turkish language is that it is a rich language in terms of definition, analysis and identification of the structure of words. We can call this the richness of the morphological structure of the Turkish language. When other studies in the literature are examined, one of the main contributions of the fastText word embedding model is that it deals with the internal structure of the algorithm while learning the word representations that will be useful in Turkish [25].
One of the word embedding approaches is word2vec. However, the word representation approach using the fastText word insertion model is different from word2vec and others. fastText assumes that a word consists of n-gram characters, while word2vec uses each word as the smallest unit. It can be different from the number of letters in a word, which is referred to as the length of n here. So this method keeps word vectors as n-gram characters. This helps to find vector representations for words that are not directly found in the dictionary. In Table-3, our parameters of the fastText word embedding model are included with their explanations.
Table-3. Hyperparameters used in fastText model
Hyperparametres | Explanation | Value Used |
Lr | Learning rate | 0.1 |
Epochs | Number of epoch | 75 |
Dim | Size of Word vector | 300 |
Neg | Number of nagatives sampled | 5 |
ws | Window size | 2 |
wordNgrams | Maximum word length n-grams | 3 |
Learning rate is a hyper parameter used in deep neural networks to determine all the details during training. The epoch parameters represent the forward and backward transition of the training set. The size of the word vectors is one of the most important and critical parameters in the fastText classification model. Maximizing the size can have the effect of reducing the classification accuracy, as redundant words are included in the word similarity determination phase. The size of the content window indicates the number of words used to determine the context of each word.
One of the different aspects of the study is that fastText was not applied on Twitter datasets for fake news classification purposes.
Within the scope of the study, it is aimed to make a binary classification in order to determine whether the news in our data set is fake or real news. In this context, the data set was arranged so that the number of "real" news was 1942 and the number of "fake" news was 1942, in accordance with the binary classification. In order to run the FastText model, we added “__label__fake” or ”__label__real” to the beginning of the data. Two methods were applied to train a model using fastText with the news corpus prepared for this study. First, the hyperparameter values presented in Table-3 were used for training with different combinations in order to obtain the best results. Second, fastText's autotune feature is used to automatically find the best hyperparameters in a given time frame. Models were tested with pre-prepared test data for 10-fold cross-validation with stratified sampling. Commonly used metrics to evaluate the performance of classification models are accuracy, precision, recall, and f1-score. When calculating these metrics, 4 parameters are required; True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN).
-
TP: Predicted positive class correctly.
-
TN: Predicted negative class correctly.
-
FP: Predicted positive class incorrectly.
-
FN: Predicted negative class incorrectly.
Formulas calculating performance metrics used to evaluate model performance in this study are presented in equations (1), (2), (3) and (4).
$$Accuracy≔\frac{TP+TN}{TP+TN+FP+FN}$$
1
$$Precision≔\frac{TP}{TP+FP}$$
2
$$Recall≔\frac{TP}{TP+FN}$$
3
$$F1-Measure≔\frac{2*TP}{2*TP+FP+FN}$$
4
fastText does not automatically calculate the evaluation metrics described above when doing binary classification. Therefore, Scikit-learn library functions are used to calculate these metrics. Training and testing processes were carried out for 10-fold cross-validation and the best performance results obtained in line with the tests performed are presented in Table-4.
Table-4. Best performance results with fastText
Set status | Accuracy | Label | Precision | Recall | F-1 Measure | Support |
Raw form | 0.84 | Gerçek | 0.84 | 0.84 | 0.84 | 400 |
Sahte | 0.84 | 0.84 | 0.84 | 400 |
Rooted form | 0.88 | Gerçek | 0.84 | 0.94 | 0.89 | 392 |
Sahte | 0.93 | 0.82 | 0.87 | 392 |
Autotuned form | 0.83 | Gerçek | 0.85 | 0.81 | 0.83 | 392 |
Sahte | 0.81 | 0.86 | 0.84 | 392 |
3.6. Machine learning models
In the study, we also used 7 classification algorithms to classify. These algorithms are multnomialNB, XGBoost, Random Forest, Logistic Regression, K-NN, SGD and SVM, respectively.
We evaluated the machine learning models used in the study in detail. We used a rooted version of our dataset when testing our machine learning models. Our evaluation metrics were accuracy, precision, recall and f-1 score, as in our fastText model. Our calculations are shown in Table-5.
Table-5. Results obtained with machine learning algorithms
Algorithm | Accuracy | Label | Precision | Recall | F-1 Measure | Support |
MultinomialNB | 0.82 | Gerçek | 0.83 | 0.76 | 0.80 | 362 |
Sahte | 0.81 | 0.87 | 0.83 | 415 |
SGDClassifier | 0.80 | Gerçek | 0.76 | 0.85 | 0.80 | 362 |
Sahte | 0.85 | 0.76 | 0.80 | 415 |
Logistic Regression | 0.79 | Gerçek | 0.78 | 0.75 | 0.77 | 362 |
Sahte | 0.79 | 0.81 | 0.80 | 415 |
RandomForest | 0.75 | Gerçek | 0.78 | 0.66 | 0.71 | 362 |
Sahte | 0.74 | 0.84 | 0.78 | 415 |
SVM | 0.82 | Gerçek | 0.78 | 0.85 | 0.81 | 362 |
Sahte | 0.86 | 0.80 | 0.83 | 415 |
XGBoost | 0.69 | Gerçek | 0.62 | 0.85 | 0.72 | 362 |
Sahte | 0.81 | 0.55 | 0.66 | 415 |
K-NN | 0.76 | Gerçek | 0.83 | 0.62 | 0.71 | 362 |
Sahte | 0.73 | 0.89 | 0.80 | 415 |
As a result of the calculations, the SVC classifier gave the best accuracy value. However, k-layer cross-examination was performed to ensure that we did not make any mistakes in choosing the "train" and "test" data and that there was no duplication or memorization in the data set. Here, the k value is set to 10, as in our fastText model. That is, by dividing the data set into 10 equal parts, the accuracy rate was calculated with 10 times different data. The results obtained as a result of this calculation are presented in Table-6.
Table-6. Results obtained with machine learning algorithm after 10-fold cross validation
10-folds cross validation | Algorithm | Accuracy |
SVM | 0.83 (+/- 0.04) |
Naive Bayes | 0.82 (+/- 0.04) |
SGD | 0.82 (+/- 0.04) |
Logistic Regression | 0.77 (+/- 0.06) |
Random forest | 0.84 (+/- 0.04) |
KNN | 0.75 (+/- 0.04) |
XGBoost | 0.72 (+/- 0.05) |