Fake News Detection in Turkish Using Machine Learning Algorithms and Fasttext With Word Embedding

doi:10.21203/rs.3.rs-2042669/v1

Download PDF

Research Article

Fake News Detection in Turkish Using Machine Learning Algorithms and Fasttext With Word Embedding

https://doi.org/10.21203/rs.3.rs-2042669/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Text classification problem is the transaction of pre-processing texts with natural language processing techniques, and the controlled separation of texts into one or more predefined categories or classes according to their content. Preprocessing is the most important and crucial step in classification and text mining. Applications of text classifications are commonly used in various fields such as classification of social interactions, web sites and news texts, improvement of search engines, extraction of information, automatic processing of e-mails. In this study, the classification success of Turkish fake news pulled from Twitter was analyzed with different parameters by using word embedding with fastText and using scikit-learn libraries in our fastText language model. With this model, the classification of Turkish news tweets according to two predefined classes (fake, real) was tested and the classification success was 88%. In addition, the performances of multnomialNB, Stochastic Gradient Descent (SGD), Random Forest, Logistic Regression, K-NN, XGBoost and Support Vector Machines (SVM) algorithms on Turkish news tweets were compared and interpreted. At the end of the of the study, the technique with the best classification accomplishment was the SVM algorithm with a classification success of about 84%. According to the result, an effective classifier method in the classification of fake news in Turkish has been put forward and a language model has been developed with our data set created with Turkish tweets. It is possible that the suggested methodology can also be applied to Turkish news on different social media platforms other than Twitter.

word embedding

fasttext

text classification

machine learning

fake news detection

Due to the very fast progress of information technology and the widespread use of the internet, the use of social networking platforms has increased. There has been a huge explosion in the sharing of textual data on these social networking platforms. Processing and classification of text data becomes an important issue in data mining on these platforms where this large amount of data is produced and shared [1, 2]. When text classification is considered as a problem, we can say that it is the process of assigning one or more predefined classes of text data, taking into account the content of the pre-processed texts. It is used in our study to detect news texts, which is one of the text classification application areas. Our main purpose in the study is to detect the fakeness of the news text in question. [3, 4, 5]. Text classification is important in the automatic processing and organization of large amounts of text data.

Machine learning is the basic data analysis technique that expresses the created analytical model that the computer will understand. It is possible to establish a relationship with computer-based predictions and computational statistics made within the scope of text mining, thanks to machine learning. Machine learning algorithms are divided into two according to the form of learning supervision. These forms are supervised learning and unsupervised learning. Classification and regression problems are done with supervised learning algorithms [1, 5, 9]. In supervised machine learning applications, the system in the model is trained with a predefined set of classes [1, 9]. As can be seen here, classes are predefined in this model. Then the system is trained by a training set. At this stage where the system is trained, the features of each class that we call predefined for classification are made automatically. Lastly, the test data that we will use for classification are assigned to the classes we defined before in accordance with this model, and the classification process is completed. [1, 9, 10]. Text classification problems encountered while detecting fake news are solved by supervised machine learning algorithms. The most important part here is to specify our fake and real news classes fully and accurately when creating the training sets. After the classes are determined, the study becomes more effective and easier than the unsupervised learning approach [1, 9, 10]. The most known supervised learning algorithms are multnomialNB, XGBoost, Random Forest, Logistic Regression, K-NN, SGD and SVM [9, 10]. Supervised machine learning algorithms are efficiently and effectively used in the analysis and classification of news shared on online social media platforms like Twitter where news text data is dense [11, 12, 13].

Considering the existing literature on research and applications for the analysis of news, it has been seen that there are many studies in different languages. Most of the papers based on fake news detection have been done on English texts [13]. Turkish texts are increasing rapidly when considered worldwide. The Turkish language is the official language in Turkey and Cyprus and is spoken as a mother tongue by a part of the population in 9 more countries. People living in Turkey speak the Turkish language at the rate of 88%. A total of 78.1 million people worldwide speak Turkish as their mother tongue [6]. However, there are a limited number of studies on the classification of Turkish texts, and the number of studies that detect fake news is much less [7, 8]. Studies for the detection of fake news have been carried out with machine learning methods in variety of languages. However, until 2020, there is no study that automatically detects fake news using the Turkish language. [8].

Fact checking, fake news, etc., which have gained popularity in the last 5–6 years. new concepts complicate the definition and structure of news. The complements used with the news, the way of presentation of the news, its methods, the language of the news and even the words chosen make it difficult to determine who and according to what the news is true and false. The rapid and intense news spread with the development of technology does not allow people to question the quality of the news. At this stage, the concept of “fake news” emerged. In terms of the nature, effects, detection and technological evolution of fake news, researchers have increased their orientation to this subject. The significant increase in the publishing and sharing of fake news on social media has increased the efforts to combat fake news. In this context, the focus is on computer-based numerical solutions for fake news detection.

In order to verify all kinds of news shared by users on social networks by experts, verification platforms have been created where fake news created in Turkish are detected. The Twitter accounts of these platforms also share the same news as their web pages. These accounts are teyitorg, dogrulukpayicom, dogrulaorg, gununyalanlari and malumatfurusorg. In this study, these accounts were used to confirm fake news. The International Fact-Checking Network (IFCN) has been established to bring together verification platforms from around the world and to increase control and create an accountable system on these platforms. Teyitorg and Doğrulukpayicom Twitter accounts are members of IFCN [14]. Other accounts, on the other hand, are not members but accounts that fulfill the requirements of this network. These accounts not only publish fake news, but also share posts to raise awareness to facilitate the detection of fake news. The following sections provide in-depth information about these accounts.

In this study, fake news was detected using natural language processing methods in Turkish tweets shared on the Twitter[15] platform. Our dataset, which is created in Turkish language consisting of fake and real tweets, consists of tweets shot between August 2020 and April 2022. Collected tweets were manually tagged and vectorized using fastText word representation method and classified using fasttext and supervised machine learning algorithms.

The difficulties we encountered in the study are listed below;

Data is difficult to obtain in the free version of the Twitter API because we were able to get the tweet phrases from the last 7 days.
We pulled the data using the username on Twitter, so we got all the tweets of that username in the last 7 days. Considering that we shot over a long time, a total of 3250 tweets were taken each time the application ran.
We used Zemberek natural language processing platform because Turkish has a very different structure than English and there are not many open sources to analyze word suffixes.
Since there are few word placement models for the Turkish language, we searched and tried to use the most effective one.

The contributions to the literature with this study are given below;

We determined whether Turkish news tweets for Twitter are fake or real, using machine learning methods.
We tried tools that detect fake news in Turkish and are rarely used in terms of word placement and preprocessing.

The rest of the paper is structured as follows. In Chap. 2, the studies on the subject are explained, in Chap. 3, the data set we prepared and its preprocessing are mentioned, in Chap. 4 the experiments and results are shown, and finally in Chap. 5, the findings and evaluations about the work carried out are presented.

With the rapid development of technology and the emergence of the internet, newspapers, television and radio, which we call traditional news sources, have been replaced by social media platforms such as Twitter and Facebook. The main reason why people tend to use these social media platforms is their low cost, rapid dissemination and easy accessibility. According to the results of the household members of information technologies usage survey, the rate of internet usage in the 16–74 age group in 2021 is 82.6%. This rate was 79% in the previous year. As it can be understood from here, the use of the Internet is becoming more and more widespread [16]. Statistics show that there were 68.9 million social media users in Turkey as of January 2022. While this number is equal to 80.8% of the total population, it shows an increase of 14.8% with the previous year. The usage rate of the Twitter platform is 69.9%. It is known that there are 16.1 million Twitter users in Turkey at the beginning of 2022 [17]. In other words, the fact that this number is so high has made this platform, which is used by people and interactions on Twitter to read news, a topic for researchers like us.

The problem of detecting fake news has started to attract significant attention in social media.

Researchers on the subject have developed different perspectives. They discussed the current definitions of fake news in the literature and provided an overview of fake news detection methods [12]. They first extensively studied and analyzed Natural Language Processing solutions for automated fake news detection and shared their experience by categorizing existing datasets with NLP approaches [18]. They created an efficient feature set to detect fake news articles and classified them with the best performing machine learning classification algorithm [19].

There have also been studies that formulate a mathematical formula to track online fake news and those who spread fake news, and develop a method to prevent the spread of fake news [20]. In their study, they presented an approach based on n-gram features, which is one of the n-gram features-based text analysis and machine learning classification techniques to detect fake news [21]. In a study conducted in Turkish, they determined whether the Turkish news share on social media is fake or real, by selecting certain Tweet news and using machine learning methods. At the same time, using the social network analysis method, it was determined that the users who spread fake news in Turkish with the follower analysis method were in a certain group [8].

With the topic model method, they mined conflicting viewpoints from tweets about news and provided early detection of fake news [22]. Focusing on the reliability of newsworthy information disseminated through Twitter, they explained that the news could reach its sources with the logic of microblogging [23]. They categorize Facebook posts as fake or non-fake with high accuracy based on users who “like” them and map the pattern of information dissemination, making them a useful component of automated fake news detection systems [24].

GDELT (Global Data on Events, Languages and Tone), a global big database project, and news browsers they developed that allow users to easily access news texts, have created a data set of labeled and unlabeled news. Automatic extraction of news tags was performed using machine learning methods with a modeling covering a series of training and testing processes on this data set [7].

They achieved high success by using the fastText method in classifying Turkish news texts using the TTC-3600 dataset, which contains 12000 news, with word vectorization methods [25]. The performances of word placement methods were compared and analyzed using machine learning algorithms on the data set consisting of Turkish comments shared by users on various shopping sites, and a sentiment analysis study was conducted [26]. They trained context-free word embeddings on different morphological forms of words. Then they compared the results with basic continuous word embeddings and contextual embeddings in the word anology, text classification, sentiment analysis, and language models tasks [27]. They proposed learning representations of character n-grams and representing words as sums of n-gram vectors. They recognized an extension of the skip model[29] that takes sub-vocabulary into account. They have shown that this approach is useful in 9 languages [28].

They carried out a study showing that the successive word groups in the texts were determined before the classification stage and the performance values of the applied classifiers increased after the texts were vectorized by considering these expressions as a single word [30]. The success of dictionary analysis, SVM and extreme gradient boosting for text classification and sentiment analysis task were evaluated for Turkish language. A pre-trained transformer-based classifier that outperforms previous methods for Turkish text classification is proposed. In the context of text classification, all machine learning models proposed in the study did not show any domain-independent and task-specific changes [31]. In the study, two different Convolutional Neural Networks (CNNs) were trained and tested on the raw and embodied versions of the TTC-3600 with Zemberek software. CNN and Word2Vec method performed more effective than classical statistical and machine learning based classification algorithms [32].

The performance evaluation of fake news detection methods is being studied. In their study, performance comparisons were made with traditional machine learning methods by dividing the fake news dataset into 2 categories [33]. They searched a wide variety of features from news articles, posts, and stories to more accurately predict the detection of fake news. They used the Buzzfeed dataset with 2282 news articles and demonstrated the opportunities and challenges of their approach with traditional machine learning algorithms. They achieved success with the XGBoost algorithm with an accuracy rate of 0.86 [34]. They classified news articles with machine learning models and community algorithms using various textual features to distinguish fake news. They used ISOT fake news dataset and maximum accuracy was obtained with random forest algorithm [35]. They made the early detection of fake news by optimizing the detection process with the help of asymmetric graphics between the publisher and the users. The dataset they use are real world datasets. They found that this model gives high accuracy [36].

We explain how to collect, label and preprocess data for the problem of classification of news in our news dataset with natural language processing and machine learning methods. Using the data set we prepared after these steps that contributed to the methodology of the study, the tweets we pulled from Twitter were classified with fastText. The methodology of the study is as given in Figure-1.

3.1. Data Resources

As internet access becomes more widespread, the use of social media has increased significantly in terms of both users and the volume of content produced [37]. This rapid increase in access to the Internet has led to the explosion of online forums and social media groups, thus transforming the information seeking practices of Turkish Internet users [38]. Towards the end of the 2000s, such groups and forums became the main source of urban legends and misinformation, resulting in the need for platforms that could verify and verify the accuracy of these claims. Fact checking is often one of the main antidotes to the problems of disinformation and fake news. The purpose of the International Verification Network (IFCN) is to bring together verification platforms operating in every country in the world, and to bring a certain standardization and accountability to the activities of these platforms. There are 77 active verification platforms worldwide that are IFCN members. In Turkey, only Doğruluk Payı and Teyit.org are members of the IFCN [14]. In the scope of the research, the Twitter accounts of the accuracy control platforms we used while creating the data set were preferred. The accuracy platforms we use in this context in Turkey are shown in Table-1.

Table-1. Fact checking platforms that I used for fake news data set in the scope of the research

Twitter Account	Year of foundation	Their Task	IFCN Member?
malumatfurusorg	2009	Newspaper columnists fact checking	No
dogrulukpayicom	2014	Political statement verification	Yes
teyitorg	2015	Refutation fake news allegations	Yes
gununyalanlari	2015	Refutation fake news allegations	No
dogrulaorg	2017	Refutation fake news allegations	No

As seen in Table-1, we obtained our fake news dataset from these Twitter accounts. 2 of the accounts here are IFCN members and the remaining 3 are accounts that comply with IFCN rules. The common feature of these accounts is that they are located in Turkey and they are Twitter accounts that share Turkish tweets, fake news spread through online channels, unconfirmed news circulating on social media, and urban legends. For this reason, we took a total of 16250 tweets from 5 Twitter users. Since these accounts were also confirming correct news tweets at the same time, only fake news tweets were manually tagged during data pre-processing, and the number of tweets pulled decreased.

Within the scope of the study, our tweets are drawn by using different news sources from the Twitter platform. We pulled our dataset of real news from Twitter verified news accounts. Twitter accounts of local and lesser-known news sources were ignored because they would be difficult to verify. For this reason, we pulled a total of 61750 real news tweets from 19 Twitter users. This number decreased a little more after the data preprocessing steps, as we mentioned in the main requirements, as per the article “Tweets must consist of text and be homogeneous”.

In order to form the data collection, the news sources from which the news tweets are collected, the verification and validation of the data are explained in this section. In our study, as used by [39] in his publications, the main topics required for the compilation of the news were made as in the study that we used as a guide as a result of the researches in the literature [40], and these main requirements and our reasons for choosing Twitter as a social media platform are as follows:

Tweets containing fake and real news items,
Tweets consist of only text,
It is easy for researchers like us to access Twitter with the help of API,
Have a verifiable truth,
Its length is homogeneous,
Presenting tweets consisting of fake and real news in the same way and for the same purpose,
Providing access to the time zone when requested,
Presented as open source to the public,
Respect for language and culture differences.

3.2. Data gathering and labeling

Before starting the dataset creation phase, the existence of the dataset consisting of Turkish tweets that we need in the literature was checked, but it was not found. In order to avoid misunderstandings, the data we are looking for is Twitter data and consists of both fake and real news.

We captured the tweets in question using the Twitter Standard API and Turkish tweets containing fake and real news from Twitter using Python's tweepy module. The Standard API allows users to download tweets published in the last 7 days from Twitter, which can receive 200 tweets on a user-based basis under normal conditions. However, within the scope of dataset improvement studies, when we pulled tweets from Twitter with the “.json” extension, the number of tweets we could pull from each user at once was 3250. Thus, we were able to obtain tweets without the last 7 days filter.

At this stage, word clouds belonging to the words in the dataset containing fake and real news in the dataset I prepared for the study were also created. The function I use for the word cloud is as follows.

wordcloud = WordCloud(width = 1000, height = 500, background_color='black', max_words = 50).fit_words(freq)

Here, 50 most used words are placed in a 1000*500 area. The word cloud created for real news is as follows in Figure-2.

The word cloud created for fake news is as follows in Figure-3.

3.3. Data preprocessing

In order for the text classification to be carried out effectively and to create the classification model, the data set must be preprocessed. When the literature is examined, the results obtained when classifying the data in the pre-processed data sets are better [41]. The purpose of applying preprocessing steps is to achieve better performance with smaller vector space and less size. The open source Zemberek Library [42] was preferred in the natural language processing processes for the data set we prepared for the study, and the code developments in the library were made with the Java programming language. After the preprocessing steps, 3848 tweet sentences remained in our data collection in order to provide balance within the sets. The processes we do for preprocessing are listed below.

Duplicate tweets have been deleted.
Special characters, emoticons and punctuation marks have been deleted when written with words, as they also affect natural language processing.
Word errors or missing letters entered by the users were applied to the data corpus and normalization was performed.
Mentions and tweet tags have been deleted.
Stop words that do not add any meaning to the sentence have been deleted.
Words have been converted to lowercase.

As a result of the data processing study, the raw and rooted versions of 3 sample tweets belonging to the tweet sentences in the data corpus are shown in Table-2. Here the first two examples are fake news tweets and the third example is real news tweets.

Table-2. Pre-processing (rooted) and post-processing(raw) of fake and real news given as an example from the data set

Sample 1	Raw tweet	❌Diş macunundaki florürün epifiz bezinin kireçlenmesinin sebebi oldu https://t.co/Keb745abRd
	English Version	❌Fluoride in toothpaste caused calcification of the pineal gland https://t.co/Keb745abRd
	Rooted tweet	diş macun florür epifiz bez kireçlemek sebep olmak
	English Version	toothpaste fluoride pineal gland to cause calcification
Sample 2	Raw tweet	@SibeLiks Küba’da ilaç çalışmaları için denek olarak tecavüzcüler kullanıldı ❌\n\nhttps://t.co/PwhlTwbh45"
	English Version	@SibeLiks Rapists used as test subjects for drug trials in Cuba ❌\n\nhttps://t.co/PwhlTwbh45"
	Rooted tweet	Küba ilaç çalışmak denemek olmak tecavüz kullanmak
	English Version	Cuba drug work try be rape use
Sample 3	Raw tweet	#Canlı: Mardin'in Kızıltepe ilçesinde yoğun kar yağışı nedeniyle zincirleme trafik kazası meydana geldi https://t.co/8sULz7qdKO
	English Version	#Live: A chain traffic accident occurred due to heavy snowfall in Kızıltepe district of Mardin https://t.co/8sULz7qdKO
	Rooted tweet	Mardin Kızıltepe ilçe yoğun kar yağmak neden zincir trafik kaza meydan gelmek
	English Version	Mardin Kızıltepe district causes heavy snowfall occur chain traffic accident

3.4. Training and test set preparation

The classification process of fake and real news is the subtitle of text classification. Developing language models is also an important step in facilitating detection and generating training and testing data. The most important thing to consider when creating training and test data is that the datasets are balanced. Considering the data set we have prepared, tweet sentences with different tags should be found proportionally in all training and test sets. The tweets in the test set were prepared to be 20% of the tweets in the training set. With the understanding of a balanced data set, the number of fake and real news in the training and test data sets is arranged to be half. Otherwise, it is possible to obtain different results.

As a result of the literature review, it was evaluated that the stratified random sampling model would be appropriate for our study. The advantage of this sampling model is that its results are high compared to other sampling models, it is easy to teach and easy to understand by researchers, and it gives good results for the smallest size data sets [42].

The scikit-learn library has been used on the python programming language to split our dataset with stratified random sampling for 10-fold cross validation [43].

3.5. fastText usage

fastext is a library developed by the Facebook Research Team to effectively learn word embedding and text classification. [44]. The most important feature of the Turkish language is that it is a rich language in terms of definition, analysis and identification of the structure of words. We can call this the richness of the morphological structure of the Turkish language. When other studies in the literature are examined, one of the main contributions of the fastText word embedding model is that it deals with the internal structure of the algorithm while learning the word representations that will be useful in Turkish [25].

One of the word embedding approaches is word2vec. However, the word representation approach using the fastText word insertion model is different from word2vec and others. fastText assumes that a word consists of n-gram characters, while word2vec uses each word as the smallest unit. It can be different from the number of letters in a word, which is referred to as the length of n here. So this method keeps word vectors as n-gram characters. This helps to find vector representations for words that are not directly found in the dictionary. In Table-3, our parameters of the fastText word embedding model are included with their explanations.

Table-3. Hyperparameters used in fastText model

Hyperparametres	Explanation	Value Used
Lr	Learning rate	0.1
Epochs	Number of epoch	75
Dim	Size of Word vector	300
Neg	Number of nagatives sampled	5
ws	Window size	2
wordNgrams	Maximum word length n-grams	3

Learning rate is a hyper parameter used in deep neural networks to determine all the details during training. The epoch parameters represent the forward and backward transition of the training set. The size of the word vectors is one of the most important and critical parameters in the fastText classification model. Maximizing the size can have the effect of reducing the classification accuracy, as redundant words are included in the word similarity determination phase. The size of the content window indicates the number of words used to determine the context of each word.

One of the different aspects of the study is that fastText was not applied on Twitter datasets for fake news classification purposes.

Within the scope of the study, it is aimed to make a binary classification in order to determine whether the news in our data set is fake or real news. In this context, the data set was arranged so that the number of "real" news was 1942 and the number of "fake" news was 1942, in accordance with the binary classification. In order to run the FastText model, we added “__label__fake” or ”__label__real” to the beginning of the data. Two methods were applied to train a model using fastText with the news corpus prepared for this study. First, the hyperparameter values presented in Table-3 were used for training with different combinations in order to obtain the best results. Second, fastText's autotune feature is used to automatically find the best hyperparameters in a given time frame. Models were tested with pre-prepared test data for 10-fold cross-validation with stratified sampling. Commonly used metrics to evaluate the performance of classification models are accuracy, precision, recall, and f1-score. When calculating these metrics, 4 parameters are required; True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN).

TP: Predicted positive class correctly.
TN: Predicted negative class correctly.
FP: Predicted positive class incorrectly.
FN: Predicted negative class incorrectly.

Formulas calculating performance metrics used to evaluate model performance in this study are presented in equations (1), (2), (3) and (4).

$$Accuracy≔\frac{TP+TN}{TP+TN+FP+FN}$$

$$Precision≔\frac{TP}{TP+FP}$$

$$Recall≔\frac{TP}{TP+FN}$$

$$F1-Measure≔\frac{2*TP}{2*TP+FP+FN}$$

fastText does not automatically calculate the evaluation metrics described above when doing binary classification. Therefore, Scikit-learn library functions are used to calculate these metrics. Training and testing processes were carried out for 10-fold cross-validation and the best performance results obtained in line with the tests performed are presented in Table-4.

Table-4. Best performance results with fastText

Set status	Accuracy	Label	Precision	Recall	F-1 Measure	Support
Raw form	0.84	Gerçek	0.84	0.84	0.84	400
Raw form	0.84	Sahte	0.84	0.84	0.84	400
Rooted form	0.88	Gerçek	0.84	0.94	0.89	392
Rooted form	0.88	Sahte	0.93	0.82	0.87	392
Autotuned form	0.83	Gerçek	0.85	0.81	0.83	392
Autotuned form	0.83	Sahte	0.81	0.86	0.84	392

3.6. Machine learning models

In the study, we also used 7 classification algorithms to classify. These algorithms are multnomialNB, XGBoost, Random Forest, Logistic Regression, K-NN, SGD and SVM, respectively.

We evaluated the machine learning models used in the study in detail. We used a rooted version of our dataset when testing our machine learning models. Our evaluation metrics were accuracy, precision, recall and f-1 score, as in our fastText model. Our calculations are shown in Table-5.

Table-5. Results obtained with machine learning algorithms

Algorithm	Accuracy	Label	Precision	Recall	F-1 Measure	Support
MultinomialNB	0.82	Gerçek	0.83	0.76	0.80	362
MultinomialNB	0.82	Sahte	0.81	0.87	0.83	415
SGDClassifier	0.80	Gerçek	0.76	0.85	0.80	362
SGDClassifier	0.80	Sahte	0.85	0.76	0.80	415
Logistic Regression	0.79	Gerçek	0.78	0.75	0.77	362
Logistic Regression	0.79	Sahte	0.79	0.81	0.80	415
RandomForest	0.75	Gerçek	0.78	0.66	0.71	362
RandomForest	0.75	Sahte	0.74	0.84	0.78	415
SVM	0.82	Gerçek	0.78	0.85	0.81	362
SVM	0.82	Sahte	0.86	0.80	0.83	415
XGBoost	0.69	Gerçek	0.62	0.85	0.72	362
XGBoost	0.69	Sahte	0.81	0.55	0.66	415
K-NN	0.76	Gerçek	0.83	0.62	0.71	362
K-NN	0.76	Sahte	0.73	0.89	0.80	415

As a result of the calculations, the SVC classifier gave the best accuracy value. However, k-layer cross-examination was performed to ensure that we did not make any mistakes in choosing the "train" and "test" data and that there was no duplication or memorization in the data set. Here, the k value is set to 10, as in our fastText model. That is, by dividing the data set into 10 equal parts, the accuracy rate was calculated with 10 times different data. The results obtained as a result of this calculation are presented in Table-6.

Table-6. Results obtained with machine learning algorithm after 10-fold cross validation

10-folds cross validation	Algorithm	Accuracy
	SVM	0.83 (+/- 0.04)
	Naive Bayes	0.82 (+/- 0.04)
	SGD	0.82 (+/- 0.04)
	Logistic Regression	0.77 (+/- 0.06)
	Random forest	0.84 (+/- 0.04)
	KNN	0.75 (+/- 0.04)
	XGBoost	0.72 (+/- 0.05)

As seen in Table-6, our accuracy values are almost the same as the accuracy values we calculated without cross validation. In addition, the error rates of the accuracy values are quite low. This has provided us with the reliability of our study. Finally, when we compare the machine learning models with our model with fastText, it has been seen that the accuracy of our work with fastText is higher.

Data from Twitter is in short text. The classification of short texts also has several disadvantages. The rapid change of the topics in the text, including the low number of words and low frequencies, affect the classification performances negatively [48]. In addition to the disadvantages mentioned, when the special expressions, abbreviations, spelling errors, and losses in natural language processing are taken into consideration [49], the accuracy rate of 88% obtained within the scope of the study is considered as a high performance.

In this study, which was carried out to detect fake news in social networks, a Turkish data set was created by preprocessing the data collected from Twitter (data cleaning, normalization, data labeling, deletion of stop words). The data set was obtained from the Twitter accounts of fact-checking platforms and tweets published from the accounts of reliable news agencies. The created set was divided into 10 equal parts with the stratified sampling method in order to ensure the reliability of the tests. In the study, a language model was developed using machine learning algorithms and the fastText library, and the best result in the model was obtained in the section made with fastText. As a result of the study, the best performance emerged as 88%.

In the literature, there are studies that can be taken as a reference for the detection of fake news in Turkish, but there is no published data set in terms of data set. A high performance was obtained from the values obtained in the classification study for the detection of fake news in Turkish. There are studies that have the same result as our study from the studies for the English language on fake news detection. In this respect, the language model created in the study is considered to be successful.

It is thought that this study will guide the studies to be done in this field in Turkish language. The classification model, the presented results and findings, and the data set prepared for the detection of Turkish Fake News will contribute to the literature. It can be used as a basis for academic studies.

In the future, it is aimed to use the prepared data set with different language models and deep learning applications, to increase the performance and to present it to users with real-time applications. In addition, it is evaluated that the language model we created with fastText can be used in datasets created in languages other than Turkish and in Turkish datasets that will be published by different researchers in the future.

Ethical Approval

The authors of this article declare that the materials and methods used in this study do not require ethical committee permission and/or legal-special permission.

Authors' contributions

Gülsüm KAYABAŞI KORU: Obtained the dataset on the social network, performed the models, analysed the results and wrote the manuscript. Determined the algorithm used in the study, checked the models used and observed the result obtained.

Çelebi ULUYOL: Checked the studies in the literature the manuscript, determined the algorithms used and played a role in the creation of the model.

Funding

There is no fundig in this study. We don’t receive funding.

Competing interests

There is no conflict of interest in this study.

Availability of data and materials

The Twitter dataset used in this manuscript can be accessed from author. You can mail to corresponding author.

Torunoğlu, D., Çakirman, E., Ganiz, M. C., Akyokuş, S., & Gürbüz, M. Z. (2011, June). Analysis of preprocessing methods on classification of Turkish texts. In 2011 International Symposium on Innovations in Intelligent Systems and Applications (pp. 112–117). IEEE.
Gürcan, F. (2018, October). Multi-class classification of turkish texts with machine learning algorithms. In 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT) (pp. 1–5). IEEE.
Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2), 103–134.
Gupta, V., & Lehal, G. S. (2009). A survey of text mining techniques and applications. Journal of emerging technologies in web intelligence, 1(1), 60–76.
Aggarwal, C. C. (2015). Mining text data. In Data mining (pp. 429–455). Springer, Cham.
https://www.worlddata.info/languages/turkish.php
Mertoğlu, U., Genç, B., Sever, H., & Sağlam, F. (2019). Auto-Tagging Model For Turkish News, içinde. In International Ankara Conference on Scientific Researches (pp. 615–623).
Taskin, S. G., Kucuksille, E. U., & Topal, K. (2022). Detection of Turkish Fake News in Twitter with Machine Learning Algorithms. Arabian Journal for Science and Engineering, 47(2), 2359–2379.
Kotsiantis, S. B., Zaharakis, I. D., & Pintelas, P. E. (2006). Machine learning: a review of classification and combining techniques. Artificial Intelligence Review, 26(3), 159–190.
Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text classification using machine learning techniques. WSEAS transactions on computers, 4(8), 966–974.
Hu, X., & Liu, H. (2012). Text analytics in social media. In Mining text data (pp. 385–414). Springer, Boston, MA.
Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD explorations newsletter, 19(1), 22–36.
Zhou, X., & Zafarani, R. (2020). A survey of fake news: Fundamental theories, detection methods, and opportunities. ACM Computing Surveys (CSUR), 53(5), 1–40.
Ünver, H. A., & EDAM, O. C. (2020). TÜRKİYE'DE DOĞRULUK KONTROLÜ VE DOĞRULAMA KURULUŞLARI. Centre for Economics and Foreign Policy Studies..
https://www.twitter.com
https://data.tuik.gov.tr/Bulten/Index?p=Hanehalki-Bilisim-Teknolojileri-(BT)-Kullanim-Arastirmasi-2021-37437
https://recrodigital.com/we-are-social-2022-turkiye-sosyal-medya-kullanimi-verileri/
Oshikawa, R., Qian, J., & Wang, W. Y. (2018). A survey on natural language processing for fake news detection. arXiv preprint arXiv:1811.00770.
Gravanis, G., Vakali, A., Diamantaras, K., & Karadais, P. (2019). Behind the cues: A benchmarking study for fake news detection. Expert Systems with Applications, 128, 201–213.
Shrivastava, G., Kumar, P., Ojha, R. P., Srivastava, P. K., Mohan, S., & Srivastava, G. (2020). Defensive modeling of fake news through online social networks. IEEE Transactions on Computational Social Systems, 7(5), 1159–1167.
Ahmed, H., Traore, I., & Saad, S. (2017, October). Detection of online fake news using n-gram analysis and machine learning techniques. In International conference on intelligent, secure, and dependable systems in distributed and cloud environments (pp. 127–138). Springer, Cham.
Jin, Z., Cao, J., Zhang, Y., & Luo, J. (2016, March). News verification by exploiting conflicting social viewpoints in microblogs. In Proceedings of the AAAI conference on artificial intelligence (Vol. 30, No. 1).
Castillo, C., Mendoza, M., & Poblete, B. (2013). Predicting information credibility in time-sensitive social media. Internet Research.
Tacchini, E., Ballarin, G., Della Vedova, M. L., Moret, S., & De Alfaro, L. (2017). Some like it hoax: Automated fake news detection in social networks. arXiv preprint arXiv:1704.07506.
Çelik, Ö., & Koç, B. C. (2021). TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen ve Mühendislik Dergisi, 23(67), 121–127.
AYDOĞAN, M. COMPARISON OF WORD EMBEDDING METHODS FOR TURKISH SENTIMENT CLASSIFICATION.
Güler, G., & Tantuğ, A. C. (2020). Comparison of Turkish Word Representations Trained on Different Morphological Forms. arXiv preprint arXiv:2002.05417.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5, 135–146.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
KINIK, D., & GÜRAN, A. (2021). TF-IDF ve Doc2Vec Tabanlı Türkçe Metin Sınıflandırma Sisteminin Başarım Değerinin Ardışık Kelime Grubu Tespiti ile Arttırılması. Avrupa Bilim ve Teknoloji Dergisi, (21), 323–332.
Kavi, D. (2020). Turkish Text Classification: From Lexicon Analysis to Bidirectional Transformer. arXiv preprint arXiv:2104.11642.
Çiğdem, A. C. I., & ÇIRAK, A. (2019). Türkçe haber metinlerinin konvolüsyonel sinir ağları ve Word2Vec kullanılarak sınıflandırılması. Bilişim Teknolojileri Dergisi, 12(3), 219–228.
Han, W., & Mehta, V. (2019, November). Fake news detection in social networks using machine learning and deep learning: Performance evaluation. In 2019 IEEE International Conference on Industrial Internet (ICII) (pp. 375–380). IEEE.
Reis, J. C., Correia, A., Murai, F., Veloso, A., & Benevenuto, F. (2019). Supervised learning for fake news detection. IEEE Intelligent Systems, 34(2), 76–81.
Ahmad, I., Yousaf, M., Yousaf, S., & Ahmad, M. O. (2020). Fake news detection using machine learning ensemble methods. Complexity, 2020.
Yuan, C., Ma, Q., Zhou, W., Han, J., & Hu, S. (2020). Early detection of fake news by utilizing the credibility of news, publishers, and users based on weakly supervised learning. arXiv preprint arXiv:2012.04233.
Dogramaci, E., & Radcliffe, D. (2015). How Turkey uses social media. Digital News Report.
Torlak, O., Ozkara, B. Y., Tiltay, M. A., Cengiz, H., & Dulger, M. F. (2014). The effect of electronic word of mouth on brand image and purchase intention: An application concerning cell phone brands for youth consumers in Turkey. Journal of Marketing Development and Competitiveness, 8(2), 61–68.
Pérez-Rosas, V., Kleinberg, B., Lefevre, A., & Mihalcea, R. (2017). Automatic detection of fake news. arXiv preprint arXiv:1708.07104.
Rubin, V. L., Conroy, N., Chen, Y., & Cornwell, S. (2016, June). Fake news or truth? using satirical cues to detect potentially misleading news. In Proceedings of the second workshop on computational approaches to deception detection (pp. 7–17).
Alqaraleh, S., & Işik, M. (2020). Efficient Turkish tweet classification system for crisis response. Turkish Journal of Electrical Engineering and Computer Sciences, 28(6), 3168–3182.
Bhardwaj, P. (2019). Types of sampling in research. Journal of the Practice of Cardiovascular Sciences, 5(3), 157.
Scikit-learn machine learning in Python. https://scikit- learn.org/

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Fake News Detection in Turkish Using Machine Learning Algorithms and Fasttext With Word Embedding

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Works

3. Material And Method

3.1. Data Resources

3.2. Data gathering and labeling

3.3. Data preprocessing

3.4. Training and test set preparation

3.5. fastText usage

3.6. Machine learning models

4. Experiment And Discussion

5. Future Works And Conclusion

Declarations

References

Additional Declarations

Status:

Version 1