Deception Detection using Random Forest-based Ensemble Learning

doi:10.21203/rs.3.rs-2460665/v1

Download PDF

Article

Deception Detection using Random Forest-based Ensemble Learning

https://doi.org/10.21203/rs.3.rs-2460665/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The purpose of this work is to detect people lying using different ensemble machine learning algorithms to conclude a better classification model through comparison. Random Forest (RF) did an efficient work while dealing with both classification and regression problems; In this paper, we proposed a Random Forest-based ensemble learning, which is the combination of RF with SVM, GLM, KNNs, and GBM to improve the model performance. The data set that we used to fit into the machine learning models is Miami University Deception Detection Database (MU3D). MU3D is a free resource containing 320 videos of Black and White targets, female and male, telling truths and lies. We fit the MU3D video level data set into Random Forest-based ensemble learning models, which includes RF + SVM.Linear, RF + SVM.Poly, RF + GLM, RF + KNNs, RF + GBM (Stochastic Gradient Boosting) and RF + WSRF (Weighted Subspace Random Forest). As a comprehensive comparison of the model performance, we conclude our new combination of algorithms performs better than the traditional machine learning models. Our contribution in this work provides a robust classification method which improves the predicted performance while avoiding model overfitting.

Biological sciences/Psychology/Human behaviour

Physical sciences/Mathematics and computing/Scientific data

Physical sciences/Mathematics and computing/Statistics

Physical sciences/Mathematics and computing/Computational science

Traditional lie detection machine is a polygraph, which can provide people with an averaging accuracy between 78–90%.¹ With 90% accuracy, it seems to do a very good job on detecting lying, however, with 78% accuracy, we can hardly have much confidence to say a person is lying. In other words, the polygraph test is easy to pass for those well-trained people (i.e. company spies or country spies). Even ordinary people who search for the word “polygraph” online, the next searching suggestion would be “How to Pass a Polygraph Test?” Since the polygraph operating principle is to detect lies by looking for signs of an examinee’s physiological changes. Once the examinee lies, it puts a blip on the polygraph machine that serves as a signature of that examinee’s lies. Besides, polygraph test is a time-based test that only captures the examinee’s body reaction in each specific question, which means the examinees themselves know that they’re being tested whether they are lying. Therefore, polygraphs are not useful for those underground and secret cases. Therefore, artificial intelligence (AI) approaches come to scientist’s minds. Why don’t we just detect lying by applying machine learning algorithms to see if the accuracy of deception detection would be improved.

The Miami University Deception Detection Database (MU3D)² is a free resource containing 320 videos of Black and White targets, female and male, telling truths and lies. Eighty (20 Black female, 20 Black male, 20 White female, and 20 White male) targets were recorded speaking honestly and dishonestly about their social relationships. Each target generated four different videos (i.e., positive truth, negative truth, positive lie, negative lie), yielding 320 videos fully crossing target race, target gender, statement valence, and statement veracity. In the previous studies of MU3D, scholars conducted research using standardized stimuli that can aid in building comprehensive theories of interpersonal sensitivity, enhance replication among labs, facilitate the use of signal detection analyses, and promote consideration of race, gender, and their interactive effects in deception detection research¹. Our motivation also comes from those previous studies and aims to develop a better deception detection via machine learning tools.

Ensemble learning, sometimes referred to as a multi-classifier system, builds and combines multiple classifiers to complete the learning task. There are two choices for getting multiple classifiers. The first is supposed that all individual classifiers are of the same type, or homogenous. For example, both decision tree individual classifiers, or both neural network individual classifiers (i.e., Bagging and boosting, for example, Random Forest). The second is supposed that all individual classifiers are not homogeneous, or heterogeneous. For example, in this paper, we have a classification problem of deception detection, we use support vector machine (SVM) individual learner, logistic regression (LR) individual learner and k-Nearest Neighbors (KNNs) individual learner to learn the training set, and then determine the final strong classifier by some combination strategy. This integration is called Stacking³. In the experimental section, we applied both Bagging, boosting and stacking, and selected a better ensemble model to classify people lying.

Exploratory Data Analysis

Data Description

MU3D is collected by recording 80 targets speaking honestly and dishonestly about their social relationships. The dataset was divided into two parts: video level and target level. In the video level dataset, information such as the valence, indicating whether the statement in the video is negative or positive; VidLength ms and VidLength sec indicates the length of video in millisecond and seconds, respectively. There are a total number of 12 variables with 1 label variable called Veracity in the video level dataset, a short variable description is shown in Table 1.

Table 1

MU3D Video level database variables description (Hugenberg et al. (2017)).
Video-Level Variables	Mean (± SD)	Description
VideoID	/	ID associated with the video.
Veracity	/	Indicates whether the statement in the video is a truth or a lie: value of 0 indicates a lie, value of 1 indicates a truth.
Valence	/	Indicates whether the statement in the video is negative or positive: value of 0 a indicates negative statement, value of 1 indicates a positive statement.
Sex	/	Indicates target’s sex: value of 0 indicates a female target, value of 1 indicates a male target.
Race	/	Indicates target’s race: value of 0 indicates a Black target, value of 1 indicates a White target.
VidLength_ms	35728.86 ± 3491.95	Indicates length of the video in milliseconds.
VidLength_sec	35.73 ± 3.49	Indicates length of the video in seconds.
WordCount	106.69 ± 23.48	Indicates the number of words contained in the full transcription of the video.
Accuracy	0.52 ± 0.21	Indicates average accuracy (i.e., proportion correct) across raters who viewed the video.
TruthProp	0.59 ± 0.18	Indicates average truth proportion (i.e., proportion of truth responses) across raters who viewed the video.
Attractive	4.08 ± 0.58	Indicates average attractiveness ratings (measured on a scale ranging from 1 ”Not at all” to 7 ”Extremely”) across raters who viewed the video.
Trustworthy	4.16 ± 0.52	Indicates average trustworthiness ratings (measured on a scale ranging from 1 ”Not at all” to 7 ”Extremely”) across raters who viewed the video.
Anxious	3.04 ± 0.65	Indicates average anxiousness ratings (measured on a scale ranging from 1 ”Not at all” to 7 ”Extremely”) across raters who viewed the video.
Transcription	/	Full transcription of the video.

Statistical Data Analysis

Due to the data properties, normalizing the data so that they are of the same order of magnitude is much better for machine learning. Except for VideoID and the last variable Transcription, a normalized boxplot was created for all other variables by Veracity. As shown in Fig. 1, only the variables Accuracy, TruthProp and Trustworthy have differences in Veracity, which is hard for us to start choosing features to train a classification prediction model.

A correlation scatter plot was shown in Fig. 2, it indicates VidLength ms and VidLength sec, variable TruthProp and Trustworthy, variable Accuracy and TruthProp are highly correlated, however, the effect here is similar to that of multicollinearity in linear regression. Our learned model may not be particularly stable against small variations in the training set, because different weight vectors will have similar outputs. The training set predictions, though, will be stable, and so will test predictions if they come from the same distribution. Based on the variable’s linear correlation relationship in Fig. 2, we can reduce the features dimension from 11 to 9, where the VidLength_ms and TruthProp were removed because they are linearly correlated with VidLength_sec and Accuracy, respectively.

Ensemble Learning for Deception Detection

Algorithms Selecting Procedures

The next step is to fit MU3D into a machine learning model and see how the computer performs on detecting lies. We first trained three different models based on the data properties, including Support vector machine (SVM), Binary Logistic Regression (BLR) and Random Forest (RF) to predict the deception. The basic idea of these three selected algorithms is based on the Fig. 3 algorithm flowchart. The purpose of this algorithm flowchart is to create a tool that helps not only select the possible modeling techniques but also understand deeper about the problem itself. As shown in Fig. 3, by answering the following questions from the flowchart, we initially decided the main three modeling techniques to be applied in this prediction. In the next subsection, we explained the reason why these three machine learning algorithms were chosen based on the data properties and the model assumptions.

Preliminaries Machine Learning Algorithms

Support Vector Machines are based on a decision plane concept that defines decision boundaries, and SVMs have been shown to perform well in a variety of settings, and are often considered one of the best ”out of the box” classifiers according to James, G. et al.(2013)⁴, a decision plane is a separation plane between a set of objects with different types of membership.

SVM is a supervised learning method used to perform binary classification on data. According to the statistical data analysis section, the data properties show that we have exactly two classes: Lies or Truth. Besides, SVM can deal with real valued features, which means there are no categorical variables in the data, such as our dataset above, all of the features excepted the Transcription are numerical numbers, which are much fittable by using SVM. What’s more, the SVM can perform well on a large number of features, for example, it works with tens, hundreds and thousands of features. In our dataset, we have more than 10 features which motivates us to choose SVM according to Chapelle et al. (2002)⁵. Another reason is that SVM has simple decision boundaries, indicating that there are no issues with over fitting. The SVM can be defined as linear classifiers under the following two assumptions⁶: 1) The distance from the SVM’s classification boundary to the nearest data point should be as large as possible; the distance formulas include Euclidean distance, Manhattan distance, Chebyshev distance and Minkowski distance. Where the Euclidean distance and Manhattan distance are special forms of the Minkowski distance⁷, and 2) The support vectors are the most useful data points because those points are the ones most likely to be incorrectly classified. This means that the primary goal of training SVMs is to find support vectors in the dataset that both separate the data and find the maximum margin between classes.

Binary logistic regression (LR) is a regression model where the target variable is binary, that is, it can take only two values, 0 or 1. It is the most utilized regression model in deception prediction, given that the output is modeled as truth (1) or lie (0). BLR is a statistical tool that classifies the MU3D target person in a video to either lie or not. BLR has two stages: training and evaluation. At the training stage it uses Video level data from both lie and truth and builds a detection module. At the evaluation stage, data that was not used in the training stage, is used to evaluate the detection model.

Logistic regression is also called logit (log unit) regression, and we usually build a Logistic regression model by starting to build the model by combining the generalized linear model with logit function, or from the point of a random variable following the logistic distribution. Binary Logistic Regression has the following assumptions: 1) adequate sample size, 2) absence of multicollinearity and 3) no outliers. Note that according to EDA, the outliers in our dataset need to be removed before fitting this model so that it doesn’t violate the assumptions.

The mathematics expression of the logistic regression is

$$g\left(y\right)=ln\left(\frac{y}{1-y}\right)={\widehat{w}}^{T}\bullet \widehat{x}$$

Where $g\left(y\right)=ln\left(\frac{y}{1-y}\right)$ is called logit function and it is the link function for the generalized linear model. This logistic regression model can be expressed as

$$y = \frac{1}{1+{e}^{-x}}$$

Where $e$ is the natural logarithm, and the above function is called sigmoid function.

According to Fern ́andez-Delgado et al. (2014)⁸, ”The classifier most likely to be the best are the random forest versions, the best of which (implemented in R and accessed via caret), achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets.” This quote clearly pointed out the power of RF in the classification field. The basic idea of RF is to produce numerous trees and combine the results. The random forest technique does this by applying two different tricks in model development. The first is the use of bootstrap aggregation or in short as bagging. In the bagging process, a single decision tree is built on a random sample of the dataset, which accounts for about two-thirds of the total observations (note that the remaining third is called out-of-bag (oob)). Repeat this dozens or hundreds of times, and then calculate the average of the results. The growth and pruning of each tree are not based on any error measure, which means that the variance of each tree is high. However, by averaging the results, you can reduce the variance without increasing the bias according to Merentitis et. al (2014)⁹.

The next thing that random forest brings to the table is that concurrently with the random sample of the data, that is, bagging¹⁰ according to Kuncheva, Ludmila I. It also takes a random sample of the input features at each split, the original bagging algorithm is present in Fig. 4. We will use the default random number of the predictors that are sampled, which, for classification problems, is the square root of the total predictors. The advantage of RF by doing this random sample of the features at each split and incorporating it into the methodology, one can mitigate the effect of a highly correlated predictor, becoming the main driver in all the bootstrapped trees. The subsequent averaging of the trees that are less correlated to each other is more generalizable and robust to outliers than if you only performed bagging.

Thomas G. Dietterich¹¹ pointed out that the effectiveness of ensemble learning can be attributed to the reasons from both statistical and computational. Therefore, in this paper, we applied the Random Forest-based Ensemble Learning to improve the prediction performance. We fit the MU3D video level data set into Random Forest-based ensemble learning models, which includes RF + SVM.Linear(SVM with linear kernel), RF + SVM.Poly(SVM with polynomial kernel), RF + GLM(Generalized Linear Model), RF + KNNs(k-Nearest Neighbors), RF + GBM(Stochastic Gradient Boosting) and RF + WSRF(Weighted Subspace Random Forest). We keep our ensemble learning as simple as just combining one algorithm with RF at each time to avoid complicated models, because the more complicated the model is the easier it will cause overfitting. Section 5 explains a comprehensive comparison of the model performance based on the experimental, and then we conclude our new combination of algorithms performs better than the traditional machine learning models.

Experimental

Experiments Setup

As mentioned in the Data description section, the MU3D contains 320 data cells that tell truths and lies. 80 targets were recorded speaking honestly and dishonestly about their social relationships. Each target generated 4 different levels (i.e., positive truth, negative truth, positive lie, negative lie), yielding 320 videos data cells. Before building our models with SVM and BLR, we choose a common split ratio as 80:20 for training/validation and testing. As for the RF, we do not split our dataset because Random Forest does not require a split sampling method to assess accuracy of the model. It performs internal validation as 2-3rd of available training data is used to grow each tree and the remaining one-third portion of training data is always used to calculate out-of-bag error to assess model performance.

Normalization

Normalization is a data preparation technique that is frequently used in machine learning. The process of transforming the columns in a dataset to the same scale is referred to as normalization. Every dataset does not need to be normalized for machine learning. It is only required when the ranges of characteristics are different. Standardization Scaling, that is, centering variables at zero and standardizing the variance at one. Subtracting the mean of each observation and then dividing by the standard deviation is the procedure as

$$X{\prime }=\frac{X-\mu }{\sigma }$$

For example, the ranges of variables VidLength ms, VidLength sec and Attractive, Trust-worthy, Anxious are varied. Therefore, normalization was applied for the MU3D. Note that Normalization was done after splitting the data between training and test set, using only the data from the training set. This is because when normalizing the test set, one should apply the normalization parameters previously obtained from the training set as is. Recalculate them on the test set would be inconsistent with the model and this would produce wrong predictions, and the test set plays the role of fresh unseen data, so it’s not supposed to be accessible at the training stage. Using any information coming from the test set before or during training is a potential bias in the evaluation of the performance.

Feature Engineering

Since there are more than 10 dimensions in this MU3D, as shown in Table 1 and Fig. 2, multicollinearity needs to be addressed and removed. We applied the Principal Component Analysis (PCA), which takes advantage of multicollinearity and combines the highly correlated variables into a set of uncorrelated variables. Therefore, PCA can effectively eliminate multicollinearity between features. Here in Fig. 5 shows the scree-plot and the cumulative variance plot from PCA, which indicated the cutoff number of PCs is 5.

Figure 6 shows the MSE by the number of trees in the model. We can see that as the trees are added, significant improvement in MSE occurs early on and then flatlines just before 400 trees are built in the forest, and the optimal tree can be specified in the model. Here we split the data as 80/20 and our model’s optimal number of trees is ntree = 334. Based on this optimal ntree, the error rate in the training set is just 0.78%. Figure 7 and is the Random Forest variance importance plot and Gini plot of random forest, which indicated the 5 selected features are Accuracy, Trustworthy, Anxious, WordCount and Attractive.

Ensemble Learning with Selected Algorithms

Parmar et. al (2018)¹² explained that Since RF itself is an ensemble learning method, the widely discussed problem for RF is its overfitting problem. Segal, Mark R (2004)¹³provides the Machine Learning Benchmarks on RF and he mentioned that some researchers believe that RF does not overfit because of the two random processes in random sampling for each tree node and random selecting features for each splitting. However, since the random forest is based on the decision tree, the decision tree has been proved to have an overfit problem. In addition, most machine learning scientists believe that those two random processes in RF can only help to reduce the chance of overfitting but cannot avoid it. Based on this opinion, this paper implements the Random forest-based ensemble learning method, which helps improve the model prediction performance while reducing the chance of model overfitting. The core idea is by the different types of ensembles, such as Averaging, Majority vote, and Weighted average.

Ensemble learning is the key ingredient for winning almost all the machine learning hackathons and makes the model more robust and stable thus ensuring decent performance on the test cases in most scenarios. One can use ensembling to capture linear and simple as well nonlinear complex relationships in the data. This can be done by using two different models and forming an ensemble of two. In this paper, we include 6 different random-forest based ensemble learning models: RF + SVM.Linear(SVM with linear kernel), RF + SVM.Poly(SVM with polynomial kernel), RF + GLM(Generalized Linear Model), RF + KNNs(k-Nearest Neighbors), RF + GBM(Stochastic Gradient Boosting) and RF + WSRF(Weighted Subspace Random Forest).

Overall, Ensemble method is a meta-algorithm that combines several machine learning techniques into a prediction model to achieve the effect of reducing variance (bagging), boosting (boosting) or improving prediction (stacking).

Computational Results

Our results show that RF + GBM(Stochastic Gradient Boosting) provides the highest prediction among the other ensemble learning methods. Table 2 shows the experimental results with accuracy, sensitivity, specificity and kappa value for each ensemble learning.

Table 2

Evaluation of Random-forest based Ensemble Learning Methods on MU3D Video Level Database.
Method	Accuracy	Sensitivity	Specificity	Kappa	Computational Time (Sec)
RF	0.9531	1.0000	0.9062	0.9062	0.1481
GLM	0.6875	0.7812	0.5938	0.3750	0.0199
KNNs	0.8281	0.9062	0.7500	0.6562	0.0166
SVM	0.7031	0.4722	1.0000	0.4391	0.0185
WSRF	0.9375	0.9688	0.9062	0.6171	0.5292
GBM	0.6875	0.6774	0.6970	0.3744	4.7694
RF + GLM	0.7047	0.7406	0.6687	0.4094	5.9365
RF + KNNs	0.7266	0.7344	0.7188	0.4531	5.9884
RF + SVM.Poly	0.7906	0.9719	0.6094	0.5812	18.2516
RF + SVM.Linear	0.8062	1.0000	0.6125	0.6125	6.4943
RF + WSRF	0.9609	0.9781	0.9437	0.9219	10.9316
RF + GBM	0.9750	0.9875	0.9625	0.9500	7.2205

According to Table 2 and Fig. 8, our result shows that RF has a better performance among all the individual classifiers; followed by the WSRF model and KNNs. Although GBM does not show a good performance as an individual classifier, it achieves the highest accuracy when ensemble with RF compared with other RF based ensemble learning classifiers. In addition, the computational time is higher in all the ensemble studies compared to the individual classifier, which inevitably due to the model complexity, however, RF + GBM has the highest performance with acceptable model running time at the same time, which stands it out from all the classifiers.

In conclusion, RF + GBM and RF + WSRF ranked the top 2 among the other ensemble learning models, with overall accuracy 0.9750 and 0.9609, sensitivity 0.9875 and 0.9781, specificity 0.9625 and 0.9437, Kappa 0.9500 and 0.9219, respectively. Although RF + SVM.Linear has the highest sensitivity, the specificities are the lowest among all the six methods, which means there are few false negative results and more false positive results, that is, the model correctly predicts the lie cases but misclassified the truth.

Data availability

The Miami University Deception Detection Database (MU3D) is a free resource and it can be accessed at https://link.springer.com/article/10.3758/s13428-018-1061-4 Or by contact directly to the original author Dr. E. Paige Lloyd at [email protected] .

Competing interests

All authors declare no competing interests.

Author contributions

B.K. conceived the experiment(s). B.K. conducted the experiment(s) and performed statistical analysis and figure generation. K.R. supervised and lead the experiment(s). All authors reviewed the manuscript.

Spence, Sean A., and Catherine J. Kaylor-Hughes. "Looking for truth and finding lies: The prospects for a nascent neuroimaging of deception." Neurocase 14.1 (2008): 68-81.
Lloyd, E. P., Deska, J. C., Hugenberg, K., McConnell, A. R., Humphrey, B., & Kunstman, J. W. (2017). Miami University deception detection video database. Manuscript under review.
Ting, Kai Ming, and Ian H. Witten. "Stacking bagged and dagged models." (1997).
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning, volume 112. Springer, 2013.
Chapelle, Olivier, et al. "Choosing multiple parameters for support vector machines." Machine learning 46.1 (2002): 131-159.
Steinwart, Ingo, and Clint Scovel. "Fast rates for support vector machines using Gaussian kernels." The Annals of Statistics 35.2 (2007): 575-607.
Walters-Williams, Janett, and Yan Li. "Comparative study of distance functions for nearest neighbors." Advanced techniques in computing sciences and software engineering. Springer, Dordrecht, 2010. 79-84.
Manuel Fern ́andez-Delgado, Eva Cernadas, Sen ́en Barro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems? The journal of machine learning research, 15(1):3133–3181, 2014.
A. Merentitis, C. Debes and R. Heremans, "Ensemble Learning in Hyperspectral Image Classification: Toward Selecting a Favorable Bias-Variance Tradeoff," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 4, pp. 1089-1102, April 2014, doi: 10.1109/JSTARS.2013.2295513.
Kuncheva, Ludmila I. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2014.
Dietterich, Thomas G. "Ensemble learning." The handbook of brain theory and neural networks 2.1 (2002): 110-125.
Parmar, A., Katariya, R., & Patel, V. (2018, August). A review on random forest: An ensemble classifier. In International Conference on Intelligent Data Communication Technologies and Internet of Things (pp. 758-763). Springer, Cham.
Segal, Mark R. "Machine learning benchmarks and random forest regression." (2004).

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Deception Detection using Random Forest-based Ensemble Learning

Status:

Version 1

Abstract

Figures

Introduction

Methods

Results

Discussion

Declarations

References

Additional Declarations

Status:

Version 1