Exploratory Data Analysis
Data Description
MU3D is collected by recording 80 targets speaking honestly and dishonestly about their social relationships. The dataset was divided into two parts: video level and target level. In the video level dataset, information such as the valence, indicating whether the statement in the video is negative or positive; VidLength ms and VidLength sec indicates the length of video in millisecond and seconds, respectively. There are a total number of 12 variables with 1 label variable called Veracity in the video level dataset, a short variable description is shown in Table 1.
Table 1
MU3D Video level database variables description (Hugenberg et al. (2017)).
Video-Level Variables | Mean (± SD) | Description |
---|
VideoID | / | ID associated with the video. |
Veracity | / | Indicates whether the statement in the video is a truth or a lie: value of 0 indicates a lie, value of 1 indicates a truth. |
Valence | / | Indicates whether the statement in the video is negative or positive: value of 0 a indicates negative statement, value of 1 indicates a positive statement. |
Sex | / | Indicates target’s sex: value of 0 indicates a female target, value of 1 indicates a male target. |
Race | / | Indicates target’s race: value of 0 indicates a Black target, value of 1 indicates a White target. |
VidLength_ms | 35728.86 ± 3491.95 | Indicates length of the video in milliseconds. |
VidLength_sec | 35.73 ± 3.49 | Indicates length of the video in seconds. |
WordCount | 106.69 ± 23.48 | Indicates the number of words contained in the full transcription of the video. |
Accuracy | 0.52 ± 0.21 | Indicates average accuracy (i.e., proportion correct) across raters who viewed the video. |
TruthProp | 0.59 ± 0.18 | Indicates average truth proportion (i.e., proportion of truth responses) across raters who viewed the video. |
Attractive | 4.08 ± 0.58 | Indicates average attractiveness ratings (measured on a scale ranging from 1 ”Not at all” to 7 ”Extremely”) across raters who viewed the video. |
Trustworthy | 4.16 ± 0.52 | Indicates average trustworthiness ratings (measured on a scale ranging from 1 ”Not at all” to 7 ”Extremely”) across raters who viewed the video. |
Anxious | 3.04 ± 0.65 | Indicates average anxiousness ratings (measured on a scale ranging from 1 ”Not at all” to 7 ”Extremely”) across raters who viewed the video. |
Transcription | / | Full transcription of the video. |
Statistical Data Analysis
Due to the data properties, normalizing the data so that they are of the same order of magnitude is much better for machine learning. Except for VideoID and the last variable Transcription, a normalized boxplot was created for all other variables by Veracity. As shown in Fig. 1, only the variables Accuracy, TruthProp and Trustworthy have differences in Veracity, which is hard for us to start choosing features to train a classification prediction model.
A correlation scatter plot was shown in Fig. 2, it indicates VidLength ms and VidLength sec, variable TruthProp and Trustworthy, variable Accuracy and TruthProp are highly correlated, however, the effect here is similar to that of multicollinearity in linear regression. Our learned model may not be particularly stable against small variations in the training set, because different weight vectors will have similar outputs. The training set predictions, though, will be stable, and so will test predictions if they come from the same distribution. Based on the variable’s linear correlation relationship in Fig. 2, we can reduce the features dimension from 11 to 9, where the VidLength_ms and TruthProp were removed because they are linearly correlated with VidLength_sec and Accuracy, respectively.
Ensemble Learning for Deception Detection
Algorithms Selecting Procedures
The next step is to fit MU3D into a machine learning model and see how the computer performs on detecting lies. We first trained three different models based on the data properties, including Support vector machine (SVM), Binary Logistic Regression (BLR) and Random Forest (RF) to predict the deception. The basic idea of these three selected algorithms is based on the Fig. 3 algorithm flowchart. The purpose of this algorithm flowchart is to create a tool that helps not only select the possible modeling techniques but also understand deeper about the problem itself. As shown in Fig. 3, by answering the following questions from the flowchart, we initially decided the main three modeling techniques to be applied in this prediction. In the next subsection, we explained the reason why these three machine learning algorithms were chosen based on the data properties and the model assumptions.
Preliminaries Machine Learning Algorithms
Support Vector Machines are based on a decision plane concept that defines decision boundaries, and SVMs have been shown to perform well in a variety of settings, and are often considered one of the best ”out of the box” classifiers according to James, G. et al.(2013)4, a decision plane is a separation plane between a set of objects with different types of membership.
SVM is a supervised learning method used to perform binary classification on data. According to the statistical data analysis section, the data properties show that we have exactly two classes: Lies or Truth. Besides, SVM can deal with real valued features, which means there are no categorical variables in the data, such as our dataset above, all of the features excepted the Transcription are numerical numbers, which are much fittable by using SVM. What’s more, the SVM can perform well on a large number of features, for example, it works with tens, hundreds and thousands of features. In our dataset, we have more than 10 features which motivates us to choose SVM according to Chapelle et al. (2002)5. Another reason is that SVM has simple decision boundaries, indicating that there are no issues with over fitting. The SVM can be defined as linear classifiers under the following two assumptions6: 1) The distance from the SVM’s classification boundary to the nearest data point should be as large as possible; the distance formulas include Euclidean distance, Manhattan distance, Chebyshev distance and Minkowski distance. Where the Euclidean distance and Manhattan distance are special forms of the Minkowski distance7, and 2) The support vectors are the most useful data points because those points are the ones most likely to be incorrectly classified. This means that the primary goal of training SVMs is to find support vectors in the dataset that both separate the data and find the maximum margin between classes.
Binary logistic regression (LR) is a regression model where the target variable is binary, that is, it can take only two values, 0 or 1. It is the most utilized regression model in deception prediction, given that the output is modeled as truth (1) or lie (0). BLR is a statistical tool that classifies the MU3D target person in a video to either lie or not. BLR has two stages: training and evaluation. At the training stage it uses Video level data from both lie and truth and builds a detection module. At the evaluation stage, data that was not used in the training stage, is used to evaluate the detection model.
Logistic regression is also called logit (log unit) regression, and we usually build a Logistic regression model by starting to build the model by combining the generalized linear model with logit function, or from the point of a random variable following the logistic distribution. Binary Logistic Regression has the following assumptions: 1) adequate sample size, 2) absence of multicollinearity and 3) no outliers. Note that according to EDA, the outliers in our dataset need to be removed before fitting this model so that it doesn’t violate the assumptions.
The mathematics expression of the logistic regression is
$$g\left(y\right)=ln\left(\frac{y}{1-y}\right)={\widehat{w}}^{T}\bullet \widehat{x}$$
Where \(g\left(y\right)=ln\left(\frac{y}{1-y}\right)\) is called logit function and it is the link function for the generalized linear model. This logistic regression model can be expressed as
$$y = \frac{1}{1+{e}^{-x}}$$
Where \(e\) is the natural logarithm, and the above function is called sigmoid function.
According to Fern ́andez-Delgado et al. (2014)8, ”The classifier most likely to be the best are the random forest versions, the best of which (implemented in R and accessed via caret), achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets.” This quote clearly pointed out the power of RF in the classification field. The basic idea of RF is to produce numerous trees and combine the results. The random forest technique does this by applying two different tricks in model development. The first is the use of bootstrap aggregation or in short as bagging. In the bagging process, a single decision tree is built on a random sample of the dataset, which accounts for about two-thirds of the total observations (note that the remaining third is called out-of-bag (oob)). Repeat this dozens or hundreds of times, and then calculate the average of the results. The growth and pruning of each tree are not based on any error measure, which means that the variance of each tree is high. However, by averaging the results, you can reduce the variance without increasing the bias according to Merentitis et. al (2014)9.
The next thing that random forest brings to the table is that concurrently with the random sample of the data, that is, bagging10 according to Kuncheva, Ludmila I. It also takes a random sample of the input features at each split, the original bagging algorithm is present in Fig. 4. We will use the default random number of the predictors that are sampled, which, for classification problems, is the square root of the total predictors. The advantage of RF by doing this random sample of the features at each split and incorporating it into the methodology, one can mitigate the effect of a highly correlated predictor, becoming the main driver in all the bootstrapped trees. The subsequent averaging of the trees that are less correlated to each other is more generalizable and robust to outliers than if you only performed bagging.
Thomas G. Dietterich11 pointed out that the effectiveness of ensemble learning can be attributed to the reasons from both statistical and computational. Therefore, in this paper, we applied the Random Forest-based Ensemble Learning to improve the prediction performance. We fit the MU3D video level data set into Random Forest-based ensemble learning models, which includes RF + SVM.Linear(SVM with linear kernel), RF + SVM.Poly(SVM with polynomial kernel), RF + GLM(Generalized Linear Model), RF + KNNs(k-Nearest Neighbors), RF + GBM(Stochastic Gradient Boosting) and RF + WSRF(Weighted Subspace Random Forest). We keep our ensemble learning as simple as just combining one algorithm with RF at each time to avoid complicated models, because the more complicated the model is the easier it will cause overfitting. Section 5 explains a comprehensive comparison of the model performance based on the experimental, and then we conclude our new combination of algorithms performs better than the traditional machine learning models.