In this section, we investigate the performance of unimodal, cross-modal, and proposed ensemble AVDFD network and report the obtained results. First, we briefly describe the benchmark datasets that we employed in our research for evaluation. Then we describe the metrics used for evaluation and details of implementation. We provided details of different experiments performed to analyze the effectiveness of the proposed approach in terms of audio-visual deepfake video identification accuracy, robustness, and generalization and discussed the obtained results. We also performed a comparative analysis with existing approaches.
4.1 Dataset
Deepfake datasets are essential to confirm the effectiveness and performance of the detection model. As our proposed approach is multimodal, we considered publicly available audio-visual deepfake datasets for evaluation. Most of the currently available benchmark datasets for deepfake video detection only include a visual component, and no voice signal is present. Moreover, audio deepfakes are uncommon in most of them. For performance analysis, we utilized datasets having audio signal as well such as FakeAVCeleb [36], Deepfake Detection Challenge (DFDC) [9], and DeepfakeTIMIT (DF-TIMIT) [76]. The FakeAVCeleb is a recently released multimodal dataset having a total of 500 real videos and 19,500 manipulated videos. The dataset contains fake videos having manipulated faces along with synthesized cloned voices. The dataset is generated using different face-swapping, face-reenactment techniques, and audio synthesis techniques. The overall dataset contains multimodal manipulated videos belonging to five different manipulation categories with diverse ethnicity and gender. The dataset is separated into four groups: Real video and real audio (VRAR), real video and fake audio (VRAF), fake video and real audio (VFAR), and fake video and fake audio (VFAF). In this dataset, the number of samples is highly unbalanced for real and fake classes. We supplemented the dataset with 16,00 real video training samples from the VoxCeleb1 dataset [77]. The DFDC dataset is currently the largest available deepfake detection dataset containing around 100,000 manipulated videos and 23,000 original samples having 960 different identities. DFDC-preview is the initial version of the DFDC dataset containing 1131 real videos and 4119 manipulated videos and is available on Kaggle. The details of the algorithms used for manipulation are not revealed. The dataset includes both manipulated audios and videos, however, labels the entire video as fake. It does not indicate whether the visual component or speech is manipulated in the video. Additionally, the synthetic audios are not lip-synced with the corresponding videos and are labeled as fake even if the original voice is replaced with a different person's voice. The other dataset used is the DF-TIMIT dataset which contains a total of 620 fake videos created by using 32 subjects. The dataset is further divided into two different subsets of fake videos having different qualities. A low-quality (LQ) set has 320 videos of size 64 × 64 dimensions and a high-quality (HQ) set with a 128 × 128 frame size. However, this dataset contains facial manipulated videos only, the audio channel in none of the videos has been altered. We use all videos from the DF-TIMIT database for evaluation. Due to these limitations, we used FakeAVCeleb dataset for training and testing as it contains multimodal videos with manipulations in both the audio and video modalities. The selected videos used for the training and testing of the proposed framework were split into an 85:15 train-test split. To analyze the cross-dataset generalization, we used DFDCp and DFTIMIT datasets. It is worth mentioning that videos used for evaluation are diverse in nature, generated by using varying manipulation techniques and categories.
4.2 Implementation details
The overall implementation of the suggested framework is done using TensorFlow and Keras library. We trained all three networks in an end-to-end manner using AdamW optimizer and binary cross entropy loss (log-loss) that predict the probability of being fake. The initial learning rate is set as 1×10− 2 and steadily decreases with a weight decay of 0.2. The training and testing are performed on Nvidia RTX3090 GPU. We split each video into small segments of 2s for processing. The sampling rate for the video sequence is 20 FPS and 16 kHz for the audio sequence. We employ the face detector in [78] to detect, crop, and align faces over time. The size of input of video and audio networks are face frames of dimension 224 × 224 × 3 and its corresponding audio mel-spectrograms.
4.3 Evaluation parameters
We report results using various metrics such as recall (R), precision (P), f1-score, accuracy (ACC), and area under the ROC curve (AUC). We used these performance metrics because they are commonly discussed in the literature for deepfake detection and are suitable for comparison. The R is also known as the true positive rate (TPR), it estimates the proportion of deepfakes that are accurately detected as fake. P also called false positive rate (FPR) measures the likelihood that the model misclassified the real as deepfake. The f1-score is the harmonic mean of P and R. The ACC measures the percentage of correctly identified observations. The AUC is the graphical depiction of the classification performance of the model computed at various thresholds ranging from 0 to 1. The following are the definitions of the ACC, P, R, and f1-score:
$$f1=\frac{2\times R\times P}{R+P}$$
10
$$ACC=\frac{TP+TN}{TP+TN+FP+FN}$$
11
Where TP (true positive) and TN (true negative) represent correctly identified deepfake and real samples, respectively. FP (false positive) and FN (false negative) denotes incorrectly identified fake and real samples, respectively. The higher values of AUC and ACC indicate the model can accurately discriminate between pristine and deepfake videos.
4.4 Evaluation of Audio-Visual deepfake Detection
A reliable video forensic analysis model must be able to accurately identify multimodal manipulations. In this section, we discuss the quantitative assessment findings of the proposed unimodal, cross-modal, and ensemble network using several standard metrics to demonstrate audio-visual deepfake video recognition performance. We conducted a comprehensive analysis of individual models to get an understanding of the extent to which these models are capable of identifying manipulation in respective modalities so that they can effectively be combined into an ensemble for improved performance. Three assessment methodologies were used in this experiment: Unimodal - The analysis is performed on individual Audio and Video networks, considering a single modality at a time. Cross-modal - considered Audio-Visual Network that integrates both modalities for prediction. Finally, an ensemble framework (AVDFDNet) is used to assess performance by combining the prediction of unimodal and cross-modal networks. We utilized the FakeAVCeleb dataset for evaluation, following the original training, validation, and testing split. Except for FakeAVCeleb and DFDC datasets, all other datasets include facial alterations only making them unsuitable for the audio-visual deepfake detection task. The DFDC dataset does not contain respective audio or video labels. This type of dataset is unsuitable for training a detector to identify both multimodal deepfakes. For training of Video network, we used videos from VRAR set as a real class. The videos from VFAF and VFAR are used as a fake class because it contains manipulated video modality. For the audio network, we used videos from VRAR set for the real audio class, and VRAF and VFAF set for fake audio to train the model. For training of the Audio-Visual network, we used all the given sets from the dataset i.e., VRAR, VRAF, VFAR, and VFAvF. The set VRAR is used for real class and VRAF, VFAR, and VFAF for fake class. A video is considered fake if any of its modalities i.e., audio, video, or both are manipulated otherwise real. We reported the binary classification results in terms of various evaluation parameters such as recall, precision, f1-score, and accuracy.
Table 2 presents the detection results using all the models motioned above. From the results, it can be observed that the proposed ensemble AVDFDNet consistently outperformed on fake video detection. In terms of accuracy, the audio-only and video-only based deepfake classifiers achieved an accuracy of 83.40% and 90.79%, respectively. While using both audio-visual information resulted in more accurate fake video detection i.e., 92.64% which shows that audio-visual correspondence is an informative cue for deepfake detection. Particularly, the cross-modal audio-visual network identifies the deepfake video by learning the intrinsic consistency between audio-visual modality using speech signals and corresponding facial movements. Overall, the combined prediction of all networks using ensemble learning, our approach AVDFDNet achieved good results, with an accuracy of 94.08%, which is 5.15% higher than individual network predictions. The independently trained audio and video networks are unable to learn the correlation between audio and visual modality, which resulted in low identification accuracy. In deepfakes videos, the manipulation can occur in any modality which results in the disharmony between audio and visual modalities and inconsistencies within modalities that our proposed cross-modal Audio-Visual network effectively captures. Meanwhile, the results in Table 2 also indicate that in terms of F1 score, AVDFDNet outperforms individual networks (unimodal- audio network, video network, and cross-modal audio-visual network) by a large margin, demonstrating that diversity in individual network prediction assists in achieving high balanced accuracy and confirming the efficacy of our ensemble approach.
Table 2
Class-wise evaluation of audio and video deepfake detection using the proposed model on FakeAVCeleb dataset.
Model | Precision (%) | Recall (%) | F1-score (%) | Accuracy (%) |
Video Network | 88.97 | 93.12 | 91.00 | 90.79 |
Audio Network | 81.59 | 86.25 | 83.86 | 83.40 |
Audio-Visual Network | 95.96 | 89.03 | 92.36 | 92.64 |
AVDFDNet (proposed) | 92.91 | 95.43 | 94.15 | 94.08 |
Although the deepfake detectors show good deepfake identification accuracy, however, they may have a low recall. A low recall value indicates that the model incorrectly categorized many fake videos as real, failing to perform the primary task of identifying fake videos. A high recall rate is preferred for deep fake detection models. Presumably, even if some real videos are misidentified, it would be preferable to identify more deepfakes as opposed to not detecting them at all. To better comprehend the deepfake recognition performance of our framework, we plotted the confusion matrix of all the networks for unimodal, cross-modal, and ensemble deepfake video detection using the fakeAVCeleb dataset. Figure 5 presents the confusion matrix of each network showing the percentage of correctly and incorrectly identified deepfake videos against ground truth. The values in the diagonal of the matrix show the percentage of model predictions that are correctly matched to the class of test data, whereas the off-diagonal values represent wrong predictions. From Fig. 5, it can be observed that the best classification is provided by AVDFDNet, with the highest separation between the fake and real classes i.e., 95.43%. Comparatively, the cross-modal audio-visual network provides moderate results using the combined audio-visual modalities with a correct classification value of 89.03%. It can be seen that most instances of fake videos are misclassified as real videos and vice versa, which could be due to no evident discrepancies between the video and audio channels in some manipulation cases such as expression-manipulated videos. Similarly, the video-only network and audio-only network showed 93.12% and 86.25% correct classifications for the fake class and 88.46% and 80.54% correct classifications for the real class. In the case of unimodality classification, the test set contains fake video clips with audio alteration but no visual manipulation or vice versa, which is the cause of the low recall.
The ROC curves shown in Fig. 6 are plotted to compare the performance of models and demonstrate how well individual networks performed on deepfake and real data distributions. The ROC curve depicts the trade-off between the TPR and FPR for each model at various categorization levels. The y-axis labeled TPR indicates how well a model succeeded in accurately detecting deepfakes while the x-axis labeled FPR indicates the extent to which the model mistakenly identified real videos as deepfakes. The value increasing in the direction of 1 indicates the capability of the model is improving at identifying both real and fake data. To facilitate the interpretation of these curves, we compute the AUC to determine more precisely the ability of each model to distinguish between fake and pristine videos. Figure 6 shows that the AVDFDNet model achieves the highest results with an AUC value of 0.940. On the other hand, the audio-visual network and the audio-only have low performance, i.e., AUC < 0.90%. Overall, the proposed approach shows a 6.1% improved AUC for the identification of audio-visual deepfake videos. It can be explained as unimodal networks cannot consider all modalities concurrently and so have low identification accuracy. The cross-modal network facilitates the correlation of diverse properties resulting from the interaction between audio-visual modalities and achieves better results. Our proposed AVDFDNet employs an ensemble approach, which combines decisions from multiple fake detection networks utilizing single and multiple modalities, resulting in improved overall performance.
4.5 Evaluation on different audio-visual manipulations types
Deepfake videos can be generated using various audio-visual manipulation methods. A deepfake detector should be able to discriminate different types of video manipulations performed. To analyze the efficacy of the proposed AVDFDNet, we conducted an evaluation on deepfake videos generated using different visual and audio manipulation methods. The FakeAVCeleb dataset includes a number of multimodal deepfake subsets generated using various audio-visual manipulation techniques such as identity swap (Faceswap, Faceswap-wav2lip), expression swap (Fsgan, Fsgan-wav2lip), lip-syncing (wav2lip) and cloned speech (RTVC). We performed experiments on each subset respectively to analyze the generalization of the proposed AVDFDNet against different multimodal manipulation techniques. We report the evaluation metrics such as ACC, R, P, and F1-score. The results obtained on different manipulations using the AVDFDNet model are presented in Table 3.
Table 3
Evaluation of proposed approach on different audio-visual deepfake generation techniques
Generative Methods | Precision (%) | Recall (%) | F1-score (%) | Accuracy (%) |
Faceswap | 94.45 | 98.12 | 96.25 | 97.51 |
Faceswap-wav2lip | 97.41 | 98.91 | 98.15 | 98.11 |
Fsgan | 88.01 | 86.17 | 87.08 | 86.94 |
Fsgan-wav2lip | 96.34 | 95.12 | 95.73 | 95.33 |
RTVC | 95.38 | 94.36 | 94.87 | 94.48 |
wav2lip | 89.56 | 87.98 | 88.76 | 88.42 |
It can be observed from the results that the proposed ensemble network performs best on videos from faceswap, faceswap-wav2lip and Fsgan-wav2lip set showing the highest accuracy of 97.51%, 98.11%, and 95.33%. Since, the videos in these sets contain manipulation in both modalities, which may result in inconsistencies across modalities and within modalities in comparison to authentic audio-visual video pairs. In comparison, the proposed approach attained the lowest accuracy on videos from set Fsgan and wav2lip set with a value of 86.94% and 88.42%, respectively. The wav2lip and Fsgan set contain manipulations in the visual modality with modified either pose, expression, or mouth movement of the target speaker only. We assume that the low performance on these sets is a result of the predominance of audio modality in the final outcome that weakened the capability of fake video discrimination. Overall, the proposed AVDFDNet provides balanced results for the detection of fake audio-visual videos due to ensemble prediction using multiple networks.
4.6 Generalization Evaluation
To be utilized in the real world, a deepfake detection system must have strong inference performance on deepfake videos created by unseen methods. The detection algorithm focusing on specific artifacts fails to generalize to unknown forgeries classes since artifact patterns change across manipulation methods. The deepfake detection algorithms trained on specific datasets may overfit various manipulation-specific artifacts while failing to identify a common trait transferrable to a new method which causes low performance on cross-dataset evaluation. As a result, domain generalization is an important measure for deepfake detection algorithms. In this experiment, we examine the generalizability of the proposed AVDFDNet to recognize deepfakes that were not included in the training. For this purpose, we evaluated the model trained on the FakeAVCeleb dataset using DFDC and DF-TIMIT datasets. Figure 7 depicts a boxplot of the generalization findings in terms of accuracy. The generalization results in terms of accuracy are presented in boxplot in Fig. 7. The x-axis represent the databases, while y-axis shows the attained accuracy. The boxplot shows the distribution of model categorization output across the databases into four quartiles, median, maximum, minimum, and outlier value. The proposed approach achieved an accuracy greater than 75% on both datasets, with especially strong results on DF-TIMIT(HQ) i.e., 88.44%. Since the DF-TIMIT dataset contains videos manipulated through the face-swap technique with frontal face pose without occlusion or rotation that are accurately recognized by the detector. However, the performance slightly drops to 85.03% on DF-TIMIT(LQ) set due to the low quality of videos. In addition, our method achieves relatively low accuracy i.e., 76.62% on the DFDC dataset. The reason could be that videos in DFDC are partially forged videos and contain varied facial/head poses, lighting conditions, and background. Moreover, it contains videos labeled as fake even if the voice is replaced with a different person's real voice. Overall, the proposed approach detects fake content with improved accuracy and reliability. Our method employs ensemble models that combine decisions from multiple models using attention based unimodal and cross-modal learning, providing better performance for fake video detection.
4.7 Robustness Evaluation
In real-world situations, a video may undergo a variety of disturbances which might lead to a model making a false prediction. For effective deepfake detection in real-world scenarios, a model should be robust to the various noise inserted in the video, which could commonly happen for deepfake videos due to the compression artifacts. We conduct an experiment to evaluate the robustness of the proposed technique. We randomly insert different types of perturbations such as gaussian noise, blur, and video compression (with 0.5 probability) into the FakeAVCeleb test set videos. Figure 8 shows the accuracy of AVDFDNet on seen and unseen perturbed videos. The accuracy of the proposed network goes down to 88.24% with the original 94.08% i.e., a 5.84% drop, which indicates the resilience of the proposed approach in dealing with real-world settings including compressed videos over the Internet. The features learned using an ensemble AVDFNet network comprising unimodal and cross-modal networks generalize extremely well. The integration of audio-visual signal and individual modality processing in an ensemble AVDFNet network strengthens robustness to noise. It is worth noting that the results are presented without retraining the model on an augmented dataset having mentioned distortions. As the video distortions indicate a cause of weakness for detection approaches, we analyzed the model after retraining with data augmentation applied. The finding indicates that after retraining on the augmented dataset the test accuracies increased by 3.57%, i.e., 91.81%, which shows the robustness of the proposed approach against various video distortions.
4.8 Comparison to existing approaches
In this section, we performed a comparative analysis of the proposed approach with existing methods. We considered several other state-of-the-art methods utilizing unimodality (visual information only) and both audio-visual modalities to analyze their effectiveness for deepfake video detection. Table 4 summarizes the outcomes of our approach as well as other methods on deepfake detection benchmark datasets such as FakeAVCeleb. We used the results reported by these approaches on the mentioned dataset for comparison. From the results, it can be noticed that using both audio and video modality for the deepfake video detection task generalizes extremely well as compared to utilizing only a single modality (visual information) [27]. The proposed approach outperforms the existing method by attaining an accuracy of 94.08% for both audio-visual deepfakes. In comparison, the multimodal approach [75] achieved the second-best results for both audio-visual deepfake detection task. In [75], an ensemble-based audio and visual network is suggested for multimodal deepfake detection. However, this approach relies on a late fusion of audio-video features for prediction and fails to consider the intricacies of the inherent correlation pattern between modalities. The other work in [12] employed ensemble and multimodal models for manipulation identification, however, these methods yield poor results. The method [27] employed high-level characteristics such as lip movements to deal with temporal inconsistency and attained an accuracy of 0.76. The other approaches such as [42] utilized both audio-visual modalities for deepfake identification, however, they are unable to identify the manipulation in audio modality. These approaches [27, 42] employ authentic audio as a reference. The audio in these videos is not synthesized using any fake audio generation techniques. Chung et al. [18] proposed an audio-visual dissonance-based model for deepfake video detection. However, the approach showed an accuracy of 69% on the FakeAVCeleb dataset [75], demonstrating that using only spatio-temporal audio and visual information is insufficient for the task of multimodal forgery detection. Overall, the existing work achieves an accuracy that ranges between 69%-89%, which is low as compared to the proposed work. In comparison, our method employed unimodal and cross-modal networks that consider both audio and visual modalities of a video and can identify fake videos with both facial and speech manipulations. Our approach effectively learns the inconsistencies within the audio-visual modality as well as across modalities and thus yield improved performance compared to the existing state-of-the-art. The cross-modality attention contributes toward learning underlying semantic patterns between speech and facial movements to distinguish between original and manipulated videos. Moreover, we combined the predictions of individual networks using ensemble learning, which results in robust classification. The obtained results indicate the effectiveness of our method which combines spatio-temporal audio and visual features with high-level synchronization patterns between modalities and provide strong clues for identifying forged videos.
Table 4
Comparison with existing unimodal and multimodal deepfake detection methods.
Method | Modality | Class | Precision (%) | Recall (%) | F1-score (%) | Accuracy (%) |
VGG-16 [12] | V | Real | 69.35 | 89.66 | 78.21 | 81.03 |
Fake | 87.24 | 77.50 | 82.08 |
LipForensics [27] | V | Real | 70 | 91 | 80 | 76 |
Fake | 88 | 61 | 72 |
Xception-c23 [40] | V | Real | 66.54 | 77.08 | 71.43 | 73.06 |
Fake | 79.73 | 69.93 | 74.51 |
VGG-16 [12] | AV | Real | 69.35 | 89.66 | 78.21 | 78.04 |
Fake | 89.48 | 68.94 | 77.88 |
MesoInception [40] | AV | Real | 64.51 | 50.70 | 73.37 | 72.87 |
Fake | 84.40 | 63.32 | 72.35 |
AV-Dissonance [18] | AV | Real | 62 | 99 | 76 | 69 |
Fake | 94 | 40 | 57 |
Multimodal (MesoNet18 and 2DCNN) [75] | AV | Real | 83 | 99 | 90 | 89 |
Fake | 98 | 80 | 88 |
AVDFDNet (proposed) | AV | Real | 89.95 | 92.34 | 91.13 | 94.08 |
Fake | 95.83 | 96.14 | 95.98 |