Attention-based Multimodal learning framework for Generalized Audio- Visual Deepfake Detection

doi:10.21203/rs.3.rs-3415144/v1

Download PDF

Research Article

Attention-based Multimodal learning framework for Generalized Audio- Visual Deepfake Detection

https://doi.org/10.21203/rs.3.rs-3415144/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Deepfake media proliferated on the internet has major societal consequences for politicians, celebrities, and even common people. Recent advancements in deepfake videos include the creation of realistic talking faces and the usage of synthetic human voices. Numerous deepfake detection approaches have been proposed in response to the potential harm caused by deepfakes. However, the majority of deepfake detection methods process audio and video modality independently and have low identification accuracy. In this work, we propose an ensemble multimodal deepfake detection method that can identify both auditory and facial manipulations by exploiting correspondence between audio-visual modalities. The proposed framework comprises unimodal and cross-modal learning networks to exploit intra- and inter-modality inconsistencies introduced as a result of manipulation. The suggested multimodal approach employs an ensemble of deep convolutional neural-network based on an attention mechanism that extracts representative features and effectively determines if a video is fake or real. We evaluated the proposed approach on several benchmark multimodal deepfake datasets including FakeAVCeleb, DFDC-p, and DF-TIMIT. Experimental results demonstrate that an ensemble of deep learners based on unimodal and cross-modal network mechanisms exploit highly semantic information between audio and visual signals and outperforms independently trained audio and visual classifiers. Moreover, it can effectively identify different unseen types of deepfakes as well as robust under various post-processing attacks. The results confirm that our approach outperforms existing unimodal/multimodal classifiers for audio-visual manipulated video identification.

Attention mechanism

audio-visual

cross-modal learning

deepfake detection

deep neural network

ensemble learning

multimodal

video classification

In recent years, synthetic media generation techniques have evolved quickly due to their variety of applications in industries ranging from film production to video game development [1, 2]. With the advent of such technology, the level of photorealism has risen to a level that it has become increasingly difficult to differentiate synthetic media from real ones, raising serious concerns regarding the spread of false or distorted information on the Internet [3]. Earlier, the term “deepfake” is specifically used to refer to a certain type of synthetic media in which a person’s face in an image or video is replaced with another person [4]. Currently, it encompasses a larger spectrum of potential video manipulations such as the speech of the subject can be synthesized, their facial expression can be modified, their identity can be changed, and even their actual uttered words can be altered [5]. The creation of deepfakes research featuring extremely realistic lip-syncing of previous U.S. Presidents, such as Barack Obama and Donald Trump, is an illustration of this counterfeit technique [6]. In order to produce a convincing deepfake that is meant to disseminate misleading information and fake news, such as a politician speaking or giving a speech, involves manipulation in both speech and facial content. Such manipulated content poses severe security and privacy problems as well as increases mistrust in the online media [7]. Thus, it is crucial to recognize deepfake videos using more precise and trustworthy techniques.

Multiple attempts have been made by researchers and forensics experts to develop several benchmark datasets [8–12] and algorithms for deepfake video detection [13–26]. Existing low-level approaches concentrate on inconsistent behaviors, including eye blinking [14], facial expression, head movement [15, 16], lip movement [27, 28], gestural mannerisms [29], and misalignment between the eyes [21]. Since these approaches rely on flaws left during the creation process, now the majority of deepfake generation methods incorporate post-processing steps to remove low-level artifacts, making deepfake videos more difficult to identify. The other high-level approaches have generally examined for statistical discrepancies in frame-level visual aspects [10, 13, 26, 30], however, these methods ignore temporal cues in the video. While temporal aspects have recently been studied to identify spatiotemporal inconsistencies that exist between frames because deepfake videos are always generated frame-by-frame [24, 31–33]. Similarly, multiple algorithms have been developed to differentiate between authentic and fake audio snippets of speech [34, 35]. Thus, the vast majority of these approaches currently in use are unimodal since they independently leverage audio and video signals. The main presumption in most cases in underlying existing deepfake video detection techniques is that only the visual signal in a video contains fake content [5]. Additionally, the majority of publicly available deepfake video detection datasets also emphasize visual content rather than aural alterations [36].

In the era of AI, it has become difficult to identify whether a speech recording is from a real person or synthesized. Recent advances in text-to-speech (TTS) and voice conversion (VC) algorithms have made it easier to generate natural-sounding human speech [37–40]. A subtle facial-speech manipulation can change the entire meaning of the video content. However, while the audio stream is an important part of videos, it is often overlooked in the deepfake video detection [5]. In recent years, several pioneering attempts have investigated complementary modalities such as audio to improve the identification of manipulated traces [41–43]. Some techniques look for inconsistencies in audio and visual content based on the inaccuracy of some generation approaches in precisely synchronizing the audio stream with the visual data [28]. Existing deepfake generation algorithms mainly focus on the face regions, but maintaining voice-facial consistency is difficult [18]. Zhou et al. [41] use speech content to identify synchronization between mouth-shape variations and spoken phonemes. Another method [43] relies on extracting the emotion’s characteristics from both audio and video modalities and performs a similarity analysis for fake video identification. These methods, though improved generalization, however, concentrate on partial facial attributes, such as lip movements or emotional biases, which are easily countered by specific countermeasures [44, 45]. Another area of research makes use of biometric signals to identify unique facial motion patterns in a particular person, however, the ability of such ID-specific methods to generalize to new identities is limited [46]. Furthermore, these detectors frequently underperform when dealing with novel forging types, as it is unlikely that a deployed detector would only be subjected to forgeries encountered during training. One of the main issues with the current deepfake detection methods is their poor cross-dataset generalization capabilities. Additionally, they are frequently susceptible to common perturbations like compression, making them vulnerable to image-processing pipelines on social networks [5].

Despite extensive research done on the accurate detection of deepfakes, there is still a need for performance enhancement to achieve a more generalized detection for multimodal fake detection such as audio-visual manipulated videos. In this work, we propose an effective deepfake detection framework using a multimodal and ensemble approach namely Audio-Visual DeepFake Detection (AVDFD) Network for identifying real and fake videos. Our architecture utilizes unimodal and cross-modal learning networks to identify facial and vocal manipulated videos. It comprises three deep neural networks: An audio network, a video network, and an audio-visual network to detect inter and intra-modal discrepancies in deepfake videos. The unimodal networks process visual and auditory channels for anomalies in facial appearance/movements and speech whereas the Audio-Visual network uses cross-attention to exploit semantic information between facial and speech. Finally, we combine the predictions of the three networks mentioned above using the ensemble method to produce the final prediction as real or fake. We hope that by considering audio signals with the visual channel, the network learns complementary information which makes detection more accurate and generalizes inferences. The following are the primary contributions of our work:

We propose a multi-modal ensemble-based deep learning approach for deepfake video detection using audio/visual representation and cross-modality learning.
The proposed framework effectively exploits the spatio-temporal cues within the visual modality and irregularities in manipulated speech. Furthermore, it effectively learns the strong correlation between facial movements and voice signal patterns to identify real and fake videos.
The presented approach is reliable in identifying fake content and generalizes extremely well due to ensemble architecture that employs unimodal and cross-modal networks considering both audio and video modalities simultaneously.
We evaluated the proposed framework using multiple available multimodal benchmark datasets i.e., DFDC, FakeAVCeleb, and DF-TIMIT. Extensive experiments including a cross-dataset evaluation are conducted to show the effectiveness and robustness of the multi-modal ensemble system for detecting unseen facial and speech-manipulated deepfake videos.

The remainder of this paper has the following structure. Section 2 presents the overview of existing audio, video, and audio-visual deepfake detection techniques. In Section 3, we presented the proposed multimodal audio-visual framework for deepfake video detection comprising an ensemble approach using unimodal and cross-modal representation learning. Section 4 presents the details of the experimental setup, including the dataset used and the evaluation parameters. Several experiments are conducted to verify the robustness and generalization ability of the proposed framework. The experimental results are also discussed in this section. Finally, Section 5 concludes our work.

The issue of deepfake video detection has been examined in a number of recent studies. This section presents an overview of the existing methods proposed for spotting deepfakes including multi-modal (audio and video) deepfakes.

Video Deepfake detection

Since deepfakes have raised serious concerns in society, several practical countermeasures have been presented. Typically, deepfake detection is regarded as a binary classification problem in which a model is employed to classify an input video as real or fake. Initially, ad hoc extraction of manually engineered large-scale characteristics was used as an effective technique to identify AI-assisted video manipulations. These approaches assumed that some characteristics of synthetic media do not (or have yet to) correctly represent real-world talking face aspects. As a result, data preparation techniques with manually engineered features and ML models were created specifically to identify and evaluate the resulting manipulation artifacts [13, 16]. Agarwal et al. [16] hypothesized that during deepfake video creation the distinctive facial expressions, head movement patterns, and other physiological traits of the target person are frequently disturbed. They estimated the facial landmarks and movements of a particular person of interest and trained a binary SVM to distinguish between the real and the fake face. This method shows accurate results when frontal views of faces are used. Other studies used physical/physiological discrepancies to identify deepfakes. Li et al. [14] showed that deepfake videos portray the human eye blinking unnaturally due to inadequate training with closed-eye images. Matern et al. [13] exploited a variety of explainable visual abnormalities, such as eye color discrepancies, lighting defects, and absences of fine face characteristics. Yang et al. [15] discovered that manipulated the face region exhibits misalignment between the estimated 3D landmark poses of the head and face. Agarwal et al. [47] presented a forensic strategy that takes advantage of the unnatural shape of the ear and ear canal produced by lower jaw movement during speech. These methods exploit characteristics described by human-comprehensible notions and inherent the intrinsic explainability necessary for a formal inquiry. However, the generation algorithms are continuously improving and different post-processing procedures are applied to eliminate noticeable disparities, these techniques are unlikely to perform well. Moreover, given current trends in deepfake generation quality, the AI can mimic talking face characteristics and physiological signs expressing naturalness, thus the practical application of such detection systems is essentially limited [2, 45].

Various CNN structures are used by a variety of detection methods to identify manipulated media. The idea of artifacts and inconsistencies within frames has been investigated in multiple studies for the detection of deepfakes, where a similarity score is computed among frame patches. Afchar et al. [30] proposed a shallow CNN architecture, namely MesoNet, to detect facial forgeries at a mesoscopic level of detail, avoiding an emphasis on microscopic information that may be lost owing to video compression. They replaced regular convolution blocks with MesoInception blocks to improve the performance. Li et al. [14] used the ResNet50 architecture to detect deepfakes by exploiting the artifacts produced during the face-warping stage of the deepfake creation process. Zhou et al. [48] presented face categorization and patch triplet networks for the detection of face manipulation. For one stream, it used a GoogLeNet-based architecture to analyze low-level inconsistencies and categorize the face images as genuine or manipulated. While for the second, it employed a patch-based network to capture tampering traces. This method requires the extraction of steganalysis features for training. Masi et al. [49] suggested a two-branch network for extracting color and frequency domain errors individually. Huh et al. [50] predicted metadata discrepancy using a siamese network by repeatedly comparing random patches from different images. Deng et al. [51] presented an EfficientNet-V2-based CNN to extract frame-level features and distinguish real and fake faces. In [26], the authors analyzed the statistical disparities across the multi-color channels at the pixel, block, and region levels and computed a co-occurrence matrix of video frames. The extracted features were used to train the SVM classifier and perform classification. Nguyen et al. [52] used CapsuleNet to exploit the position of DeepFake video actors' face characteristics that differ from those in the genuine video. Rossler et al. [10] presented the XceptionNet model for extracting distinguishing characteristics between real and fake. Since these approaches rely on intra-frame spatial artifacts for the identification of fake content, they have low generalizability for unseen and new manipulation techniques.

Currently, the deepfake creation methods rely on frame-by-frame synthesis, which frequently fails to enforce temporally coherent linkages between feature representations in succeeding frames. To exploit temporal dynamics, Sabir et al. [53] proposed a recurrent convolutional structure to investigate the spatiotemporal appearance artifacts to identify synthesized faces in the videos. Similarly, Delp et al. [32] highlighted that deepfake videos contain frame inconsistencies and thus identified deepfake videos using a CNN with LSTM network. Zhang et al. [54] proposed a TD-3DCNN model that captures the inter-frame inconsistency at a frame level for detection while it ignores the intra-frame inconsistency cues in the spatial domain. Amerini et al. [17] proposed a dual convolutional network pipeline for extracting and analyzing forward optical flow vectors to analyze motion artifacts in videos for deepfake identification. Tariq et al. [31] introduced CLRNet based on convolutional-LSTM and residual network to identify deepfake video frames by taking into account the temporal information between successive video frames. In [27], a pre-trained lipreading network is fined tuned on real and forged data to learn the discriminative embeddings that are sensitive to mouth movement for boosting the performance of deepfake detection. Kim et al. [22] extracted frame-by-frame color difference features at the edge region between the facial portion and the background area in order to verify the authenticity of a video. Then, a 3D-DenseNetCNN was used to analyze the extracted features spatiotemporally and classify them as real or fake. Authors in [55] proposed a cross-vision transformer (CCViT) based deepfake detection approach that employed a convolutional network with two distinct branch transformer encoders that operated on variable-sized patches created from faces. Cross attention was used then to mix the visual tokens generated by the transformer encoders from these two branches. Similarly, Wang et al. [23] presented a two-stream network for learning spatial and frequency discrepancies. A multi-scale vision transformer network detected local discrepancies in a frame at spatial levels by using patches of varying sizes. A second stream learned to detect counterfeit artifacts in the frequency domain. A cross-modality fusion block was used to combine them into a single representation for categorization. In [24], authors introduced ViXNet, a two-branch network, one that attempts to learn discrepancies between local face area details by combining a patch-wise self-attention module with a ViT and the other uses XceptionNet to extract global spatial features. Yang et al. [25] proposed a dense optical flow-based approach to detect forged faces. The authors revealed that forged video frames exhibit optical flow truncation by examining successive video frames processed using the optical flow technique. A double triangular area between face keypoints such as eyes, nose, and lip corners was used to compute the representation of facial motion across frames. This approach showed improved performance, however, it is sensitive to the movements of a person in a video. Chen et al. [56] proposed an XceptionNet-ConvLSTM network with an added spatiotemporal attention mechanism to extract more robust temporal features and improve the generalizability of deepfake face detection. Fei et al. [57] initially preprocessed the videos by applying Eulerian motion magnification to amplify the facial motions, which considerably enhanced the distortion or flicker in fake videos. The enhanced frames sequences were then used to train a joint CNN-LSTM network and distinguish between real and fake faces.

Audio deepfake detection

Numerous fake speech detection systems presently in use rely on the extraction of acoustic features from audio signals, such as MFCC, CQT [58], CQCC [59], or a combination of these features, and then use classifiers like SVM, Random forest, Gaussian Mixture Model (GMM), or CNN to produce predictions [60]. Although audio spoofing has been extensively researched using traditional handcrafted features like Mel-frequency cepstral coefficients (MFCCs), CQT, and GTCC, however, with the development of DL models deep features have shown significant performance improvement over the recent years [61]. Some recent works [62] [63] proposed end-to-end deep learning-based systems using a ResNet-like framework to identify spoofed audios. In [64], 60-dimensional linear filter banks (LFB) were computed from voice samples and used to train a modified ResNet model. Khochare et al. [35] proposed a spatial-temporal CNN model for processing mel-spectrogram sequences in order to determine if a particular audio clip is authentic or manipulated. Wang et al. [34] suggested DeepSonar, a DNN-based framework that uses layer-wise neuron activation to distinguish between real and AI-synthesized voices.

Audio-Visual analysis

In contrast to the previous techniques, which only employed multiple cues from visual modality, recent studies employed multimodal information such as audio as well for deepfake video detection. Korshunov et al. [65] proposed a method for evaluating the consistency of audio Mel-frequency cepstrum and lip keypoints derived from 68-point facial landmarks. The correlated features were obtained using Principal Component Analysis and trained using different classifiers such as SVM, LSTM, and Gaussian mixture model for deepfake detection. In [28], exploited differences between certain spoken phonemes and lip shape dynamics created as a result of manipulation to spot fake videos. For example, the lips must be closed while saying specific words starting with M, B, or P. However, fake videos lack this aspect. Gu et al. [42] proposed an audio-visual coupling model (AVCM) to analyze the synchronization between lip movement and speech segments created using a variable partitioning strategy based on phonemes. Chung et al. [18] proposed a modality dissonance score (MDS) as a method for measuring audio-visual disharmony in videos. In particular, the MDS employs contrastive loss that aims to determine the degree of mismatch between visual and audio elements in a video, resulting in smaller distance between them for real videos and larger for fake videos. Chitan et al. [66] proposed a framework based on a convolutional bidirectional recurrent LSTM network to separately process the audio and visual streams to identify manipulated videos. Similarly, in [67], the authors conducted an audio and video-based analysis to accurately spot discrepancies. An EfficientNet-based feature computation network with GRU layers was used to compute features from visual modality. For audio analysis, the audio stream was transformed into a spectrogram, and a 2D-CNN with two convolutional layers was utilized to calculate features and perform classification. Zhou et al. [41] proposed an audiovisual detection framework to learn intrinsic synchronization between video and audio streams and detect the inconsistent mouth-related dynamics in manipulated videos. The method used a two-plus-one stream to jointly train a model with video and audio streams. In [12], the authors analyzed manipulated audio and visual modalities using uni-modal and multi-modal ensemble-based methods. Mittal et al. [43] extracted face and speech embedding along with perceived emotion features from both the audio and visual modalities and computed the similarity between them to identify deepfake videos. The similarity score was measured using a siamese network with a combination of two triplet loss functions. Although, the mentioned approaches showed improved deepfake video detection performance, however, there is still a need for the development of robust and generalized detection models for multimodal fake detection, such as audio-visual manipulated videos.

In this subsection, we present the proposed framework for detecting deepfakes videos with both visual and auditory manipulations. The proposed architecture is composed of three networks to individually process audio and video modality and combined audio-visual modality. Figure 1 shows the architecture of the proposed multimodal deepfake detection model. The video network is based on an attention-based spatio-temporal model that processes the visual channel to learn facial appearance as well as motion inconsistency to identify manipulated videos. The audio network uses 2D-CNN to extract useful representations from vocal spectrograms to differentiate between genuine and synthesized speech signals. Whereas the audio-visual network learns to exploit the inconsistencies between visual and audio that usually happens due to manipulation in either or both modalities. Therefore, by combining the audio and visual modalities, the proposed model can learn richer representations than others that use either unimodal approaches or multimodal models whose modalities are exclusively from the visual stream [23, 27]. Hence, our ensemble framework can determine the deepfake videos when one or both the audio and visual modalities are modified simultaneously.

To conduct this study, we define the problem of audio-video manipulation as a binary classification problem. Given an input video X containing a talking face with audio and video channels denoted as a and v respectively. The model F(x) ingests the N sequence of cropped facial frames and respective vocal segments and processes both modalities individually and jointly. It classifies all modalities i.e., video, audio, and combined audio-video stream as p_a, p_v, and p_av, with 0 indicating real and 1 indicating fake. Finally, we used an ensemble approach to combine decisions from multiple models using audio-visual modalities to identify whether the input video is forged or original.

3.1 Video Network

The video network processes the visual modality of the input video and extracts representative features to determine if the video is authentic or not. The majority of deepfake creation techniques synthesize fake faces frame by frame without taking motion coherence into account during talking, therefore the deepfake signals are hidden in spatiotemporal correlations of a few adjacent frames. Effective modeling of appearance and local spatiotemporal characteristics is critical for the detection of such artifacts and learning robust characteristics for subtle differences identification between real and forged videos. Several methods have been investigated using 2D-CNN with Recurrent Neural Networks (RNN) [31, 32] and 3D-CNN [22, 33, 54, 68] to capture the temporal dynamics and useful latent appearance representations for deepfake video identification. In comparison to sequence-based approaches, 3D-CNNs have been found more effective in capturing spatiotemporal dynamics in a video. This is because sequence-based techniques are indifferent to order and local spatial temporal inconsistencies, whereas 3D Convnets effectively represent spatiotemporal correlation and hence capture disruptions in it. Therefore, in this work, we employ a 3D CNN-based network as an architecture for modeling spatiotemporal cues in order to extract more local cues and patterns in the facial region and better recognize various types of manipulations. We propose a 3D-ResAttNet architecture based on 3D-ResNet [69] model with spatial-temporal attention to learn high-level semantic characteristics that are robust to different video post-processing techniques. Since, the tampering in deepfake videos typically occurs in limited portions of the entire face, indicating that the detection network should pay more attention to certain informative locations, such as inconsistent facial appearance/movements or asymmetry between facial regions such as the left and right eyes. We added the attention module to assist the model to emphasize on more important regions of the visual information, leading to improved performance [70].

The input to the video network is a sequence of frames cropped around the face region with dimensions 3×h×w×t_v, where 3 refers to the RGB color channel of each frame, h and w are the width and height and t_v is the length of the sequence. Table 1 shows the layer-wise architecture details of the proposed 3D-ResAttNet network for processing visual modality. The model comprises 5 groups of convolutions having 3D residual convolutional layers generated by extending initial 2D ResNet to preserve temporal information in the feature maps. The ResNet architecture is designed to create deeper models for improved feature extraction with skip connections between layers without affecting performance. Because of its superior deep network capabilities, ResNet is often utilized in feature extraction networks. The 3D-ResAttNet learns robust features by taking into consideration the mutual correlations in both the temporal and spatial domains to effectively discover underlying inconsistencies even in deepfakes that seem realistic in terms of the temporal facial motions. The extracted features are passed to a spatiotemporal attention module to emphasize the tampered regions and enhance spatiotemporal correlations rather than focusing on irrelevant background information.

Table 1

The architecture of Audio and Visual networks.
Layer	Visual Network	Audio Network
conv_1	7×7×7, 64	3×3, 64
conv_2	3×3×3 maxpool $\left[\begin{array}{c}1\times 1\times 1, 64\\ 3\times 3\times 3, 64\\ 1\times 1\times 1, 256\end{array}\right]$ × 3	1×1 maxpool $\left[\begin{array}{c}1\times 1, 64\\ 3\times 3, 64\\ 1\times 1, 256\end{array}\right]$ × 3
conv_3	$\left[\begin{array}{c}1\times 1\times 1, 128\\ 3\times 3\times 3, 128\\ 1\times 1\times 1, 512\end{array}\right]$ × 4	$\left[\begin{array}{c}1\times 1, 128\\ 3\times 3, 128\\ 1\times 1, 512\end{array}\right]$ × 4
conv_4	$\left[\begin{array}{c}1\times 1\times 1, 256\\ 3\times 3\times 3, 256\\ 1\times 1\times 1, 1024\end{array}\right]$ × 6	$\left[\begin{array}{c}1\times 1, 256\\ 3\times 3, 256\\ 1\times 1, 1024\end{array}\right]$ × 6
conv_5	$\left[\begin{array}{c}1\times 1\times 1, 256\\ 3\times 3\times 3, 256\\ 1\times 1\times 1, 1024\end{array}\right]$ × 3	$\left[\begin{array}{c}1\times 1, 256\\ 3\times 3, 256\\ 1\times 1, 1024\end{array}\right]$ × 3
	Attention block AdaAvgPool3D Dense layer FC 512×2	AdaAvgPool2D Dense layer FC 512×2

The output features from the last convolutional block x ∈ R^c×d×h×w are fed to the attention block that transforms features into key k(x_i), query q(x_j), and value v(x_i) using 1×1×1 convolutional filter. Figure 2 shows the architecture of the attention block comprising a spatial-temporal attention mechanism. The attention weights are calculated by queries and keys used to aggregate values. The spatial attention module calculates the intra-frame correlation to capture the spatial forged patterns. The relationship between each pixel with every other pixel in the frame is calculated, and only features with the most significant intra-frame dependencies are passed. The computation of inter-frame connections between the face and the temporal relationship between the global properties of consecutive facial frames are both done via temporal attention. The attention block integrates both inter- and intra-frame dependencies to identify overall spatial and temporal variations in the facial region such as eye, mouth, and facial movements that differ in forged video frames. The final output obtained by applying spatial (${Y}_{s})$and temporal $\left({Y}_{t}\right)$attention is computed as:

$$\text{O}={\alpha }({\text{Y}}_{s}+{\text{Y}}_{t})+\text{x}$$

$${Y}_{s(j,i)}^{{\prime }}=\frac{exp\left({k\left({x}_{i}\right)}^{T} q\right({x}_{j}\left)\right)}{{\sum }_{i=1}^{\text{H}\times \text{W}}exp\left({k\left({x}_{i}\right)}^{T} q\right({x}_{j}\left)\right)}$$

$${\text{Y}}_{s}= {Y}_{s(j,i)}^{{\prime }}\times v \left({x}_{i}\right)$$

$${Y}_{t(j,i)}^{{\prime }}=\frac{exp\left({k\left({x}_{i}\right)}^{T} q\right({x}_{j}\left)\right)}{{\sum }_{i=1}^{\text{D}}exp\left({k\left({x}_{i}\right)}^{T} q\right({x}_{j}\left)\right)}$$

$${\text{Y}}_{t}= {Y}_{t(j,i)}^{{\prime }}\times v \left({x}_{i}\right)$$

In this formulation, Y denotes the sensitivity of the pixel in location i to all locations j in the feature map. The k(x_i)^T q (x_j) is the matrix dot product used to measure the similarity between these two positions. α denotes the weight of the attention block. It is initialized as 0 and is updated during the training process, under the guidance of an optimizer. Initially, it guides the network to depend on local cues before progressively increasing the value to give non-local evidence higher weight and generate the attention embedding. The output feature maps are passed through 1×1×1 adaptive average spatiotemporal pooling that produces a 512-dimensional feature vector and two fully connected classifier layers to predict the classification score for each video segment.

3.2 Audio Network

The audio network processes the audio modality of the input video and extracts the representative features to categorize as manipulated and real speech. This network takes an audio segment corresponding to the respective visual frame. Processing audio component with their corresponding visual frame allows more effective analysis of audio in and across modalities. Similar to the video network, the audio network is also based on the ResNet network, however, it is a stack of 2D convolutional layers that operates on the mel-spectrogram representation as input of dimensions 3×128×128. Spectrograms have been widely used for human voice analysis for emotion recognition, speech recognition, and other speech applications [71]. Mel-spectrograms are spectrograms with frequencies measured in the mel-scale and are designed to simulate how the human ear perceives sound. They are effective speech signal visualizations because they demonstrate the frequency and intensity fluctuations over time. Moreover, the image-based methods outperformed feature-based techniques, including those that made use of characteristics related to energy, bandwidth, frequency, and short-term transform features like MFCCs, for the identification of synthetic audio [72, 73].

After segmentation, the audio signal is converted into mel-spectrograms which requires windowing and framing of the input audio waveform. The fast fourier transform is then applied to convert the signal from the time domain to the frequency domain, allowing for statistical and spectral representation. Then, mel-filter banks are applied to construct the spectrogram envelope. The mel-spectrograms are generated using 128 mel-filterbanks with a hop length of 512 and n_fft of 2048. The architecture of the audio processing network is given in Table 1. The architecture is based on ResNet-50 having 5 groups of convolutions. It contains several residual blocks, including convolutional layers, batch normalization layers, and ReLU activation functions. The network applies multiple convolution operations that extract the spatial relationship in time and frequency from the speech spectrogram. The network preserves the temporal resolution and converts the spectrogram to a feature map containing high-level salient information for discriminating between synthesized and real speech signals. The output features are of dimension 512 and are passed through fully connected layers to determine whether an audio speech is genuine or fake.

3.3 Cross-Modal Audio-Video Network

In addition to the Audio and Video network, we employed a cross-modal audio-video network to take advantage of the correlation between audio and visual modalities in a video. Audio-visual synchrony is a useful indicator for identifying alterations in a video. Deepfakes tend to be characterized by fine-grained audiovisual anomalies, such as misalignment between voice signal and lip movements and unnatural mouth appearance. We observe that when a person speaks there is a strong association between the uttered syllables and facial movements. When any of the modalities is modified, this synchronization breaks at some inconspicuous moments. The fake videos suffer from irregular mouth shapes and vivid movements lacking lip synchronization whereas the real videos have jitter-free and synced lip movements. The audiovisual network learns the correspondence between the audio and visual signal and distinguishes intrinsic patterns of real audiovisual pairs from those of fake pairs. During training, the visual and audio channels mutually benefit each other resulting in more robust representations that are independent of manipulation type.

Audio-visual network combines the information from audio and visual modalities by learning the relationship between both modalities within videos. It employs an inter-modality cross-attention mechanism to learn the correlation between facial movements and speech information at the utterance level and aggregates cross-modal speaking patterns from the temporal context using a self-attention [74]. The architecture of the audio-visual network is shown in Fig. 3. The network is divided into two streams: audio and visual stream. The architecture of the audio and video branch is identical to the networks described in Sections 3.1 and 3.2. The visual branch aims to learn facial movement dynamics whereas the audio branch seeks to learn an audio content representation. Unlike [75] where a late fusion of learned audio and video embedding is performed to produce a prediction, we used inter-modality cross-attention to dynamically describe the interaction between facial and speech features. We denote the feature embedding from audio and visual networks as ${I}_{a}ϵ{R}^{t\times d}$, ${I}_{v}ϵ{R}^{t\times d}$, respectively, where t denotes the length of the sequence and d represents feature size. The length of both speech and visual features is kept the same. The encoded features are passed to the cross-attention layer to obtain semantic relevance via queries, keys, and values as input, shown in Fig. 4. In the case of audio-guided visual attention, the audio features are used as the query, while the visual features are used as the key and values. The role of these features is reversed in the visual to audio attention. At any given time in step T, we define the query, key, and value as ${Q}_{a}={I}_{a}{W}_{{Q}_{a}}$ ,${K}_{v}={I}_{v}{W}_{{K}_{v}}$ and ${X}_{v}={I}_{v}{W}_{{X}_{v}}$, where ${W}_{{Q}_{a}}$,${W}_{{K}_{v}}$ And ${W}_{{X}_{v}}$ are the weights. The output visual attention feature ${Y}_{va}$and audio attention feature ${Y}_{av}$from inter-modality attention are computed as:

$${Y}_{av}=Softmax \left(\frac{{Q}_{a}{K}_{v}^{T}}{\sqrt{{d}_{k}}}\right){X}_{v}$$

$${Y}_{va}=Softmax \left(\frac{{Q}_{v}{K}_{a}^{T}}{\sqrt{{d}_{k}}}\right){X}_{a}$$

where d denotes the dimensionality of the keys. Following the attention layer is a feed-forward layer, a residual connection, and a normalization layer, which together create the inter-model cross-attention block. Following the cross-attention, a hybrid fusion is performed to represent audio-visual utterances using audio attention features as the query, while its key and values are sourced from the visual attention feature. The cross-attention and hybrid fusion network comprise a transformer layer with eight attention heads. The output features are max-pooled along the temporal dimension before passing to tanh activation. Finally, a fully connected layer performs classification and predicts a score indicating whether the facial and speech pair is real or manipulated.

3.4 Classification

The final prediction is obtained by combining the predicted output from all the networks i.e., audio, video, and audiovisual networks. We used ensemble learning to combine predictions from all networks to obtain superior results. In ensemble-based learning, several base models are trained, and the output of classifiers is combined using a decision fusion method which improves the generalization capabilities and the robustness of the resulting model. There are several ways to combine predictions from several models for a final decision. Among these, the voting technique is a simple approach for combining predictions from multiple classifiers. In hard voting, the predicted class with the highest number of votes is decided as a final class label whereas soft voting takes the average of predictions. To identify the manipulated video, we employed audio, video, and audiovisual networks simultaneously, and used a hard voting approach to produce the final prediction.

In this section, we investigate the performance of unimodal, cross-modal, and proposed ensemble AVDFD network and report the obtained results. First, we briefly describe the benchmark datasets that we employed in our research for evaluation. Then we describe the metrics used for evaluation and details of implementation. We provided details of different experiments performed to analyze the effectiveness of the proposed approach in terms of audio-visual deepfake video identification accuracy, robustness, and generalization and discussed the obtained results. We also performed a comparative analysis with existing approaches.

4.1 Dataset

Deepfake datasets are essential to confirm the effectiveness and performance of the detection model. As our proposed approach is multimodal, we considered publicly available audio-visual deepfake datasets for evaluation. Most of the currently available benchmark datasets for deepfake video detection only include a visual component, and no voice signal is present. Moreover, audio deepfakes are uncommon in most of them. For performance analysis, we utilized datasets having audio signal as well such as FakeAVCeleb [36], Deepfake Detection Challenge (DFDC) [9], and DeepfakeTIMIT (DF-TIMIT) [76]. The FakeAVCeleb is a recently released multimodal dataset having a total of 500 real videos and 19,500 manipulated videos. The dataset contains fake videos having manipulated faces along with synthesized cloned voices. The dataset is generated using different face-swapping, face-reenactment techniques, and audio synthesis techniques. The overall dataset contains multimodal manipulated videos belonging to five different manipulation categories with diverse ethnicity and gender. The dataset is separated into four groups: Real video and real audio (V_RA_R), real video and fake audio (V_RA_F), fake video and real audio (V_FA_R), and fake video and fake audio (V_FA_F). In this dataset, the number of samples is highly unbalanced for real and fake classes. We supplemented the dataset with 16,00 real video training samples from the VoxCeleb1 dataset [77]. The DFDC dataset is currently the largest available deepfake detection dataset containing around 100,000 manipulated videos and 23,000 original samples having 960 different identities. DFDC-preview is the initial version of the DFDC dataset containing 1131 real videos and 4119 manipulated videos and is available on Kaggle. The details of the algorithms used for manipulation are not revealed. The dataset includes both manipulated audios and videos, however, labels the entire video as fake. It does not indicate whether the visual component or speech is manipulated in the video. Additionally, the synthetic audios are not lip-synced with the corresponding videos and are labeled as fake even if the original voice is replaced with a different person's voice. The other dataset used is the DF-TIMIT dataset which contains a total of 620 fake videos created by using 32 subjects. The dataset is further divided into two different subsets of fake videos having different qualities. A low-quality (LQ) set has 320 videos of size 64 × 64 dimensions and a high-quality (HQ) set with a 128 × 128 frame size. However, this dataset contains facial manipulated videos only, the audio channel in none of the videos has been altered. We use all videos from the DF-TIMIT database for evaluation. Due to these limitations, we used FakeAVCeleb dataset for training and testing as it contains multimodal videos with manipulations in both the audio and video modalities. The selected videos used for the training and testing of the proposed framework were split into an 85:15 train-test split. To analyze the cross-dataset generalization, we used DFDCp and DFTIMIT datasets. It is worth mentioning that videos used for evaluation are diverse in nature, generated by using varying manipulation techniques and categories.

4.2 Implementation details

The overall implementation of the suggested framework is done using TensorFlow and Keras library. We trained all three networks in an end-to-end manner using AdamW optimizer and binary cross entropy loss (log-loss) that predict the probability of being fake. The initial learning rate is set as 1×10^− 2 and steadily decreases with a weight decay of 0.2. The training and testing are performed on Nvidia RTX3090 GPU. We split each video into small segments of 2s for processing. The sampling rate for the video sequence is 20 FPS and 16 kHz for the audio sequence. We employ the face detector in [78] to detect, crop, and align faces over time. The size of input of video and audio networks are face frames of dimension 224 × 224 × 3 and its corresponding audio mel-spectrograms.

4.3 Evaluation parameters

We report results using various metrics such as recall (R), precision (P), f1-score, accuracy (ACC), and area under the ROC curve (AUC). We used these performance metrics because they are commonly discussed in the literature for deepfake detection and are suitable for comparison. The R is also known as the true positive rate (TPR), it estimates the proportion of deepfakes that are accurately detected as fake. P also called false positive rate (FPR) measures the likelihood that the model misclassified the real as deepfake. The f1-score is the harmonic mean of P and R. The ACC measures the percentage of correctly identified observations. The AUC is the graphical depiction of the classification performance of the model computed at various thresholds ranging from 0 to 1. The following are the definitions of the ACC, P, R, and f1-score:

$$R=\frac{TP}{TP+FN}$$

$$P=\frac{TP}{TP+FP}$$

$$f1=\frac{2\times R\times P}{R+P}$$

$$ACC=\frac{TP+TN}{TP+TN+FP+FN}$$

Where TP (true positive) and TN (true negative) represent correctly identified deepfake and real samples, respectively. FP (false positive) and FN (false negative) denotes incorrectly identified fake and real samples, respectively. The higher values of AUC and ACC indicate the model can accurately discriminate between pristine and deepfake videos.

4.4 Evaluation of Audio-Visual deepfake Detection

A reliable video forensic analysis model must be able to accurately identify multimodal manipulations. In this section, we discuss the quantitative assessment findings of the proposed unimodal, cross-modal, and ensemble network using several standard metrics to demonstrate audio-visual deepfake video recognition performance. We conducted a comprehensive analysis of individual models to get an understanding of the extent to which these models are capable of identifying manipulation in respective modalities so that they can effectively be combined into an ensemble for improved performance. Three assessment methodologies were used in this experiment: Unimodal - The analysis is performed on individual Audio and Video networks, considering a single modality at a time. Cross-modal - considered Audio-Visual Network that integrates both modalities for prediction. Finally, an ensemble framework (AVDFDNet) is used to assess performance by combining the prediction of unimodal and cross-modal networks. We utilized the FakeAVCeleb dataset for evaluation, following the original training, validation, and testing split. Except for FakeAVCeleb and DFDC datasets, all other datasets include facial alterations only making them unsuitable for the audio-visual deepfake detection task. The DFDC dataset does not contain respective audio or video labels. This type of dataset is unsuitable for training a detector to identify both multimodal deepfakes. For training of Video network, we used videos from V_RA_R set as a real class. The videos from V_FA_F and V_FA_R are used as a fake class because it contains manipulated video modality. For the audio network, we used videos from V_RA_R set for the real audio class, and V_RA_F and V_FA_F set for fake audio to train the model. For training of the Audio-Visual network, we used all the given sets from the dataset i.e., V_RA_R, V_RA_F, V_FA_R, and V_FAv_F. The set V_RA_R is used for real class and V_RA_F, V_FA_R, and V_FA_F for fake class. A video is considered fake if any of its modalities i.e., audio, video, or both are manipulated otherwise real. We reported the binary classification results in terms of various evaluation parameters such as recall, precision, f1-score, and accuracy.

Table 2 presents the detection results using all the models motioned above. From the results, it can be observed that the proposed ensemble AVDFDNet consistently outperformed on fake video detection. In terms of accuracy, the audio-only and video-only based deepfake classifiers achieved an accuracy of 83.40% and 90.79%, respectively. While using both audio-visual information resulted in more accurate fake video detection i.e., 92.64% which shows that audio-visual correspondence is an informative cue for deepfake detection. Particularly, the cross-modal audio-visual network identifies the deepfake video by learning the intrinsic consistency between audio-visual modality using speech signals and corresponding facial movements. Overall, the combined prediction of all networks using ensemble learning, our approach AVDFDNet achieved good results, with an accuracy of 94.08%, which is 5.15% higher than individual network predictions. The independently trained audio and video networks are unable to learn the correlation between audio and visual modality, which resulted in low identification accuracy. In deepfakes videos, the manipulation can occur in any modality which results in the disharmony between audio and visual modalities and inconsistencies within modalities that our proposed cross-modal Audio-Visual network effectively captures. Meanwhile, the results in Table 2 also indicate that in terms of F1 score, AVDFDNet outperforms individual networks (unimodal- audio network, video network, and cross-modal audio-visual network) by a large margin, demonstrating that diversity in individual network prediction assists in achieving high balanced accuracy and confirming the efficacy of our ensemble approach.

Table 2

Class-wise evaluation of audio and video deepfake detection using the proposed model on FakeAVCeleb dataset.
Model	Precision (%)	Recall (%)	F1-score (%)	Accuracy (%)
Video Network	88.97	93.12	91.00	90.79
Audio Network	81.59	86.25	83.86	83.40
Audio-Visual Network	95.96	89.03	92.36	92.64
AVDFDNet (proposed)	92.91	95.43	94.15	94.08

Although the deepfake detectors show good deepfake identification accuracy, however, they may have a low recall. A low recall value indicates that the model incorrectly categorized many fake videos as real, failing to perform the primary task of identifying fake videos. A high recall rate is preferred for deep fake detection models. Presumably, even if some real videos are misidentified, it would be preferable to identify more deepfakes as opposed to not detecting them at all. To better comprehend the deepfake recognition performance of our framework, we plotted the confusion matrix of all the networks for unimodal, cross-modal, and ensemble deepfake video detection using the fakeAVCeleb dataset. Figure 5 presents the confusion matrix of each network showing the percentage of correctly and incorrectly identified deepfake videos against ground truth. The values in the diagonal of the matrix show the percentage of model predictions that are correctly matched to the class of test data, whereas the off-diagonal values represent wrong predictions. From Fig. 5, it can be observed that the best classification is provided by AVDFDNet, with the highest separation between the fake and real classes i.e., 95.43%. Comparatively, the cross-modal audio-visual network provides moderate results using the combined audio-visual modalities with a correct classification value of 89.03%. It can be seen that most instances of fake videos are misclassified as real videos and vice versa, which could be due to no evident discrepancies between the video and audio channels in some manipulation cases such as expression-manipulated videos. Similarly, the video-only network and audio-only network showed 93.12% and 86.25% correct classifications for the fake class and 88.46% and 80.54% correct classifications for the real class. In the case of unimodality classification, the test set contains fake video clips with audio alteration but no visual manipulation or vice versa, which is the cause of the low recall.

The ROC curves shown in Fig. 6 are plotted to compare the performance of models and demonstrate how well individual networks performed on deepfake and real data distributions. The ROC curve depicts the trade-off between the TPR and FPR for each model at various categorization levels. The y-axis labeled TPR indicates how well a model succeeded in accurately detecting deepfakes while the x-axis labeled FPR indicates the extent to which the model mistakenly identified real videos as deepfakes. The value increasing in the direction of 1 indicates the capability of the model is improving at identifying both real and fake data. To facilitate the interpretation of these curves, we compute the AUC to determine more precisely the ability of each model to distinguish between fake and pristine videos. Figure 6 shows that the AVDFDNet model achieves the highest results with an AUC value of 0.940. On the other hand, the audio-visual network and the audio-only have low performance, i.e., AUC < 0.90%. Overall, the proposed approach shows a 6.1% improved AUC for the identification of audio-visual deepfake videos. It can be explained as unimodal networks cannot consider all modalities concurrently and so have low identification accuracy. The cross-modal network facilitates the correlation of diverse properties resulting from the interaction between audio-visual modalities and achieves better results. Our proposed AVDFDNet employs an ensemble approach, which combines decisions from multiple fake detection networks utilizing single and multiple modalities, resulting in improved overall performance.

4.5 Evaluation on different audio-visual manipulations types

Deepfake videos can be generated using various audio-visual manipulation methods. A deepfake detector should be able to discriminate different types of video manipulations performed. To analyze the efficacy of the proposed AVDFDNet, we conducted an evaluation on deepfake videos generated using different visual and audio manipulation methods. The FakeAVCeleb dataset includes a number of multimodal deepfake subsets generated using various audio-visual manipulation techniques such as identity swap (Faceswap, Faceswap-wav2lip), expression swap (Fsgan, Fsgan-wav2lip), lip-syncing (wav2lip) and cloned speech (RTVC). We performed experiments on each subset respectively to analyze the generalization of the proposed AVDFDNet against different multimodal manipulation techniques. We report the evaluation metrics such as ACC, R, P, and F1-score. The results obtained on different manipulations using the AVDFDNet model are presented in Table 3.

Table 3

Evaluation of proposed approach on different audio-visual deepfake generation techniques
Generative Methods	Precision (%)	Recall (%)	F1-score (%)	Accuracy (%)
Faceswap	94.45	98.12	96.25	97.51
Faceswap-wav2lip	97.41	98.91	98.15	98.11
Fsgan	88.01	86.17	87.08	86.94
Fsgan-wav2lip	96.34	95.12	95.73	95.33
RTVC	95.38	94.36	94.87	94.48
wav2lip	89.56	87.98	88.76	88.42

It can be observed from the results that the proposed ensemble network performs best on videos from faceswap, faceswap-wav2lip and Fsgan-wav2lip set showing the highest accuracy of 97.51%, 98.11%, and 95.33%. Since, the videos in these sets contain manipulation in both modalities, which may result in inconsistencies across modalities and within modalities in comparison to authentic audio-visual video pairs. In comparison, the proposed approach attained the lowest accuracy on videos from set Fsgan and wav2lip set with a value of 86.94% and 88.42%, respectively. The wav2lip and Fsgan set contain manipulations in the visual modality with modified either pose, expression, or mouth movement of the target speaker only. We assume that the low performance on these sets is a result of the predominance of audio modality in the final outcome that weakened the capability of fake video discrimination. Overall, the proposed AVDFDNet provides balanced results for the detection of fake audio-visual videos due to ensemble prediction using multiple networks.

4.6 Generalization Evaluation

To be utilized in the real world, a deepfake detection system must have strong inference performance on deepfake videos created by unseen methods. The detection algorithm focusing on specific artifacts fails to generalize to unknown forgeries classes since artifact patterns change across manipulation methods. The deepfake detection algorithms trained on specific datasets may overfit various manipulation-specific artifacts while failing to identify a common trait transferrable to a new method which causes low performance on cross-dataset evaluation. As a result, domain generalization is an important measure for deepfake detection algorithms. In this experiment, we examine the generalizability of the proposed AVDFDNet to recognize deepfakes that were not included in the training. For this purpose, we evaluated the model trained on the FakeAVCeleb dataset using DFDC and DF-TIMIT datasets. Figure 7 depicts a boxplot of the generalization findings in terms of accuracy. The generalization results in terms of accuracy are presented in boxplot in Fig. 7. The x-axis represent the databases, while y-axis shows the attained accuracy. The boxplot shows the distribution of model categorization output across the databases into four quartiles, median, maximum, minimum, and outlier value. The proposed approach achieved an accuracy greater than 75% on both datasets, with especially strong results on DF-TIMIT(HQ) i.e., 88.44%. Since the DF-TIMIT dataset contains videos manipulated through the face-swap technique with frontal face pose without occlusion or rotation that are accurately recognized by the detector. However, the performance slightly drops to 85.03% on DF-TIMIT(LQ) set due to the low quality of videos. In addition, our method achieves relatively low accuracy i.e., 76.62% on the DFDC dataset. The reason could be that videos in DFDC are partially forged videos and contain varied facial/head poses, lighting conditions, and background. Moreover, it contains videos labeled as fake even if the voice is replaced with a different person's real voice. Overall, the proposed approach detects fake content with improved accuracy and reliability. Our method employs ensemble models that combine decisions from multiple models using attention based unimodal and cross-modal learning, providing better performance for fake video detection.

4.7 Robustness Evaluation

In real-world situations, a video may undergo a variety of disturbances which might lead to a model making a false prediction. For effective deepfake detection in real-world scenarios, a model should be robust to the various noise inserted in the video, which could commonly happen for deepfake videos due to the compression artifacts. We conduct an experiment to evaluate the robustness of the proposed technique. We randomly insert different types of perturbations such as gaussian noise, blur, and video compression (with 0.5 probability) into the FakeAVCeleb test set videos. Figure 8 shows the accuracy of AVDFDNet on seen and unseen perturbed videos. The accuracy of the proposed network goes down to 88.24% with the original 94.08% i.e., a 5.84% drop, which indicates the resilience of the proposed approach in dealing with real-world settings including compressed videos over the Internet. The features learned using an ensemble AVDFNet network comprising unimodal and cross-modal networks generalize extremely well. The integration of audio-visual signal and individual modality processing in an ensemble AVDFNet network strengthens robustness to noise. It is worth noting that the results are presented without retraining the model on an augmented dataset having mentioned distortions. As the video distortions indicate a cause of weakness for detection approaches, we analyzed the model after retraining with data augmentation applied. The finding indicates that after retraining on the augmented dataset the test accuracies increased by 3.57%, i.e., 91.81%, which shows the robustness of the proposed approach against various video distortions.

4.8 Comparison to existing approaches

In this section, we performed a comparative analysis of the proposed approach with existing methods. We considered several other state-of-the-art methods utilizing unimodality (visual information only) and both audio-visual modalities to analyze their effectiveness for deepfake video detection. Table 4 summarizes the outcomes of our approach as well as other methods on deepfake detection benchmark datasets such as FakeAVCeleb. We used the results reported by these approaches on the mentioned dataset for comparison. From the results, it can be noticed that using both audio and video modality for the deepfake video detection task generalizes extremely well as compared to utilizing only a single modality (visual information) [27]. The proposed approach outperforms the existing method by attaining an accuracy of 94.08% for both audio-visual deepfakes. In comparison, the multimodal approach [75] achieved the second-best results for both audio-visual deepfake detection task. In [75], an ensemble-based audio and visual network is suggested for multimodal deepfake detection. However, this approach relies on a late fusion of audio-video features for prediction and fails to consider the intricacies of the inherent correlation pattern between modalities. The other work in [12] employed ensemble and multimodal models for manipulation identification, however, these methods yield poor results. The method [27] employed high-level characteristics such as lip movements to deal with temporal inconsistency and attained an accuracy of 0.76. The other approaches such as [42] utilized both audio-visual modalities for deepfake identification, however, they are unable to identify the manipulation in audio modality. These approaches [27, 42] employ authentic audio as a reference. The audio in these videos is not synthesized using any fake audio generation techniques. Chung et al. [18] proposed an audio-visual dissonance-based model for deepfake video detection. However, the approach showed an accuracy of 69% on the FakeAVCeleb dataset [75], demonstrating that using only spatio-temporal audio and visual information is insufficient for the task of multimodal forgery detection. Overall, the existing work achieves an accuracy that ranges between 69%-89%, which is low as compared to the proposed work. In comparison, our method employed unimodal and cross-modal networks that consider both audio and visual modalities of a video and can identify fake videos with both facial and speech manipulations. Our approach effectively learns the inconsistencies within the audio-visual modality as well as across modalities and thus yield improved performance compared to the existing state-of-the-art. The cross-modality attention contributes toward learning underlying semantic patterns between speech and facial movements to distinguish between original and manipulated videos. Moreover, we combined the predictions of individual networks using ensemble learning, which results in robust classification. The obtained results indicate the effectiveness of our method which combines spatio-temporal audio and visual features with high-level synchronization patterns between modalities and provide strong clues for identifying forged videos.

Table 4

Comparison with existing unimodal and multimodal deepfake detection methods.
Method	Modality	Class	Precision (%)	Recall (%)	F1-score (%)	Accuracy (%)
VGG-16 [12]	V	Real	69.35	89.66	78.21	81.03
VGG-16 [12]	V	Fake	87.24	77.50	82.08	81.03
LipForensics [27]	V	Real	70	91	80	76
LipForensics [27]	V	Fake	88	61	72	76
Xception-c23 [40]	V	Real	66.54	77.08	71.43	73.06
Xception-c23 [40]	V	Fake	79.73	69.93	74.51	73.06
VGG-16 [12]	AV	Real	69.35	89.66	78.21	78.04
VGG-16 [12]	AV	Fake	89.48	68.94	77.88	78.04
MesoInception [40]	AV	Real	64.51	50.70	73.37	72.87
MesoInception [40]	AV	Fake	84.40	63.32	72.35	72.87
AV-Dissonance [18]	AV	Real	62	99	76	69
AV-Dissonance [18]	AV	Fake	94	40	57	69
Multimodal (MesoNet18 and 2DCNN) [75]	AV	Real	83	99	90	89
Multimodal (MesoNet18 and 2DCNN) [75]	AV	Fake	98	80	88	89
AVDFDNet (proposed)	AV	Real	89.95	92.34	91.13	94.08
AVDFDNet (proposed)	AV	Fake	95.83	96.14	95.98	94.08

In this paper, we have presented an ensemble framework for detecting multimodal deepfake videos containing facial and speech manipulations. Our method leverages unimodal and cross-modal learning networks that use both audio and video signals for more accurate fake video detection. It is based on a video network that uses attention mechanisms to identify facial deepfakes and an audio network that uses mel-spectrogram features to detect manipulated speech in videos. In addition, a cross-modal audio-video network is employed to detect fine-grained audiovisual anomalies and an ensemble approach to combine predictions. The system is evaluated using multiple multimodal deepfake datasets i.e., FakeAVCeleb, DFDC, and DFTIMIT, achieving an overall accuracy of 94.08% in detecting audio-video deepfakes. Additionally, the cross-dataset evaluation shows that the model learns robust features that contribute to generalization to unseen deepfake categories. Overall, the reported results verify the effectiveness of using multiple modality analysis i.e., audio-visual signal for the detection of deepfake videos. In the future, we will explore other deep neural networks to improve the identification accuracy of multimodal deepfakes. We also intend to further improve the reliability of the detection model by incorporating explainability to act as evidence for legislative purposes.

Acknowledgment

This work was supported by the grant of the Punjab Higher Education Commission (PHEC) of Pakistan via Award No. (PHEC/ARA/PIRCA/20527/21).

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability Statement

The authors have used 3 publicly available standard deepfakes datasets i.e., FakeAVCeleb [36], DFDC [9], and DFTIMIT [76] for performance evaluation.

I. Goodfellow et al., "Generative adversarial nets," Advances in Neural Information Processing Systems, vol. 1, pp. 2672–2680, 2014.
Y. Nirkin, Y. Keller, and T. Hassner, "FSGAN: Subject Agnostic Face Swapping and Reenactment," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7184–7193.
M. Westerlund, "The emergence of deepfake technology: A review," Technology Innovation Management Review, vol. 9, no. 11, 2019.
Y. Mirsky and W. Lee, "The creation and detection of deepfakes: A survey," ACM Computing Surveys, vol. 54, no. 1, pp. 1–41, 2021.
M. Masood, M. Nawaz, K. M. Malik, A. Javed, A. Irtaza, and H. Malik, "Deepfakes Generation and Detection: State-of-the-art, open challenges, countermeasures, and way forward," Applied Intelligence, pp. 1–53, 2022.
J. Thies, M. Elgharib, A. Tewari, C. Theobalt, and M. Nießner, "Neural voice puppetry: Audio-driven facial reenactment," in European conference on computer vision, 2020, pp. 716–731: Springer.
B. van der Sloot and Y. Wagensveld, "Deepfakes: regulatory challenges for the synthetic society," Computer Law Security Review, vol. 46, p. 105716, 2022.
Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, "Celeb-df: A new dataset for deepfake forensics," in IEEE Conference on Computer Vision and Patten Recognition (CVPR), Seattle, WA, United States, 2020.
B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer, "The Deepfake Detection Challenge (DFDC) Preview Dataset," arXiv preprint arXiv:1910.08854, 2019.
A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, "Faceforensics++: Learning to detect manipulated facial images," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1–11.
L. Jiang, R. Li, W. Wu, C. Qian, and C. C. Loy, "Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2889–2898.
H. Khalid, M. Kim, S. Tariq, and S. S. Woo, "Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors," in Proceedings of the 1st workshop on synthetic multimedia-audiovisual deepfake generation and detection, 2021, pp. 7–15.
F. Matern, C. Riess, and M. Stamminger, "Exploiting visual artifacts to expose deepfakes and face manipulations," in 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), 2019, pp. 83–92: IEEE.
Y. Li, M.-C. Chang, and S. Lyu, "In ictu oculi: Exposing ai created fake videos by detecting eye blinking," in 2018 IEEE International Workshop on Information Forensics and Security (WIFS), 2018, pp. 1–7: IEEE.
X. Yang, Y. Li, and S. Lyu, "Exposing deep fakes using inconsistent head poses," in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 8261–8265: IEEE.
S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, and H. Li, "Protecting World Leaders Against Deep Fakes," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 38–45.
I. Amerini, L. Galteri, R. Caldelli, and A. Del Bimbo, "Deepfake Video Detection through Optical Flow based CNN," in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0.
K. Chugh, P. Gupta, A. Dhall, and R. Subramanian, "Not made for each other-Audio-Visual Dissonance-based Deepfake Detection and Localization," in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 439–447.
D. M. Montserrat et al., "Deepfakes Detection with Automatic Face Weighting," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 668–669.
U. A. Ciftci and I. Demir, "FakeCatcher: Detection of Synthetic Portrait Videos using Biological Signals," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, 2020.
T. Jung, S. Kim, and K. Kim, "DeepVision: Deepfakes Detection Using Human Eye Blinking Pattern," IEEE Access, vol. 8, pp. 83144–83154, 2020.
D.-K. Kim and K.-S. Kim, "Generalized Facial Manipulation Detection with Edge Region Feature Extraction," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 2828–2838.
J. Wang et al., "M2tr: Multi-modal multi-scale transformers for deepfake detection," in Proceedings of the 2022 International Conference on Multimedia Retrieval, 2022, pp. 615–623.
S. Ganguly, A. Ganguly, S. Mohiuddin, S. Malakar, and R. Sarkar, "ViXNet: Vision Transformer with Xception Network for deepfakes based video and image forgery detection," Expert Systems with Applications, vol. 210, p. 118423, 2022.
G. Yang, K. Xu, X. Fang, and J. Zhang, "Video face forgery detection via facial motion-assisted capturing dense optical flow truncation," The Visual Computer, pp. 1–20, 2022.
Z. Xia, T. Qiao, M. Xu, N. Zheng, and S. Xie, "Towards DeepFake video forensics based on facial textural disparities in multi-color channels," Information Sciences, vol. 607, pp. 654–669, 2022.
A. Haliassos, K. Vougioukas, S. Petridis, and M. Pantic, "Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5039–5049.
S. Agarwal, H. Farid, O. Fried, and M. Agrawala, "Detecting deep-fake videos from phoneme-viseme mismatches," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 660–661.
M. Boháček and H. Farid, "Protecting world leaders against deep fakes using facial, gestural, and vocal mannerisms," in Proceedings of the National Academy of Sciences, 2022, vol. 119, no. 48, p. e2216035119.
D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, "Mesonet: a compact facial video forgery detection network," in 2018 IEEE International Workshop on Information Forensics and Security (WIFS), 2018, pp. 1–7: IEEE.
S. Tariq, S. Lee, and S. Woo, "One detector to rule them all: Towards a general deepfake attack detection framework," in Proceedings of the Web Conference 2021, 2021, pp. 3625–3637.
D. Güera and E. J. Delp, "Deepfake video detection using recurrent neural networks," in 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2018, pp. 1–6: IEEE.
Y. Wang and A. Dantcheva, "A video is worth more than 1000 lies. Comparing 3DCNN approaches for detecting deepfakes," in 2020 15Th IEEE international conference on automatic face and gesture recognition (FG 2020), 2020, pp. 515–519: IEEE.
R. Wang et al., "Deepsonar: Towards effective and robust detection of ai-synthesized fake voices," in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1207–1216.
J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, and F. Kazi, "A Deep Learning Framework for Audio Deepfake Detection," Arabian Journal for Science and Engineering, vol. 1, pp. 1–12, 2021.
H. Khalid, S. Tariq, M. Kim, and S. S. Woo, "FakeAVCeleb: a novel audio-video multimodal deepfake dataset," in Thirty-fifth Conference on Neural Information Processing Systems, 2021.
Y. Jia et al., "Transfer learning from speaker verification to multispeaker text-to-speech synthesis," in Advances in neural information processing systems, 2018, pp. 4480–4490.
J. Shen et al., "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4779–4783: IEEE.
J. Frank and L. Schönherr, "WaveFake: A Data Set to Facilitate Audio Deepfake Detection," in 35th Annual Conference on Neural Information Processing Systems, 2021.
T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion," in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6820–6824: IEEE.
Y. Zhou and S.-N. Lim, "Joint audio-visual deepfake detection," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14800–14809.
Y. Gu, X. Zhao, C. Gong, and X. Yi, "Deepfake Video Detection Using Audio-Visual Consistency," in International Workshop on Digital Watermarking, 2020, pp. 168–180: Springer.
T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, "Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues," in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2823–2832.
K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, "A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild," in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 484–492.
A. Lahiri, V. Kwatra, C. Frueh, J. Lewis, and C. Bregler, "LipSync3D: Data-efficient learning of personalized 3D talking faces from video using pose and lighting normalization," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2755–2764.
D. Cozzolino, A. Rössler, J. Thies, M. Nießner, and L. Verdoliva, "ID-Reveal: Identity-aware DeepFake Video Detection," presented at the International Conference on Computer Vision, Montreal, QC, Canada, 2021.
S. Agarwal and H. Farid, "Detecting deep-fake videos from aural and oral dynamics," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 981–989.
P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, "Two-stream neural networks for tampered face detection," in 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), 2017, pp. 1831–1839: IEEE.
I. Masi, A. Killekar, R. M. Mascarenhas, S. P. Gurudatt, and W. AbdAlmageed, "Two-branch recurrent network for isolating deepfakes in videos," in European conference on computer vision, 2020, pp. 667–684: Springer.
M. Huh, A. Liu, A. Owens, and A. A. Efros, "Fighting fake news: Image splice detection via learned self-consistency," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 101–117.
L. Deng, H. Suo, and D. Li, "Deepfake Video Detection Based on EfficientNet-V2 Network," Computational Intelligence Neuroscience, vol. 2022, 2022.
H. H. Nguyen, J. Yamagishi, and I. Echizen, "Capsule-forensics: Using capsule networks to detect forged images and videos," in ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 2307–2311: IEEE.
E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natarajan, "Recurrent Convolutional Strategies for Face Manipulation Detection in Videos," Interfaces (GUI), vol. 3, no. 1, pp. 80–87, 2019.
D. Zhang, C. Li, F. Lin, D. Zeng, and S. Ge, "Detecting Deepfake Videos with Temporal Dropout 3DCNN," in IJCAI, 2021, pp. 1288–1294.
D. A. Coccomini, N. Messina, C. Gennaro, and F. Falchi, "Combining efficientnet and vision transformers for video deepfake detection," in International Conference on Image Analysis and Processing, 2022, pp. 219–229: Springer.
B. Chen, T. Li, and W. Ding, "Detecting deepfake videos based on spatiotemporal attention and convolutional LSTM," Information Sciences, vol. 601, pp. 58–70, 2022.
J. Fei, Z. Xia, P. Yu, and F. Xiao, "Exposing AI-generated videos with motion magnification," Multimedia Tools Applications, vol. 80, no. 20, pp. 30789–30802, 2021.
R. K. Das, J. Yang, and H. Li, "Data Augmentation with Signal Companding for Detection of Logical Access Attacks," in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6349–6353: IEEE.
M. Todisco, H. Delgado, and N. W. Evans, "A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients," in Odyssey, 2016, vol. 2016, pp. 283–290.
M. Aljasem et al., "Secure Automatic Speaker Verification (SASV) System through sm-ALTP Features and Asymmetric Bagging," IEEE Transactions on Information Forensics Security, vol. 16, pp. 3524–3537, 2021.
T. Chen, A. Kumar, P. Nagarsheth, G. Sivaraman, and E. Khoury, "Generalization of audio deepfake detection," in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 2020, pp. 132–137.
Y. Zhang, F. Jiang, and Z. Duan, "One-class learning towards synthetic voice spoofing detection," IEEE Signal Processing Letters, vol. 28, pp. 937–941, 2021.
G. Hua, A. Bengjinteoh, and H. Zhang, "Towards End-to-End Synthetic Speech Detection," IEEE Signal Processing Letters, vol. 28, pp. 1265–1269, 2021.
H. Ma, J. Yi, J. Tao, Y. Bai, Z. Tian, and C. Wang, "Continual Learning for Fake Audio Detection," in 22nd Annual Conference of the International Speech Communication Association, 2021, pp. 886–890: ISCA.
P. Korshunov and S. Marcel, "Speaker inconsistency detection in tampered video," in 2018 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 2375–2379: IEEE.
A. Chintha et al., "Recurrent convolutional structures for audio spoof and video deepfake detection," IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, pp. 1024–1037, 2020.
H. Hao et al., "Deepfake Detection Using Multiple Data Modalities," in Handbook of Digital Face Manipulation and Detection: Springer, Cham, 2022, pp. 235–254.
R. Roy, I. Joshi, A. Das, and A. Dantcheva, "3D CNN Architectures and Attention Mechanisms for Deepfake Detection," in Handbook of Digital Face Manipulation and Detection: Springer, Cham, 2022, pp. 213–234.
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, "A closer look at spatiotemporal convolutions for action recognition," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
N. K. Mehta, S. S. Prasad, S. Saurav, R. Saini, and S. Singh, "Three-dimensional DenseNet self-attention neural network for automatic detection of student’s engagement," Applied Intelligence, pp. 1–21, 2022.
H. Meng, T. Yan, F. Yuan, and H. Wei, "Speech emotion recognition from 3D log-mel spectrograms with deep learning network," IEEE access, vol. 7, pp. 125868–125881, 2019.
J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, and F. Kazi, "A deep learning framework for audio deepfake detection," Arabian Journal for Science Engineering, vol. 47, no. 3, pp. 3447–3458, 2022.
S.-Y. Lim, D.-K. Chae, and S.-C. Lee, "Detecting Deepfake Voice Using Explainable Deep Learning Techniques," Applied Sciences, vol. 12, no. 8, p. 3926, 2022.
B. Xie, M. Sidulova, and C. H. Park, "Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion," Sensors, vol. 21, no. 14, p. 4913, 2021.
A. Hashmi, S. A. Shahzad, W. Ahmad, C. W. Lin, Y. Tsao, and H.-M. Wang, "Multimodal Forgery Detection Using Ensemble Learning," in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022, pp. 1524–1532: IEEE.
P. Korshunov and S. Marcel, "Deepfakes: a new threat to face recognition? assessment and detection," arXiv preprint arXiv:1812.08685, 2018.
A. Nagrani, J. S. Chung, and A. Zisserman, "Voxceleb: a large-scale speaker identification dataset," arXiv preprint arXiv:.08612, 2017.
K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, "Joint face detection and alignment using multitask cascaded convolutional networks," IEEE signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Attention-based Multimodal learning framework for Generalized Audio- Visual Deepfake Detection

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Work

3. Proposed Method

3.1 Video Network

3.2 Audio Network

3.3 Cross-Modal Audio-Video Network

3.4 Classification

4. Experiments and discussion

4.1 Dataset

4.2 Implementation details

4.3 Evaluation parameters

4.4 Evaluation of Audio-Visual deepfake Detection

4.5 Evaluation on different audio-visual manipulations types

4.6 Generalization Evaluation

4.7 Robustness Evaluation

4.8 Comparison to existing approaches

5. Conclusion

Declarations

References

Additional Declarations

Status:

Version 1