This section surveys all the mechanisms that have been proposed to find tampering in videos that have been recorded using CCTV cameras. Table 3 delineates all the techniques and approaches that have been proposed so far to detect tampering in surveillance videos.
Table 3: Surveillance Videos Forgery Detection Approaches
Approaches and Techniques for Forgery Detection in Surveillance Videos
|
Sensor Pattern Noise Technique
|
Gaussian Distribution
|
Residual Gradient and Optical Flow Gradient
|
Residual Frames
|
Optical Flow Gradient and Residual Analysis
|
Feature Extraction
|
WiFi Signals
|
Temporal Domain
|
Capsule Network
|
Secure-Pose
|
Similarity Analysis
|
Deep Learning
|
Radio-Frequency (RF) Signal
|
4.1 Sensor Pattern Noise Technique
Sensor Pattern Noise (SPN) [93] and resampling estimation [94] techniques have been proposed to identify fakes in surveillance footage. Minimum Average Correlation Energy - Mellin Radial Harmonic (MACE-MRH) correlation filters can detect upscale crop, partial manipulation, and video alternation forgeries by utilising invariance or scaling tolerance. This approach is also used to identify the source camera. In the first stage, the source camera for a certain video is recognised. Then, in the second step, the scalar factor and correlation coefficient are used to identify tampering in the videos. Videos of static scenes have considerably remarkably outperformed others. This method produced significantly superior results compared to Chen’s method [95] (i.e., 15% higher accuracy particularly when the scaling factor for infrared video is 1.8) [96].
The previously mentioned approach has been improved in which the scaling tolerance of a Minimum Average Correlation Energy - Mellin Radial Harmonic (MACE-MRH) correlation has been filtered to consistently reveal video upscale-crop fraud and recognise partially altered portions. According to this, since resampling creates specific statistical correlations in the provided content, its presence can be determined by checking for these correlations. The Sensor Pattern Noise (SPN) [97] has been utilised as a forensic feature and examined the differences between reference SPN and SPN of upscaled frames in terms of correlation characteristics. The approach was evaluated on a total of 1920 fabricated sequences constructed from 120 self-recorded RGB and infrared H.264 encoded test videos. As long as the scale and quality parameters were regularly checked and adjusted, this method achieved a TNR (True Negative Rate) of 100% and a TPR (True Positive Rate) of greater than 98%. In the instance of partial modification detection, the detection accuracy of 100% for dynamic scene videos and 94.2 to 100% for static scene videos was recorded for region sizes between 100 and 150 square pixels. This technique was found to be reliable while dealing with compressed videos in addition to RGB and infrared videos. It works with movies of both dynamic and static scenes taken with moving and not moving cameras [98].
4.2 Gaussian Distribution
In the optical-flow-based forgery detection approach, the probability distributions of optical-flow variations for unaltered surveillance videos were modelled using a Gaussian distribution. An anomaly was any irregularity in the flow fluctuations, and a statistical inference test (Grubb's test) was used to assign an anomaly score to the optical flow patterns of each test video. The degree to which the pattern demonstrated anomalous behaviour determined this score. Lastly, to detect inter-frame forgeries, three cut-off levels (one for frame insertion, one for frame deletion, and one for frame duplication) were applied to the anomaly score, which identified the abnormalities. The technique was assessed using a total of 160 test clips, all of which were produced from two original MPEG-2-encoded videos extracted from TRECVID's [99] surveillance event detection data set. The detection accuracies for frame deletion, insertion, and duplication were determined to be 75%, 85%, and 82.5%, respectively. The reported accuracy rates for forgery localisation were 96.9%, 100%, and 86.2%, respectively [100].
4.3 Residual Gradient and Optical Flow Gradient
For H.264 and MPEG-2 encoded films, a detection technique for inter-frame forgeries employing prediction residual gradient and optical flow gradient has been given. A hybrid technique based on motion and brightness gradient characteristics is used to determine forgeries by identifying variations between nearby frames, notably for manually mobile recorded films and surveillance footage. Using the spike count regardless of the number of frames in the video, the proposed technique automatically detects video manipulation. This method achieved an accuracy of 83% [101].
4.4 Residual Frames
For the detection and localization of digital video inter-frame duplication, another approach based on the idea of residual frames has been developed. To detect and discover frame duplication frauds, the entropy of DCT coefficients in each residual frame's standard deviation value is computed, and the similarity between pairs of feature vectors is assessed. Using positive predictive value (PPV), true positive rate (TPR), and F1 Score, the efficacy of this method has been tested. This technique can effectively detect inter-frame duplication tampering in an extremely short time and it obtained the following results after using the SULFA dataset: PPV: 98%, TPR: 99%, F1: 98%, and after using the VIRAT dataset obtained PPV: 97%, TPR: 98%, F1: 97% [102].
4.5 Optical Flow Gradient and Residual Analysis
The forgery detection technique based on optical flow gradient characteristics and prediction residual analysis has been described. The approach can detect and identify video frame deletion, insertion, and duplication. When the video is altered, the temporal correlations between neighbouring frames are broken, which is evaluated by the researchers. The window-based paradigm is used to locate the counterfeit. The suggested method is optimised for H.264 video and MPEG-2 codecs, and it is 83% accurate for both slow- and fast-motion video [103].
4.6 Feature Extraction
Feature extraction and novel point localisation technique have been proposed. During the phase of feature extraction, the 2-D phase congruency of each frame was identified as a desirable image property. The relationship between adjacent frames was then determined. In the second step, anomalous places were identified using a clustering technique (k-means). The normal and abnormal points were divided into two groups. The average accuracy acquired for the 1st dataset [104] is 97.08% and for the 2nd dataset [105] is 93.13% [106].
4.7 WiFi Signals
It has been demonstrated that Wi-Fi signals are beneficial for revealing video looping assaults on surveillance systems. It utilises handcrafted event-level timing and frequency information from time-series Wi-Fi and camera data, resulting in a slow reaction time and an inability to do fine-grained false localisation. Consequently, none of the existing solutions simultaneously meet the real-time and fine-grained criteria of forgery detection and localisation in video surveillance systems. SurFi is used to analyse event-level timing information from WiFi and camera data to detect camera looping attacks. SurFi utilises existing Wi-Fi infrastructure (requiring no further hardware or deployment costs) to extract channel state information (CSI), which is then analysed and linked with video and CSI signals to identify discrepancies. SurFi can identify assaults with up to 95.1% accuracy [107].
4.8 Temporal Domain
Another approach for detecting inter-frame forging (i.e., frame deletion, insertion, and shuffling) has been presented, in which the manipulation takes place in the temporal domain. This approach uses the universal image quality index (UQI) of temporal averages (TP) for non-overlapping neighbouring frames to detect illegitimate actions in an exceptionally short amount of time. Individual frames will be collected from the security camera's directly captured footage. Then, each frame's TP will be measured. Due to the consistency and regularity of the video, the UQI of every two adjacent TP images is used to extract unusual activity as illegal candidates; if the video is subjected to deletion, insertion, or shuffling, the similarity will decrease, and the Q values at the border of the doctored clip will be lower than those of other clips. Lastly, the least Q value of the related frames of TP candidates and their neighbors is used to select the locations of inter-frame attacks. The detection performance of UQI method using Precision, Recall, and F1 Score using frame deletion is 0.98, 0.99, and 0.98 respectively. The performance of the same metrics under frame-insertion tampering is 0.99, 0.99, and 0.99 respectively. However, with frame-shuffling forgeries, the performance of the measures is 0.96, 0.97, and 0.96, respectively. The outcomes of each of the three assessment criteria were compared to the procedures in [108]–[111] and proved the best outcomes in terms of Precision, Recall, and F1 Score values. Moreover, the proposed technique has the shortest execution time compared to the previous techniques mentioned in [108]–[111] because it compares the temporal averages of non-overlapping subsequence frames rather than examining each frame individually [112].
Instead of all frames, the temporal average of each shot was employed to detect frame duplication. Grey-level co-occurrence matrix (GLCM) features were obtained for feature vectors, and the similarity between adjacent vectors was used to detect frame duplication. Despite the inclusion of post-processing activities with high false positives due to weak boundaries of duplicated frames, the suggested technique obtained an accuracy rate of 95% to 99% and a low running time. Without post-processing, the accuracy rates for frame duplication with shuffling (FDS) and frame duplication (FD) were 94% and 99%, respectively. This proposed technique has been evaluated on the SULFA [105] and EASIEST datasets [113].
A technique for identifying video tampering based on sensor pattern noise in video frames has been presented. Denoising video frames yielded the noise patterns, which were then averaged to identify sensor noise patterns. Using a locally adaptable DCT, the sensor noise patterns were analysed (Discrete Cosine Transform). To detect if a video was genuine or faked, the correlation of noise residues from several video frames was calculated. The method was evaluated on a dataset containing noise patterns and yielded satisfactory results; nevertheless, these findings are contingent on the physical specifications of the source device. The accuracy of the prior model is 96.6 % [114].
4.9 Capsule Network
Based on Capsule Networks, a new digital forensic method for identifying object-based counterfeiting in surveillance recordings has been developed. Intra-frame and inter-frame statistical features of the video sequence have been recovered as the input of the capsule network utilising motion residual computed for each video frame. The experimental results demonstrate that the proposed method, with different bit rate values and dataset resolutions, achieves significant performance in terms of Video Detection Accuracy (VDA), Authentic Frame Detection Accuracy (AFDA), Forged Frame Detection Accuracy (FFDA), and Double-compressed Frame Detection Accuracy (DFDA), regardless of the group image length and degree of video compression. With a 3 M bit rate and 1280 720 dataset resolution, for example, VDA: 100%, AFDA: 99.30%, DFDA: 97.94%, and FFDA: 84.97%. For a 1.5 M bit rate and 1280 720 dataset resolution, the VDA accuracy is 99.99%, the AFDA accuracy is 98.64%, the DFDA accuracy is 96.12%, and the FFDA accuracy is 81.05%. Although the accuracies for 3 M bit rate and 640 360 dataset resolution are VDA: 100%, AFDA: 98.95%, DFDA: 97.49%, and FFDA: 84.56% [115]. The results mentioned for VDA, DFDA, and FFDA are considered the best compared to [116] and [117].
4.10 Secure-Pose
Secure-Pose, the novel cross-modal system that identifies and localises forgery attacks in each frame of live surveillance video, has been implemented. In a half-hour, they generated their dataset by gathering multimodal data. Faster-RCNN is utilised for intra-frame assaults to detect and clip out a human item before replacing it with the equivalent blank backdrop segment. Their test data covered a forgery detection accuracy of 95% [118].
4.11 Similarity Analysis
The AIFDT-SV-BAS approach for identifying inter-frame manipulation is based on a study of similarity that is not affected by a single or many scenes. The recommended method involves examining the suspicious video for scene transitions. Whenever a situation changes, the method separates the scene into many shots. The images are then fed into a passive-blind technique based on a similarity analysis [73]. However, if there is no scene change, the splitting is incomplete. Primarily, the histogram difference between two consecutive frames in the HSV colour space will be used to detect forgeries. In addition, H-S and S-V colour histograms can identify various variations. The proposed technique AIFDT-SV-BAS has been assessed using CASIA 2 and NC 16 datasets. Furthermore, this technique has been evaluated using precision, recall, and accuracy metrics. This method has significantly outperformed the benchmark [73] result with a precision of 98.07%, a recall of 100%, and an accuracy of 99.1% due to the scene change recognition and video segmentation before checking for counterfeit [119].
4.12 Deep Learning
A system for identifying inter-frame forgeries that segments a movie into video shots and fuses spatial and temporal information to generate a single picture for each shot has been created. For effective extraction of spatiotemporal features, a pre-trained 2D-CNN model is utilised. The structural similarity index (SSIM) is then utilised to construct deep-learning video features. Lastly, they used 2D-CNN and RBF Multiclass Support Vector Machine (RBF-MSVM) to detect temporal manipulation in the video. To detect inter-frame forgery, a dataset of 13135 videos containing three types of forged videos under different conditions was created using original videos from the VRAT, SULFA, LASIESTA, and IVY datasets. The dataset achieved TPRs of 0.987, 0.999, and 0.985 for frame deletion, insertion, and duplication, respectively [120].
4.13 Radio-Frequency (RF) Signals
Learning-based algorithms have been designed to detect video forgery attacks using radio-frequency (RF) signals. It is an extended version of Secure-Pose [118]. The secure-Pose method identifies camera looping attacks by analysing event-level timing and frequency data derived from the coexistence of Wi-Fi signal and camera data. However, it cannot give fast identification and precise location of forgeries. Subsequently, the enhanced study employed the RF-based approach, which identifies anomalous items with a detection accuracy of 98.7% and correctly localises them during playback and manipulation [121].