In this section, we introduce works related to large-scale video retrieval, including the history and importance of retrieval, the problem of extracting important features, and the problem of key information loss.
Key-frame extraction is an essential part in video analysis and management, providing a suitable video summarization for video indexing, browsing, and retrieval. Existing key frame extraction methods are roughly divided into three categories. Early works [17] focused on sampling video sequences uniformly or randomly to obtain key frames, which is easy to implement. However, it ignores the contents of the frames and may result in repeated frames or missing of the important frames.
A second generation of works [18] reported significant gains in key-frame extraction based on shot segmentation which selects the key frames from shot fragments. The extracted key frames via this method are representative. However, the neglected correlation between different shots may result in information redundancy. In response to the above-mentioned problems, cluster-based key-frame extraction [19, 20] has emerged. This method divides the video frame into clusters based on frame contents and then extracts several representative frames from each cluster. The key frames extracted by this method faithfully reflect the original video content.
In the image-to-video task, frame representation plays a critical role. In the early 1990s, images were indexed by the hand-crafted features like color, texture, and spatial. A straightforward strategy for image representation is to extract global descriptors. However, global signatures may fail the invariance expectation to image changes such as illumination, occlusion, and translation. The performance of these visual descriptors was still limited until the breakthrough of local descriptors.
In recent years, numerous researches have been done on visual representation and search methods. Bag-of-words, abbreviated as BoW, is one of the traditional visual representation methods for displaying images and videos, which was presented by [21]. This model has been used to index movies across the movie database. [22] Presented a new method to solve the problem of frame duplication detection in a video. They used BoW to create visual words and build a dictionary from Scale Independent Feature Transform (SIFT) key points of frames in video. They considered hierarchical k-means (HKM) to generate a large vocabulary tree for quantization and represent each video clip and search topic with a BoW model. [23] Integrated spatial and temporal information with the BoW model to significantly improve the computational efficiency. Specifically, they model pairwise spatial and temporal correlations by Gaussian distributions and develop a new similarity measure to emphasize the discriminative visual words with respect to the query.
However, several works [24, 25] reported that local features from the last convolutional layer usually yield superior accuracy compared to the global features from the fully connected layer. Our works share similarities with the former methods that extract motion features from pretrained CNNs. However, to perform large-scale retrieval, it is necessary to compress the high-dimensional features to reduce the storage cost and speed up the retrieval. Several works have tried to encode features from CNNs via BoW [27], VLAD [26], and FV [28], which are commonly used to generate hand-crafted descriptors.
Although these methods perform well in some visual search tasks, they require a large code book trained offline, which is difficult to achieve in the large-scale database. Additionally, some information will be lost in the feature encoding stage using these methods. Apart from the aggregation strategies mentioned above, average pooling mechanism was able to generate discriminative descriptors. Lin et al. [26] elaborated the reasons why pooling is effective in encoding deep local convolutional features. Firstly, the mean pooling strategy could largely prevent over-fitting. Secondly, it sums up the spatial information, resulting in a more robust spatial transformation of the query image. Inspired by the excellent performance of average pooling, we propose a simple aggregation method to generate compact and discriminative frame representations.