An efficient approach for large-scale image-to-video retrieval with convolutional neural network features

doi:10.21203/rs.3.rs-2008491/v1

Download PDF

Research Article

An efficient approach for large-scale image-to-video retrieval with convolutional neural network features

https://doi.org/10.21203/rs.3.rs-2008491/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

In recent years, inspired by the remarkable success in deep learning, many deep neural networks (DNN)-based methods have been proposed to solve the problem by intersection analysis. These approaches, for their enormous nonlinear learning capacity, show superior performance in cross-retrieval tasks compared to traditional cross-retrieval methods. Due to the increased amount of information for information recovery and limited storage capacity, the extraction of key-frame for videos and key feature extraction is becoming a fundamental challenge. In this paper, we have investigated the image-to-video retrieval challenge, and a new convolutional neural network (CNN)-based visual search system is presented to find similar videos in an extensive database based on the target image. Our framework includes a key frame extraction algorithm and feature aggregation strategy. Specifically, in the key frame extraction algorithm, by taking advantage of the extracted objects in each image, we present a new scheme to extract the features of each form. Therefore the redundant information in the video data will be removed, and the storage cost will also be reduced to an acceptable level. The feature aggregation strategy uses deep-local-complex feature fusion, which enables fast retrieval in a large-scale video database. The results of extensive experiments on publicly available datasets demonstrate that the proposed method achieves superior performance and accuracy compared to other advanced visual search methods.

Content-based retrieval

deep neural networks

convolutional neural networks

image processing

Bow

Visual search applications have gained substantial popularity recently. In its most common form, this technology enables image-based querying against a database of images. This is typically used to retrieve information associated with specific objects from large databases, by comparing an image of an object (the query image) against a database of reference images [1].

Enormous images and videos are generated and uploaded onto the Internet. With a large amount of publicly available data, visual search has become an important frontier topic in the field of information retrieval. With billions of images and videos created each day, it is essential to build tools for accessing and retrieving multimedia content efficiently [2].

Images only contain static information but videos have much richer visual information, like optical flow. Due to the lack of temporal information, standard techniques used for extracting video descriptors [3–6] cannot be directly used on static images. But, standard features for image retrieval [7–10] can be applied to video data by processing each frame as an independent image. Temporal information is usually compressed either by reducing the number of local features or by encoding multiple frames into a single global representation. Various visual search applications have emerged such as image-to-image retrieval, image-to-video retrieval, and video-to-video retrieval, including image-to-image retrieval (I2I) [11, 12], video-to-video retrieval (V2V) [13, 14], and image-to-video search (I2V) [15, 16]. Specifically, the well-known I2I visual search can be used for product search, in which relevant images are retrieved by the query image. The V2V search is commonly used for copyright protection, in which video clips are found via a relevant video. The I2V search addresses the problem of retrieving relevant video frames or specific timestamps from a large database via the query image. This technology is relevant for numerous applications, such as brand monitoring, searching film using slides, and searching lecture videos using screenshot. In this work, we study the specific task of I2V search, which is challenging because of the asymmetry between the query image and the video data. Video data can be divided into four hierarchical structures: video, scene, shot, and frame. When considering only the visual content, a video is a sequence of frames displayed at a certain rate.

For example, a video with a frame rate of 30 fps is equivalent to 30 images in one second. The structure of a video means that adjacent frames are highly correlated with each other. To perform large-scale retrieval, we should select representative frames of a video frame sequence to reduce redundant information for further processes. Key frame extraction, which could represent the salient content and information of the video, is the technique employed to remove redundant or duplicate frames. In this work, we propose a key-frame extraction algorithm to summarize the video sequences. Inspired by the advances in content-based image retrieval (CBIR), we have taken advantage of the image retrieval techniques to reduce key frames and, accordingly, reduce important features. In CBIR, one of the most challenging issues is the association of pixel-level information with human-perceived semantics. Although some hand-crafted features have been suggested to represent images, the performance of these descriptors is not satisfactory. Recently, the CNN-based descriptors have shown excellent performance on various computer vision tasks, such as image classification, instant search, and target tracking. With the advances in the deep convolutional neural network, our works share similarities with other CNN-based methods extracting features of the frame via pretrained CNNs.

In this article; through reducing feature extraction steps, the feature storage cost is reduced without quality loss. An extensive set of experiments on two publicly available datasets demonstrated that the proposed method outperforms several state- of-the-art visual search methods.

In this section, we introduce works related to large-scale video retrieval, including the history and importance of retrieval, the problem of extracting important features, and the problem of key information loss.

Key-frame extraction is an essential part in video analysis and management, providing a suitable video summarization for video indexing, browsing, and retrieval. Existing key frame extraction methods are roughly divided into three categories. Early works [17] focused on sampling video sequences uniformly or randomly to obtain key frames, which is easy to implement. However, it ignores the contents of the frames and may result in repeated frames or missing of the important frames.

A second generation of works [18] reported significant gains in key-frame extraction based on shot segmentation which selects the key frames from shot fragments. The extracted key frames via this method are representative. However, the neglected correlation between different shots may result in information redundancy. In response to the above-mentioned problems, cluster-based key-frame extraction [19, 20] has emerged. This method divides the video frame into clusters based on frame contents and then extracts several representative frames from each cluster. The key frames extracted by this method faithfully reflect the original video content.

In the image-to-video task, frame representation plays a critical role. In the early 1990s, images were indexed by the hand-crafted features like color, texture, and spatial. A straightforward strategy for image representation is to extract global descriptors. However, global signatures may fail the invariance expectation to image changes such as illumination, occlusion, and translation. The performance of these visual descriptors was still limited until the breakthrough of local descriptors.

In recent years, numerous researches have been done on visual representation and search methods. Bag-of-words, abbreviated as BoW, is one of the traditional visual representation methods for displaying images and videos, which was presented by [21]. This model has been used to index movies across the movie database. [22] Presented a new method to solve the problem of frame duplication detection in a video. They used BoW to create visual words and build a dictionary from Scale Independent Feature Transform (SIFT) key points of frames in video. They considered hierarchical k-means (HKM) to generate a large vocabulary tree for quantization and represent each video clip and search topic with a BoW model. [23] Integrated spatial and temporal information with the BoW model to significantly improve the computational efficiency. Specifically, they model pairwise spatial and temporal correlations by Gaussian distributions and develop a new similarity measure to emphasize the discriminative visual words with respect to the query.

However, several works [24, 25] reported that local features from the last convolutional layer usually yield superior accuracy compared to the global features from the fully connected layer. Our works share similarities with the former methods that extract motion features from pretrained CNNs. However, to perform large-scale retrieval, it is necessary to compress the high-dimensional features to reduce the storage cost and speed up the retrieval. Several works have tried to encode features from CNNs via BoW [27], VLAD [26], and FV [28], which are commonly used to generate hand-crafted descriptors.

Although these methods perform well in some visual search tasks, they require a large code book trained offline, which is difficult to achieve in the large-scale database. Additionally, some information will be lost in the feature encoding stage using these methods. Apart from the aggregation strategies mentioned above, average pooling mechanism was able to generate discriminative descriptors. Lin et al. [26] elaborated the reasons why pooling is effective in encoding deep local convolutional features. Firstly, the mean pooling strategy could largely prevent over-fitting. Secondly, it sums up the spatial information, resulting in a more robust spatial transformation of the query image. Inspired by the excellent performance of average pooling, we propose a simple aggregation method to generate compact and discriminative frame representations.

In this section, we explain our model for visual feature extraction and representation based on convolutional neural networks. First, we present the generalities of this model and then introduce visual feature extraction by convolutional neural networks.

3.1. Key form identification overview

There are two main components in a typical content-based video retrieval (CBVR) system, one is video feature extraction, and another is similarity comparison. Extracting video features is very challenging; previous video feature extraction methods were primarily based on individual video frames, leading to the loss of temporal information in videos[30].

Here, we briefly introduce the temporal structure of video as show in Fig. 1. Frames are the basic unit of video composition. According to the audio-visual contents, a video can be sliced into a number of segments where the shot and the scene. A shot is consisted of a sequence of frames captured by the same camera without interruption, while the frames belonging to the same shot are highly correlated. A scene is defined as a concise segment of video that contains interrelated shots and represents a semantic unit for the given type of content.

As mentioned in the related work section, the main problems that can be expressed in key form identification and significantly impact the efficiency of recovery systems are the possibility of losing or repeated choosing of key forms. This article has focused on these two categories.

As shown in Fig. 2, to identify the key forms, the first form of the first shot is recognized as the primary key form, and complex neural networks identify the key features embeddings; here, we have used Alexnet neural networks, and images features are extracted. Then, the key features of the first form of the next shot are evaluated with the extracted features of the previous shot. There will be two possibilities:

Suppose the visual features extracted in these two forms are identified as similar, then the features of the first image are recorded, and the features of the second image are assumed to be ignored.
Suppose the extracted features are not identified as similar. In that case, new features may exist in the initial shot forms, so the initial shot algorithm is divided into two equal parts. The identification and evaluation step is performed again to extract new features.

Video sequences of video frames are estimated frame-by-frame for processing. Our approach is similar to previous works that have extracted transformation features from pre-trained CNNs. However, we discard the original network's softmax and fully connected layers while retaining the moving layers to achieve local features.

We propose an improved key frame extraction algorithm inspired by the algorithm presented in [32]. In this work, we choose the well-known deep neural network named Alexnet to extract frame features, which is trained on the ILSVRC dataset. The network consists of five convolutional layers and three max pooling layers, followed by three fully connected and softmax layers. Table 1 shows the output size of convolution layers in Alexnet. Given a pre-prepared Alexnet network, an input frame is first scaled to a predetermined image and then passed through the network in a forward pass. Finally, we obtain features of size 6 x 6 x 256 from the last max pooling layer.

Algorithm: Key form extraction

Input: Video

Output: video form recovery

1- Extracting the first frame of the video

2- Extracting form features using deep neural network

3- Extracting the first form of the next shot

4- Extracting form features using deep neural network

5- Comparing the characteristics of two forms to measure the degree of similarity

• If there is a similarity, back to command 3, otherwise, divide the first shot into two and extract the form of the second half of the first shot, and go to command 4, otherwise, identify the form as a key form and store it in the database.

Table 1

Alexnet network structure
input	Filter type	No.	Size	step	Margin	output
3227227	Conv	96	11*11	4	0	965555
965555	maxpool	1	3*3	2	–	962727
962727	conv	256	5*5	1	2	2562727
2562727	Maxpool	1	3*3	2	–	2561313
2561313	Conv	384	3*3	1	1	3841313
3841313	Conv	384	3*3	1	1	3841313
3841313	Conv	256	3*3	1	1	2561313
2561313	Maxpool	1	3*3	2	–	25666

Figure 3 shows an example of feature encoding extracted from the last maximum pooling layer before the fully connected layer of the Alexnet network. The given features have a size of 256 × 6 × 6, and after the first average pooling process with a pooling window of 6 × 6, we get a feature descriptor with the size of 256 × 1 × 1. Then the features extracted in the image are stored in the database if there is no similar image.

In this section, we demonstrate the benefits of our method. We start with introducing the datasets, evaluation metrics, and parameter setting. Then, we present our experimental results with performance comparison with several existing visual search approaches.

4.1. Datasets

We consider 2 datasets: YouTube-8M and Sports-1M datasets.

YouTube-M8 (https://research.google.com/youtube8m/) is a large-scale tagged video dataset containing 6.1 million high-quality machine-annotated video identifiers consisting of more than 3800 visual vocabularies. This dataset has pre-computed audio and video features of 2.6 billion frames and audio segments. Each video is between 120 and 500 seconds.

Sports-1m (https://github.com/gtoderici/sports-1m-dataset) is another large-scale video dataset with 1 million YouTube videos classified with 487 sports classes.

4.2. Evaluation metrics

We adopt the widely used metric in cross-modal retrieval, Mean Average Precision (MAP), to evaluate the performance of the proposed method [33]. Mean Average Precision (MAP) score, the mean value of ground-truth matching in the retrieved results for all queries, is a very popular evaluation metric in retrieval tasks. For every query, the Average Precision is computed as the following formula.

$$AP=\frac{1}{R}{\sum }_{i=1}^{N}\frac{{R}_{i}}{i}*re{l}_{i} \left(1\right)$$

where j rel_j is denoted by 0 or 1, and R_j is the sequence number of the relevant instances among the top- j retrieved results. Then, we obtain the eventual MAP by calculating the mean value of all the average precision values, as follows.

$$mAP=\frac{1}{N}{\sum }_{i=1}^{N}A{P}_{i} \left(2\right)$$

4.3. Performance Power

A performance power for top-k image-to-video query includes 100 queries. Query response time is used to evaluate the algorithms' performance. The size of the video dataset increases from 2000 to 8000 hours. The number of visual query words changes from 50 to 150 and the number of k results from 10 to 50. By default, the size of the video dataset, the number of query visual words, the number of results k are set to 2000, 100, and 10, respectively. Experiments are run on a PC with an Intel Core i5 2.6 GHz processor, 8 GB of RAM, and Microsoft Windows 10 operating system. Tables 3 and 4 shows examples of our retrieval results on two datasets.

4.4 Precise assessment of dataset size

We evaluate the effect of dataset size on YouTube-M8 and Sports-1M for accurate evaluation. Figure 4 (a) shows that the precision of CNN-VWII and SIFT-VWII decreases with the increase of the size of YouTube-M8. It is clear that the performance of the systems decreases with the increase of the database size and the increase of the extracted features. In the Sports-1M dataset shown in Fig. 4(b), we can see that the mAP of both methods decreases step by step, similar to the situation of YouTube-M8. Now the features are extracted and re-evaluated through the proposed method. As can be seen in Fig. 4 (a) on YouTube-M8 and Fig. 4 (b) on Sports-1M, the precision decreases as the data set increases but with a gentler slope, indicating the proposed method's better performance in extracting and storing key features.

In this paper, we investigated the problem of large-scale video retrieval. We have presented a novel retrieval framework for finding similar images in a large-scale database, which aims to reduce the storage cost and increase the retrieval speed for a large-scale video database. We first employed an encoding scheme to aggregate local descriptors using two pooling and encoding approaches so that it could be used in a very large database for producing initial search. We re-ranked the extracted subject areas in the candidate list to refine the initial search result. Our approach increased the search and verification speed using an efficient scheme to extract visual features and measure the similarity between images based on typical subject areas and could achieve competitive performance compared to existing methods.

It should be noted that although the proposed framework works effectively on large-scale datasets, there are limitations regarding how to classify and compare the similarity between the query image and the database. In this work, we have focused on the identification and extraction of local features with the approach of reducing the storage space without quality loss, and we have taken advantage of typical similarity comparison methods. In the future, we intend to use a combination of the topic model and CNN in a similar way to [34] or new similarity comparison models to increase the quality of the results. Therefore, more studies are needed to address these issues and provide a highly efficient search system.

Abbreviation

DNN, CNN, I2I, V2V, I2V, CBIR, SIFT, HKM, MAP.

Competing Interests

The authors declare that they have no competing interests.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Ethics approval and consent to participate

Not applicable

Consent for publication

Not Applicable

Authors’ contribution

All authors contributed equally to this manuscript.

Availability of data and materials

The data that support the findings of this study are available from the public data repository at the website of: https://research.google.com and https://github.com.

Acknowledgment

Araujo, A.; and Girod, “Large-Scale Video Retrieval Using Image Queries.” in IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 28, NO. 6, JUNE 2018.
Imane Hachchane, Abdelmajid Badri, Aïcha Sahel, Yassine Ruichek, “Large-scale image-to-video face retrieval with convolutional neural network features” in IAES International Journal of Artificial Intelligence (IJ-AI),Vol. 9, No. 1, March 2020.
A. Filgueiras De Araujo, “Large-Scale Video Retrieval Using Image Queries A Dissertation Submitted To The Department Of Electrical Engineering And The Committee On Graduate Studies Of Stanford University In Partial Fulfillment Of The Requirements For The Degree Of Doctor Of Philos,” 2016.
J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond Short Snippets: Deep Networks for Video Classification,” Mar. 2015.
K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” Jun. 2014.
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks,” Dec. 2014.
A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural Codes for Image Retrieval,” Apr. 2014.
Y. Kalantidis, C. Mellina, and S. Osindero, “Cross-dimensional Weighting for Aggregated Deep Convolutional Features,” Dec. 2015.
A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki, “Visual Instance Retrieval with Deep Convolutional Networks,” Dec. 2014.
L. Wu, Y. Wang, Z. Ge, Q. Hu, and X. Li, “Structured deep hashing with convolutional neural networks for fast person re-identification,” Comput. Vis. Image Underst., vol. 167, pp. 63–73, Feb. 2018.
K. Lin, Y. Huei-Fang, H. Jen-Hao, and C. Chu-Song, “Deep learning of binary hash codes for fast image retrieval,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, June 2015.
M. A. Marzouk, “Improving web based image retrieval with fuzzy descriptors relevance feedback technique,” Journal of Computers, vol. 28, no. 3, pp. 11–26, 2017.
S. Poullot, T. Shunsuke, N. Anh Phuong, and J. Herve, “Temporal matching kernel with explicit feature maps,” in Proceedings of the 23rd ACM International Conference on Multimedia, ACM, Brisbane, Australia, October 2015.
S. R. Shinde and G. G. Chiddarwar, “Recent advances in content based video copy detection,” in Proceedings of the International Conference on Pervasive Computing (ICPC), IEEE, Pune, India, January 2015.
A. Araujo, “Large-scale query-by-image video retrieval using bloom filters,” 2016, http://arxiv.org/abs/1604.07939.
A. Araujo, “Temporal aggregation for large-scale query-byimage video retrieval,” in Proceedings of the IEEE International Conference on Image Processing (ICIP), IEEE, Quebec City, QC, Canada, September 2015.
Xu-D. Wu, L. Tie-Yan, L. Kwok-Tung, and F. Jian, “Dynamic selection and effective compression of key frames for video abstraction,” Pattern Recognition Letters, vol. 24, pp. 9–10, 2003.
Li-J. Qin, Z. Yue-Ting, W. Fei, and P. Yun-He, “An integrated framework for shot boundary detection with multi-level features similarity,” in Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No. 04EX826), IEEE, Shanghai, China, China, August 2004.
S. E. F. De Avila, A. P. B. Lopes, and A. de Albuquerque Ara´ujo, “VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method,” Pattern Recognition Letters, vol. 32, no. 1, pp. 56–68, 2011.
R. da Luz, B. Qin, and T. Liu, “A novel approach to update summarization using evolutionary manifold-ranking and spectral clustering,” Expert Systems with Applications, vol. 39, no. 3, pp. 2375–2384, 2012.
J. Sivic, A. Zisserman, Video google: A text retrieval approach to object match- ing in videos, in: 9th IEEE International Conference on Computer Vision (ICCV 2003), Nice, France, 2003, pp. 1470–1477, 14–17 October 2003.
G. Ulutas, B. Ustubioglu, M. Ulutas, V.V. Nabiyev, Frame duplication detection based on bow model, Multimedia Syst. 24 (5) (2018) 549–567 .
L. Wang, D. Song, E. Elyan, Improving bag-of-visual-words model with spa- tial-temporal correlation for video retrieval, in: 21st ACM International Con- ference on Information and Knowledge Management, CIKM’12, Maui, HI, USA, October 29, pp. 1303–1312, - November 02, 2012, 2012.
H. Noh, “Large-scale image retrieval with attentive deep local features,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, October 2017.
Y. Kalantidis, C. Mellina, and O. Simon, “Cross-dimensional weighting for aggregated deep convolutional features,” in Proceedings of the European Conference on Computer Vision, Springer, Munich, Germany, September 2016.
P. Kulkarni, Z. Joaquin, J. Frederic, P. Patrick, and C. Louis, “Hybrid multi-layer deep CNN/aggregator feature for image classification,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Brisbane, QLD, Australia, April 2015.
Y. Gong, W. Liwei, G. Ruiqi, and L. Svetlana, “Multi-scale orderless pooling of deep convolutional activation features,” in Proceedings of the European Conference on Computer Vision, Springer, Munich, Germany, September 2014.
H. J´egou, D. Matthijs, S. Cordelia, and P. Patrick, “Aggregating local descriptors into a compact image representation,” in Proceedings of the CVPR 2010-23rd IEEE Conference on Computer Vision & Pattern Recognition, San Francisco, CA, USA, June 2010.
M. Lin, Q. Chen, and S. Yan, “Network in network,” 2013, http://arxiv.org/abs/1312.4400.
Hanqing Chen, Chunyan Hu, Feifei Lee, Chaowei Lin, Wei Yao, Lu Chen, Qiu Chen,’’ A Supervised Video Hashing Method Based on a Deep 3D Convolutional Neural Network for Large-Scale Video Retrieval,” in Sensors 2021, 21, 3094. https://doi.org/10.3390/s21093094, 26 April 2021.
Chunfu Jia, Hongyang Yan, Mengqi Chen, Li Hu,’’ Secure video retrieval using image query on an untrusted cloud,” in Applied Soft Computing Journal 97 (2020) 106782, 5 October 2020.
Qingrong Cheng and Xiaodong Gu, “Video sequence feature extraction and segmentation using likelihood regression model,” in Multimedia Tools and Applications, 10 March 2021.
Qingrong Cheng and Xiaodong Gu, “Bridging multimedia heterogeneity gap via Graph Representation Learning for cross-modal retrieval,” in Neural Networks, 23 November 2020.
Anfeng Liu, Chengyuan Zhang, Yunwu Lin, Lei Zhu, Zuping Zhang, Fang Huang, “CNN-VWII: An efficient approach for large-scale video retrieval by image queries,” in Pattern Recognition Letters 123 (2019) 82–88.

Tables 2-4 are not available with this version

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

An efficient approach for large-scale image-to-video retrieval with convolutional neural network features

Status:

Version 1

Abstract

Figures

1- Introduction

2- Related Works

3- The Proposed Method

3.1. Key form identification overview

4- Evaluation ( Experiment)

4.1. Datasets

4.2. Evaluation metrics

4.3. Performance Power

4.4 Precise assessment of dataset size

5- Conclusion

Declarations

References

tables

Additional Declarations

Status:

Version 1