Spatial-Aware and Load Balancing Distributed Data Partitioning Strategies for Content-Based Multimedia Retrieval

doi:10.21203/rs.3.rs-4973077/v1

Download PDF

Research Article

Spatial-Aware and Load Balancing Distributed Data Partitioning Strategies for Content-Based Multimedia Retrieval

https://doi.org/10.21203/rs.3.rs-4973077/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Content-Based Multimedia Retrieval (CBMR) has become very popular in several applications, driven by the growing routine use of multimedia data. Since the datasets used in real-world applications are very large and descriptor’s dimensionality is high, querying is an expensive, albeit important functionality. Further, exact search is prohibitive in most cases, motivating the use of Approximate Nearest Neighbour Search (ANNS) algorithms, trading accuracy for performance. These have been mainly developed targeting a sequential execution in a single node. However, the large and increasing datasets used and the high query loads submitted to those systems typically surpass the memory and computing resources available in a single node. This motivated the development of parallel distributed memory ANNS solutions to meet the computing capabilities required by those applications. A common problem that must be handled when using distributed memory systems is data partitioning and its impact on load imbalance. Several data partitioning approaches have already been proposed, including elaborated spatial-aware strategies. However, little effort has been put into carefully analyzing the performance of those strategies at scale. Here, we evaluated the commonly used data partitioning strategies in ANNS and identified their limitations to propose a novel class of partitioning algorithms that can minimize load imbalance while improving data locality to attain high performance on the distributed memory search. Experimentally, we found that our proposed algorithms (SABBS and SABBSR) improved search performance by up to 1.64× compared to the best previous solution. In a distributed memory weak scaling evaluation, with up to 12 billion 128-dimensional descriptors and 60 compute nodes, the gains were maintained as the system scaled with our novel approaches. These results demonstrate the efficiency of our new algorithms for billion-scale ANNS and the importance of considering not only data locality but also data and load imbalance in the data partitioning.

Content-Based Multimedia Retrieval

Data Partitioning

Load Imbalance

Approximate Nearest Neighbors

Product Quantization

Distributed Computing

Similarity Search

No competing interests reported.

Download PDF

Editorial decision: Revision requested
22 Oct, 2024
Reviews received at journal
04 Oct, 2024
Reviews received at journal
01 Oct, 2024
Reviewers agreed at journal
07 Sep, 2024
Reviewers agreed at journal
06 Sep, 2024
Reviewers agreed at journal
05 Sep, 2024
Reviewers invited by journal
05 Sep, 2024
Editor assigned by journal
26 Aug, 2024
Submission checks completed at journal
25 Aug, 2024
First submitted to journal
25 Aug, 2024

You are reading this latest preprint version

Spatial-Aware and Load Balancing Distributed Data Partitioning Strategies for Content-Based Multimedia Retrieval

Status:

Version 1

Abstract

Full Text

Additional Declarations

Status:

Version 1