Distributed Out-of-Memory NMF on CPU/GPU Architectures

doi:10.21203/rs.3.rs-2782712/v1

Download PDF

Research Article

Distributed Out-of-Memory NMF on CPU/GPU Architectures

https://doi.org/10.21203/rs.3.rs-2782712/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory (OOM) problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/Output (I/O) latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10^{-6}.

NMF

out-of-memory

latent features

model selection

distributed processing

parallel programming

big data

heterogeneous computing

GPU

CUDA

NCCL

cupy

No competing interests reported.

Download PDF

Editorial decision: Major revision
06 Jul, 2023
Reviewers agreed at journal
25 Jun, 2023
Reviews received at journal
19 Jun, 2023
Reviewers agreed at journal
11 Jun, 2023
Reviews received at journal
25 May, 2023
Reviewers agreed at journal
16 May, 2023
Reviewers agreed at journal
15 May, 2023
Reviewers invited by journal
12 Apr, 2023
Editor assigned by journal
09 Apr, 2023
Submission checks completed at journal
07 Apr, 2023
First submitted to journal
05 Apr, 2023

You are reading this latest preprint version

Distributed Out-of-Memory NMF on CPU/GPU Architectures

Status:

Version 1

Abstract

Full Text

Additional Declarations

Status:

Version 1