English

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Distributed, Parallel, and Cluster Computing 2026-03-13 v3

Abstract

Graph embeddings map graph nodes to continuous vectors and are foundational to community detection, recommendation, and many scientific applications. At billion-scale, however, existing graph embedding systems face a trade-off: they either rely on large in-memory footprints across many GPUs (limited scalability) or repeatedly stream data from disk (incurring severe I/O overhead and low GPU utilization). In this paper, we propose Legend, a lightweight heterogeneous system for graph embedding that systematically redesigns data management across CPU, GPU, and NVMe SSD resources. Legend combines three practical ideas: (1) a prefetch-friendly embedding-loading order that lets GPUs efficiently prefetch necessary embeddings directly from NVMe SSD with low I/O amplification; (2) a high-throughput GPU-SSD direct-access driver tuned for the access patterns of embedding training; and (3) a customized parallel execution strategy that maximizes GPU utilization. Together, these components let Legend store and stream vast embedding data without overprovisioning GPU memory or suffering I/O stalls. Extensive experiments on billion-scale graphs demonstrate that Legend speeds up end-to-end workloads by up to 4.8x versus state-of-the-art systems, and matches their performance on the largest workloads while using only one quarter of the GPUs.

Keywords

Cite

@article{arxiv.2505.09258,
  title  = {Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration},
  author = {Zhonggen Li and Xiangyu Ke and Yifan Zhu and Yunjun Gao and Feifei Li},
  journal= {arXiv preprint arXiv:2505.09258},
  year   = {2026}
}

Comments

Accepted by The VLDB Journal 2026