Scalable and Performant Data Loading

Moto Hira; Christian Puhrsch; Valentin Andrei; Roman Malinovskyy; Gael Le Lan; Abhinandan Krishnan; Joseph Cummings; Victor Bourgin; Olga Gerasimova; Miguel Martin; Gokul Gunasekaran; Yuta Inoue; Alex J Turner; Raghuraman Krishnamoorthi

Scalable and Performant Data Loading

Distributed, Parallel, and Cluster Computing 2026-03-11 v2

Authors: Moto Hira , Christian Puhrsch , Valentin Andrei , Roman Malinovskyy , Gael Le Lan , Abhinandan Krishnan , Joseph Cummings , Victor Bourgin , Olga Gerasimova , Miguel Martin , Gokul Gunasekaran , Yuta Inoue , Alex J Turner , Raghuraman Krishnamoorthi

View on arXiv ↗ PDF ↗

Abstract

We present SPDL (Scalable and Performant Data Loading), an open-source, framework-agnostic library designed for efficiently loading array data to GPU. Data loading is often a bottleneck in AI applications, and is challenging to optimize because it requires coordination of network calls, CPU-bound tasks, and GPU device transfer. On top of that, Python's GIL (Global Interpreter Lock) makes it difficult to gain performance improvement from multi-threading. We found that when data preprocessing functions release the GIL entirely, it is possible to execute them concurrently in a thread pool, thereby improving the workflow performance. Our benchmark shows that compared to the PyTorch DataLoader, SPDL can iterate through the ImageNet dataset 74% faster while using 38% less CPU and 50GB less memory. When training ViT-B/16 model, SPDL can send data to the GPU at a speed that does not starve the training. Additionally, when using SPDL on Python 3.13t, without changing any code, the throughput is further by improved by 33%, thanks to the disabled GIL. SPDL can improve the performance of current AI model training, and receives further performance improvements when Free-Threaded Python is adopted in production systems. SPDL is available at https://github.com/facebookresearch/spdl.

Keywords

large language model inference gpu computing software library

Cite

@article{arxiv.2504.20067,
  title  = {Scalable and Performant Data Loading},
  author = {Moto Hira and Christian Puhrsch and Valentin Andrei and Roman Malinovskyy and Gael Le Lan and Abhinandan Krishnan and Joseph Cummings and Victor Bourgin and Olga Gerasimova and Miguel Martin and Gokul Gunasekaran and Yuta Inoue and Alex J Turner and Raghuraman Krishnamoorthi},
  journal= {arXiv preprint arXiv:2504.20067},
  year   = {2026}
}

Comments

For the latest version of the software please visit https://facebookresearch.github.io/spdl/main/

Scalable and Performant Data Loading

Abstract

Keywords

Cite

Comments

Related papers