Related papers: PyTorch Distributed: Experiences on Accelerating D…

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-13 Yanli Zhao , Andrew Gu , Rohan Varma , Liang Luo , Chien-Chin Huang , Min Xu , Less Wright , Hamid Shojanazeri , Myle Ott , Sam Shleifer , Alban Desmaison , Can Balioglu , Pritam Damania , Bernard Nguyen , Geeta Chauhan , Yuchen Hao , Ajit Mathews , Shen Li

PyTorch Tabular: A Framework for Deep Learning with Tabular Data

In spite of showing unreasonable effectiveness in modalities like Text and Image, Deep Learning has always lagged Gradient Boosting in tabular data - both in popularity and performance. But recently there have been newer models created…

Machine Learning · Computer Science 2021-04-29 Manu Joseph

A Data and Model-Parallel, Distributed and Scalable Framework for Training of Deep Networks in Apache Spark

Training deep networks is expensive and time-consuming with the training period increasing with data size and growth in model parameters. In this paper, we provide a framework for distributed training of deep networks over a cluster of CPUs…

Machine Learning · Statistics 2017-08-22 Disha Shrivastava , Santanu Chaudhury , Dr. Jayadeva

Modern Distributed Data-Parallel Large-Scale Pre-training Strategies For NLP models

Distributed deep learning is becoming increasingly popular due to the expanding demand for computing resources for deep learning models with a larger amount of parameters. Different from traditional training approaches, data-parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-14 Hao Bai

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style…

Machine Learning · Computer Science 2019-12-05 Adam Paszke , Sam Gross , Francisco Massa , Adam Lerer , James Bradbury , Gregory Chanan , Trevor Killeen , Zeming Lin , Natalia Gimelshein , Luca Antiga , Alban Desmaison , Andreas Köpf , Edward Yang , Zach DeVito , Martin Raison , Alykhan Tejani , Sasank Chilamkurthy , Benoit Steiner , Lu Fang , Junjie Bai , Soumith Chintala

Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism

An appropriate choice of batch sizes in large-scale model training is crucial, yet it involves an intrinsic yet inevitable dilemma: large-batch training improves training efficiency in terms of memory utilization, while generalization…

Machine Learning · Computer Science 2025-03-18 Tim Tsz-Kit Lau , Weijian Li , Chenwei Xu , Han Liu , Mladen Kolar

FlexModel: A Framework for Interpretability of Distributed Large Language Models

With the growth of large language models, now incorporating billions of parameters, the hardware prerequisites for their training and deployment have seen a corresponding increase. Although existing tools facilitate model parallelization…

Machine Learning · Computer Science 2023-12-07 Matthew Choi , Muhammad Adil Asif , John Willes , David Emerson

Acceleration for Deep Reinforcement Learning using Parallel and Distributed Computing: A Survey

Deep reinforcement learning has led to dramatic breakthroughs in the field of artificial intelligence for the past few years. As the amount of rollout experience data and the size of neural networks for deep reinforcement learning have…

Machine Learning · Computer Science 2024-11-11 Zhihong Liu , Xin Xu , Peng Qiao , Dongsheng Li

Fast Graph Representation Learning with PyTorch Geometric

We introduce PyTorch Geometric, a library for deep learning on irregularly structured input data such as graphs, point clouds and manifolds, built upon PyTorch. In addition to general graph data structures and processing methods, it…

Machine Learning · Computer Science 2019-04-26 Matthias Fey , Jan Eric Lenssen

PyTorch Frame: A Modular Framework for Multi-Modal Tabular Learning

We present PyTorch Frame, a PyTorch-based framework for deep learning over multi-modal tabular data. PyTorch Frame makes tabular deep learning easy by providing a PyTorch-based data structure to handle complex tabular data, introducing a…

Machine Learning · Computer Science 2024-12-17 Weihua Hu , Yiwen Yuan , Zecheng Zhang , Akihiro Nitta , Kaidi Cao , Vid Kocijan , Jinu Sunil , Jure Leskovec , Matthias Fey

A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields.…

Machine Learning · Computer Science 2022-07-04 Daniel Nichols , Siddharth Singh , Shu-Huai Lin , Abhinav Bhatele

PyTorch Metric Learning

Deep metric learning algorithms have a wide variety of applications, but implementing these algorithms can be tedious and time consuming. PyTorch Metric Learning is an open source library that aims to remove this barrier for both…

Computer Vision and Pattern Recognition · Computer Science 2020-08-24 Kevin Musgrave , Serge Belongie , Ser-Nam Lim

TorchBench: Benchmarking PyTorch with High API Surface Coverage

Deep learning (DL) has been a revolutionary technique in various domains. To facilitate the model development and deployment, many deep learning frameworks are proposed, among which PyTorch is one of the most popular solutions. The…

Machine Learning · Computer Science 2023-06-27 Yueming Hao , Xu Zhao , Bin Bao , David Berard , Will Constable , Adnan Aziz , Xu Liu

FL_PyTorch: optimization research simulator for federated learning

Federated Learning (FL) has emerged as a promising technique for edge devices to collaboratively learn a shared machine learning model while keeping training data locally on the device, thereby removing the need to store and access the full…

Machine Learning · Computer Science 2022-07-19 Konstantin Burlachenko , Samuel Horváth , Peter Richtárik

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage

The AI hardware boom has led modern data centers to adopt HPC-style architectures centered on distributed, GPU-centric computation. Large GPU clusters interconnected by fast RDMA networks and backed by high-bandwidth NVMe storage enable…

Databases · Computer Science 2026-05-21 Jigao Luo , Nils Boeschen , Muhammad El-Hindi , Carsten Binnig

DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters

The increasing complexity of deep neural networks (DNNs) has made it challenging to exploit existing large-scale data processing pipelines for handling massive data and parameters involved in DNN training. Distributed computing platforms…

Machine Learning · Computer Science 2016-10-04 Hanjoo Kim , Jaehong Park , Jaehee Jang , Sungroh Yoon

A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale

Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product…

Machine Learning · Computer Science 2023-09-14 Hao-Jun Michael Shi , Tsung-Hsien Lee , Shintaro Iwasaki , Jose Gallego-Posada , Zhijing Li , Kaushik Rangadurai , Dheevatsa Mudigere , Michael Rabbat

Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach

As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…

Machine Learning · Computer Science 2025-03-13 Ruifeng She , Bowen Pang , Kai Li , Zehua Liu , Tao Zhong

Towards a Scalable and Distributed Infrastructure for Deep Learning Applications

Although recent scaling up approaches to training deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets, require deep learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-21 Bita Hasheminezhad , Shahrzad Shirzad , Nanmiao Wu , Patrick Diehl , Hannes Schulz , Hartmut Kaiser

Torch-Choice: A PyTorch Package for Large-Scale Choice Modeling with Python

The $\texttt{torch-choice}$ is an open-source library for flexible, fast choice modeling with Python and PyTorch. $\texttt{torch-choice}$ provides a $\texttt{ChoiceDataset}$ data structure to manage databases flexibly and…

Machine Learning · Computer Science 2025-06-05 Tianyu Du , Ayush Kanodia , Susan Athey