Related papers: DiPaCo: Distributed Path Composition

DiLoCo: Distributed Low-Communication Training of Language Models

Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging…

Machine Learning · Computer Science 2024-09-24 Arthur Douillard , Qixuan Feng , Andrei A. Rusu , Rachita Chhaparia , Yani Donchev , Adhiguna Kuncoro , Marc'Aurelio Ranzato , Arthur Szlam , Jiajun Shen

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all…

Computation and Language · Computer Science 2025-01-31 Arthur Douillard , Yanislav Donchev , Keith Rush , Satyen Kale , Zachary Charles , Zachary Garrett , Gabriel Teston , Dave Lacey , Ross McIlroy , Jiajun Shen , Alexandre Ramé , Arthur Szlam , Marc'Aurelio Ranzato , Paul Barham

Decoupled DiLoCo for Resilient Distributed Pre-training

Modern large-scale language model pre-training relies heavily on the single program multiple data (SPMD) paradigm, which requires tight coupling across accelerators. Due to this coupling, transient slowdowns, hardware failures, and…

Computation and Language · Computer Science 2026-04-24 Arthur Douillard , Keith Rush , Yani Donchev , Zachary Charles , Nova Fallen , Ayush Dubey , Ionel Gog , Josef Dean , Blake Woodworth , Zachary Garrett , Nate Keating , Jenny Bishop , Henry Prior , Edouard Yvinec , Arthur Szlam , Marc'Aurelio Ranzato , Jeff Dean

DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we…

Machine Learning · Computer Science 2025-06-27 Ji Qi , WenPeng Zhu , Li Li , Ming Wu , YingJun Wu , Wu He , Xun Gao , Jason Zeng , Michael Heinrich

DiPPeR: Diffusion-based 2D Path Planner applied on Legged Robots

In this work, we present DiPPeR, a novel and fast 2D path planning framework for quadrupedal locomotion, leveraging diffusion-driven techniques. Our contributions include a scalable dataset generator for map images and corresponding…

Robotics · Computer Science 2024-05-30 Jianwei Liu , Maria Stamatopoulou , Dimitrios Kanoulas

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

This paper proposes DisCo, an automatic deep learning compilation module for data-parallel distributed training. Unlike most deep learning compilers that focus on training or inference on a single device, DisCo optimizes a DNN model for…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-27 Xiaodong Yi , Shiwei Zhang , Lansong Diao , Chuan Wu , Zhen Zheng , Shiqing Fan , Siyu Wang , Jun Yang , Wei Lin

DisCo: Distributed Contact-Rich Trajectory Optimization for Forceful Multi-Robot Collaboration

We present DisCo, a distributed algorithm for contact-rich, multi-robot tasks. DisCo is a distributed contact-implicit trajectory optimization algorithm, which allows a group of robots to optimize a time sequence of forces to objects and to…

Robotics · Computer Science 2024-10-31 Ola Shorinwa , Matthew Devlin , Elliot W. Hawkes , Mac Schwager

Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization

Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to…

Machine Learning · Computer Science 2026-02-24 Yuchen Zhu , Wei Guo , Jaemoo Choi , Petr Molodyk , Bo Yuan , Molei Tao , Yongxin Chen

Communication Efficient LLM Pre-training with SparseLoCo

Communication-efficient distributed training algorithms have received considerable interest recently due to their benefits for training Large Language Models (LLMs) in bandwidth-constrained settings, such as across datacenters and over the…

Machine Learning · Computer Science 2025-11-07 Amir Sarfi , Benjamin Thérien , Joel Lidin , Eugene Belilovsky

Cross-region Model Training with Communication-Computation Overlapping and Delay Compensation

Training large language models (LLMs) requires massive computational resources, often necessitating the aggregation of geographically distributed data centers (\ie, cross-region training). However, the high communication latency in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-25 Ying Zhu , Yang Xu , Hongli Xu , Yunming Liao , Zhiwei Yao , Liusheng Huang

Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap

As both ML training and inference are increasingly distributed, parallelization techniques that shard (divide) ML model across GPUs of a distributed system, are often deployed. With such techniques, there is a high prevalence of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-12 Shagnik Pal , Shaizeen Aga , Suchita Pati , Mahzabeen Islam , Lizy K. John

Scaling Distributed Machine Learning with In-Network Aggregation

Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-01 Amedeo Sapio , Marco Canini , Chen-Yu Ho , Jacob Nelson , Panos Kalnis , Changhoon Kim , Arvind Krishnamurthy , Masoud Moshref , Dan R. K. Ports , Peter Richtárik

Eager Updates For Overlapped Communication and Computation in DiLoCo

Distributed optimization methods such as DiLoCo have been shown to be effective in training very large models across multiple distributed workers, such as datacenters. These methods split updates into two parts: an inner optimization phase,…

Computation and Language · Computer Science 2025-02-19 Satyen Kale , Arthur Douillard , Yanislav Donchev

Divide and Conquer: Accelerating Diffusion-Based Large Language Models via Adaptive Parallel Decoding

Diffusion-based large language models (dLLMs) have shown promising performance across various reasoning tasks, establishing themselves as an alternative to autoregressive large language models (LLMs). Unlike autoregressive LLMs that…

Computation and Language · Computer Science 2026-03-02 Xiangzhong Luo , Yilin An , Zhicheng Yu , Weichen Liu , Xu Yang

AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models

Scaling distributed training of Large Language Models (LLMs) requires not only algorithmic advances but also efficient utilization of heterogeneous hardware resources. While existing methods such as DiLoCo have demonstrated promising…

Machine Learning · Computer Science 2025-08-26 Nikolay Kutuzov , Makar Baderko , Stepan Kulibaba , Artem Dzhalilov , Daniel Bobrov , Maxim Mashtaler , Alexander Gasnikov

DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization

Multimodal Large Language Models (MLLMs) have achieved remarkable advances by integrating text, image, and audio understanding within a unified architecture. However, existing distributed training frameworks remain fundamentally data-blind:…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-20 Hyeonjun An , Sihyun Kim , Chaerim Lim , Hyunjoon Kim , Rathijit Sen , Sangmin Jung , Hyeonsoo Lee , Dongwook Kim , Takki Yu , Jinkyu Jeong , Youngsok Kim , Kwanghyun Park

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We…

Artificial Intelligence · Computer Science 2026-05-20 Xiaozhe Li , Yang Li , Xinyu Fang , Shengyuan Ding , Peiji Li , Yongkang Chen , Yichuan Ma , Tianyi Lyu , Linyang Li , Dahua Lin , Qipeng Guo , Qingwen Liu , Kai Chen

Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach…

Machine Learning · Computer Science 2025-03-14 Zachary Charles , Gabriel Teston , Lucio Dery , Keith Rush , Nova Fallen , Zachary Garrett , Arthur Szlam , Arthur Douillard

PaCo: Parameter-Compositional Multi-Task Reinforcement Learning

The purpose of multi-task reinforcement learning (MTRL) is to train a single policy that can be applied to a set of different tasks. Sharing parameters allows us to take advantage of the similarities among tasks. However, the gaps between…

Machine Learning · Computer Science 2022-10-24 Lingfeng Sun , Haichao Zhang , Wei Xu , Masayoshi Tomizuka

Distributed Low-Communication Training with Decoupled Momentum Optimization

The training of large models demands substantial computational resources, typically available only in data centers with high-bandwidth interconnects. However, reducing the reliance on high-bandwidth interconnects between nodes enables the…

Machine Learning · Computer Science 2025-10-07 Sasho Nedelkoski , Alexander Acker , Odej Kao , Soeren Becker , Dominik Scheinert