English
Related papers

Related papers: Seq1F1B: Efficient Sequence-Level Pipeline Paralle…

200 papers

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-10 Joel Lamy-Poirier

Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling strategy that, to our knowledge,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-22 Penghui Qi , Xinyi Wan , Guangxing Huang , Min Lin

Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context…

Machine Learning · Computer Science 2025-04-22 Zhouyang Li , Yuliang Liu , Wei Zhang , Tailing Yuan , Bin Chen , Chengru Song , Di Zhang

Pipeline parallelism has been widely explored, but most existing schedules lack a systematic methodology. In this paper, we propose a framework to decompose pipeline schedules as repeating a building block, and show that the lifespan of the…

Machine Learning · Computer Science 2024-11-05 Penghui Qi , Xinyi Wan , Nyamdavaa Amar , Min Lin

In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities, driving advancements in real-world applications. However, training LLMs on increasingly long input sequences imposes significant challenges due to high…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-14 Qiaoling Chen , Shenggui Li , Wei Gao , Peng Sun , Yonggang Wen , Tianwei Zhang

As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel…

Machine Learning · Computer Science 2025-07-02 Geng Zhang , Shenggan Cheng , Xuanlei Zhao , Ziming Liu , Yang You

Larger model sizes and longer sequence lengths have empowered the Large Language Model (LLM) to achieve outstanding performance across various domains. However, this progress brings significant storage capacity challenges for LLM…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-26 Xinyuan Lin , Chenlu Li , Zongle Huang , Chunyu Wang , Bo Xiao , Huazhong Yang , Shishi Duan , Yongpan Liu

Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-14 Xiulong Yuan , Hongtao Xu , Wenting Shen , Ang Wang , Xiafei Qiu , Jie Zhang , Yuqiong Liu , Bowen Yu , Junyang Lin , Mingzhen Li , Weile Jia , Yong Li , Wei Lin

Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the…

Machine Learning · Computer Science 2023-10-05 Sam Ade Jacobs , Masahiro Tanaka , Chengming Zhang , Minjia Zhang , Shuaiwen Leon Song , Samyam Rajbhandari , Yuxiong He

Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding…

Machine Learning · Computer Science 2024-09-25 Johannes Hagemann , Samuel Weinbach , Konstantin Dobler , Maximilian Schall , Gerard de Melo

Pipeline parallelism is widely used to train large language models (LLMs). However, increasing heterogeneity in model architectures exacerbates pipeline bubbles, thereby reducing training efficiency. Existing approaches overlook the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-30 Jihu Guo , Tenghui Ma , Wei Gao , Peng Sun , Jiaxing Li , Xun Chen , Yuyang Jin , Dahua Lin

While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass…

Computation and Language · Computer Science 2024-08-26 Quandong Wang , Yuxuan Yuan , Xiaoyu Yang , Ruike Zhang , Kang Zhao , Wei Liu , Jian Luan , Daniel Povey , Bin Wang

Extending the context length (i.e., the maximum supported sequence length) of LLMs is of paramount significance. To facilitate long context training of LLMs, sequence parallelism has emerged as an essential technique, which scatters each…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-12 Yujie Wang , Shiju Wang , Shenhan Zhu , Fangcheng Fu , Xinyi Liu , Xuefeng Xiao , Huixia Li , Jiashi Li , Faming Wu , Bin Cui

Pipeline parallelism (PP) has become a standard technique for scaling large language model (LLM) training across multiple devices. However, despite recent progress in reducing memory consumption through activation offloading, existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-08 Hongpei Li , Han Zhang , Huikang Liu , Dongdong Ge , Yinyu Ye

Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Man Tsung Yeung , Penghui Qi , Min Lin , Xinyi Wan

In the machine learning system, the hybrid model parallelism combining tensor parallelism (TP) and pipeline parallelism (PP) has become the dominant solution for distributed training of Large Language Models~(LLMs) and Multimodal LLMs…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-03 Mengshi Qi , Jiaxuan Peng , Jie Zhang , Juan Zhu , Yong Li , Huadong Ma

Pipeline parallelism has been demonstrated to be a remarkable approach to improve throughput for training deep neural networks with billions of parameters over heterogeneous clusters. The 1F1B scheduling plan is a widely adopted strategy…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-06 Siyu Wang , Zongyan Cao , Chang Si , Lansong Diao , Jiamang Wang , Wei Lin

We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of…

Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and…

Performance · Computer Science 2023-12-04 Longteng Zhang , Xiang Liu , Zeyu Li , Xinglin Pan , Peijie Dong , Ruibo Fan , Rui Guo , Xin Wang , Qiong Luo , Shaohuai Shi , Xiaowen Chu
‹ Prev 1 2 3 10 Next ›