Related papers: Linear Attention Sequence Parallelism

LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized…

Machine Learning · Computer Science 2025-02-12 Weigao Sun , Disen Lan , Yiran Zhong , Xiaoye Qu , Yu Cheng

USP: A Unified Sequence Parallelism Approach for Long Context Generative AI

Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the…

Machine Learning · Computer Science 2024-07-03 Jiarui Fang , Shangchun Zhao

Sequence Parallelism: Long Sequence Training from System Perspective

Transformer achieves promising results on various tasks. However, self-attention suffers from quadratic memory requirements with respect to the sequence length. Existing work focuses on reducing time and space complexity from an algorithm…

Machine Learning · Computer Science 2022-05-24 Shenggui Li , Fuzhao Xue , Chaitanya Baranwal , Yongbin Li , Yang You

ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

Linear attention mechanisms deliver significant advantages for Large Language Models (LLMs) by providing linear computational complexity, enabling efficient processing of ultra-long sequences (e.g., 1M context). However, existing Sequence…

Machine Learning · Computer Science 2025-07-03 Yuhong Chou , Zehao Liu , Ruijie Zhu , Xinyi Wan , Tianjian Li , Congying Chu , Qian Liu , Jibin Wu , Zejun Ma

TASP: Topology-aware Sequence Parallelism

Long-context large language models (LLMs) face constraints due to the quadratic complexity of the self-attention mechanism. The mainstream sequence parallelism (SP) method, Ring Attention, attempts to solve this by distributing the query…

Machine Learning · Computer Science 2025-10-10 Yida Wang , Ke Hong , Xiuhong Li , Yuanchao Xu , Wenxun Wang , Guohao Dai , Yu Wang

Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference

We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single device axis. In conventional multi-dimensional parallelism layouts, tensor parallelism (TP)…

Computation and Language · Computer Science 2026-04-30 Vasu Shyam , Anna Golubeva , Quentin Anthony

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

Efficient parallelization of Large Language Models (LLMs) with long sequences is essential but challenging due to their significant computational and memory demands, particularly stemming from communication bottlenecks in attention…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-31 Zongwu Wang , Fangxin Liu , Mingshuai Li , Li Jiang

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance

With the increasing volumes of Large Language Models (LLMs) and the expanding context lengths, attention computation has become a key performance bottleneck in LLM serving. For fast attention computation, recent practices often parallelize…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-12 Di Liu , Yifei Liu , Chen Chen , Zhibin Yu , Xiaoyi Fan , Quan Chen , Minyi Guo

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads

Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-27 Mert Hidayetoglu , Aurick Qiao , Michael Wyatt , Jeff Rasley , Yuxiong He , Samyam Rajbhandari

FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism

Extending the context length (i.e., the maximum supported sequence length) of LLMs is of paramount significance. To facilitate long context training of LLMs, sequence parallelism has emerged as an essential technique, which scatters each…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-12 Yujie Wang , Shiju Wang , Shenhan Zhu , Fangcheng Fu , Xinyi Liu , Xuefeng Xiao , Huixia Li , Jiashi Li , Faming Wu , Bin Cui

HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism

As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel…

Machine Learning · Computer Science 2025-07-02 Geng Zhang , Shenggan Cheng , Xuanlei Zhao , Ziming Liu , Yang You

SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading

In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities, driving advancements in real-world applications. However, training LLMs on increasingly long input sequences imposes significant challenges due to high…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-14 Qiaoling Chen , Shenggui Li , Wei Gao , Peng Sun , Yonggang Wen , Tianwei Zhang

db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism

Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Siqi Chen , Ke Hong , Tianchen Zhao , Ruiqi Xie , Zhenhua Zhu , Xudong Zhang , Yu Wang

LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism

The context window of large language models (LLMs) is rapidly increasing, leading to a huge variance in resource usage between different requests as well as between different phases of the same request. Restricted by static parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-30 Bingyang Wu , Shengyu Liu , Yinmin Zhong , Peng Sun , Xuanzhe Liu , Xin Jin

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-03 Han-Byul Kim , Duc Hoang , Arnav Kundu , Mohammad Samragh , Minsik Cho

Ultra-Long Sequence Distributed Transformer

Transformer models trained on long sequences often achieve higher accuracy than short sequences. Unfortunately, conventional transformers struggle with long sequence training due to the overwhelming computation and memory requirements.…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-09 Xiao Wang , Isaac Lyngaas , Aristeidis Tsaris , Peng Chen , Sajal Dash , Mayanka Chandra Shekar , Tao Luo , Hong-Jun Yoon , Mohamed Wahib , John Gouley

SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training

Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context…

Machine Learning · Computer Science 2025-04-22 Zhouyang Li , Yuliang Liu , Wei Zhang , Tailing Yuan , Bin Chen , Chengru Song , Di Zhang

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements…

Computation and Language · Computer Science 2024-06-25 Chao Lou , Zixia Jia , Zilong Zheng , Kewei Tu

Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention…

Computer Vision and Pattern Recognition · Computer Science 2026-01-30 Yuxiang Huang , Mingye Li , Xu Han , Chaojun Xiao , Weilin Zhao , Ao Sun , Ziqi Yuan , Hao Zhou , Fandong Meng , Zhiyuan Liu

Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques

Transformers have demonstrated great success in numerous domains including natural language processing and bioinformatics. This success stems from the use of the attention mechanism by these models in order to represent and propagate…

Machine Learning · Computer Science 2025-02-10 Nathaniel Tomczak , Sanmukh Kuppannagari