Related papers: Blockwise Parallel Transformer for Large Context M…

Ring Attention with Blockwise Transformers for Near-Infinite Context

Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability…

Computation and Language · Computer Science 2023-11-28 Hao Liu , Matei Zaharia , Pieter Abbeel

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of self-attention limit its application on long text. In this paper, adopting a fine-to-coarse attention mechanism on…

Computation and Language · Computer Science 2019-11-12 Zihao Ye , Qipeng Guo , Quan Gan , Xipeng Qiu , Zheng Zhang

Blockwise Parallel Decoding for Deep Autoregressive Models

Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make…

Machine Learning · Computer Science 2018-11-09 Mitchell Stern , Noam Shazeer , Jakob Uszkoreit

Blockwise Compression of Transformer-based Models without Retraining

Transformer-based models, exemplified by GPT-3, ChatGPT, and GPT-4, have recently garnered considerable attention in both academia and industry due to their promising performance in general language tasks. Nevertheless, these models…

Computation and Language · Computer Science 2023-09-19 Gaochen Dong , Wei Chen

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-07 Ao Sun , Weilin Zhao , Xu Han , Cheng Yang , Zhiyuan Liu , Chuan Shi , Maosong Sun

Block Transformer: Global-to-Local Language Modeling for Fast Inference

We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of…

Computation and Language · Computer Science 2024-11-04 Namgyu Ho , Sangmin Bae , Taehyeon Kim , Hyunjik Jo , Yireun Kim , Tal Schuster , Adam Fisch , James Thorne , Se-Young Yun

Blockwise Self-Attention for Long Document Understanding

We present BlockBERT, a lightweight and efficient BERT model for better modeling long-distance dependencies. Our model extends BERT by introducing sparse block structures into the attention matrix to reduce both memory consumption and…

Computation and Language · Computer Science 2020-11-03 Jiezhong Qiu , Hao Ma , Omer Levy , Scott Wen-tau Yih , Sinong Wang , Jie Tang

Associative Recurrent Memory Transformer

This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is…

Computation and Language · Computer Science 2025-02-17 Ivan Rodkin , Yuri Kuratov , Aydar Bulatov , Mikhail Burtsev

ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism

Transformer-based models have emerged as a leading architecture for natural language processing, natural language generation, and image generation tasks. A fundamental element of the transformer architecture is self-attention, which allows…

Machine Learning · Computer Science 2025-07-01 Venmugil Elango

Compact Recurrent Transformer with Persistent Memory

The Transformer architecture has shown significant success in many language processing and visual tasks. However, the method faces challenges in efficiently scaling to long sequences because the self-attention computation is quadratic with…

Machine Learning · Computer Science 2025-05-05 Edison Mucllari , Zachary Daniels , David Zhang , Qiang Ye

Recurrent Memory Transformer

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has…

Computation and Language · Computer Science 2022-12-09 Aydar Bulatov , Yuri Kuratov , Mikhail S. Burtsev

Block-State Transformers

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous…

Computation and Language · Computer Science 2023-10-31 Mahan Fathi , Jonathan Pilault , Orhan Firat , Christopher Pal , Pierre-Luc Bacon , Ross Goroshin

Long-Short Transformer: Efficient Transformers for Language and Vision

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic…

Computer Vision and Pattern Recognition · Computer Science 2021-12-08 Chen Zhu , Wei Ping , Chaowei Xiao , Mohammad Shoeybi , Tom Goldstein , Anima Anandkumar , Bryan Catanzaro

Memory Transformer

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware…

Computation and Language · Computer Science 2021-02-17 Mikhail S. Burtsev , Yuri Kuratov , Anton Peganov , Grigory V. Sapunov

Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory…

Machine Learning · Computer Science 2025-06-06 Danil Sivtsov , Ivan Rodkin , Gleb Kuzmin , Yuri Kuratov , Ivan Oseledets

Block-wise Bit-Compression of Transformer-based Models

With the popularity of the recent Transformer-based models represented by BERT, GPT-3 and ChatGPT, there has been state-of-the-art performance in a range of natural language processing tasks. However, the massive computations, huge memory…

Computation and Language · Computer Science 2023-04-04 Gaochen Dong , Wei Chen

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model…

Machine Learning · Computer Science 2023-03-28 Quentin Fournier , Gaétan Marceau Caron , Daniel Aloise

Sequence Parallelism: Long Sequence Training from System Perspective

Transformer achieves promising results on various tasks. However, self-attention suffers from quadratic memory requirements with respect to the sequence length. Existing work focuses on reducing time and space complexity from an algorithm…

Machine Learning · Computer Science 2022-05-24 Shenggui Li , Fuzhao Xue , Chaitanya Baranwal , Yongbin Li , Yang You

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

Efficient large-scale inference of transformer-based large language models (LLMs) remains a fundamental systems challenge, frequently requiring multi-GPU parallelism to meet stringent latency and throughput targets. Conventional tensor…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-10 Chong Wang , Nan Du , Tom Gunter , Tao Lei , Kulin Seth , Senyu Tong , Jianyu Wang , Guoli Yin , Xiyou Zhou , Kelvin Zou , Ruoming Pang

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Transformer-based models have achieved great success in various NLP, vision, and speech tasks. However, the core of Transformer, the self-attention mechanism, has a quadratic time and memory complexity with respect to the sequence length,…

Computation and Language · Computer Science 2023-05-23 Chao-Hong Tan , Qian Chen , Wen Wang , Qinglin Zhang , Siqi Zheng , Zhen-Hua Ling