English
Related papers

Related papers: Blockwise Parallel Transformer for Large Context M…

200 papers

Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability…

Computation and Language · Computer Science 2023-11-28 Hao Liu , Matei Zaharia , Pieter Abbeel

The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of self-attention limit its application on long text. In this paper, adopting a fine-to-coarse attention mechanism on…

Computation and Language · Computer Science 2019-11-12 Zihao Ye , Qipeng Guo , Quan Gan , Xipeng Qiu , Zheng Zhang

Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make…

Machine Learning · Computer Science 2018-11-09 Mitchell Stern , Noam Shazeer , Jakob Uszkoreit

Transformer-based models, exemplified by GPT-3, ChatGPT, and GPT-4, have recently garnered considerable attention in both academia and industry due to their promising performance in general language tasks. Nevertheless, these models…

Computation and Language · Computer Science 2023-09-19 Gaochen Dong , Wei Chen

Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-07 Ao Sun , Weilin Zhao , Xu Han , Cheng Yang , Zhiyuan Liu , Chuan Shi , Maosong Sun

We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of…

Computation and Language · Computer Science 2024-11-04 Namgyu Ho , Sangmin Bae , Taehyeon Kim , Hyunjik Jo , Yireun Kim , Tal Schuster , Adam Fisch , James Thorne , Se-Young Yun

We present BlockBERT, a lightweight and efficient BERT model for better modeling long-distance dependencies. Our model extends BERT by introducing sparse block structures into the attention matrix to reduce both memory consumption and…

Computation and Language · Computer Science 2020-11-03 Jiezhong Qiu , Hao Ma , Omer Levy , Scott Wen-tau Yih , Sinong Wang , Jie Tang

This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is…

Computation and Language · Computer Science 2025-02-17 Ivan Rodkin , Yuri Kuratov , Aydar Bulatov , Mikhail Burtsev

Transformer-based models have emerged as a leading architecture for natural language processing, natural language generation, and image generation tasks. A fundamental element of the transformer architecture is self-attention, which allows…

Machine Learning · Computer Science 2025-07-01 Venmugil Elango

The Transformer architecture has shown significant success in many language processing and visual tasks. However, the method faces challenges in efficiently scaling to long sequences because the self-attention computation is quadratic with…

Machine Learning · Computer Science 2025-05-05 Edison Mucllari , Zachary Daniels , David Zhang , Qiang Ye

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has…

Computation and Language · Computer Science 2022-12-09 Aydar Bulatov , Yuri Kuratov , Mikhail S. Burtsev

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous…

Computation and Language · Computer Science 2023-10-31 Mahan Fathi , Jonathan Pilault , Orhan Firat , Christopher Pal , Pierre-Luc Bacon , Ross Goroshin

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic…

Computer Vision and Pattern Recognition · Computer Science 2021-12-08 Chen Zhu , Wei Ping , Chaowei Xiao , Mohammad Shoeybi , Tom Goldstein , Anima Anandkumar , Bryan Catanzaro

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware…

Computation and Language · Computer Science 2021-02-17 Mikhail S. Burtsev , Yuri Kuratov , Anton Peganov , Grigory V. Sapunov

Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory…

Machine Learning · Computer Science 2025-06-06 Danil Sivtsov , Ivan Rodkin , Gleb Kuzmin , Yuri Kuratov , Ivan Oseledets

With the popularity of the recent Transformer-based models represented by BERT, GPT-3 and ChatGPT, there has been state-of-the-art performance in a range of natural language processing tasks. However, the massive computations, huge memory…

Computation and Language · Computer Science 2023-04-04 Gaochen Dong , Wei Chen

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model…

Machine Learning · Computer Science 2023-03-28 Quentin Fournier , Gaétan Marceau Caron , Daniel Aloise

Transformer achieves promising results on various tasks. However, self-attention suffers from quadratic memory requirements with respect to the sequence length. Existing work focuses on reducing time and space complexity from an algorithm…

Machine Learning · Computer Science 2022-05-24 Shenggui Li , Fuzhao Xue , Chaitanya Baranwal , Yongbin Li , Yang You

Efficient large-scale inference of transformer-based large language models (LLMs) remains a fundamental systems challenge, frequently requiring multi-GPU parallelism to meet stringent latency and throughput targets. Conventional tensor…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-10 Chong Wang , Nan Du , Tom Gunter , Tao Lei , Kulin Seth , Senyu Tong , Jianyu Wang , Guoli Yin , Xiyou Zhou , Kelvin Zou , Ruoming Pang

Transformer-based models have achieved great success in various NLP, vision, and speech tasks. However, the core of Transformer, the self-attention mechanism, has a quadratic time and memory complexity with respect to the sequence length,…

Computation and Language · Computer Science 2023-05-23 Chao-Hong Tan , Qian Chen , Wen Wang , Qinglin Zhang , Siqi Zheng , Zhen-Hua Ling
‹ Prev 1 2 3 10 Next ›