Related papers: Block-Recurrent Transformers

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

Transformers process tokens in parallel but are temporally shallow: at position $t$, each layer attends to key-value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer…

Machine Learning · Computer Science 2026-04-24 Costin-Andrei Oncescu , Depen Morwani , Samy Jelassi , Alexandru Meterez , Mujin Kwun , Sham Kakade

RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals

Transformers have achieved great success in effectively processing sequential data such as text. Their architecture consisting of several attention and feedforward blocks can model relations between elements of a sequence in parallel…

Machine Learning · Computer Science 2025-02-20 Jaemu Heo , Eldor Fozilov , Hyunmin Song , Taehwan Kim

Recurrent Memory Transformer

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has…

Computation and Language · Computer Science 2022-12-09 Aydar Bulatov , Yuri Kuratov , Mikhail S. Burtsev

Parallel Recursive LSTM

Transformers have become the dominant architecture for sequence modeling by using self-attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long-context…

Machine Learning · Computer Science 2026-05-19 Tristan Gaudreault , Yongyi Mao

Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior

We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source…

Machine Learning · Computer Science 2026-05-27 Zeyi Huang , Xuehai He , LiLiang Ren , Yiping Wang , Baolin Peng , Hao Cheng , Shuohang Wang , Pengcheng He , Jianfeng Gao , Yong Jae Lee , Yelong Shen

StagFormer: Time Staggering Transformer Decoding for RunningLayers In Parallel

Decoding in a Transformer based language model is inherently sequential as a token's embedding needs to pass through all the layers in the network before the generation of the next token can begin. In this work, we propose a new…

Machine Learning · Computer Science 2025-08-27 Dylan Cutler , Arun Kandoor , Nishanth Dikkala , Nikunj Saunshi , Xin Wang , Rina Panigrahy

Block-State Transformers

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous…

Computation and Language · Computer Science 2023-10-31 Mahan Fathi , Jonathan Pilault , Orhan Firat , Christopher Pal , Pierre-Luc Bacon , Ross Goroshin

Modeling Recurrence for Transformer

Recently, the Transformer model that is based solely on attention mechanisms, has advanced the state-of-the-art on various machine translation tasks. However, recent studies reveal that the lack of recurrence hinders its further improvement…

Computation and Language · Computer Science 2019-04-08 Jie Hao , Xing Wang , Baosong Yang , Longyue Wang , Jinfeng Zhang , Zhaopeng Tu

Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model

Transformers have shown dominant performance across a range of domains including language and vision. However, their computational cost grows quadratically with the sequence length, making their usage prohibitive for resource-constrained…

Computation and Language · Computer Science 2023-10-24 Yinghan Long , Sayeed Shafayet Chowdhury , Kaushik Roy

Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models

Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have…

Machine Learning · Computer Science 2025-10-17 Jonas Geiping , Xinyu Yang , Guinan Su

Two-Scale Latent Dynamics for Recurrent-Depth Transformers

Recurrent-depth transformers scale test-time compute by iterating latent computations before emitting tokens. We study the geometry of these iterates and argue for a simple, two-scale operational picture: (i) within a looped block, updates…

Machine Learning · Computer Science 2025-11-14 Francesco Pappone , Donato Crisostomi , Emanuele Rodolà

Blockwise Parallel Transformer for Large Context Models

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention…

Computation and Language · Computer Science 2023-08-30 Hao Liu , Pieter Abbeel

Block Transformer: Global-to-Local Language Modeling for Fast Inference

We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of…

Computation and Language · Computer Science 2024-11-04 Namgyu Ho , Sangmin Bae , Taehyeon Kim , Hyunjik Jo , Yireun Kim , Tal Schuster , Adam Fisch , James Thorne , Se-Young Yun

Thinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling

We extend the recent latent recurrent modeling to sequential input streams. By interleaving fast, recurrent latent updates with self-organizational ability between slow observation updates, our method facilitates the learning of stable…

Machine Learning · Computer Science 2026-04-23 Shota Takashiro , Masanori Koyama , Takeru Miyato , Yusuke Iwasawa , Yutaka Matsuo , Kohei Hayashi

Convolutional Sequence to Sequence Learning

The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks. Compared to…

Computation and Language · Computer Science 2017-07-26 Jonas Gehring , Michael Auli , David Grangier , Denis Yarats , Yann N. Dauphin

Variable Computation in Recurrent Neural Networks

Recurrent neural networks (RNNs) have been used extensively and with increasing success to model various types of sequential data. Much of this progress has been achieved through devising recurrent units and architectures with the…

Machine Learning · Statistics 2017-03-06 Yacine Jernite , Edouard Grave , Armand Joulin , Tomas Mikolov

Deep Neural Machine Translation with Weakly-Recurrent Units

Recurrent neural networks (RNNs) have represented for years the state of the art in neural machine translation. Recently, new architectures have been proposed, which can leverage parallel computation on GPUs better than classical RNNs.…

Computation and Language · Computer Science 2018-05-14 Mattia Antonino Di Gangi , Marcello Federico

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

Over the last two decades, language modeling has experienced a shift from the use of predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements…

Computation and Language · Computer Science 2026-05-20 Benjamin L. Badger

Going Wider: Recurrent Neural Network With Parallel Cells

Recurrent Neural Network (RNN) has been widely applied for sequence modeling. In RNN, the hidden states at current step are full connected to those at previous step, thus the influence from less related features at previous step may…

Computation and Language · Computer Science 2017-05-04 Danhao Zhu , Si Shen , Xin-Yu Dai , Jiajun Chen

Transition-based Parsing with Stack-Transformers

Modeling the parser state is key to good performance in transition-based parsing. Recurrent Neural Networks considerably improved the performance of transition-based systems by modelling the global state, e.g. stack-LSTM parsers, or local…

Computation and Language · Computer Science 2020-10-22 Ramon Fernandez Astudillo , Miguel Ballesteros , Tahira Naseem , Austin Blodgett , Radu Florian