Related papers: Blockwise Parallel Decoding for Deep Autoregressiv…

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works…

Computer Vision and Pattern Recognition · Computer Science 2026-03-12 Zhuoyang Zhang , Luke J. Huang , Chengyue Wu , Shang Yang , Kelly Peng , Yao Lu , Song Han

Blockwise Parallel Transformer for Large Context Models

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention…

Computation and Language · Computer Science 2023-08-30 Hao Liu , Pieter Abbeel

Exploring and Improving Drafts in Blockwise Parallel Decoding

Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. as…

Computation and Language · Computer Science 2024-06-06 Taehyeon Kim , Ananda Theertha Suresh , Kishore Papineni , Michael Riley , Sanjiv Kumar , Adrian Benton

Fast Decoding in Sequence Models using Discrete Latent Variables

Autoregressive sequence models based on deep neural networks, such as RNNs, Wavenet and the Transformer attain state-of-the-art results on many tasks. However, they are difficult to parallelize and are thus slow at processing long…

Machine Learning · Computer Science 2018-06-11 Łukasz Kaiser , Aurko Roy , Ashish Vaswani , Niki Parmar , Samy Bengio , Jakob Uszkoreit , Noam Shazeer

Parallelized Autoregressive Visual Generation

Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process. In this paper, we propose a simple yet effective approach for…

Computer Vision and Pattern Recognition · Computer Science 2025-04-04 Yuqing Wang , Shuhuai Ren , Zhijie Lin , Yujin Han , Haoyuan Guo , Zhenheng Yang , Difan Zou , Jiashi Feng , Xihui Liu

Continuous Speculative Decoding for Autoregressive Image Generation

Continuous visual autoregressive (AR) models have demonstrated promising performance in image generation. However, the heavy autoregressive inference burden imposes significant overhead. In Large Language Models (LLMs), speculative decoding…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Zili Wang , Robert Zhang , Kun Ding , Qi Yang , Fei Li , Shiming Xiang

Parallelizing non-linear sequential models over the sequence length

Sequential models, such as Recurrent Neural Networks and Neural Ordinary Differential Equations, have long suffered from slow training due to their inherent sequential nature. For many years this bottleneck has persisted, as many thought…

Machine Learning · Computer Science 2024-01-17 Yi Heng Lim , Qi Zhu , Joshua Selfridge , Muhammad Firmansyah Kasim

Pipelined Decoder for Efficient Context-Aware Text Generation

As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a…

Computation and Language · Computer Science 2025-07-02 Zixian Huang , Chenxu Niu , Yu Gu , Gengyang Xiao , Xinwei Huang , Gong Cheng

Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

Recent advances in reasoning models have demonstrated significant improvements in accuracy by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and…

Computation and Language · Computer Science 2025-08-27 Yijiong Yu

Fast Autoregressive Video Generation with Diagonal Decoding

Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of…

Computer Vision and Pattern Recognition · Computer Science 2025-03-19 Yang Ye , Junliang Guo , Haoyu Wu , Tianyu He , Tim Pearce , Tabish Rashid , Katja Hofmann , Jiang Bian

Ring Attention with Blockwise Transformers for Near-Infinite Context

Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability…

Computation and Language · Computer Science 2023-11-28 Hao Liu , Matei Zaharia , Pieter Abbeel

From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation

Inspired by the remarkable success of autoregressive models in language modeling, this paradigm has been widely adopted in visual generation. However, the sequential token-by-token decoding mechanism inherent in traditional autoregressive…

Computer Vision and Pattern Recognition · Computer Science 2026-01-01 Siyang Wang , Hanting Li , Wei Li , Jie Hu , Xinghao Chen , Feng Zhao

Accelerating Transformer Inference for Translation via Parallel Decoding

Autoregressive decoding limits the efficiency of transformers for Machine Translation (MT). The community proposed specific network architectures and learning-based methods to solve this issue, which are expensive and require changes to the…

Computation and Language · Computer Science 2025-02-06 Andrea Santilli , Silvio Severino , Emilian Postolache , Valentino Maiorca , Michele Mancusi , Riccardo Marin , Emanuele Rodolà

AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential…

Computation and Language · Computer Science 2025-06-05 Zhepei Wei , Wei-Lin Chen , Xinyu Zhu , Yu Meng

Accelerating Inference of Discrete Autoregressive Normalizing Flows by Selective Jacobi Decoding

Discrete normalizing flows are promising generative models with advantages such as analytical log-likelihood computation and end-to-end training. However, the architectural constraints to ensure invertibility and tractable Jacobian…

Machine Learning · Computer Science 2026-05-06 Jiaru Zhang , Juanwu Lu , Xiaoyu Wu , Ziran Wang , Ruqi Zhang

End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification

Autoregressive decoding is the only part of sequence-to-sequence models that prevents them from massive parallelization at inference time. Non-autoregressive models enable the decoder to generate all output symbols independently in…

Computation and Language · Computer Science 2018-11-13 Jindřich Libovický , Jindřich Helcl

Hierarchical Attention Encoder Decoder

Recent advances in large language models have shown that autoregressive modeling can generate complex and novel sequences that have many real-world applications. However, these models must generate outputs autoregressively, which becomes…

Machine Learning · Computer Science 2023-06-05 Asier Mujika

Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation

Much recent effort has been invested in non-autoregressive neural machine translation, which appears to be an efficient alternative to state-of-the-art autoregressive machine translation on modern GPUs. In contrast to the latter, where…

Computation and Language · Computer Science 2021-06-28 Jungo Kasai , Nikolaos Pappas , Hao Peng , James Cross , Noah A. Smith

Speculative Decoding and Beyond: An In-Depth Survey of Techniques

Sequential dependencies present a fundamental bottleneck in deploying large-scale autoregressive models, particularly for real-time applications. While traditional optimization approaches like pruning and quantization often compromise model…

Computation and Language · Computer Science 2025-10-09 Yunhai Hu , Zining Liu , Zhenyuan Dong , Tianfan Peng , Bradley McDanel , Sai Qian Zhang

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV)…

Computation and Language · Computer Science 2026-03-06 Jia-Nan Li , Jian Guan , Wei Wu , Chongxuan Li