Related papers: Breaking the Autoregressive Chain: Hyper-Parallel …

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works…

Computer Vision and Pattern Recognition · Computer Science 2026-03-12 Zhuoyang Zhang , Luke J. Huang , Chengyue Wu , Shang Yang , Kelly Peng , Yao Lu , Song Han

ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By…

Computation and Language · Computer Science 2025-08-15 Keyu Chen , Zhifeng Shen , Daohai Yu , Haoqian Wu , Wei Wen , Jianfeng He , Ruizhi Qiao , Xing Sun

Accelerating Diffusion LLMs via Adaptive Parallel Decoding

The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in…

Computation and Language · Computer Science 2025-11-03 Daniel Israel , Guy Van den Broeck , Aditya Grover

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding…

Machine Learning · Computer Science 2024-02-06 Yichao Fu , Peter Bailis , Ion Stoica , Hao Zhang

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these…

Machine Learning · Computer Science 2025-10-01 Hao Mark Chen , Wayne Luk , Ka Fai Cedric Yiu , Rui Li , Konstantin Mishchenko , Stylianos I. Venieris , Hongxiang Fan

PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a…

Machine Learning · Computer Science 2025-12-02 Zihao An , Huajun Bai , Ziqiong Liu , Dong Li , Emad Barsoum

Hierarchical Skip Decoding for Efficient Autoregressive Text Generation

Autoregressive decoding strategy is a commonly used method for text generation tasks with pre-trained language models, while early-exiting is an effective approach to speedup the inference stage. In this work, we propose a novel decoding…

Computation and Language · Computer Science 2024-03-25 Yunqi Zhu , Xuebing Yang , Yuanyuan Wu , Wensheng Zhang

APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding

The massive adoption of large language models (LLMs) demands efficient deployment strategies. However, the auto-regressive decoding process, which is fundamental to how most LLMs generate text, poses challenges to achieve efficient serving.…

Computation and Language · Computer Science 2024-01-15 Mingdao Liu , Aohan Zeng , Bowen Wang , Peng Zhang , Jie Tang , Yuxiao Dong

Blockwise Parallel Decoding for Deep Autoregressive Models

Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make…

Machine Learning · Computer Science 2018-11-09 Mitchell Stern , Noam Shazeer , Jakob Uszkoreit

HSD: Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding

Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Wenhui Liao , Hongliang Li , Pengyu Xie , Xinyu Cai , Yufan Shen , Yi Xin , Qi Qin , Shenglong Ye , Tianbin Li , Ming Hu , Junjun He , Yihao Liu , Wenhai Wang , Min Dou , Bin Fu , Botian Shi , Yu Qiao , Lianwen Jin

Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models

Diffusion Large Language Models (DLLMs) have emerged as a new paradigm of language modeling beyond autoregressive next-token prediction. Taking advantage of their inherent modeling foundations, DLLMs have the great potential of efficient…

Machine Learning · Computer Science 2026-02-04 Shutong Wu , Jiawei Zhang

Efficient Document Parsing via Parallel Token Prediction

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing…

Computation and Language · Computer Science 2026-03-17 Lei Li , Ze Zhao , Meng Li , Zhongwang Lun , Yi Yuan , Xingjing Lu , Zheng Wei , Jiang Bian , Zang Li

AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential…

Computation and Language · Computer Science 2025-06-05 Zhepei Wei , Wei-Lin Chen , Xinyu Zhu , Yu Meng

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires…

Computation and Language · Computer Science 2024-08-20 Xukun Liu , Bowen Lei , Ruqi Zhang , Dongkuan Xu

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

Parallel Context-of-Experts Decoding for Retrieval Augmented Generation

Retrieval Augmented Generation faces a trade-off: concatenating documents in a long prompt enables multi-document reasoning but creates prefill bottlenecks, while encoding document KV caches separately offers speed but breaks cross-document…

Artificial Intelligence · Computer Science 2026-01-14 Giulio Corallo , Paolo Papotti

Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding

Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost because each output token is generated auto-regressively through all model layers. Early-exit based self-speculative decoding (EESD) has…

Computation and Language · Computer Science 2025-09-25 Ruanjun Li , Ziheng Liu , Yuanming Shi , Jiawei Shao , Chi Zhang , Xuelong Li

Accelerating Transformer Inference for Translation via Parallel Decoding

Autoregressive decoding limits the efficiency of transformers for Machine Translation (MT). The community proposed specific network architectures and learning-based methods to solve this issue, which are expensive and require changes to the…

Computation and Language · Computer Science 2025-02-06 Andrea Santilli , Silvio Severino , Emilian Postolache , Valentino Maiorca , Michele Mancusi , Riccardo Marin , Emanuele Rodolà

Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

This paper presents "Predictive Pipelined Decoding (PPD)," an approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding. Unlike conventional strategies, PPD…

Computation and Language · Computer Science 2024-07-30 Seongjun Yang , Gibbeum Lee , Jaewoong Cho , Dimitris Papailiopoulos , Kangwook Lee

Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs

Parallel decoding for diffusion LLMs (dLLMs) is difficult because each denoising step provides only token-wise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We…

Machine Learning · Computer Science 2026-03-16 Bumjun Kim , Dongjae Jeon , Moongyu Jeon , Albert No