Related papers: Generation Meets Verification: Accelerating Large …

SPEED: Speculative Pipelined Execution for Efficient Decoding

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

Diffusion large language models (dLLMs) generate text through iterative denoising. In commonly adopted parallel decoding schemes, each step confirms only high-confidence positions while remasking the others. By analyzing dLLM denoising…

Computation and Language · Computer Science 2026-05-27 Kangyu Wang , Zhiyun Jiang , Haibo Feng , Weijia Zhao , Lin Liu , Jianguo Li , Zhenzhong Lan , Weiyao Lin

Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment

The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive…

Machine Learning · Computer Science 2025-02-03 Gregor Bachmann , Sotiris Anagnostidis , Albert Pumarola , Markos Georgopoulos , Artsiom Sanakoyeu , Yuming Du , Edgar Schönfeld , Ali Thabet , Jonas Kohler

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao

Efficient Inference for Large Language Model-based Generative Recommendation

Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly particularly due to excessive inference latency caused by autoregressive decoding. For lossless LLM decoding…

Information Retrieval · Computer Science 2025-02-27 Xinyu Lin , Chaoqun Yang , Wenjie Wang , Yongqi Li , Cunxiao Du , Fuli Feng , See-Kiong Ng , Tat-Seng Chua

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding…

Machine Learning · Computer Science 2024-07-18 Benjamin Bergner , Andrii Skliar , Amelie Royer , Tijmen Blankevoort , Yuki Asano , Babak Ehteshami Bejnordi

Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding

Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in…

Machine Learning · Computer Science 2025-05-27 Yixuan Wang , Yijun Liu , Shiyu ji , Yuzhuang Xu , Yang Xu , Qingfu Zhu , Wanxiang Che

Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement

Large language models (LLMs) often face a bottleneck in inference speed due to their reliance on auto-regressive decoding. Recently, parallel decoding has shown significant promise in enhancing inference efficiency. However, we have…

Computation and Language · Computer Science 2024-10-18 Yuxuan Liu , Wenyuan Li , Laizhong Cui , Hailiang Yang

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy

As Large Language Models (LLMs) have made significant advancements across various tasks, such as question answering, translation, text summarization, and dialogue systems, the need for accuracy in information becomes crucial, especially for…

Information Retrieval · Computer Science 2024-05-31 Yao Zhao , Zhitian Xie , Chen Liang , Chenyi Zhuang , Jinjie Gu

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The…

Computation and Language · Computer Science 2025-02-11 Jun Zhang , Jue Wang , Huan Li , Lidan Shou , Ke Chen , Gang Chen , Sharad Mehrotra

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during…

Computation and Language · Computer Science 2024-11-21 Sean Welleck , Amanda Bertsch , Matthew Finlayson , Hailey Schoelkopf , Alex Xie , Graham Neubig , Ilia Kulikov , Zaid Harchaoui

Inference with Reference: Lossless Acceleration of Large Language Models

We propose LLMA, an LLM accelerator to losslessly speed up Large Language Model (LLM) inference with references. LLMA is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the…

Computation and Language · Computer Science 2023-04-11 Nan Yang , Tao Ge , Liang Wang , Binxing Jiao , Daxin Jiang , Linjun Yang , Rangan Majumder , Furu Wei

PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition

In this study, we aim to reduce generation latency for Named Entity Recognition (NER) with Large Language Models (LLMs). The main cause of high latency in LLMs is the sequential decoding process, which autoregressively generates all labels…

Computation and Language · Computer Science 2024-11-22 Jinghui Lu , Ziwei Yang , Yanjie Wang , Xuejing Liu , Brian Mac Namee , Can Huang

Faster Speech-LLaMA Inference with Multi-token Prediction

Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-13 Desh Raj , Gil Keren , Junteng Jia , Jay Mahadeokar , Ozlem Kalinli

Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States

Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of…

Computation and Language · Computer Science 2026-02-05 Ximing Dong , Shaowei Wang , Dayi Lin , Boyuan Chen , Ahmed E. Hassan

Efficient Document Parsing via Parallel Token Prediction

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing…

Computation and Language · Computer Science 2026-03-17 Lei Li , Ze Zhao , Meng Li , Zhongwang Lun , Yi Yuan , Xingjing Lu , Zheng Wei , Jiang Bian , Zang Li

Parallel Prefix Verification for Speculative Generation

We introduce PARSE (PArallel pRefix Speculative Engine), a speculative generation framework that accelerates large language model (LLM) inference by parallelizing prefix verification on a semantic level. Existing speculative decoding…

Artificial Intelligence · Computer Science 2026-05-07 Yuncheng Yao , Yuxuan Xia , Shengjie Wang , Danyang Zhuo

A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models

As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time…

Computation and Language · Computer Science 2026-02-11 Lingzhe Zhang , Liancheng Fang , Chiming Duan , Minghua He , Leyi Pan , Pei Xiao , Shiyu Huang , Yunpeng Zhai , Xuming Hu , Philip S. Yu , Aiwei Liu

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang