English
Related papers

Related papers: Learning Harmonized Representations for Speculativ…

200 papers

Large Language Models (LLMs) have revolutionized natural language processing by understanding and generating human-like text. However, the increasing demand for more sophisticated LLMs presents significant computational challenges due to…

Computation and Language · Computer Science 2025-01-14 Ze Yang , Yihong Jin , Xinhe Xu

Auto-regressive decoding in Large Language Models (LLMs) is inherently memory-bound: every generation step requires loading the model weights and intermediate results from memory (e.g., High-Bandwidth Memory (HBM) for GPU servers), making…

Machine Learning · Computer Science 2026-05-13 Yuning Han , Yangchenchen Jin , Dylan Zhao , Jingwei Sun

Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases…

Information Retrieval · Computer Science 2026-04-23 Peng Peng , Weiwei Lin , Wentai Wu , Xinyang Wang , Yongheng Liu

Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft…

Machine Learning · Computer Science 2025-10-24 Clara Mohri , Haim Kaplan , Tal Schuster , Yishay Mansour , Amir Globerson

Speculative decoding, which combines a draft model with a target model, has emerged as an effective approach to accelerate large language model (LLM) inference. However, existing methods often face a trade-off between the acceptance rate…

Computation and Language · Computer Science 2025-05-14 Danying Ge , Jianhua Gao , Qizhi Jiang , Yifei Feng , Weixing Ji

Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for…

Computation and Language · Computer Science 2025-02-11 Sukmin Cho , Sangjin Choi , Taeho Hwang , Jeongyeon Seo , Soyeong Jeong , Huije Lee , Hoyun Song , Jong C. Park , Youngjin Kwon

Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Wenhui Liao , Hongliang Li , Pengyu Xie , Xinyu Cai , Yufan Shen , Yi Xin , Qi Qin , Shenglong Ye , Tianbin Li , Ming Hu , Junjun He , Yihao Liu , Wenhai Wang , Min Dou , Bin Fu , Botian Shi , Yu Qiao , Lianwen Jin

Recent works have revealed the great potential of speculative decoding in accelerating the autoregressive generation process of large language models. The success of these methods relies on the alignment between draft candidates and the…

Computation and Language · Computer Science 2025-09-15 Jikai Wang , Zhenxu Tian , Juntao Li , Qingrong Xia , Xinyu Duan , Zhefeng Wang , Baoxing Huai , Min Zhang

Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting…

Computation and Language · Computer Science 2025-11-04 Min Fang , Zhihui Fu , Qibin Zhao , Jun Wang

Large language model (LLM)-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of LLMs challenges the…

Audio and Speech Processing · Electrical Eng. & Systems 2025-07-29 Linye Wei , Shuzhang Zhong , Songqiang Xu , Runsheng Wang , Ru Huang , Meng Li

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

Large Language Model (LLM) collaborative decoding techniques improve output quality by combining the outputs of multiple models at each generation step, but they incur high computational costs. In this paper, we introduce Collaborative…

Computation and Language · Computer Science 2025-05-30 Jiale Fu , Yuchu Jiang , Junkai Chen , Jiaming Fan , Xin Geng , Xu Yang

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low…

Artificial Intelligence · Computer Science 2023-08-10 Benjamin Spector , Chris Re

Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a…

Computation and Language · Computer Science 2026-05-27 Avinash Kumar , Sujay Sanghavi , Poulami Das

Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to…

Artificial Intelligence · Computer Science 2026-03-03 Yuxuan Zhou , Fei Huang , Heng Li , Fengyi Wu , Tianyu Wang , Jianwei Zhang , Junyang Lin , Zhi-Qi Cheng

Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in…

Computation and Language · Computer Science 2025-12-18 Chendong Sun , Ali Mao , Lei Xu , mingmin Chen

Speculative decoding has been shown as an effective way to accelerate Large Language Model (LLM) inference by using a Small Speculative Model (SSM) to generate candidate tokens in a so-called speculation phase, which are subsequently…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-21 Fahao Chen , Peng Li , Tom H. Luan , Zhou Su , Jing Deng

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.…

Computation and Language · Computer Science 2025-06-12 Nadav Timor , Jonathan Mamou , Daniel Korat , Moshe Berchansky , Gaurav Jain , Oren Pereg , Moshe Wasserblat , David Harel

The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive…

Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens…

Machine Learning · Computer Science 2025-02-06 Minghao Yan , Saurabh Agarwal , Shivaram Venkataraman
‹ Prev 1 2 3 10 Next ›