English
Related papers

Related papers: C2T: A Classifier-Based Tree Construction Method i…

200 papers

Autoregressive language models demonstrate excellent performance in various scenarios. However, the inference efficiency is limited by its one-step-one-word generation mode, which has become a pressing problem recently as the models become…

Computation and Language · Computer Science 2025-04-25 Jikai Wang , Yi Su , Juntao Li , Qingrong Xia , Zi Ye , Xinyu Duan , Zhefeng Wang , Min Zhang

Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and…

Computation and Language · Computer Science 2026-02-27 Yinrong Hong , Zhiquan Tan , Kai Hu

Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in…

Computation and Language · Computer Science 2025-11-06 Yepeng Weng , Qiao Hu , Xujie Chen , Li Liu , Dianwen Mei , Huishi Qiu , Jiang Tian , Zhongchao Shi

Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time…

Computation and Language · Computer Science 2026-03-03 Jiebin Zhang , Zhenghan Yu , Liang Wang , Nan Yang , Eugene J. Yu , Zheng Li , Yifan Song , Dawei Zhu , Xingxing Zhang , Furu Wei , Sujian Li

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low…

Artificial Intelligence · Computer Science 2023-08-10 Benjamin Spector , Chris Re

Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly…

Computation and Language · Computer Science 2024-07-02 Yuhui Li , Fangyun Wei , Chao Zhang , Hongyang Zhang

This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict…

Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured…

Computation and Language · Computer Science 2026-01-13 Tianyu Liu , Qitan Lv , Yuhao Shen , Xiao Sun , Xiaoyan Sun

Speculative decoding is an inference-acceleration method for large language models (LLMs) where a small language model generates a draft-token sequence which is further verified by the target LLM in parallel. Recent works have advanced this…

Machine Learning · Computer Science 2024-03-06 Wonseok Jeon , Mukul Gagrani , Raghavv Goel , Junyoung Park , Mingu Lee , Christopher Lott

Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which…

Computation and Language · Computer Science 2025-05-30 Yudi Zhang , Weilin Zhao , Xu Han , Tiejun Zhao , Wang Xu , Hailong Cao , Conghui Zhu

Massive parameters of LLMs have made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures…

Computation and Language · Computer Science 2025-05-27 Xianzhen Luo , Yixuan Wang , Qingfu Zhu , Zhiming Zhang , Xuanyu Zhang , Qing Yang , Dongliang Xu

We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The…

Computation and Language · Computer Science 2025-02-11 Jun Zhang , Jue Wang , Huan Li , Lidan Shou , Ke Chen , Gang Chen , Sharad Mehrotra

We introduce G2T-LLM, a novel approach for molecule generation that uses graph-to-tree text encoding to transform graph-based molecular structures into a hierarchical text format optimized for large language models (LLMs). This encoding…

Machine Learning · Computer Science 2024-10-04 Zhaoning Yu , Xiangyang Xu , Hongyang Gao

We propose an acceleration scheme for large language models (LLMs) through Speculative Decoding with Semantic Adaptive Tokens (SDSAT). The primary objective of this design is to enhance the LLM model's ability to generate draft tokens more…

Computation and Language · Computer Science 2024-04-02 Chengbo Liu , Yong Zhu

Competition-level code generation tasks pose significant challenges for current state-of-the-art large language models (LLMs). For example, on the LiveCodeBench-Hard dataset, models such as O1-Mini and O1-Preview achieve pass@1 rates of…

Artificial Intelligence · Computer Science 2024-12-31 Hao Wang , Boyi Liu , Yufeng Zhang , Jie Chen

Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable…

Computation and Language · Computer Science 2025-12-15 Sergey Pankratov , Dan Alistarh

Deployment of autoregressive large language models (LLMs) is costly, and as these models increase in size, the associated costs will become even more considerable. Consequently, different methods have been proposed to accelerate the token…

Computation and Language · Computer Science 2024-07-03 Parsa Kavehzadeh , Mohammadreza Pourreza , Mojtaba Valipour , Tinashu Zhu , Haoli Bai , Ali Ghodsi , Boxing Chen , Mehdi Rezagholizadeh

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

Computation and Language · Computer Science 2024-11-28 Hyun Ryu , Eric Kim

Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments (a…

Computation and Language · Computer Science 2024-01-15 Sen Yang , Shujian Huang , Xinyu Dai , Jiajun Chen

Recent advancements in generative large language models (LLMs) have significantly boosted the performance in natural language processing tasks. However, their efficiency is hampered by the inherent limitations in autoregressive token…

Machine Learning · Computer Science 2024-02-22 Shuzhang Zhong , Zebin Yang , Meng Li , Ruihao Gong , Runsheng Wang , Ru Huang
‹ Prev 1 2 3 10 Next ›