English
Related papers

Related papers: PRISM: Parametrically Refactoring Inference for Sp…

200 papers

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

Computation and Language · Computer Science 2024-11-28 Hyun Ryu , Eric Kim

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing…

Computation and Language · Computer Science 2025-11-26 Luohe Shi , Zuchao Li , Lefei Zhang , Baoyuan Qi , Guoming Liu , Hai Zhao

Large language models (LLMs) solve complex problems by generating multi-step reasoning traces. Yet these traces are typically analyzed from only one of two perspectives: the sequence of tokens across different reasoning steps in the…

Computation and Language · Computer Science 2026-03-25 Ruidi Chang , Jiawei Zhou , Hanjie Chen

Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in…

Computation and Language · Computer Science 2024-11-12 Euiin Yi , Taehyeon Kim , Hongseok Jeung , Du-Seong Chang , Se-Young Yun

Speculative decoding, which combines a draft model with a target model, has emerged as an effective approach to accelerate large language model (LLM) inference. However, existing methods often face a trade-off between the acceptance rate…

Computation and Language · Computer Science 2025-05-14 Danying Ge , Jianhua Gao , Qizhi Jiang , Yifei Feng , Weixing Ji

Inference acceleration of large language models (LLMs) has been put forward in many application scenarios and speculative decoding has shown its advantage in addressing inference acceleration. Speculative decoding usually introduces a draft…

Machine Learning · Computer Science 2024-12-03 Zhuofan Wen , Shangtong Gui , Yang Feng

We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The…

Computation and Language · Computer Science 2025-02-11 Jun Zhang , Jue Wang , Huan Li , Lidan Shou , Ke Chen , Gang Chen , Sharad Mehrotra

Large Language Models (LLMs) demonstrate potential to estimate the probability of uncertain events, by leveraging their extensive knowledge and reasoning capabilities. This ability can be applied to support intelligent decision-making…

Machine Learning · Computer Science 2026-01-15 Yang Nan , Qihao Wen , Jiahao Wang , Pengfei He , Ravi Tandon , Yong Ge , Han Xu

The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference…

Computation and Language · Computer Science 2025-06-24 Guanzheng Chen , Qilong Feng , Jinjie Ni , Xin Li , Michael Qizhe Shieh

Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although…

Machine Learning · Computer Science 2026-05-12 Anton Plaksin , Sergei Krutikov , Sergei Skvortsov , Alexander Samarin

Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs), yet its integration into vision-language models (VLMs) remains underexplored. We introduce DREAM, a novel…

Computation and Language · Computer Science 2025-10-24 Yunhai Hu , Tianhua Xia , Zining Liu , Rahul Raman , Xingyu Liu , Bo Bao , Eric Sather , Vithursan Thangarasa , Sai Qian Zhang

Large language models (LLMs) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive decoding incurs substantial latency. Speculative decoding…

Artificial Intelligence · Computer Science 2026-02-06 Hanyu Wei , Zunhai Su , Peng Lu , Chao Li , Spandan Tiwari , Ashish Sirasao , Yuhan Dong

Generative sequence modeling faces a fundamental tension between the expressivity of Transformers and the efficiency of linear sequence models. Existing efficient architectures are theoretically bounded by shallow, single-step linear…

Machine Learning · Computer Science 2026-02-13 Jie Jiang , Ke Cheng , Xin Xu , Mengyang Pang , Tianhao Lu , Jiaheng Li , Yue Liu , Yuan Wang , Jun Zhang , Huan Yu , Zhouchen Lin

The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive…

DEEPTHINK methods improve reasoning by generating, refining, and aggregating populations of candidate solutions, which enables strong performance on complex mathematical and scientific tasks. However, existing frameworks often lack reliable…

Artificial Intelligence · Computer Science 2026-03-04 Rituraj Sharma , Weiyuan Chen , Noah Provenzano , Tu Vu

Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time…

Computation and Language · Computer Science 2026-03-03 Jiebin Zhang , Zhenghan Yu , Liang Wang , Nan Yang , Eugene J. Yu , Zheng Li , Yifan Song , Dawei Zhu , Xingxing Zhang , Furu Wei , Sujian Li

Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens…

Machine Learning · Computer Science 2025-02-06 Minghao Yan , Saurabh Agarwal , Shivaram Venkataraman

Speculative decoding accelerates inference in large language models (LLMs) by generating draft tokens for target model verification. Current approaches for obtaining draft tokens rely on lightweight draft models or additional model…

Computation and Language · Computer Science 2025-03-06 Guofeng Quan , Wenfeng Feng , Chuzhan Hao , Guochao Jiang , Yuewei Zhang , Hao Wang

With the rapid progress of large language models (LLMs), financial information retrieval has become a critical industrial application. Extracting task-relevant information from lengthy financial filings is essential for both operational and…

Artificial Intelligence · Computer Science 2026-04-07 Chun Chet Ng , Jia Yu Lim , Wei Zeng Low
‹ Prev 1 2 3 10 Next ›