English
Related papers

Related papers: Efficiently Aligning Draft Models via Parameter- a…

200 papers

Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time…

Computation and Language · Computer Science 2026-03-03 Jiebin Zhang , Zhenghan Yu , Liang Wang , Nan Yang , Eugene J. Yu , Zheng Li , Yifan Song , Dawei Zhu , Xingxing Zhang , Furu Wei , Sujian Li

Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment…

Computation and Language · Computer Science 2026-01-01 Tiancheng Su , Meicong Zhang , Guoxiu He

We present Entropy Adaptive Decoding (EAD), a novel approach for efficient language model inference that dynamically switches between different-sized models based on prediction uncertainty. By monitoring rolling entropy in model logit…

Machine Learning · Computer Science 2025-02-12 Toby Simonds

Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs), yet its integration into vision-language models (VLMs) remains underexplored. We introduce DREAM, a novel…

Computation and Language · Computer Science 2025-10-24 Yunhai Hu , Tianhua Xia , Zining Liu , Rahul Raman , Xingyu Liu , Bo Bao , Eric Sather , Vithursan Thangarasa , Sai Qian Zhang

Speculative decoding is a powerful technique that attempts to circumvent the autoregressive constraint of modern Large Language Models (LLMs). The aim of speculative decoding techniques is to improve the average inference time of a large,…

Computation and Language · Computer Science 2024-10-25 Sudhanshu Agrawal , Wonseok Jeon , Mingu Lee

Speculative Decoding (SD) is a popular lossless technique for accelerating the inference of Large Language Models (LLMs). We show that the decoding speed of SD frameworks with static draft structures can be significantly improved by…

Artificial Intelligence · Computer Science 2024-12-30 Situo Zhang , Hankun Wang , Da Ma , Zichen Zhu , Lu Chen , Kunyao Lan , Kai Yu

While the enormous parameter scale endows Large Models (LMs) with unparalleled performance, it also limits their adaptability across specific tasks. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a critical approach for effectively…

Machine Learning · Computer Science 2025-12-22 Dong Chen , Zhengqing Hu , Shixing Zhao , Yibo Guo

Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting…

Computation and Language · Computer Science 2025-11-04 Min Fang , Zhihui Fu , Qibin Zhao , Jun Wang

Parameter-efficient fine-tuning methods, such as LoRA, reduces the number of trainable parameters. However, they often suffer from scalability issues and differences between their learning pattern and full fine-tuning. To overcome these…

Machine Learning · Computer Science 2025-01-22 Hamid Nasiri , Peter Garraghan

Large Language Models (LLMs) have demonstrated remarkable capabilities in code editing, substantially enhancing software development productivity. However, the inherent complexity of code editing tasks forces existing approaches to rely on…

Software Engineering · Computer Science 2025-10-01 Peiding Wang , Li Zhang , Fang Liu , Yinghao Zhu , Wang Xu , Lin Shi , Xiaoli Lian , Minxiao Li , Bo Shen , An Fu

Speculative Decoding has gained popularity as an effective technique for accelerating the auto-regressive inference process of Large Language Models. However, Speculative Decoding entirely relies on the availability of efficient draft…

Computation and Language · Computer Science 2025-06-06 Ofir Zafrir , Igor Margulis , Dorin Shteyman , Shira Guskin , Guy Boudoukh

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires…

Computation and Language · Computer Science 2024-08-20 Xukun Liu , Bowen Lei , Ruqi Zhang , Dongkuan Xu

Text generation with Large Language Models (LLMs) is known to be memory bound due to the combination of their auto-regressive nature, huge parameter counts, and limited memory bandwidths, often resulting in low token rates. Speculative…

Machine Learning · Computer Science 2024-05-15 Raghavv Goel , Mukul Gagrani , Wonseok Jeon , Junyoung Park , Mingu Lee , Christopher Lott

Unsupervised domain adaptation has recently emerged as an effective paradigm for generalizing deep neural networks to new target domains. However, there is still enormous potential to be tapped to reach the fully supervised performance. In…

Machine Learning · Computer Science 2022-03-10 Binhui Xie , Longhui Yuan , Shuang Li , Chi Harold Liu , Xinjing Cheng , Guoren Wang

The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a…

Machine Learning · Computer Science 2025-12-02 Zihao An , Huajun Bai , Ziqiong Liu , Dong Li , Emad Barsoum

Speculative decoding (SD) has proven effective for accelerating LLM inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision-Language Models (LVLMs), which extend…

Machine Learning · Computer Science 2026-01-29 Minjae Lee , Wonjun Kang , Byeongkeun Ahn , Christian Classen , Kevin Galim , Seunghyuk Oh , Minghao Yan , Hyung Il Koo , Kangwook Lee

Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned for a specific downstream task. The most common fine-tuning method is to update pretrained weights via low-rank adaptation (LoRA). Existing initialization…

Machine Learning · Computer Science 2025-10-21 Fabian Paischer , Lukas Hauzenberger , Thomas Schmied , Benedikt Alkin , Marc Peter Deisenroth , Sepp Hochreiter

Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a bottleneck as vocabulary sizes scale. While existing static pruning methods effectively reduce this…

Computation and Language · Computer Science 2026-05-29 Shuyu Zhang , Lingfeng Pan , Qicheng Wang , Yaqi Shi , Yueyang Tan , Ruyu Yan , Jiaqi Chen , Lixing Du , Lu Wang

Although Large Language Models (LLMs) have made significant progress in code generation, they still struggle with code generation tasks in specific scenarios. These scenarios usually necessitate the adaptation of LLMs to fulfill specific…

Software Engineering · Computer Science 2025-10-22 Xue Jiang , Yihong Dong , Zhiyuan Fan , Zhi Jin , Wenpin Jiao , Ge Li

Dense retrieval systems increasingly need to handle complex queries. In many realistic settings, users express intent through long instructions or task-specific descriptions, while target documents remain relatively simple and static. This…

Information Retrieval · Computer Science 2026-04-07 Seiji Maekawa , Moin Aminnaseri , Pouya Pezeshkpour , Estevam Hruschka
‹ Prev 1 2 3 10 Next ›