English
Related papers

Related papers: Entropy Adaptive Decoding: Dynamic Model Switching…

200 papers

Large language models (LLMs) achieve remarkable generative performance, yet their output quality is dependent on the decoding strategy. While sampling-based methods (e.g., top-k, nucleus) and search-and-select based methods (e.g., beam…

Machine Learning · Computer Science 2026-05-12 Benjamin Patrick Evans , Sumitra Ganesh , Leo Ardon

Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment…

Computation and Language · Computer Science 2026-01-01 Tiancheng Su , Meicong Zhang , Guoxiu He

Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine-tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and…

Machine Learning · Computer Science 2026-03-11 Luxi Lin , Zhihang Lin , Zhanpeng Zeng , Yuhao Chen , Qingyu Zhang , Jixiang Luo , Xuelong Li , Rongrong Ji

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires…

Computation and Language · Computer Science 2024-08-20 Xukun Liu , Bowen Lei , Ruqi Zhang , Dongkuan Xu

Decoding strategies play a central role in shaping the reasoning ability of large language models (LLMs). Traditional methods such as greedy decoding and beam search often suffer from error propagation, while sampling-based approaches…

Designing a fast and effective entropy model is challenging but essential for practical application of neural codecs. Beyond spatial autoregressive entropy models, more efficient backward adaptation-based entropy models have been recently…

Computer Vision and Pattern Recognition · Computer Science 2024-11-12 Jun-Hyuk Kim , Seungeon Kim , Won-Hee Lee , Dokwan Oh

Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high…

Computation and Language · Computer Science 2026-05-11 Xiang Liu , Xuming Hu , Xiaowen Chu , Eunsol Choi

Large language models (LLMs) have achieved remarkable performance across diverse domains, yet their enormous computational and memory requirements hinder deployment in resource-constrained environments. Knowledge distillation offers a…

Computation and Language · Computer Science 2026-05-05 Hao Zhang , Zhibin Zhang , Guangxin Wu , Wanyi Ning , Jiafeng Guo , Xueqi Cheng

Language models cannot be random. This paper introduces Entropic Deviation (ED), the normalised KL divergence between a model's token distribution and the uniform distribution, and measures it systematically across 31,200 generations…

Computation and Language · Computer Science 2026-04-28 Jarosław Hryszko

Most efforts to improve the reasoning capabilities of large language models (LLMs) involve either scaling the number of parameters and the size of training data, or scaling inference computation by letting models generate complex chains of…

Machine Learning · Computer Science 2025-10-10 Yeskendir Koishekenov , Aldo Lipani , Nicola Cancedda

Augmenting Large Language Models (LLMs) with retrieved external knowledge has proven effective for improving the factual accuracy of generated responses. Despite their success, retrieval-augmented LLMs still face the distractibility issue,…

Computation and Language · Computer Science 2025-02-18 Zexuan Qiu , Zijing Ou , Bin Wu , Jingjing Li , Aiwei Liu , Irwin King

Recently, Large Language Models (LLMs) have demonstrated outstanding performance across a wide range of downstream language tasks. Temperature sampling is a commonly used decoding strategy for LLMs' generation process. However, a fixed…

Computation and Language · Computer Science 2024-04-04 Shimao Zhang , Yu Bao , Shujian Huang

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft…

Computation and Language · Computer Science 2026-05-27 Kuan-Wei Lu , Ding-Yong Hong , Pangfeng Liu , Jan-Jan Wu

Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by…

Computation and Language · Computer Science 2026-03-20 Minsoo Cheong , Donghyun Son , Woosang Lim , Sungjoo Yoo

This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model's…

Computation and Language · Computer Science 2026-02-13 Florian Valade

Uncertainty estimation remains a key challenge when adapting pre-trained language models to downstream classification tasks, with overconfidence often observed for difficult inputs. While predictive entropy provides a strong baseline for…

Computation and Language · Computer Science 2026-04-07 Artem Zabolotnyi , Roman Makarov , Mile Mitrovic , Polina Proskura , Oleg Travkin , Roman Alferov , Alexey Zaytsev

Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens that are later verified by a stronger target model. While effective in centralized systems, its behavior in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Jingwei Song , Wanyi Chen , Xinyuan Song , Max , Chris Tong , Gufeng Chen , Tianyi Zhao , Eric Yang , Bill Shi , Lynn Ai

Modern large language model (LLM) inference engines optimize throughput and latency under fixed decoding rules, treating generation as a linear progression in token time. We propose a fundamentally different paradigm: entropic\-time…

Computation and Language · Computer Science 2026-03-05 Andrew Kiruluta

Diffusion Language Models (DLMs) have recently achieved significant success due to their any-order generation capabilities. However, existing inference methods typically rely on local, immediate-step metrics such as confidence or entropy…

Computation and Language · Computer Science 2025-12-03 Kecheng Chen , Ziru Liu , Xijia Tao , Hui Liu , Xinyu Fu , Suiyun Zhang , Dandan Tu , Lingpeng Kong , Rui Liu , Haoliang Li

Dynamic retrieval-augmented generation (RAG) allows large language models (LLMs) to fetch external knowledge on demand, offering greater adaptability than static RAG. A central challenge in this setting lies in determining the optimal…

Computation and Language · Computer Science 2025-11-14 Bo Li , Tian Tian , Zhenghua Xu , Hao Cheng , Shikun Zhang , Wei Ye
‹ Prev 1 2 3 10 Next ›