Related papers: Entropy Adaptive Decoding: Dynamic Model Switching…

Entropy-informed Decoding: Adaptive Information-Driven Branching

Large language models (LLMs) achieve remarkable generative performance, yet their output quality is dependent on the decoding strategy. While sampling-based methods (e.g., top-k, nucleus) and search-and-select based methods (e.g., beam…

Machine Learning · Computer Science 2026-05-12 Benjamin Patrick Evans , Sumitra Ganesh , Leo Ardon

Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning

Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment…

Computation and Language · Computer Science 2026-01-01 Tiancheng Su , Meicong Zhang , Guoxiu He

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine-tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and…

Machine Learning · Computer Science 2026-03-11 Luxi Lin , Zhihang Lin , Zhanpeng Zeng , Yuhao Chen , Qingyu Zhang , Jixiang Luo , Xuelong Li , Rongrong Ji

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires…

Computation and Language · Computer Science 2024-08-20 Xukun Liu , Bowen Lei , Ruqi Zhang , Dongkuan Xu

Think Twice Before You Write -- an Entropy-based Decoding Strategy to Enhance LLM Reasoning

Decoding strategies play a central role in shaping the reasoning ability of large language models (LLMs). Traditional methods such as greedy decoding and beam search often suffer from error propagation, while sampling-based approaches…

Computation and Language · Computer Science 2026-04-02 Jiashu He , Meizhu Liu , Olaitan P Olaleye , Amit Agarwal , M. Avendi , Yassi Abbasi , Matthew Rowe , Hitesh Laxmichand Patel , Paul Li , Tao Sheng , Sujith Ravi , Dan Roth

Diversify, Contextualize, and Adapt: Efficient Entropy Modeling for Neural Image Codec

Designing a fast and effective entropy model is challenging but essential for practical application of neural codecs. Beyond spatial autoregressive entropy models, more efficient backward adaptation-based entropy models have been recently…

Computer Vision and Pattern Recognition · Computer Science 2024-11-12 Jun-Hyuk Kim , Seungeon Kim , Won-Hee Lee , Dokwan Oh

DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference

Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high…

Computation and Language · Computer Science 2026-05-11 Xiang Liu , Xuming Hu , Xiaowen Chu , Eunsol Choi

EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer

Large language models (LLMs) have achieved remarkable performance across diverse domains, yet their enormous computational and memory requirements hinder deployment in resource-constrained environments. Knowledge distillation offers a…

Computation and Language · Computer Science 2026-05-05 Hao Zhang , Zhibin Zhang , Guangxin Wu , Wanyi Ning , Jiafeng Guo , Xueqi Cheng

The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions

Language models cannot be random. This paper introduces Entropic Deviation (ED), the normalised KL divergence between a model's token distribution and the uniform distribution, and measures it systematically across 31,200 generations…

Computation and Language · Computer Science 2026-04-28 Jarosław Hryszko

Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts

Most efforts to improve the reasoning capabilities of large language models (LLMs) involve either scaling the number of parameters and the size of training data, or scaling inference computation by letting models generate complex chains of…

Machine Learning · Computer Science 2025-10-10 Yeskendir Koishekenov , Aldo Lipani , Nicola Cancedda

Entropy-Based Decoding for Retrieval-Augmented Large Language Models

Augmenting Large Language Models (LLMs) with retrieved external knowledge has proven effective for improving the factual accuracy of generated responses. Despite their success, retrieval-augmented LLMs still face the distractibility issue,…

Computation and Language · Computer Science 2025-02-18 Zexuan Qiu , Zijing Ou , Bin Wu , Jingjing Li , Aiwei Liu , Irwin King

EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling

Recently, Large Language Models (LLMs) have demonstrated outstanding performance across a wide range of downstream language tasks. Temperature sampling is a commonly used decoding strategy for LLMs' generation process. However, a fixed…

Computation and Language · Computer Science 2024-04-04 Shimao Zhang , Yu Bao , Shujian Huang

AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft…

Computation and Language · Computer Science 2026-05-27 Kuan-Wei Lu , Ding-Yong Hong , Pangfeng Liu , Jan-Jan Wu

EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by…

Computation and Language · Computer Science 2026-03-20 Minsoo Cheong , Donghyun Son , Woosang Lim , Sungjoo Yoo

Accelerating Large Language Model Inference with Self-Supervised Early Exits

This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model's…

Computation and Language · Computer Science 2026-02-13 Florian Valade

ALIEN: Aligned Entropy Head for Improving Uncertainty Estimation of LLMs

Uncertainty estimation remains a key challenge when adapting pre-trained language models to downstream classification tasks, with overconfidence often observed for difficult inputs. While predictive entropy provides a strong baseline for…

Computation and Language · Computer Science 2026-04-07 Artem Zabolotnyi , Roman Makarov , Mile Mitrovic , Polina Proskura , Oleg Travkin , Roman Alferov , Alexey Zaytsev

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens that are later verified by a stronger target model. While effective in centralized systems, its behavior in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Jingwei Song , Wanyi Chen , Xinyuan Song , Max , Chris Tong , Gufeng Chen , Tianyi Zhao , Eric Yang , Bill Shi , Lynn Ai

Entropic-Time Inference: Self-Organizing Large Language Model Decoding Beyond Attention

Modern large language model (LLM) inference engines optimize throughput and latency under fixed decoding rules, treating generation as a linear progression in token time. We propose a fundamentally different paradigm: entropic\-time…

Computation and Language · Computer Science 2026-03-05 Andrew Kiruluta

Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models

Diffusion Language Models (DLMs) have recently achieved significant success due to their any-order generation capabilities. However, existing inference methods typically rely on local, immediate-step metrics such as confidence or entropy…

Computation and Language · Computer Science 2025-12-03 Kecheng Chen , Ziru Liu , Xijia Tao , Hui Liu , Xinyu Fu , Suiyun Zhang , Dandan Tu , Lingpeng Kong , Rui Liu , Haoliang Li

Modeling Uncertainty Trends for Timely Retrieval in Dynamic RAG

Dynamic retrieval-augmented generation (RAG) allows large language models (LLMs) to fetch external knowledge on demand, offering greater adaptability than static RAG. A central challenge in this setting lies in determining the optimal…

Computation and Language · Computer Science 2025-11-14 Bo Li , Tian Tian , Zhenghua Xu , Hao Cheng , Shikun Zhang , Wei Ye