Related papers: Lazy-k: Decoding for Constrained Token Classificat…

Foundations of Top-$k$ Decoding For Language Models

Top-$k$ decoding is a widely used method for sampling from LLMs: at each token, only the largest $k$ next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top-$k$ and other sampling…

Artificial Intelligence · Computer Science 2026-02-24 Georgy Noarov , Soham Mallick , Tao Wang , Sunay Joshi , Yan Sun , Yangxinyu Xie , Mengxin Yu , Edgar Dobriban

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Decoding sits between a language model and everything we do with it, yet it is still treated as a heuristic knob-tuning exercise. We argue decoding should be understood as a principled optimisation layer: at each token, we solve a…

Machine Learning · Computer Science 2026-02-26 Xiaotong Ji , Rasul Tutunov , Matthieu Zimmer , Haitham Bou-Ammar

Compressed code: the hidden effects of quantization and distillation on programming tokens

Large Language Models (LLMs) have demonstrated exceptional code generation capabilities, yet their token-level mechanisms remain underexplored, particularly in compressed models. Through systematic analysis of programming language token…

Software Engineering · Computer Science 2026-02-10 Viacheslav Siniaev , Iaroslav Chelombitko , Aleksey Komissarov

Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective

Constrained decoding enables Language Models (LMs) to produce samples that provably satisfy hard constraints. However, existing constrained-decoding approaches often distort the underlying model distribution, a limitation that is especially…

Artificial Intelligence · Computer Science 2025-06-09 Emmanuel Anaya Gonzalez , Sairam Vaidya , Kanghee Park , Ruyi Ji , Taylor Berg-Kirkpatrick , Loris D'Antoni

Adaptive Methods for Linear Programming Decoding

Detectability of failures of linear programming (LP) decoding and the potential for improvement by adding new constraints motivate the use of an adaptive approach in selecting the constraints for the underlying LP problem. In this paper, we…

Information Theory · Computer Science 2007-07-13 Mohammad H. Taghavi , Paul H. Siegel

L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression

Learning-based probabilistic models can be combined with an entropy coder for data compression. However, due to the high complexity of learning-based models, their practical application as text compressors has been largely overlooked. To…

Computation and Language · Computer Science 2024-12-25 Junxuan Zhang , Zhengxue Cheng , Yan Zhao , Shihao Wang , Dajiang Zhou , Guo Lu , Li Song

Decoding-Free Sampling Strategies for LLM Marginalization

Modern language models operate on subword-tokenized text in order to make a trade-off between model size, inference speed, and vocabulary coverage. A side effect of this is that, during inference, models are evaluated by measuring the…

Computation and Language · Computer Science 2025-10-24 David Pohl , Marco Cognetta , Junyoung Lee , Naoaki Okazaki

Adaptive Linear Programming Decoding

Detectability of failures of linear programming (LP) decoding and its potential for improvement by adding new constraints motivate the use of an adaptive approach in selecting the constraints for the LP problem. In this paper, we make a…

Information Theory · Computer Science 2007-07-13 Mohammad H. Taghavi N. , Paul H. Siegel

Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher

How can small-scale large language models (LLMs) efficiently utilize the supervision of LLMs to improve their generative quality? This question has been well studied in scenarios where there is no restriction on the number of LLM…

Computation and Language · Computer Science 2024-10-04 Hyunjong Ok , Jegwang Ryu , Jaeho Lee

Beyond the EM Algorithm: Constrained Optimization Methods for Latent Class Model

Latent class model (LCM), which is a finite mixture of different categorical distributions, is one of the most widely used models in statistics and machine learning fields. Because of its non-continuous nature and the flexibility in shape,…

Machine Learning · Statistics 2021-03-23 Hao Chen , Lanshan Han , Alvin Lim

Online Speculative Decoding

Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive…

Artificial Intelligence · Computer Science 2024-06-11 Xiaoxuan Liu , Lanxiang Hu , Peter Bailis , Alvin Cheung , Zhijie Deng , Ion Stoica , Hao Zhang

Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

Computation and Language · Computer Science 2024-11-28 Hyun Ryu , Eric Kim

Joint Optimization of Tokenization and Downstream Model

Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the…

Computation and Language · Computer Science 2021-05-27 Tatsuya Hiraoka , Sho Takase , Kei Uchiumi , Atsushi Keyaki , Naoaki Okazaki

Decoding Uncertainty: The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models

Decoding strategies manipulate the probability distribution underlying the output of a language model and can therefore affect both generation quality and its uncertainty. In this study, we investigate the impact of decoding strategies on…

Computation and Language · Computer Science 2025-09-23 Wataru Hashimoto , Hidetaka Kamigaito , Taro Watanabe

Kad: A Framework for Proxy-based Test-time Alignment with Knapsack Approximation Deferral

Several previous works concluded that the largest part of generation capabilities of large language models (LLM) are learned (early) during pre-training. However, LLMs still require further alignment to adhere to downstream task…

Computation and Language · Computer Science 2026-01-27 Ayoub Hammal , Pierre Zweigenbaum , Caio Corro

On Learning Prediction-Focused Mixtures

Probabilistic models help us encode latent structures that both model the data and are ideally also useful for specific downstream tasks. Among these, mixture models and their time-series counterparts, hidden Markov models, identify…

Machine Learning · Computer Science 2021-10-29 Abhishek Sharma , Catherine Zeng , Sanjana Narayanan , Sonali Parbhoo , Finale Doshi-Velez

A Thorough Examination of Decoding Methods in the Era of LLMs

Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers. Prior research on decoding methods, primarily focusing on task-specific models, may not extend to the current…

Computation and Language · Computer Science 2024-10-10 Chufan Shi , Haoran Yang , Deng Cai , Zhisong Zhang , Yifan Wang , Yujiu Yang , Wai Lam

Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition

We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy. Motivated by recent work that predicts the probabilities of subsequent tokens using multiple heads, we…

Machine Learning · Computer Science 2025-02-11 Artem Basharin , Andrei Chertkov , Ivan Oseledets

Min-$k$ Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics

The quality of text generated by large language models depends critically on the decoding sampling strategy. While mainstream methods such as Top-$k$, Top-$p$, and Min-$p$ achieve a balance between diversity and accuracy through…

Artificial Intelligence · Computer Science 2026-04-14 Yuanhao Ding , Meimingwei Li , Esteban Garces Arias , Matthias Aßenmacher , Christian Heumann , Chongsheng Zhang

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui