Related papers: DynaMo: Accelerating Language Model Inference with…

Reasoning with Latent Tokens in Diffusion Language Models

Discrete diffusion models have recently become competitive with autoregressive models for language modeling, even outperforming them on reasoning tasks requiring planning and global coherence, but they require more computation at inference…

Machine Learning · Computer Science 2026-02-04 Andre He , Sean Welleck , Daniel Fried

Enabling Approximate Joint Sampling in Diffusion LMs

In autoregressive language models, each token is sampled by conditioning on all the past tokens; the overall string has thus been sampled from the correct underlying joint distribution represented by the model. In contrast, masked diffusion…

Computation and Language · Computer Science 2026-02-03 Parikshit Bansal , Sujay Sanghavi

Retrofitting Large Language Models with Dynamic Tokenization

Current language models (LMs) use a fixed, static subword tokenizer. This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English. To address this issue, we challenge the…

Computation and Language · Computer Science 2025-06-12 Darius Feher , Ivan Vulić , Benjamin Minixhofer

Multi-Token Prediction via Self-Distillation

Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a…

Computation and Language · Computer Science 2026-04-27 John Kirchenbauer , Abhimanyu Hans , Brian Bartoldson , Micah Goldblum , Ashwinee Panda , Tom Goldstein

Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and…

Computer Vision and Pattern Recognition · Computer Science 2025-05-27 Runpeng Yu , Xinyin Ma , Xinchao Wang

Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential

Autoregressive language models are constrained by their inherently sequential nature, generating one token at a time. This paradigm limits inference speed and parallelism, especially during later stages of generation when the direction and…

Computation and Language · Computer Science 2025-07-17 Mohammad Samragh , Arnav Kundu , David Harrison , Kumari Nishu , Devang Naik , Minsik Cho , Mehrdad Farajtabar

Towards Latent Diffusion Suitable For Text

Language diffusion models aim to improve sampling speed and coherence over autoregressive LLMs. We introduce Neural Flow Diffusion Models for language generation, an extension of NFDM that enables the straightforward application of…

Computation and Language · Computer Science 2026-01-26 Nesta Midavaine , Christian A. Naesseth , Grigory Bartosh

FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream…

Computation and Language · Computer Science 2025-10-10 Zhanqiu Hu , Jian Meng , Yash Akhauri , Mohamed S. Abdelfattah , Jae-sun Seo , Zhiru Zhang , Udit Gupta

FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has…

Machine Learning · Computer Science 2025-09-24 Yuxuan Cai , Xiaozhuan Liang , Xinghua Wang , Jin Ma , Haijin Liang , Jinwen Luo , Xinyu Zuo , Lisheng Duan , Yuyang Yin , Xi Chen

Just on Time: Token-Level Early Stopping for Diffusion Language Models

Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient because many tokens reach stability long before the final denoising step. We introduce a training-free, token-level…

Machine Learning · Computer Science 2026-02-12 Zahar Kohut , Severyn Shykula , Dmytro Khamula , Mykola Vysotskyi , Taras Rumezhak , Volodymyr Karpiv

DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Despite their growing capabilities, language models still frequently reproduce content from their training data, generate repetitive text, and favor common grammatical patterns and vocabulary. A possible cause is the decoding strategy: the…

Computation and Language · Computer Science 2026-01-15 Giorgio Franceschelli , Mirco Musolesi

Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual…

Computer Vision and Pattern Recognition · Computer Science 2025-04-22 Kaihang Pan , Wang Lin , Zhongqi Yue , Tenglong Ao , Liyu Jia , Wei Zhao , Juncheng Li , Siliang Tang , Hanwang Zhang

Self-conditioned Embedding Diffusion for Text Generation

Can continuous diffusion models bring the same performance breakthrough on natural language they did for image generation? To circumvent the discrete nature of text data, we can simply project tokens in a continuous space of embeddings, as…

Computation and Language · Computer Science 2022-11-09 Robin Strudel , Corentin Tallec , Florent Altché , Yilun Du , Yaroslav Ganin , Arthur Mensch , Will Grathwohl , Nikolay Savinov , Sander Dieleman , Laurent Sifre , Rémi Leblond

From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly,…

Computation and Language · Computer Science 2025-02-19 Kumari Nishu , Sachin Mehta , Samira Abnar , Mehrdad Farajtabar , Maxwell Horton , Mahyar Najibi , Moin Nabi , Minsik Cho , Devang Naik

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

Autoregressive language models generate text one token at a time, yet natural language is inherently structured in multi-token units, including phrases, n-grams, and collocations that carry meaning jointly. This one-token bottleneck limits…

Computation and Language · Computer Science 2026-05-13 Shaobin Zhuang , Yuang Ai , Jiaming Han , Xiaohui Li , Huaibo Huang , Xiangyu Yue , Xuefeng Hu , Kun Xu , Yali Wang , Hao Chen

Faster Speech-LLaMA Inference with Multi-token Prediction

Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-13 Desh Raj , Gil Keren , Junteng Jia , Jay Mahadeokar , Ozlem Kalinli

Generation with Dynamic Vocabulary

We introduce a new dynamic vocabulary for language models. It can involve arbitrary text spans during generation. These text spans act as basic generation bricks, akin to tokens in the traditional static vocabularies. We show that, the…

Computation and Language · Computer Science 2024-10-14 Yanting Liu , Tao Ji , Changzhi Sun , Yuanbin Wu , Xiaoling Wang

Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles

Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for…

Computation and Language · Computer Science 2026-04-01 Qingyan Wei , Yaojie Zhang , Zhiyuan Liu , Puyu Zeng , Yuxuan Wang , Biqing Qi , Dongrui Liu , Linfeng Zhang

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored.…

Computer Vision and Pattern Recognition · Computer Science 2026-04-20 Zekun Li , Sizhe An , Chengcheng Tang , Chuan Guo , Ivan Shugurov , Linguang Zhang , Amy Zhao , Srinath Sridhar , Lingling Tao , Abhay Mittal

Soft-Masked Diffusion Language Models

Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation…

Machine Learning · Computer Science 2026-03-03 Michael Hersche , Samuel Moor-Smith , Thomas Hofmann , Abbas Rahimi