English
Related papers

Related papers: Accelerating Diffusion LLMs via Adaptive Parallel …

200 papers

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires…

Computation and Language · Computer Science 2024-08-20 Xukun Liu , Bowen Lei , Ruqi Zhang , Dongkuan Xu

Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential…

Computation and Language · Computer Science 2025-06-05 Zhepei Wei , Wei-Lin Chen , Xinyu Zhu , Yu Meng

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel…

Computation and Language · Computer Science 2024-07-11 Jie Ou , Yueming Chen , Wenhong Tian

The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By…

Computation and Language · Computer Science 2025-08-15 Keyu Chen , Zhifeng Shen , Daohai Yu , Haoqian Wu , Wei Wen , Jianfeng He , Ruizhi Qiao , Xing Sun

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft…

Computation and Language · Computer Science 2026-05-27 Kuan-Wei Lu , Ding-Yong Hong , Pangfeng Liu , Jan-Jan Wu

We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works…

Computer Vision and Pattern Recognition · Computer Science 2026-03-12 Zhuoyang Zhang , Luke J. Huang , Chengyue Wu , Shang Yang , Kelly Peng , Yao Lu , Song Han

Parallel decoding for diffusion LLMs (dLLMs) is difficult because each denoising step provides only token-wise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We…

Machine Learning · Computer Science 2026-03-16 Bumjun Kim , Dongjae Jeon , Moongyu Jeon , Albert No

Diffusion large language models (dLLMs) offer a promising paradigm for parallel text generation, but in practice they face an accuracy-parallelism trade-off, where increasing tokens per forward (TPF) often degrades generation quality.…

Computation and Language · Computer Science 2026-05-12 Haoyang Zhou , Li Kong , Shijie Ren , Xiting Wang , Shuang Liang , Guowei Wang , Zhenxuan Pan

Diffusion-based large language models (dLLMs) have shown promising performance across various reasoning tasks, establishing themselves as an alternative to autoregressive large language models (LLMs). Unlike autoregressive LLMs that…

Computation and Language · Computer Science 2026-03-02 Xiangzhong Luo , Yilin An , Zhicheng Yu , Weichen Liu , Xu Yang

Large language models (LLMs) often suffer from hallucinations due to error accumulation in autoregressive decoding, where suboptimal early token choices misguide subsequent generation. Although multi-path decoding can improve robustness by…

Computation and Language · Computer Science 2026-05-21 Tianyu Zheng , Hong Wu , Jiaji Zhong

Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind…

Computation and Language · Computer Science 2025-07-04 Chengyue Wu , Hao Zhang , Shuchen Xue , Zhijian Liu , Shizhe Diao , Ligeng Zhu , Ping Luo , Song Han , Enze Xie

In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution. With discrete diffusion models, the more tokens they generate in parallel, the less their predicted…

Machine Learning · Computer Science 2025-04-30 Gabe Guo , Stefano Ermon

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a…

Machine Learning · Computer Science 2025-12-02 Zihao An , Huajun Bai , Ziqiong Liu , Dong Li , Emad Barsoum

The massive adoption of large language models (LLMs) demands efficient deployment strategies. However, the auto-regressive decoding process, which is fundamental to how most LLMs generate text, poses challenges to achieve efficient serving.…

Computation and Language · Computer Science 2024-01-15 Mingdao Liu , Aohan Zeng , Bowen Wang , Peng Zhang , Jie Tang , Yuxiao Dong

Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering…

Computation and Language · Computer Science 2025-12-30 Aiwei Liu , Minghua He , Shaoxun Zeng , Sijun Zhang , Linhao Zhang , Chuhan Wu , Wei Jia , Yuan Liu , Xiao Zhou , Jie Zhou

Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream, mitigate this by…

Computation and Language · Computer Science 2025-10-16 Qinglin Zhu , Yizhen Yao , Runcong Zhao , Yanzheng Xiang , Amrutha Saseendran , Chen Jin , Philip Teare , Bin Liang , Yulan He , Lin Gui

Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario with multi-user inference, where draft…

Information Theory · Computer Science 2026-04-24 Yaodan Xu , Sheng Zhou , Zhisheng Niu

Denoising Diffusion Probabilistic Models (DDPMs) have emerged as powerful tools for generative modeling. However, their sequential computation requirements lead to significant inference-time bottlenecks. In this work, we utilize the…

Machine Learning · Computer Science 2025-08-08 Hengyuan Hu , Aniket Das , Dorsa Sadigh , Nima Anari
‹ Prev 1 2 3 10 Next ›