English
Related papers

Related papers: Accelerate Parallelizable Reasoning via Parallel D…

200 papers

Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire $\gamma$-token…

Machine Learning · Computer Science 2025-06-25 Yichao Fu , Rui Ge , Zelei Shao , Zhijie Deng , Hao Zhang

With the increasing capabilities of Large Language Models (LLMs), parallel reasoning has emerged as a new inference paradigm that enhances reasoning robustness by concurrently exploring multiple lines of thought before converging on a final…

Computation and Language · Computer Science 2025-10-15 Ziqi Wang , Boye Niu , Zipeng Gao , Zhi Zheng , Tong Xu , Linghui Meng , Zhongli Li , Jing Liu , Yilong Chen , Chen Zhu , Hua Wu , Haifeng Wang , Enhong Chen

Reasoning models enhance performance by tackling problems in a step-by-step manner, decomposing them into sub-problems and exploring long chains of thought before producing an answer. However, applying extended reasoning to every step…

Artificial Intelligence · Computer Science 2025-10-08 Haiquan Lu , Gongfan Fang , Xinyin Ma , Qi Li , Xinchao Wang

It has been shown that a class of probabilistic domain models cannot be learned correctly by several existing algorithms which employ a single-link look ahead search. When a multi-link look ahead search is used, the computational complexity…

Artificial Intelligence · Computer Science 2013-02-08 TongSheng Chu , Yang Xiang

Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this "slow-thinking" paradigm,…

Computation and Language · Computer Science 2025-09-30 Sicheng Feng , Gongfan Fang , Xinyin Ma , Xinchao Wang

Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs,…

Artificial Intelligence · Computer Science 2025-08-19 Jiayi Pan , Xiuyu Li , Long Lian , Charlie Snell , Yifei Zhou , Adam Yala , Trevor Darrell , Kurt Keutzer , Alane Suhr

Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make…

Machine Learning · Computer Science 2018-11-09 Mitchell Stern , Noam Shazeer , Jakob Uszkoreit

Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and…

Computation and Language · Computer Science 2023-11-23 Giovanni Monea , Armand Joulin , Edouard Grave

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao

Multiple heads decoding accelerates the inference of Large Language Models (LLMs) by predicting next several tokens simultaneously. It generates and verifies multiple candidate sequences in parallel via tree attention with a fixed…

Computer Vision and Pattern Recognition · Computer Science 2025-02-11 Zhendong Zhang

Chain-of-Thought (CoT) reasoning enhances the decision-making capabilities of vision-language-action models in autonomous driving, but its autoregressive nature introduces significant inference latency, making it impractical for real-time…

Robotics · Computer Science 2026-02-04 Yi Gu , Yan Wang , Yuxiao Chen , Yurong You , Wenjie Luo , Yue Wang , Wenhao Ding , Boyi Li , Heng Yang , Boris Ivanovic , Marco Pavone

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential…

Computation and Language · Computer Science 2025-06-05 Zhepei Wei , Wei-Lin Chen , Xinyu Zhu , Yu Meng

Transformer-based models can perform complicated reasoning by generating reasoning paths token by token. While effective, this approach often requires generating thousands of tokens to solve a single problem, which can be slow and…

Machine Learning · Computer Science 2026-05-05 Jiayu Liu , Zhenya Huang , Xuan Yang , Tianyun Ji , Anya Sims , Hao Xu , Enhong Chen , Yee Whye Teh , Ning Miao

Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. as…

Computation and Language · Computer Science 2024-06-06 Taehyeon Kim , Ananda Theertha Suresh , Kishore Papineni , Michael Riley , Sanjiv Kumar , Adrian Benton

Large language models (LLMs) often face a bottleneck in inference speed due to their reliance on auto-regressive decoding. Recently, parallel decoding has shown significant promise in enhancing inference efficiency. However, we have…

Computation and Language · Computer Science 2024-10-18 Yuxuan Liu , Wenyuan Li , Laizhong Cui , Hailiang Yang

Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction. Our algorithm, called Tree Attention, for parallelizing exact attention computation across multiple GPUs…

Machine Learning · Computer Science 2025-02-11 Vasudev Shyam , Jonathan Pilault , Emily Shepperd , Quentin Anthony , Beren Millidge

As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a…

Computation and Language · Computer Science 2025-07-02 Zixian Huang , Chenxu Niu , Yu Gu , Gengyang Xiao , Xinwei Huang , Gong Cheng

Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive…

Machine Learning · Computer Science 2025-12-10 Long Lian , Sida Wang , Felix Juefei-Xu , Tsu-Jui Fu , Xiuyu Li , Adam Yala , Trevor Darrell , Alane Suhr , Yuandong Tian , Xi Victoria Lin

Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt. These methods boost accuracy while exploiting the computational efficiency of batching $N$ generations.…

‹ Prev 1 2 3 10 Next ›