Related papers: Accelerate Parallelizable Reasoning via Parallel D…

Scaling Speculative Decoding with Lookahead Reasoning

Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire $\gamma$-token…

Machine Learning · Computer Science 2025-06-25 Yichao Fu , Rui Ge , Zelei Shao , Zhijie Deng , Hao Zhang

A Survey on Parallel Reasoning

With the increasing capabilities of Large Language Models (LLMs), parallel reasoning has emerged as a new inference paradigm that enhances reasoning robustness by concurrently exploring multiple lines of thought before converging on a final…

Computation and Language · Computer Science 2025-10-15 Ziqi Wang , Boye Niu , Zipeng Gao , Zhi Zheng , Tong Xu , Linghui Meng , Zhongli Li , Jing Liu , Yilong Chen , Chen Zhu , Hua Wu , Haifeng Wang , Enhong Chen

MixReasoning: Switching Modes to Think

Reasoning models enhance performance by tackling problems in a step-by-step manner, decomposing them into sub-problems and exploring long chains of thought before producing an answer. However, applying extended reasoning to every step…

Artificial Intelligence · Computer Science 2025-10-08 Haiquan Lu , Gongfan Fang , Xinyin Ma , Qi Li , Xinchao Wang

Exploring Parallelism in Learning Belief Networks

It has been shown that a class of probabilistic domain models cannot be learned correctly by several existing algorithms which employ a single-link look ahead search. When a multi-link look ahead search is used, the computational complexity…

Artificial Intelligence · Computer Science 2013-02-08 TongSheng Chu , Yang Xiang

Efficient Reasoning Models: A Survey

Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this "slow-thinking" paradigm,…

Computation and Language · Computer Science 2025-09-30 Sicheng Feng , Gongfan Fang , Xinyin Ma , Xinchao Wang

Learning Adaptive Parallel Reasoning with Language Models

Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs,…

Artificial Intelligence · Computer Science 2025-08-19 Jiayi Pan , Xiuyu Li , Long Lian , Charlie Snell , Yifei Zhou , Adam Yala , Trevor Darrell , Kurt Keutzer , Alane Suhr

Blockwise Parallel Decoding for Deep Autoregressive Models

Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make…

Machine Learning · Computer Science 2018-11-09 Mitchell Stern , Noam Shazeer , Jakob Uszkoreit

PaSS: Parallel Speculative Sampling

Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and…

Computation and Language · Computer Science 2023-11-23 Giovanni Monea , Armand Joulin , Edouard Grave

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao

Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention

Multiple heads decoding accelerates the inference of Large Language Models (LLMs) by predicting next several tokens simultaneously. It generates and verifies multiple candidate sequences in parallel via tree attention with a fixed…

Computer Vision and Pattern Recognition · Computer Science 2025-02-11 Zhendong Zhang

Accelerating Structured Chain-of-Thought in Autonomous Vehicles

Chain-of-Thought (CoT) reasoning enhances the decision-making capabilities of vision-language-action models in autonomous driving, but its autoregressive nature introduces significant inference latency, making it impractical for real-time…

Robotics · Computer Science 2026-02-04 Yi Gu , Yan Wang , Yuxiao Chen , Yurong You , Wenjie Luo , Yue Wang , Wenhao Ding , Boyi Li , Heng Yang , Boris Ivanovic , Marco Pavone

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential…

Computation and Language · Computer Science 2025-06-05 Zhepei Wei , Wei-Lin Chen , Xinyu Zhu , Yu Meng

Deep Thinking by Markov Chain of Continuous Thoughts

Transformer-based models can perform complicated reasoning by generating reasoning paths token by token. While effective, this approach often requires generating thousands of tokens to solve a single problem, which can be slow and…

Machine Learning · Computer Science 2026-05-05 Jiayu Liu , Zhenya Huang , Xuan Yang , Tianyun Ji , Anya Sims , Hao Xu , Enhong Chen , Yee Whye Teh , Ning Miao

Exploring and Improving Drafts in Blockwise Parallel Decoding

Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. as…

Computation and Language · Computer Science 2024-06-06 Taehyeon Kim , Ananda Theertha Suresh , Kishore Papineni , Michael Riley , Sanjiv Kumar , Adrian Benton

Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement

Large language models (LLMs) often face a bottleneck in inference speed due to their reliance on auto-regressive decoding. Recently, parallel decoding has shown significant promise in enhancing inference efficiency. However, we have…

Computation and Language · Computer Science 2024-10-18 Yuxuan Liu , Wenyuan Li , Laizhong Cui , Hailiang Yang

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction. Our algorithm, called Tree Attention, for parallelizing exact attention computation across multiple GPUs…

Machine Learning · Computer Science 2025-02-11 Vasudev Shyam , Jonathan Pilault , Emily Shepperd , Quentin Anthony , Beren Millidge

Pipelined Decoder for Efficient Context-Aware Text Generation

As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a…

Computation and Language · Computer Science 2025-07-02 Zixian Huang , Chenxu Niu , Yu Gu , Gengyang Xiao , Xinwei Huang , Gong Cheng

ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive…

Machine Learning · Computer Science 2025-12-10 Long Lian , Sida Wang , Felix Juefei-Xu , Tsu-Jui Fu , Xiuyu Li , Adam Yala , Trevor Darrell , Alane Suhr , Yuandong Tian , Xi Victoria Lin

LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt. These methods boost accuracy while exploiting the computational efficiency of batching $N$ generations.…

Artificial Intelligence · Computer Science 2026-05-28 Gabriele Cesa , Thomas Hehn , Aleix Torres-Camps , Àlex Batlle Casellas , Jordi Ros-Giralt , Arash Behboodi , Tribhuvanesh Orekondy