Related papers: Generalized Parallel Scaling with Interdependent G…

Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation

Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an…

Computation and Language · Computer Science 2024-10-04 Rohin Manvi , Anikait Singh , Stefano Ermon

Parallel Loop Transformer for Efficient Test-Time Computation Scaling

Large Language Models (LLMs) are powerful but often too slow and costly for real-world use during inference. Looped transformers save on parameters by reusing the same weights for multiple computational steps, or "loops." However, this…

Computation and Language · Computer Science 2025-10-30 Bohong Wu , Mengzhao Chen , Xiang Luo , Shen Yan , Qifan Yu , Fan Xia , Tianqi Zhang , Hongrui Zhan , Zheng Zhong , Xun Zhou , Siyuan Qiao , Xingyan Bin

Parallel Test-Time Scaling for Latent Reasoning Models

Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances…

Computation and Language · Computer Science 2026-04-21 Runyang You , Yongqi Li , Meng Liu , Wenjie Wang , Liqiang Nie , Wenjie Li

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute. Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple…

Artificial Intelligence · Computer Science 2025-11-11 Jianhao Chen , Zishuo Xun , Bocheng Zhou , Han Qi , Hangfan Zhang , Qiaosheng Zhang , Yang Chen , Wei Hu , Yuzhong Qu , Wanli Ouyang , Shuyue Hu

RAG-R1: Incentivizing the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Large Language Models (LLMs), despite their remarkable capabilities, are prone to generating hallucinated or outdated content due to their static internal knowledge. While Retrieval-Augmented Generation (RAG) integrated with Reinforcement…

Computation and Language · Computer Science 2026-01-14 Zhiwen Tan , Jiaming Huang , Qintong Wu , Hongxuan Zhang , Chenyi Zhuang , Jinjie Gu

Inference Scaled GraphRAG: Improving Multi Hop Question Answering on Knowledge Graphs

Large Language Models (LLMs) have achieved impressive capabilities in language understanding and generation, yet they continue to underperform on knowledge-intensive reasoning tasks due to limited access to structured context and multi-hop…

Computation and Language · Computer Science 2025-06-26 Travis Thompson , Seung-Hwan Lim , Paul Liu , Ruoying He , Dongkuan Xu

When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs

Recent advancements in large language models (LLMs) have shifted focus toward scaling inference-time compute, improving performance without retraining the model. A common approach is to sample multiple outputs in parallel, and select one of…

Computation and Language · Computer Science 2025-06-26 Ammar Khairi , Daniel D'souza , Ye Shen , Julia Kreutzer , Sara Hooker

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during…

Computation and Language · Computer Science 2024-11-21 Sean Welleck , Amanda Bertsch , Matthew Finlayson , Hailey Schoelkopf , Alex Xie , Graham Neubig , Ilia Kulikov , Zaid Harchaoui

VerilogMonkey: Exploring Parallel Scaling for Automated Verilog Code Generation with LLMs

We present VerilogMonkey, an empirical study of parallel scaling for the under-explored task of automated Verilog generation. Parallel scaling improves LLM performance by sampling many outputs in parallel. Across multiple benchmarks and…

Programming Languages · Computer Science 2025-09-23 Juxin Niu , Yuxin Du , Dan Niu , Xi Wang , Zhe Jiang , Nan Guan

Keep Guessing? When Considering Inference Scaling, Mind the Baselines

Scaling inference compute in large language models (LLMs) through repeated sampling consistently increases the coverage (fraction of problems solved) as the number of samples increases. We conjecture that this observed improvement is…

Computation and Language · Computer Science 2024-10-22 Gal Yona , Or Honovich , Omer Levy , Roee Aharoni

PRISM: Parallel Residual Iterative Sequence Model

Generative sequence modeling faces a fundamental tension between the expressivity of Transformers and the efficiency of linear sequence models. Existing efficient architectures are theoretically bounded by shallow, single-step linear…

Machine Learning · Computer Science 2026-02-13 Jie Jiang , Ke Cheng , Xin Xu , Mengyang Pang , Tianhao Lu , Jiaheng Li , Yue Liu , Yuan Wang , Jun Zhang , Huan Yu , Zhouchen Lin

Parallel Latent Reasoning for Sequential Recommendation

Capturing complex user preferences from sparse behavioral sequences remains a fundamental challenge in sequential recommendation. Recent latent reasoning methods have shown promise by extending test-time computation through multi-step…

Information Retrieval · Computer Science 2026-01-07 Jiakai Tang , Xu Chen , Wen Chen , Jian Wu , Yuning Jiang , Bo Zheng

Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks. Current approaches parallelize training onto multiple devices by applying a single parallelization strategy (e.g.,…

Machine Learning · Computer Science 2018-06-12 Zhihao Jia , Sina Lin , Charles R. Qi , Alex Aiken

SPEED: Speculative Pipelined Execution for Efficient Decoding

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs

Test-time scaling (TTS) has gained widespread attention for enhancing LLM reasoning. Existing approaches such as Best-of-N and majority voting are limited as their performance depends on the quality of candidate responses, making them…

Machine Learning · Computer Science 2026-04-28 Qibin Wang , Pu Zhao , Shaohan Huang , Fangkai Yang , Lu Wang , Furu Wei , Qingwei Lin , Saravan Rajmohan , Dongmei Zhang

A Survey on Parallel Reasoning

With the increasing capabilities of Large Language Models (LLMs), parallel reasoning has emerged as a new inference paradigm that enhances reasoning robustness by concurrently exploring multiple lines of thought before converging on a final…

Computation and Language · Computer Science 2025-10-15 Ziqi Wang , Boye Niu , Zipeng Gao , Zhi Zheng , Tong Xu , Linghui Meng , Zhongli Li , Jing Liu , Yilong Chen , Chen Zhu , Hua Wu , Haifeng Wang , Enhong Chen

Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling

Test-Time Scaling (TTS) enhances the reasoning capabilities of large language models by allocating additional inference compute to explore the solution space. However, existing parallel TTS methods typically keep branches isolated during…

Computation and Language · Computer Science 2026-05-27 Xinglin Wang , Hao Lin , Shaoxiong Feng , Peiwen Yuan , Yiwei Li , Jiayi Shi , Yueqi Zhang , Chuyi Tan , Ji Zhang , Boyuan Pan , Yao Hu , Kan Li

Leveraging the true depth of LLMs

The remarkable capabilities of Large Language Models (LLMs) are overshadowed by their immense computational cost. While recent work has shown that many LLM layers can be reordered or even removed with minimal impact on accuracy, these…

Machine Learning · Computer Science 2026-01-07 Ramón Calvo González , Daniele Paliotta , Matteo Pagliardini , Martin Jaggi , François Fleuret

On the Effect of Sampling Diversity in Scaling LLM Inference

Large language model (LLM) scaling inference is key to unlocking greater performance, and leveraging diversity has proven an effective way to enhance it. Motivated by the observed relationship between solution accuracy and meaningful…

Machine Learning · Computer Science 2025-12-22 Tianchun Wang , Zichuan Liu , Yuanzhou Chen , Jonathan Light , Weiyang Liu , Haifeng Chen , Xiang Zhang , Wei Cheng