Related papers: Scaling Test-Time Compute for Agentic Coding

Scaling Test-time Compute for LLM Agents

Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents…

Artificial Intelligence · Computer Science 2025-06-17 King Zhu , Hanhao Li , Siwei Wu , Tianshun Xing , Dehua Ma , Xiangru Tang , Minghao Liu , Jian Yang , Jiaheng Liu , Yuchen Eleanor Jiang , Changwang Zhang , Chenghua Lin , Jun Wang , Ge Zhang , Wangchunshu Zhou

Scaling Agentic Verifier for Competitive Coding

Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy,…

Computation and Language · Computer Science 2026-02-05 Zeyao Ma , Jing Zhang , Xiaokang Zhang , Jiaxi Yang , Zongmeng Zhang , Jiajun Zhang , Yuheng Jing , Lei Zhang , Hao Zheng , Wenting Zhao , Junyang Lin , Binyuan Hui

Agentic Test-Time Scaling for WebAgents

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long…

Artificial Intelligence · Computer Science 2026-02-13 Nicholas Lee , Lutfi Eren Erdogan , Chris Joseph John , Surya Krishnapillai , Michael W. Mahoney , Kurt Keutzer , Amir Gholami

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment…

Software Engineering · Computer Science 2025-04-09 Yingwei Ma , Yongbin Li , Yihong Dong , Xue Jiang , Rongyu Cao , Jue Chen , Fei Huang , Binhua Li

SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is…

Software Engineering · Computer Science 2026-02-06 Yifeng Ding , Lingming Zhang

ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation

Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with…

Computation and Language · Computer Science 2026-02-04 Xingshan Zeng , Lingzhi Wang , Weiwen Liu , Liangyou Li , Yasheng Wang , Lifeng Shang , Xin Jiang , Qun Liu

Adaptive Rectification Sampling for Test-Time Compute Scaling

The newly released OpenAI-o1 and DeepSeek-R1 have demonstrated that test-time scaling can significantly improve model performance, especially in complex tasks such as logical reasoning. Common test-time scaling methods involve generating…

Computation and Language · Computer Science 2025-10-01 Zhendong Tan , Xingjun Zhang , Chaoyi Hu , Yancheng Pan , Shaoxun Wang

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration…

Machine Learning · Computer Science 2025-11-04 Fali Wang , Jihai Chen , Shuhua Yang , Runxue Bao , Tianxiang Zhao , Zhiwei Zhang , Xianfeng Tang , Hui Liu , Qi He , Suhang Wang

ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

A common failure mode in long-horizon agentic test-time scaling is error propagation, where factual errors or invalid deductions introduced at intermediate steps persist in the agent's belief state and contaminate later reasoning. Existing…

Artificial Intelligence · Computer Science 2026-05-22 Woomin Song , Beomjun Kim , Daewon Choi , Sai Muralidhar Jayanthi , Saket Dingliwal , Jinwoo Shin , Aram Galstyan

S*: Test Time Scaling for Code Generation

Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially…

Machine Learning · Computer Science 2025-02-21 Dacheng Li , Shiyi Cao , Chengkun Cao , Xiuyu Li , Shangyin Tan , Kurt Keutzer , Jiarong Xing , Joseph E. Gonzalez , Ion Stoica

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories…

Software Engineering · Computer Science 2026-03-18 Songcheng Cai , Zhiheng Lyu , Yuansheng Ni , Xiangchao Chen , Baichuan Zhou , Shenzhe Zhu , Yi Lu , Haozhe Wang , Chi Ruan , Benjamin Schneider , Weixu Zhang , Xiang Li , Andy Zheng , Yuyu Zhang , Ping Nie , Wenhu Chen

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for…

Computation and Language · Computer Science 2026-04-20 Jiazheng Zhang , Ziche Fu , Zhiheng Xi , Wenqing Jing , Mingxu Chai , Wei He , Guoqiang Zhang , Chenghao Fan , Chenxin An , Wenxiang Chen , Zhicheng Liu , Haojie Pan , Dingwei Zhu , Tao Gui , Qi Zhang , Xuanjing Huang

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for…

Computation and Language · Computer Science 2026-04-14 Yoonsang Lee , Howard Yen , Xi Ye , Danqi Chen

FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration

Scaling test-time computation improves large language model performance without additional training. Recent work demonstrates that techniques such as repeated sampling, self-verification, and self-reflection can significantly enhance task…

Artificial Intelligence · Computer Science 2025-12-15 Dongwon Jung , Peng Shi , Yi Zhang

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in…

Machine Learning · Computer Science 2025-06-11 Junhong Shen , Hao Bai , Lunjun Zhang , Yifei Zhou , Amrith Setlur , Shengbang Tong , Diego Caples , Nan Jiang , Tong Zhang , Ameet Talwalkar , Aviral Kumar

Position: Agentic Evolution is the Path to Evolving LLMs

As Large Language Models (LLMs) move from curated training sets into open-ended real-world environments, a fundamental limitation emerges: static training cannot keep pace with continual deployment environment change. Scaling training-time…

Artificial Intelligence · Computer Science 2026-03-17 Minhua Lin , Hanqing Lu , Zhan Shi , Bing He , Rui Mao , Zhiwei Zhang , Zongyu Wu , Xianfeng Tang , Hui Liu , Zhenwei Dai , Xiang Zhang , Suhang Wang , Benoit Dumoulin , Jian Pei

s1: Simple test-time scaling

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many…

Computation and Language · Computer Science 2025-03-04 Niklas Muennighoff , Zitong Yang , Weijia Shi , Xiang Lisa Li , Li Fei-Fei , Hannaneh Hajishirzi , Luke Zettlemoyer , Percy Liang , Emmanuel Candès , Tatsunori Hashimoto

Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic

As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based…

Computation and Language · Computer Science 2026-01-26 Yichuan Ma , Linyang Li , Yongkang chen , Peiji Li , Xiaozhe Li , Qipeng Guo , Dahua Lin , Kai Chen

The Art of Scaling Test-Time Compute for Large Language Models

Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical…

Computation and Language · Computer Science 2025-12-02 Aradhye Agarwal , Ayan Sengupta , Tanmoy Chakraborty

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by…

Artificial Intelligence · Computer Science 2026-05-20 George Wu , Nan Jing , Qing Yi , Chuan Hao , Ming Yang , Feng Chang , Yuan Wei , Jian Yang , Ran Tao , Bryan Dai