English
Related papers

Related papers: Scaling Test-Time Compute for Agentic Coding

200 papers

Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents…

Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy,…

Computation and Language · Computer Science 2026-02-05 Zeyao Ma , Jing Zhang , Xiaokang Zhang , Jiaxi Yang , Zongmeng Zhang , Jiajun Zhang , Yuheng Jing , Lei Zhang , Hao Zheng , Wenting Zhao , Junyang Lin , Binyuan Hui

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long…

Artificial Intelligence · Computer Science 2026-02-13 Nicholas Lee , Lutfi Eren Erdogan , Chris Joseph John , Surya Krishnapillai , Michael W. Mahoney , Kurt Keutzer , Amir Gholami

Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment…

Software Engineering · Computer Science 2025-04-09 Yingwei Ma , Yongbin Li , Yihong Dong , Xue Jiang , Rongyu Cao , Jue Chen , Fei Huang , Binhua Li

Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is…

Software Engineering · Computer Science 2026-02-06 Yifeng Ding , Lingming Zhang

Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with…

Computation and Language · Computer Science 2026-02-04 Xingshan Zeng , Lingzhi Wang , Weiwen Liu , Liangyou Li , Yasheng Wang , Lifeng Shang , Xin Jiang , Qun Liu

The newly released OpenAI-o1 and DeepSeek-R1 have demonstrated that test-time scaling can significantly improve model performance, especially in complex tasks such as logical reasoning. Common test-time scaling methods involve generating…

Computation and Language · Computer Science 2025-10-01 Zhendong Tan , Xingjun Zhang , Chaoyi Hu , Yancheng Pan , Shaoxun Wang

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration…

Machine Learning · Computer Science 2025-11-04 Fali Wang , Jihai Chen , Shuhua Yang , Runxue Bao , Tianxiang Zhao , Zhiwei Zhang , Xianfeng Tang , Hui Liu , Qi He , Suhang Wang

A common failure mode in long-horizon agentic test-time scaling is error propagation, where factual errors or invalid deductions introduced at intermediate steps persist in the agent's belief state and contaminate later reasoning. Existing…

Artificial Intelligence · Computer Science 2026-05-22 Woomin Song , Beomjun Kim , Daewon Choi , Sai Muralidhar Jayanthi , Saket Dingliwal , Jinwoo Shin , Aram Galstyan

Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially…

Machine Learning · Computer Science 2025-02-21 Dacheng Li , Shiyi Cao , Chengkun Cao , Xiuyu Li , Shangyin Tan , Kurt Keutzer , Jiarong Xing , Joseph E. Gonzalez , Ion Stoica

Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories…

Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for…

We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for…

Computation and Language · Computer Science 2026-04-14 Yoonsang Lee , Howard Yen , Xi Ye , Danqi Chen

Scaling test-time computation improves large language model performance without additional training. Recent work demonstrates that techniques such as repeated sampling, self-verification, and self-reflection can significantly enhance task…

Artificial Intelligence · Computer Science 2025-12-15 Dongwon Jung , Peng Shi , Yi Zhang

The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in…

As Large Language Models (LLMs) move from curated training sets into open-ended real-world environments, a fundamental limitation emerges: static training cannot keep pace with continual deployment environment change. Scaling training-time…

Artificial Intelligence · Computer Science 2026-03-17 Minhua Lin , Hanqing Lu , Zhan Shi , Bing He , Rui Mao , Zhiwei Zhang , Zongyu Wu , Xianfeng Tang , Hui Liu , Zhenwei Dai , Xiang Zhang , Suhang Wang , Benoit Dumoulin , Jian Pei

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many…

As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based…

Computation and Language · Computer Science 2026-01-26 Yichuan Ma , Linyang Li , Yongkang chen , Peiji Li , Xiaozhe Li , Qipeng Guo , Dahua Lin , Kai Chen

Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical…

Computation and Language · Computer Science 2025-12-02 Aradhye Agarwal , Ayan Sengupta , Tanmoy Chakraborty

Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by…

Artificial Intelligence · Computer Science 2026-05-20 George Wu , Nan Jing , Qing Yi , Chuan Hao , Ming Yang , Feng Chang , Yuan Wei , Jian Yang , Ran Tao , Bryan Dai
‹ Prev 1 2 3 10 Next ›