Related papers: CodeMonkeys: Scaling Test-Time Compute for Softwar…

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment…

Software Engineering · Computer Science 2025-04-09 Yingwei Ma , Yongbin Li , Yihong Dong , Xue Jiang , Rongyu Cao , Jue Chen , Fei Huang , Binhua Li

SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is…

Software Engineering · Computer Science 2026-02-06 Yifeng Ding , Lingming Zhang

SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling

Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing scaling methods have key limitations: parallel…

Artificial Intelligence · Computer Science 2025-12-04 Jiefeng Chen , Jie Ren , Xinyun Chen , Chengrun Yang , Ruoxi Sun , Jinsung Yoon , Sercan Ö Arık

s1: Simple test-time scaling

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many…

Computation and Language · Computer Science 2025-03-04 Niklas Muennighoff , Zitong Yang , Weijia Shi , Xiang Lisa Li , Li Fei-Fei , Hannaneh Hajishirzi , Luke Zettlemoyer , Percy Liang , Emmanuel Candès , Tatsunori Hashimoto

S*: Test Time Scaling for Code Generation

Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially…

Machine Learning · Computer Science 2025-02-21 Dacheng Li , Shiyi Cao , Chengkun Cao , Xiuyu Li , Shangyin Tan , Kurt Keutzer , Jiarong Xing , Joseph E. Gonzalez , Ion Stoica

SWE-smith: Scaling Data for Software Engineering Agents

Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub…

Software Engineering · Computer Science 2025-08-13 John Yang , Kilian Lieret , Carlos E. Jimenez , Alexander Wettig , Kabir Khandpur , Yanzhe Zhang , Binyuan Hui , Ofir Press , Ludwig Schmidt , Diyi Yang

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and…

Computation and Language · Computer Science 2024-11-13 Carlos E. Jimenez , John Yang , Alexander Wettig , Shunyu Yao , Kexin Pei , Ofir Press , Karthik Narasimhan

SWE-Tester: Training Open-Source LLMs for Issue Reproduction in Real-World Repositories

Software testing is crucial for ensuring the correctness and reliability of software systems. Automated generation of issue reproduction tests from natural language issue descriptions enhances developer productivity by simplifying root…

Software Engineering · Computer Science 2026-01-21 Aditya Bharat Soni , Rajat Ghosh , Vaishnavi Bhargava , Valerie Chen , Debojyoti Dutta

SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling

Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform…

Computation and Language · Computer Science 2025-12-02 Yang Xiao , Chunpu Xu , Ruifeng Yuan , Jiashuo Wang , Wenjie Li , Pengfei Liu

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute…

Machine Learning · Computer Science 2025-01-03 Bradley Brown , Jordan Juravsky , Ryan Ehrlich , Ronald Clark , Quoc V. Le , Christopher Ré , Azalia Mirhoseini

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies…

Computation and Language · Computer Science 2025-05-06 Qiyuan Zhang , Fuyuan Lyu , Zexu Sun , Lei Wang , Weixu Zhang , Wenyue Hua , Haolun Wu , Zhihan Guo , Yufei Wang , Niklas Muennighoff , Irwin King , Xue Liu , Chen Ma

Strategic Scaling of Test-Time Compute: A Bandit Learning Approach

Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query…

Artificial Intelligence · Computer Science 2026-04-24 Bowen Zuo , Yinglun Zhu

VerilogMonkey: Exploring Parallel Scaling for Automated Verilog Code Generation with LLMs

We present VerilogMonkey, an empirical study of parallel scaling for the under-explored task of automated Verilog generation. Parallel scaling improves LLM performance by sampling many outputs in parallel. Across multiple benchmarks and…

Programming Languages · Computer Science 2025-09-23 Juxin Niu , Yuxin Du , Dan Niu , Xi Wang , Zhe Jiang , Nan Guan

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller…

Computation and Language · Computer Science 2025-05-30 Guangtao Zeng , Maohao Shen , Delin Chen , Zhenting Qi , Subhro Das , Dan Gutfreund , David Cox , Gregory Wornell , Wei Lu , Zhang-Wei Hong , Chuang Gan

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on…

Software Engineering · Computer Science 2025-12-22 Lilin Wang , Lucas Ramalho , Alan Celestino , Phuc Anthony Pham , Yu Liu , Umang Kumar Sinha , Andres Portillo , Onassis Osunwa , Gabriel Maduekwe

AutoCodeRover: Autonomous Program Improvement

Researchers have made significant progress in automating the software development process in the past decades. Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use…

Software Engineering · Computer Science 2024-07-26 Yuntong Zhang , Haifeng Ruan , Zhiyu Fan , Abhik Roychoudhury

Exploring Test-time Scaling via Prediction Merging on Large-Scale Recommendation

Inspired by the success of language models (LM), scaling up deep learning recommendation systems (DLRS) has become a recent trend in the community. All previous methods tend to scale up the model parameters during training time. However,…

Information Retrieval · Computer Science 2025-12-09 Fuyuan Lyu , Zhentai Chen , Jingyan Jiang , Lingjie Li , Xing Tang , Xiuqiang He , Xue Liu

Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive…

Machine Learning · Computer Science 2025-06-03 Kaivalya Hariharan , Uzay Girit , Atticus Wang , Jacob Andreas

Z1: Efficient Test-time Scaling with Code

Large Language Models (LLMs) can achieve enhanced complex problem-solving through test-time computing scaling, yet this often entails longer contexts and numerous reasoning token costs. In this paper, we propose an efficient test-time…

Computation and Language · Computer Science 2025-04-02 Zhaojian Yu , Yinghao Wu , Yilun Zhao , Arman Cohan , Xiao-Ping Zhang

OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique

Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both…

Computation and Language · Computer Science 2025-07-15 Wasi Uddin Ahmad , Somshubra Majumdar , Aleksander Ficek , Sean Narenthiran , Mehrzad Samadi , Jocelyn Huang , Siddhartha Jain , Vahid Noroozi , Boris Ginsburg