English
Related papers

Related papers: CodeMonkeys: Scaling Test-Time Compute for Softwar…

200 papers

Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment…

Software Engineering · Computer Science 2025-04-09 Yingwei Ma , Yongbin Li , Yihong Dong , Xue Jiang , Rongyu Cao , Jue Chen , Fei Huang , Binhua Li

Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is…

Software Engineering · Computer Science 2026-02-06 Yifeng Ding , Lingming Zhang

Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing scaling methods have key limitations: parallel…

Artificial Intelligence · Computer Science 2025-12-04 Jiefeng Chen , Jie Ren , Xinyun Chen , Chengrun Yang , Ruoxi Sun , Jinsung Yoon , Sercan Ö Arık

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many…

Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially…

Machine Learning · Computer Science 2025-02-21 Dacheng Li , Shiyi Cao , Chengkun Cao , Xiuyu Li , Shangyin Tan , Kurt Keutzer , Jiarong Xing , Joseph E. Gonzalez , Ion Stoica

Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub…

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and…

Computation and Language · Computer Science 2024-11-13 Carlos E. Jimenez , John Yang , Alexander Wettig , Shunyu Yao , Kexin Pei , Ofir Press , Karthik Narasimhan

Software testing is crucial for ensuring the correctness and reliability of software systems. Automated generation of issue reproduction tests from natural language issue descriptions enhances developer productivity by simplifying root…

Software Engineering · Computer Science 2026-01-21 Aditya Bharat Soni , Rajat Ghosh , Vaishnavi Bhargava , Valerie Chen , Debojyoti Dutta

Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform…

Computation and Language · Computer Science 2025-12-02 Yang Xiao , Chunpu Xu , Ruifeng Yuan , Jiashuo Wang , Wenjie Li , Pengfei Liu

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute…

Machine Learning · Computer Science 2025-01-03 Bradley Brown , Jordan Juravsky , Ryan Ehrlich , Ronald Clark , Quoc V. Le , Christopher Ré , Azalia Mirhoseini

As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies…

Computation and Language · Computer Science 2025-05-06 Qiyuan Zhang , Fuyuan Lyu , Zexu Sun , Lei Wang , Weixu Zhang , Wenyue Hua , Haolun Wu , Zhihan Guo , Yufei Wang , Niklas Muennighoff , Irwin King , Xue Liu , Chen Ma

Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query…

Artificial Intelligence · Computer Science 2026-04-24 Bowen Zuo , Yinglun Zhu

We present VerilogMonkey, an empirical study of parallel scaling for the under-explored task of automated Verilog generation. Parallel scaling improves LLM performance by sampling many outputs in parallel. Across multiple benchmarks and…

Programming Languages · Computer Science 2025-09-23 Juxin Niu , Yuxin Du , Dan Niu , Xi Wang , Zhe Jiang , Nan Guan

Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller…

Computation and Language · Computer Science 2025-05-30 Guangtao Zeng , Maohao Shen , Delin Chen , Zhenting Qi , Subhro Das , Dan Gutfreund , David Cox , Gregory Wornell , Wei Lu , Zhang-Wei Hong , Chuang Gan

Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on…

Researchers have made significant progress in automating the software development process in the past decades. Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use…

Software Engineering · Computer Science 2024-07-26 Yuntong Zhang , Haifeng Ruan , Zhiyu Fan , Abhik Roychoudhury

Inspired by the success of language models (LM), scaling up deep learning recommendation systems (DLRS) has become a recent trend in the community. All previous methods tend to scale up the model parameters during training time. However,…

Information Retrieval · Computer Science 2025-12-09 Fuyuan Lyu , Zhentai Chen , Jingyan Jiang , Lingjie Li , Xing Tang , Xiuqiang He , Xue Liu

Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive…

Machine Learning · Computer Science 2025-06-03 Kaivalya Hariharan , Uzay Girit , Atticus Wang , Jacob Andreas

Large Language Models (LLMs) can achieve enhanced complex problem-solving through test-time computing scaling, yet this often entails longer contexts and numerous reasoning token costs. In this paper, we propose an efficient test-time…

Computation and Language · Computer Science 2025-04-02 Zhaojian Yu , Yinghao Wu , Yilun Zhao , Arman Cohan , Xiao-Ping Zhang

Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both…

‹ Prev 1 2 3 10 Next ›