English
Related papers

Related papers: Breaking, Stale, or Missing? Benchmarking Coding A…

200 papers

Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring or code completion. Surprisingly, there…

Software Engineering · Computer Science 2025-03-20 Kush Jain , Gabriel Synnaeve , Baptiste Rozière

Evaluating Large Language Models (LLMs) on repository-level feature implementation is a critical frontier in software engineering. However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a…

Computation and Language · Computer Science 2026-02-19 Haorui Chen , Chengze Li , Jia Li

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current…

Software Engineering · Computer Science 2026-02-12 Qixing Zhou , Jiacheng Zhang , Haiyang Wang , Rui Hao , Jiahe Wang , Minghao Han , Yuxue Yang , Shuzhe Wu , Feiyang Pan , Lue Fan , Dandan Tu , Zhaoxiang Zhang

Coding agents increasingly act as codebase-scale collaborators that can assist with codebase conversion, but this progress has exposed a critical weakness: agents often over-trust their own local validation routines and declare success on…

Software Engineering · Computer Science 2026-05-29 Linxin Song , Jiefeng Chen , Yue Huang , Bhavana Dalvi Mishra , Chi Wang , Jieyu Zhao , Jinsung Yoon , Tomas Pfister

Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench.…

Software Engineering · Computer Science 2025-03-12 Konstantinos Vergopoulos , Mark Niklas Müller , Martin Vechev

Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items…

Computation and Language · Computer Science 2026-03-24 Yandan Zheng , Haoran Luo , Zhenghong Lin , Wenjin Liu , Luu Anh Tuan

Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. While recent advances in Software Engineering agents have shown…

Software Engineering · Computer Science 2025-12-04 Spandan Garg , Roshanak Zilouchian Moghaddam , Neel Sundaresan

The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm…

Software Engineering · Computer Science 2026-03-27 Fanheng Kong , Jingyuan Zhang , Yang Yue , Chenxi Sun , Yang Tian , Shi Feng , Xiaocui Yang , Daling Wang , Yu Tian , Jun Du , Wenchong Zeng , Han Li , Kun Gai

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench,…

Software Engineering · Computer Science 2026-05-15 Hung Tran , Langston Nashold , Rayan Krishnan , Antoine Bigeard , Alex Gu

The development of large, software-intensive systems is a complex undertaking that we generally tackle by a divide and conquer strategy. Companies thereby face the challenge of coordinating individual aspects of software development, in…

Software Engineering · Computer Science 2023-08-16 Michael Unterkalmsteiner , Tony Gorschek , Robert Feldt , Eriks Klotins

Identifying vulnerabilities in source code is crucial, especially in critical software components. Existing methods such as static analysis, dynamic analysis, formal verification, and recently Large Language Models are widely used to detect…

Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program…

Software Engineering · Computer Science 2025-02-04 Wenhan Wang , Chenyuan Yang , Zhijie Wang , Yuheng Huang , Zhaoyang Chu , Da Song , Lingming Zhang , An Ran Chen , Lei Ma

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu

Benchmarks driven by test suites, notably SWE-bench, have become the de facto standard for measuring the effectiveness of automated issue-resolution agents: a generated patch is accepted whenever it passes the accompanying regression tests.…

Software Engineering · Computer Science 2026-04-03 Chenglin Li , Yisen Xu , Zehao Wang , Shin Hwei Tan , Tse-Hsun , Chen

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

Logging, the practice of inserting log statements into source code, is critical for improving software reliability. Recently, language model-based techniques have been developed to automate log statement generation based on input code.…

Software Engineering · Computer Science 2025-04-03 Boyin Tan , Junjielong Xu , Zhouruixing Zhu , Pinjia He

How to evaluate Large Language Models (LLMs) in code generation remains an open question. Existing benchmarks have two limitations - data leakage and lack of domain-specific evaluation. The former hurts the fairness of benchmarks, and the…

Computation and Language · Computer Science 2024-10-31 Jia Li , Ge Li , Xuanming Zhang , Yunfei Zhao , Yihong Dong , Zhi Jin , Binhua Li , Fei Huang , Yongbin Li

Java remains central to enterprise software, and many applications outlive their original architecture. Migrating them across frameworks is a behavior-preserving refactoring spanning build configuration, dependency injection, persistence,…

Autonomous agents are increasingly expected to support scientific research, and recent benchmarks report progress in code repair and autonomous experimentation. However, these evaluations typically assume a pre-configured execution…

Software Engineering · Computer Science 2026-03-12 Yubang Wang , Chenxi Zhang , Bowen Chen , Zezheng Huai , Zihao Dai , Xinchi Chen , Yuxin Wang , Yining Zheng , Jingjing Gong , Xipeng Qiu

Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce…

Artificial Intelligence · Computer Science 2025-03-12 Dhruv Gautam , Spandan Garg , Jinu Jang , Neel Sundaresan , Roshanak Zilouchian Moghaddam
‹ Prev 1 2 3 10 Next ›