English
Related papers

Related papers: DevEval: Evaluating Code Generation in Practical S…

200 papers

How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of…

Recent advancements in large language models (LLMs) have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of coding, such as single-file code generation…

Computation and Language · Computer Science 2024-12-17 Bowen Li , Wenhan Wu , Ziwei Tang , Lin Shi , John Yang , Jinyang Li , Shunyu Yao , Chen Qian , Binyuan Hui , Qicheng Zhang , Zhiyin Yu , He Du , Ping Yang , Dahua Lin , Chao Peng , Kai Chen

Large Language Models (LLMs) have proved effective and efficient in generating code, leading to their utilization within the hardware design process. Prior works evaluating LLMs' abilities for register transfer level code generation solely…

Computation and Language · Computer Science 2024-04-16 Matthew DeLorenzo , Vasudev Gohil , Jeyavijayan Rajendran

Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program…

Software Engineering · Computer Science 2025-02-04 Wenhan Wang , Chenyuan Yang , Zhijie Wang , Yuheng Huang , Zhaoyang Chu , Da Song , Lingming Zhang , An Ran Chen , Lei Ma

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu

Code large language models (LLMs) have shown remarkable advances in code understanding, completion, and generation tasks. Programming benchmarks, comprised of a selection of code challenges and corresponding test cases, serve as a standard…

Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring or code completion. Surprisingly, there…

Software Engineering · Computer Science 2025-03-20 Kush Jain , Gabriel Synnaeve , Baptiste Rozière

This study presents a comprehensive empirical evaluation of six state-of-the-art large language models (LLMs) for code generation, including both general-purpose and code-specialized models. Using a dataset of 944 real-world LeetCode…

Software Engineering · Computer Science 2025-12-23 Le Zhang , Suresh Kothari

Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their…

In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100…

Computation and Language · Computer Science 2023-08-15 Xueying Du , Mingwei Liu , Kaixin Wang , Hanlin Wang , Junwei Liu , Yixuan Chen , Jiayi Feng , Chaofeng Sha , Xin Peng , Yiling Lou

The application of large-language models (LLMs) to digital hardware code generation is an emerging field, with most LLMs primarily trained on natural language and software code. Hardware code like Verilog constitutes a small portion of…

Hardware Architecture · Computer Science 2025-02-05 Nathaniel Pinckney , Christopher Batten , Mingjie Liu , Haoxing Ren , Brucek Khailany

How to evaluate Large Language Models (LLMs) in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper…

Computation and Language · Computer Science 2024-04-02 Jia Li , Ge Li , Xuanming Zhang , Yihong Dong , Zhi Jin

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure…

Software Engineering · Computer Science 2023-11-01 Jiawei Liu , Chunqiu Steven Xia , Yuyao Wang , Lingming Zhang

Recently, pre-trained large language models (LLMs) have shown impressive abilities in generating codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments.…

Computation and Language · Computer Science 2023-11-07 Mohammad Abdullah Matin Khan , M Saiful Bari , Xuan Long Do , Weishi Wang , Md Rizwan Parvez , Shafiq Joty

In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which…

Software Engineering · Computer Science 2024-09-17 Jia Feng , Jiachen Liu , Cuiyun Gao , Chun Yong Chong , Chaozheng Wang , Shan Gao , Xin Xia

LLMs have become the go-to choice for code generation tasks, with an exponential increase in the training, development, and usage of LLMs specifically for code generation. To evaluate the ability of LLMs on code, both academic and industry…

Software Engineering · Computer Science 2024-03-29 Chunqiu Steven Xia , Yinlin Deng , Lingming Zhang

Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…

Software Engineering · Computer Science 2025-06-26 Shanchao Liang , Yiran Hu , Nan Jiang , Lin Tan

How to evaluate Large Language Models (LLMs) in code generation remains an open question. Existing benchmarks have two limitations - data leakage and lack of domain-specific evaluation. The former hurts the fairness of benchmarks, and the…

Computation and Language · Computer Science 2024-10-31 Jia Li , Ge Li , Xuanming Zhang , Yunfei Zhao , Yihong Dong , Zhi Jin , Binhua Li , Fei Huang , Yongbin Li

Existing code generation benchmarks for Large Language Models (LLMs) such as HumanEval and MBPP are designed to study LLMs' end-to-end performance, where the benchmarks feed a problem description in natural language as input and examine the…

Software Engineering · Computer Science 2025-02-27 Jiarong Wu , Songqiang Chen , Jialun Cao , Hau Ching Lo , Shing-Chi Cheung

Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However,…

Computation and Language · Computer Science 2025-05-19 Nishat Raihan , Antonios Anastasopoulos , Marcos Zampieri
‹ Prev 1 2 3 10 Next ›