English
Related papers

Related papers: Beyond Correctness: Benchmarking Multi-dimensional…

200 papers

Language models (LMs) have exhibited impressive abilities in generating codes from natural language requirements. In this work, we highlight the diversity of code generated by LMs as a critical criterion for evaluating their code generation…

Software Engineering · Computer Science 2024-08-28 Heejae Chon , Seonghyeon Lee , Jinyoung Yeo , Dongha Lee

Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation,…

Software Engineering · Computer Science 2025-06-24 Yanlin Wang , Tianyue Jiang , Mingwei Liu , Jiachi Chen , Mingzhi Mao , Xilin Liu , Yuchi Ma , Zibin Zheng

This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development,…

Software Engineering · Computer Science 2025-03-05 Liguo Chen , Qi Guo , Hongrui Jia , Zhengran Zeng , Xin Wang , Yijiang Xu , Jian Wu , Yidong Wang , Qing Gao , Jindong Wang , Wei Ye , Shikun Zhang

Large language models (LLMs) have demonstrated strong performance on function-level code generation benchmarks, yet real-world software development increasingly demands class-level implementations that integrate multiple methods,…

Software Engineering · Computer Science 2025-11-06 Musfiqur Rahman , SayedHassan Khatoonabadi , Emad Shihab

With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to…

Artificial Intelligence · Computer Science 2024-06-19 Debalina Ghosh Paul , Hong Zhu , Ian Bayley

Large Language Models (LLMs) have emerged as coding assistants, capable of generating source code from natural language prompts. With the increasing adoption of LLMs in software development, academic research and industry based projects are…

Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three…

Computation and Language · Computer Science 2025-07-09 Taolin Zhang , Zihan Ma , Maosong Cao , Junnan Liu , Songyang Zhang , Kai Chen

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

Despite strong performance on code generation tasks, it remains unclear whether large language models (LLMs) genuinely reason about code execution. Existing code reasoning benchmarks primarily evaluate final output correctness under a…

Software Engineering · Computer Science 2026-04-29 Jun Gao , Yun Peng , Qian Qiao , Changhai Zhou , Yuhua Zhou , Shiyang Zhang , Shichao Weng , Zhenchang Xing , Xiaoxue Ren

Large language models (LLMs) increasingly rely on explicit reasoning to solve coding tasks, yet evaluating the quality of this reasoning remains challenging. Existing reasoning evaluators are not designed for coding, and current benchmarks…

Software Engineering · Computer Science 2026-04-15 Yuangang Li , Justin Tian Jin Chen , Ethan Yu , David Hong , Iftekhar Ahmed

Code generation is one of the tasks for which the use of Large Language Models is widely adopted and highly successful. Given this popularity, there are many benchmarks dedicated to code generation that can help select the best model.…

Software Engineering · Computer Science 2026-05-12 Joanna Szych , Anne Schwerk

Large language models (LLMs) have been widely deployed in coding tasks, drawing increasing attention to the evaluation of the quality and safety of LLMs' outputs. However, research on bias in code generation remains limited. Existing…

Computation and Language · Computer Science 2025-04-03 Yongkang Du , Jen-tse Huang , Jieyu Zhao , Lu Lin

Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to verify if code implementation satisfy…

Software Engineering · Computer Science 2026-03-03 Haolin Jin , Huaming Chen

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in its automation. However, existing benchmarks for LLM-based code review face three major limitations. Lack…

Software Engineering · Computer Science 2026-01-01 Ruida Hu , Xinchen Wang , Xin-Cheng Wen , Zhao Zhang , Bo Jiang , Pengfei Gao , Chao Peng , Cuiyun Gao

Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive…

Large Language Models (LLMs) are widely used in software engineering to generate, complete, translate, and fix code, improving developer productivity. While most research focuses on the energy consumption and carbon emissions of model…

Software Engineering · Computer Science 2026-04-15 Sabiya Banu Masthan Ali , Oussema Kirmani , Aroosa Hameed , Syed Muhammad Danish , Gautam Srivastava

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area. Existing benchmarks often fall short by relying on synthetic…

Cryptography and Security · Computer Science 2026-02-02 Yanlin Wang , Ziyao Zhang , Chong Wang , Xinyi Xu , Mingwei Liu , Yong Wang , Jiachi Chen , Zibin Zheng

The development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether…

Computation and Language · Computer Science 2025-07-29 Yanzhu Guo , Guokan Shang , Chloé Clavel

This paper presents insights from evaluating 16 frontier large language models (LLMs) on the WebApp1K benchmark, a test suite designed to assess the ability of LLMs to generate web application code. The results reveal that while all models…

Software Engineering · Computer Science 2024-09-10 Yi Cui

Large Language Models (LLMs) are increasingly integrated into software engineering workflows, yet current benchmarks provide only coarse performance summaries that obscure the diverse capabilities and limitations of these models. This paper…

Software Engineering · Computer Science 2026-01-21 Felix Mächtle , Jan-Niclas Serr , Nils Loose , Thomas Eisenbarth
‹ Prev 1 2 3 10 Next ›