English
Related papers

Related papers: CodeClash: Benchmarking Goal-Oriented Software Eng…

200 papers

Leveraging Large Language Models (LLMs) for code generation has increasingly emerged as a common practice in the domain of software engineering. Relevant benchmarks have been established to evaluate the code generation capabilities of LLMs.…

Software Engineering · Computer Science 2026-03-05 Jue Huang , Tarek Mahmud , Corina Pasareanu , Guowei Yang

Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a…

Software Engineering · Computer Science 2025-04-10 Dung Nguyen Manh , Thang Phan Chau , Nam Le Hai , Thong T. Doan , Nam V. Nguyen , Quang Pham , Nghi D. Q. Bui

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount…

Software Engineering · Computer Science 2025-06-09 Zihan Wang , Siyao Liu , Yang Sun , Hongyan Li , Kai Shen

In recent years, researchers have proposed numerous benchmarks to evaluate the impressive coding capabilities of large language models (LLMs). However, current benchmarks primarily assess the accuracy of LLM-generated code, while neglecting…

Software Engineering · Computer Science 2024-10-10 Jiasheng Zheng , Boxi Cao , Zhengzhao Ma , Ruotong Pan , Hongyu Lin , Yaojie Lu , Xianpei Han , Le Sun

Large Language Models (LLMs) have demonstrated remarkable performance on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities…

Computation and Language · Computer Science 2024-06-10 Weixiang Yan , Haitian Liu , Yunkun Wang , Yunzhe Li , Qian Chen , Wen Wang , Tingyu Lin , Weishan Zhao , Li Zhu , Hari Sundaram , Shuiguang Deng

Resolving conflicts from merging different software versions is a challenging task. To reduce the overhead of manual merging, researchers develop various program analysis-based tools which only solve specific types of conflicts and have a…

Software Engineering · Computer Science 2024-09-24 Qingyu Zhang , Liangcai Su , Kai Ye , Chenxiong Qian

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test…

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

Code review is a critical practice in modern software engineering, helping developers detect defects early, improve code quality, and facilitate knowledge sharing. With the rapid advancement of large language models (LLMs), a growing body…

Software Engineering · Computer Science 2026-02-17 Taufiqul Islam Khan , Shaowei Wang , Haoxiang Zhang , Tse-Hsun Chen

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

Evaluating the performance of Code Language Models (CLMs) for software engineering tasks, especially in multilingual and low-resource programming language settings, poses significant challenges. These challenges are primarily due to the…

Software Engineering · Computer Science 2024-11-26 Rohit Dandamudi , Gema Rodríguez-Pérez

Code large language models (CodeLLMs) and agents are increasingly being integrated into complex software engineering tasks spanning the entire Software Development Life Cycle (SDLC). Benchmarking is critical for rigorously evaluating these…

Software Engineering · Computer Science 2026-03-09 Kaixin Wang , Tianlin Li , Xiaoyu Zhang , Chong Wang , Weisong Sun , Yang Liu , Aishan Liu , Xianglong Liu , Chao Shen , Bin Shi

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in its automation. However, existing benchmarks for LLM-based code review face three major limitations. Lack…

Software Engineering · Computer Science 2026-01-01 Ruida Hu , Xinchen Wang , Xin-Cheng Wen , Zhao Zhang , Bo Jiang , Pengfei Gao , Chao Peng , Cuiyun Gao

Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations…

Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering…

Software Engineering · Computer Science 2025-11-07 Amir Molzam Sharifloo , Maedeh Heydari , Parsa Kazerooni , Daniel Maninger , Mira Mezini

Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive…

With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that…

The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks…

Artificial Intelligence · Computer Science 2025-05-13 Kai Xu , YiWei Mao , XinYi Guan , ZiLong Feng

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi
‹ Prev 1 2 3 10 Next ›