Related papers: CodeArena: A Collective Evaluation Platform for LL…

ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents

Large language models (LLMs) excel across many natural language processing tasks but face challenges in domain-specific, analytical tasks such as conducting research surveys. This study introduces ResearchArena, a benchmark designed to…

Artificial Intelligence · Computer Science 2025-09-09 Hao Kang , Chenyan Xiong

Evaluating and Aligning CodeLLMs on Human Preference

Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common…

Computation and Language · Computer Science 2024-12-09 Jian Yang , Jiaxi Yang , Ke Jin , Yibo Miao , Lei Zhang , Liqun Yang , Zeyu Cui , Yichang Zhang , Binyuan Hui , Junyang Lin

CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

Large language models (LLMs) are increasingly being used to synthesize and reason about source code. However, the static nature of these models' knowledge does not reflect the fact that libraries and API functions they invoke are…

Computation and Language · Computer Science 2025-04-04 Zeyu Leo Liu , Shrey Pandit , Xi Ye , Eunsol Choi , Greg Durrett

RankArena: A Unified Platform for Evaluating Retrieval, Reranking and RAG with Human and LLM Feedback

Evaluating the quality of retrieval-augmented generation (RAG) and document reranking systems remains challenging due to the lack of scalable, user-centric, and multi-perspective evaluation tools. We introduce RankArena, a unified platform…

Information Retrieval · Computer Science 2025-08-08 Abdelrahman Abdallah , Mahmoud Abdalla , Bhawna Piryani , Jamshid Mozafari , Mohammed Ali , Adam Jatowt

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is…

Software Engineering · Computer Science 2025-12-19 Terry Yue Zhuo , Xiaolong Jin , Hange Liu , Juyong Jiang , Tianyang Liu , Chen Gong , Bhupesh Bishnoi , Vaisakhi Mishra , Marek Suppa , Noah Ziems , Saiteja Utpala , Ming Xu , Guangyu Song , Kaixin Li , Yuhan Cao , Bo Liu , Zheng Liu , Sabina Abdurakhmanova , Wenhao Yu , Mengzhao Jia , Jihan Yao , Kenneth Hamilton , Kumar Shridhar , Minh Chien Vu , Dingmin Wang , Jiawei Liu , Zijian Wang , Qian Liu , Binyuan Hui , Meg Risdal , Ahsen Khaliq , Atin Sood , Zhenchang Xing , Wasi Uddin Ahmad , John Grundy , David Lo , Banghua Zhu , Xiaoning Du , Torsten Scholak , Leandro von Werra

A Survey on Large Language Models for Code Generation

Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This…

Computation and Language · Computer Science 2025-10-28 Juyong Jiang , Fan Wang , Jiasi Shen , Sungju Kim , Sunghun Kim

Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate…

Software Engineering · Computer Science 2025-04-03 Nam Huynh , Beiyu Lin

Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback

Large Language Models (LLMs) have shown remarkable progress in automated code generation. Yet, LLM-generated code may contain errors in API usage, class, data structure, or missing project-specific information. As much of this…

Computation and Language · Computer Science 2024-06-12 Zhangqian Bi , Yao Wan , Zheng Wang , Hongyu Zhang , Batu Guan , Fangxin Lu , Zili Zhang , Yulei Sui , Hai Jin , Xuanhua Shi

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of…

Computation and Language · Computer Science 2025-02-25 Alexander Zhang , Marcus Dong , Jiaheng Liu , Wei Zhang , Yejie Wang , Jian Yang , Ge Zhang , Tianyu Liu , Zhongyuan Peng , Yingshui Tan , Yuanxing Zhang , Zhexu Wang , Weixun Wang , Yancheng He , Ken Deng , Wangchunshu Zhou , Wenhao Huang , Zhaoxiang Zhang

CodeGrad: Integrating Multi-Step Verification with Gradient-Based LLM Refinement

While Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, they often produce solutions that lack guarantees of correctness, robustness, and efficiency. This limitation is particularly acute in domains…

Software Engineering · Computer Science 2025-09-04 Yueke Zhang , Yifan Zhang , Kevin Leach , Yu Huang

CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models

State-of-the-art large language models (LLMs) have demonstrated impressive code generation capabilities but struggle with real-world software engineering tasks, such as revising source code to address code reviews, hindering their practical…

Software Engineering · Computer Science 2025-06-03 Hong Yi Lin , Chunhua Liu , Haoyu Gao , Patanamon Thongtanunam , Christoph Treude

Copilot Arena: A Platform for Code LLM Evaluation in the Wild

Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no clear solution. We introduce Copilot Arena, a platform to collect user preferences for code generation through native integration…

Software Engineering · Computer Science 2025-02-14 Wayne Chi , Valerie Chen , Anastasios Nikolas Angelopoulos , Wei-Lin Chiang , Aditya Mittal , Naman Jain , Tianjun Zhang , Ion Stoica , Chris Donahue , Ameet Talwalkar

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities,…

Computation and Language · Computer Science 2025-08-13 Jason Chou , Ao Liu , Yuchi Deng , Zhiying Zeng , Tao Zhang , Haotian Zhu , Jianwei Cai , Yue Mao , Chenchen Zhang , Lingyun Tan , Ziyan Xu , Bohui Zhai , Hengyi Liu , Speed Zhu , Wiggin Zhou , Fengzong Lian

Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models

The recent explosion of large language models (LLMs), each with its own general or specialized strengths, makes scalable, reliable benchmarking more urgent than ever. Standard practices nowadays face fundamental trade-offs: closed-ended…

Computation and Language · Computer Science 2025-05-20 Yanbin Yin , Kun Zhou , Zhen Wang , Xiangdong Zhang , Yifei Shao , Shibo Hao , Yi Gu , Jieyuan Liu , Somanshu Singla , Tianyang Liu , Eric P. Xing , Zhengzhong Liu , Haojian Jin , Zhiting Hu

SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving

We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, SwingArena models the collaborative process of…

Computation and Language · Computer Science 2026-03-10 Wendong Xu , Jing Xiong , Chenyang Zhao , Qiujiang Chen , Haoran Wang , Hui Shen , Zhongwei Wan , Jianbo Dai , Taiqiang Wu , He Xiao , Chaofan Tao , Z. Morley Mao , Ying Sheng , Zhijiang Guo , Hongxia Yang , Bei Yu , Lingpeng Kong , Quanquan Gu , Ngai Wong

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation…

Software Engineering · Computer Science 2024-06-07 Naman Jain , King Han , Alex Gu , Wen-Ding Li , Fanjia Yan , Tianjun Zhang , Sida Wang , Armando Solar-Lezama , Koushik Sen , Ion Stoica

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

The ability of large language models (LLMs) to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though recent LLMs seem capable of planning and reasoning given user instructions, their effectiveness in…

Artificial Intelligence · Computer Science 2025-02-07 Léo Boisvert , Megh Thakkar , Maxime Gasse , Massimo Caccia , Thibault Le Sellier De Chezelles , Quentin Cappart , Nicolas Chapados , Alexandre Lacoste , Alexandre Drouin

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to…

Artificial Intelligence · Computer Science 2024-06-19 Debalina Ghosh Paul , Hong Zhu , Ian Bayley

Generating Unseen Code Tests In Infinitum

Large Language Models (LLMs) are used for many tasks, including those related to coding. An important aspect of being able to utilize LLMs is the ability to assess their fitness for specific usages. The common practice is to evaluate LLMs…

Artificial Intelligence · Computer Science 2024-07-30 Marcel Zalmanovici , Orna Raz , Eitan Farchi , Iftach Freund

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi