Related papers: CoCo-Bench: A Comprehensive Code Benchmark For Mul…

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive…

Software Engineering · Computer Science 2025-09-12 Jielin Qiu , Zuxin Liu , Zhiwei Liu , Rithesh Murthy , Jianguo Zhang , Haolin Chen , Shiyu Wang , Ming Zhu , Liangwei Yang , Juntao Tan , Zhepeng Cen , Cheng Qian , Shelby Heinecke , Weiran Yao , Silvio Savarese , Caiming Xiong , Huan Wang

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of…

Computation and Language · Computer Science 2025-02-25 Alexander Zhang , Marcus Dong , Jiaheng Liu , Wei Zhang , Yejie Wang , Jian Yang , Ge Zhang , Tianyu Liu , Zhongyuan Peng , Yingshui Tan , Yuanxing Zhang , Zhexu Wang , Weixun Wang , Yancheng He , Ken Deng , Wangchunshu Zhou , Wenhao Huang , Zhaoxiang Zhang

KOCO-BENCH: Can Large Language Models Leverage Domain Knowledge in Software Development?

Large language models (LLMs) excel at general programming but struggle with domain-specific software development, necessitating domain specialization methods for LLMs to learn and utilize domain knowledge and data. However, existing…

Software Engineering · Computer Science 2026-04-28 Xue Jiang , Ge Li , Jiaru Qian , Xianjie Shi , Chenjie Li , Hao Zhu , Ziyu Wang , Jielun Zhang , Zheyu Zhao , Lingwei Wu , Kechi Zhang , Jia Li , Wenpin Jiao , Zhi Jin , Yihong Dong

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

Large language models (LLMs) have demonstrated significant potential in advancing various fields of research and society. However, the current community of LLMs overly focuses on benchmarks for analyzing specific foundational skills (e.g.…

Computation and Language · Computer Science 2025-03-03 Xiaoshuai Song , Muxi Diao , Guanting Dong , Zhengyang Wang , Yujia Fu , Runqi Qiao , Zhexu Wang , Dayuan Fu , Huangxuan Wu , Bin Liang , Weihao Zeng , Yejie Wang , Zhuoma GongQue , Jianing Yu , Qiuna Tan , Weiran Xu

What can Large Language Models Capture about Code Functional Equivalence?

Code-LLMs, LLMs pre-trained on large code corpora, have shown great progress in learning rich representations of the structure and syntax of code, successfully using it to generate or classify code fragments. At the same time, understanding…

Software Engineering · Computer Science 2025-02-14 Nickil Maveli , Antonio Vergari , Shay B. Cohen

Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion…

Computation and Language · Computer Science 2025-05-16 Yutao Mou , Xiao Deng , Yuxiao Luo , Shikun Zhang , Wei Ye

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing…

Software Engineering · Computer Science 2025-04-09 Jiawei Guo , Ziming Li , Xueling Liu , Kaijing Ma , Tianyu Zheng , Zhouliang Yu , Ding Pan , Yizhi LI , Ruibo Liu , Yue Wang , Shuyue Guo , Xingwei Qu , Xiang Yue , Ge Zhang , Wenhu Chen , Jie Fu

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

Software Engineering · Computer Science 2025-04-02 Terry Yue Zhuo , Minh Chien Vu , Jenny Chim , Han Hu , Wenhao Yu , Ratnadira Widyasari , Imam Nur Bani Yusuf , Haolan Zhan , Junda He , Indraneil Paul , Simon Brunner , Chen Gong , Thong Hoang , Armel Randy Zebaze , Xiaoheng Hong , Wen-Ding Li , Jean Kaddour , Ming Xu , Zhihan Zhang , Prateek Yadav , Naman Jain , Alex Gu , Zhoujun Cheng , Jiawei Liu , Qian Liu , Zijian Wang , Binyuan Hui , Niklas Muennighoff , David Lo , Daniel Fried , Xiaoning Du , Harm de Vries , Leandro Von Werra

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

As large language models (LLMs) evolve into sophisticated autonomous agents capable of complex software development tasks, evaluating their real-world capabilities becomes critical. While existing benchmarks like…

Software Engineering · Computer Science 2025-11-19 Jielin Qiu , Zuxin Liu , Zhiwei Liu , Rithesh Murthy , Jianguo Zhang , Haolin Chen , Shiyu Wang , Ming Zhu , Liangwei Yang , Juntao Tan , Roshan Ram , Akshara Prabhakar , Tulika Awalgaonkar , Zixiang Chen , Zhepeng Cen , Cheng Qian , Shelby Heinecke , Weiran Yao , Silvio Savarese , Caiming Xiong , Huan Wang

SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation

We introduce SIMCOPILOT, a benchmark that simulates the role of large language models (LLMs) as interactive, "copilot"-style coding assistants. Targeting both completion (finishing incomplete methods or code blocks) and infill tasks…

Machine Learning · Computer Science 2025-05-29 Mingchao Jiang , Abhinav Jain , Sophia Zorek , Chris Jermaine

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation…

Software Engineering · Computer Science 2024-06-07 Naman Jain , King Han , Alex Gu , Wen-Ding Li , Fanjia Yan , Tianjun Zhang , Sida Wang , Armando Solar-Lezama , Koushik Sen , Ion Stoica

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs

Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a…

Software Engineering · Computer Science 2025-04-10 Dung Nguyen Manh , Thang Phan Chau , Nam Le Hai , Thong T. Doan , Nam V. Nguyen , Quang Pham , Nghi D. Q. Bui

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

Although LLM-based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap…

Computation and Language · Computer Science 2025-08-25 Weiwei Sun , Shengyu Feng , Shanda Li , Yiming Yang

EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e.,…

Machine Learning · Computer Science 2025-09-23 Anjiang Wei , Jiannan Cao , Ran Li , Hongyu Chen , Yuhui Zhang , Ziheng Wang , Yuan Liu , Thiago S. F. X. Teixeira , Diyi Yang , Ke Wang , Alex Aiken

CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts

Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and…

Machine Learning · Computer Science 2025-05-09 Manik Sheokand , Parth Sawant

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities,…

Computation and Language · Computer Science 2025-08-13 Jason Chou , Ao Liu , Yuchi Deng , Zhiying Zeng , Tao Zhang , Haotian Zhu , Jianwei Cai , Yue Mao , Chenchen Zhang , Lingyun Tan , Ziyan Xu , Bohui Zhai , Hengyi Liu , Speed Zhu , Wiggin Zhou , Fengzong Lian

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu

InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models

Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the…

Software Engineering · Computer Science 2024-11-15 Linyi Li , Shijie Geng , Zhenwen Li , Yibo He , Hao Yu , Ziyue Hua , Guanghan Ning , Siwei Wang , Tao Xie , Hongxia Yang

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li