Related papers: BigCodeBench: Benchmarking Code Generation with Di…

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities,…

Computation and Language · Computer Science 2025-08-13 Jason Chou , Ao Liu , Yuchi Deng , Zhiying Zeng , Tao Zhang , Haotian Zhu , Jianwei Cai , Yue Mao , Chenchen Zhang , Lingyun Tan , Ziyan Xu , Bohui Zhai , Hengyi Liu , Speed Zhu , Wiggin Zhou , Fengzong Lian

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts

Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and…

Machine Learning · Computer Science 2025-05-09 Manik Sheokand , Parth Sawant

Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate…

Software Engineering · Computer Science 2025-04-03 Nam Huynh , Beiyu Lin

CodeMixBench: Evaluating Code-Mixing Capabilities of LLMs Across 18 Languages

Code-mixing, the practice of switching between languages within a conversation, poses unique challenges for traditional NLP. Existing benchmarks are limited by their narrow language pairs and tasks, failing to adequately assess large…

Computation and Language · Computer Science 2025-09-09 Yilun Yang , Yekun Chai

ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario

Enhancing large language models (LLMs) with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the…

Computation and Language · Computer Science 2025-01-20 Lucen Zhong , Zhengxiao Du , Xiaohan Zhang , Haiyi Hu , Jie Tang

Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference

Large Language Models (LLMs) are increasingly being used to automate programming tasks. Yet, LLMs' capabilities in reasoning about program semantics are still inadequately studied, leaving significant potential for further exploration. This…

Programming Languages · Computer Science 2025-05-30 Thanh Le-Cong , Bach Le , Toby Murray

InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models

Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the…

Software Engineering · Computer Science 2024-11-15 Linyi Li , Shijie Geng , Zhenwen Li , Yibo He , Hao Yu , Ziyue Hua , Guanghan Ning , Siwei Wang , Tao Xie , Hongxia Yang

PyBench: Evaluating LLM Agent on various real-world coding tasks

The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as…

Software Engineering · Computer Science 2024-08-06 Yaolun Zhang , Yinxu Pan , Yudong Wang , Jie Cai

LIFEBench: Evaluating Length Instruction Following in Large Language Models

While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally,…

Computation and Language · Computer Science 2025-06-12 Wei Zhang , Zhenhong Zhou , Kun Wang , Junfeng Fang , Yuanhe Zhang , Rui Wang , Ge Zhang , Xavier Li , Li Sun , Lingjuan Lyu , Yang Liu , Sen Su

CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation

Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive…

Software Engineering · Computer Science 2025-04-30 Wenjing Yin , Tianze Sun , Yijiong Yu , Jiawei Fang , Guangyao Su , Jiancheng Wang , Zekun Wang , Wei Wang , Ran Chen , Ziyun Dai , Shuai Yuan , Menghang Dong , Peng Luo , Dong Cao , Da Lei , Yajun Zhang , Hao Chen , Xiang Ma , Yong Liu , Weifeng Liu , Yuanjian Xu , Ji Pei

BaxBench: Can LLMs Generate Correct and Secure Backends?

Automatic program generation has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve…

Cryptography and Security · Computer Science 2025-06-02 Mark Vero , Niels Mündler , Victor Chibotaru , Veselin Raychev , Maximilian Baader , Nikola Jovanović , Jingxuan He , Martin Vechev

EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e.,…

Machine Learning · Computer Science 2025-09-23 Anjiang Wei , Jiannan Cao , Ran Li , Hongyu Chen , Yuhui Zhang , Ziheng Wang , Yuan Liu , Thiago S. F. X. Teixeira , Diyi Yang , Ke Wang , Alex Aiken

NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition

Natural language-driven no-code development allows users to specify software functionality using natural language (NL) instead of editing source code, promising increased productivity and democratized development. Large language models…

Software Engineering · Computer Science 2025-08-19 Le Deng , Zhonghao Jiang , Jialun Cao , Michael Pradel , Zhongxin Liu

DSCodeBench: A Realistic Benchmark for Data Science Code Generation

We introduce DSCodeBench, a new benchmark designed to evaluate large language models (LLMs) on complicated and realistic data science code generation tasks. DSCodeBench consists of 1,000 carefully constructed problems sourced from realistic…

Software Engineering · Computer Science 2025-11-18 Shuyin Ouyang , Dong Huang , Jingwen Guo , Zeyu Sun , Qihao Zhu , Jie M. Zhang

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of…

Computation and Language · Computer Science 2025-02-25 Alexander Zhang , Marcus Dong , Jiaheng Liu , Wei Zhang , Yejie Wang , Jian Yang , Ge Zhang , Tianyu Liu , Zhongyuan Peng , Yingshui Tan , Yuanxing Zhang , Zhexu Wang , Weixun Wang , Yancheng He , Ken Deng , Wangchunshu Zhou , Wenhao Huang , Zhaoxiang Zhang

TaskBench: Benchmarking Large Language Models for Task Automation

In recent years, the remarkable progress of large language models (LLMs) has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute…

Computation and Language · Computer Science 2024-11-04 Yongliang Shen , Kaitao Song , Xu Tan , Wenqi Zhang , Kan Ren , Siyu Yuan , Weiming Lu , Dongsheng Li , Yueting Zhuang

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we…

Computation and Language · Computer Science 2024-07-03 Bodhisattwa Prasad Majumder , Harshit Surana , Dhruv Agarwal , Bhavana Dalvi Mishra , Abhijeetsingh Meena , Aryan Prakhar , Tirth Vora , Tushar Khot , Ashish Sabharwal , Peter Clark

Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering…

Software Engineering · Computer Science 2025-11-07 Amir Molzam Sharifloo , Maedeh Heydari , Parsa Kazerooni , Daniel Maninger , Mira Mezini