Related papers: Isolating Language-Coding from Problem-Solving: Be…

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation

Code benchmarks such as HumanEval are widely adopted to evaluate the capabilities of Large Language Models (LLMs), providing insights into their strengths and weaknesses. However, current benchmarks primarily exercise LLMs' capability on…

Artificial Intelligence · Computer Science 2024-08-26 Qiming Zhu , Jialun Cao , Yaojie Lu , Hongyu Lin , Xianpei Han , Le Sun , Shing-Chi Cheung

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However,…

Computation and Language · Computer Science 2025-05-19 Nishat Raihan , Antonios Anastasopoulos , Marcos Zampieri

McEval: Massively Multilingual Code Evaluation

Code large language models (LLMs) have shown remarkable advances in code understanding, completion, and generation tasks. Programming benchmarks, comprised of a selection of code challenges and corresponding test cases, serve as a standard…

Programming Languages · Computer Science 2024-06-12 Linzheng Chai , Shukai Liu , Jian Yang , Yuwei Yin , Ke Jin , Jiaheng Liu , Tao Sun , Ge Zhang , Changyu Ren , Hongcheng Guo , Zekun Wang , Boyang Wang , Xianjie Wu , Bing Wang , Tongliang Li , Liqun Yang , Sufeng Duan , Zhoujun Li

CodeEval: A pedagogical approach for targeted evaluation of code-trained Large Language Models

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated…

Software Engineering · Computer Science 2026-01-08 Danny Brahman , Mohammad Mahoor

Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?

While large language models (LLMs) exhibit state-of-the-art performance in various tasks, recent studies have revealed their struggle for code translation. This is because they haven't been extensively pre-trained with parallel multilingual…

Software Engineering · Computer Science 2024-10-15 Qingxiao Tao , Tingrui Yu , Xiaodong Gu , Beijun Shen

Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'

Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…

Software Engineering · Computer Science 2025-06-26 Shanchao Liang , Yiran Hu , Nan Jiang , Lin Tan

SwiftEval: Developing a Language-Specific Benchmark for LLM-generated Code Evaluation

In recent years, large language models (LLMs) have showcased significant advancements in code generation. However, most evaluation benchmarks are primarily oriented towards Python, making it difficult to evaluate other programming…

Machine Learning · Computer Science 2025-06-02 Ivan Petrukha , Yana Kurliak , Nataliia Stulova

HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to…

Computation and Language · Computer Science 2024-03-26 Qiwei Peng , Yekun Chai , Xuhong Li

MdEval: Massively Multilingual Code Debugging

Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their…

Computation and Language · Computer Science 2025-02-25 Shukai Liu , Linzheng Chai , Jian Yang , Jiajun Shi , He Zhu , Liran Wang , Ke Jin , Wei Zhang , Hualei Zhu , Shuyue Guo , Tao Sun , Jiaheng Liu , Yunlong Duan , Yu Hao , Liqun Yang , Guanglin Niu , Ge Zhang , Zhoujun Li

Can Large Language Models Write Parallel Code?

Large language models are increasingly becoming a popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-15 Daniel Nichols , Joshua H. Davis , Zhaojun Xie , Arjun Rajaram , Abhinav Bhatele

CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

Code benchmarks such as HumanEval are widely adopted to evaluate Large Language Models' (LLMs) coding capabilities. However, there is an unignorable programming language bias in existing code benchmarks -- over 95% code generation…

Artificial Intelligence · Computer Science 2025-05-20 Ruiyang Xu , Jialun Cao , Yaojie Lu , Ming Wen , Hongyu Lin , Xianpei Han , Ben He , Shing-Chi Cheung , Le Sun

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They…

Software Engineering · Computer Science 2025-01-03 Zhaojian Yu , Yilun Zhao , Arman Cohan , Xiao-Ping Zhang

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100…

Computation and Language · Computer Science 2023-08-15 Xueying Du , Mingwei Liu , Kaixin Wang , Hanlin Wang , Junwei Liu , Yixuan Chen , Jiayi Feng , Chaofeng Sha , Xin Peng , Yiling Lou

xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

Recently, pre-trained large language models (LLMs) have shown impressive abilities in generating codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments.…

Computation and Language · Computer Science 2023-11-07 Mohammad Abdullah Matin Khan , M Saiful Bari , Xuan Long Do , Weishi Wang , Md Rizwan Parvez , Shafiq Joty

TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models

Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with a single…

Software Engineering · Computer Science 2025-10-27 Florian Tambon , Amin Nikanjam , Cyrine Zid , Foutse Khomh , Giuliano Antoniol

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs in a few-shot or even zero-shot manner. Despite promising…

Computation and Language · Computer Science 2023-10-03 Ansong Ni , Pengcheng Yin , Yilun Zhao , Martin Riddell , Troy Feng , Rui Shen , Stephen Yin , Ye Liu , Semih Yavuz , Caiming Xiong , Shafiq Joty , Yingbo Zhou , Dragomir Radev , Arman Cohan

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and…

Software Engineering · Computer Science 2026-04-27 Jia Li , Hongyi Deng , Yiran Zhang , Kechi Zhang , Tianqi Shao , Tiankuo Zhao , Weinan Wang , Zhi Jin , Ge Li , Yang Liu , Yingtao Fang , Yihong Dong

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program…

Software Engineering · Computer Science 2025-02-04 Wenhan Wang , Chenyuan Yang , Zhijie Wang , Yuheng Huang , Zhaoyang Chu , Da Song , Lingming Zhang , An Ran Chen , Lei Ma

Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM

LLMs have become the go-to choice for code generation tasks, with an exponential increase in the training, development, and usage of LLMs specifically for code generation. To evaluate the ability of LLMs on code, both academic and industry…

Software Engineering · Computer Science 2024-03-29 Chunqiu Steven Xia , Yinlin Deng , Lingming Zhang

Exploring Multi-Lingual Bias of Large Code Models in Code Generation

Code generation aims to synthesize code and fulfill functional requirements based on natural language (NL) specifications, which can greatly improve development efficiency. In the era of large language models (LLMs), large code models…

Software Engineering · Computer Science 2024-05-01 Chaozheng Wang , Zongjie Li , Cuiyun Gao , Wenxuan Wang , Ting Peng , Hailiang Huang , Yuetang Deng , Shuai Wang , Michael R. Lyu