Related papers: IndustryCode: A Benchmark for Industry Code Genera…

MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark

With the rapid advancement of Multimodal Large Language Models (MLLMs), numerous evaluation benchmarks have emerged. However, comprehensive assessments of their performance across diverse industrial applications remain limited. In this…

Computation and Language · Computer Science 2025-01-29 Dongyi Yi , Guibo Zhu , Chenglin Ding , Zongshu Li , Dong Yi , Jinqiao Wang

OmniCode: A Benchmark for Evaluating Software Engineering Agents

LLM-powered coding agents are redefining how real-world software is developed. To drive the research towards better coding agents, we require challenging benchmarks that can rigorously evaluate the ability of such agents to perform various…

Software Engineering · Computer Science 2026-05-19 Atharv Sonwane , Eng-Shen Tu , Wei-Chung Lu , Claas Beger , Carter Larsen , Debjit Dhar , Simon Alford , Rachel Chen , Ronit Pattanayak , Tuan Anh Dang , Guohao Chen , Gloria Geng , Kevin Ellis , Saikat Dutta

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities,…

Computation and Language · Computer Science 2025-08-13 Jason Chou , Ao Liu , Yuchi Deng , Zhiying Zeng , Tao Zhang , Haotian Zhu , Jianwei Cai , Yue Mao , Chenchen Zhang , Lingyun Tan , Ziyan Xu , Bohui Zhai , Hengyi Liu , Speed Zhu , Wiggin Zhou , Fengzong Lian

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs

Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a…

Software Engineering · Computer Science 2025-04-10 Dung Nguyen Manh , Thang Phan Chau , Nam Le Hai , Thong T. Doan , Nam V. Nguyen , Quang Pham , Nghi D. Q. Bui

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Large Language Models (LLMs) have demonstrated remarkable performance on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities…

Computation and Language · Computer Science 2024-06-10 Weixiang Yan , Haitian Liu , Yunkun Wang , Yunzhe Li , Qian Chen , Wen Wang , Tingyu Lin , Weishan Zhao , Li Zhu , Hari Sundaram , Shuiguang Deng

Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis

Recent advances in code generation have illuminated the potential of employing large language models (LLMs) for general-purpose programming languages such as Python and C++, opening new opportunities for automating software development and…

Machine Learning · Computer Science 2025-03-06 Jiahao Gai , Hao Mark Chen , Zhican Wang , Hongyu Zhou , Wanru Zhao , Nicholas Lane , Hongxiang Fan

FairCoder: Evaluating Social Bias of LLMs in Code Generation

Large language models (LLMs) have been widely deployed in coding tasks, drawing increasing attention to the evaluation of the quality and safety of LLMs' outputs. However, research on bias in code generation remains limited. Existing…

Computation and Language · Computer Science 2025-04-03 Yongkang Du , Jen-tse Huang , Jieyu Zhao , Lu Lin

AutoCode: LLMs as Problem Setters for Competitive Programming

Writing competitive programming problems is exacting. Authors must: set constraints, input distributions, and edge cases that rule out shortcuts; target specific algorithms (e.g., max-flow, dynamic programming, data structures); and…

Software Engineering · Computer Science 2025-10-16 Shang Zhou , Zihan Zheng , Kaiyuan Liu , Zeyu Shen , Zerui Cheng , Zexing Chen , Hansen He , Jianzhu Yao , Huanzhi Mao , Qiuyang Mang , Tianfu Fu , Beichen Li , Dongruixuan Li , Wenhao Chai , Zhuang Liu , Aleksandra Korolova , Peter Henderson , Natasha Jaques , Pramod Viswanath , Saining Xie , Jingbo Shang

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure,…

Artificial Intelligence · Computer Science 2026-05-13 Haozhe Zhang , Kaichen Liu , Miaomiao Chen , Lei Li , Shaojie Yang , Cheng Peng , Hanjie Chen

Holistic Evaluation of State-of-the-Art LLMs for Code Generation

This study presents a comprehensive empirical evaluation of six state-of-the-art large language models (LLMs) for code generation, including both general-purpose and code-specialized models. Using a dataset of 944 real-world LeetCode…

Software Engineering · Computer Science 2025-12-23 Le Zhang , Suresh Kothari

DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes…

Computation and Language · Computer Science 2025-05-30 Wenhao Hu , Jinhao Duan , Chunchen Wei , Li Zhang , Yue Zhang , Kaidi Xu

Evaluating LLM-Generated Code: A Benchmark and Developer Study

Code generation is one of the tasks for which the use of Large Language Models is widely adopted and highly successful. Given this popularity, there are many benchmarks dedicated to code generation that can help select the best model.…

Software Engineering · Computer Science 2026-05-12 Joanna Szych , Anne Schwerk

SciCode: A Research Coding Benchmark Curated by Scientists

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities…

Artificial Intelligence · Computer Science 2024-07-19 Minyang Tian , Luyu Gao , Shizhuo Dylan Zhang , Xinan Chen , Cunwei Fan , Xuefei Guo , Roland Haas , Pan Ji , Kittithat Krongchon , Yao Li , Shengyan Liu , Di Luo , Yutao Ma , Hao Tong , Kha Trinh , Chenyu Tian , Zihan Wang , Bohao Wu , Yanyu Xiong , Shengzhu Yin , Minhui Zhu , Kilian Lieret , Yanxin Lu , Genglin Liu , Yufeng Du , Tianhua Tao , Ofir Press , Jamie Callan , Eliu Huerta , Hao Peng

Large Language Models for Code Generation: The Practitioners Perspective

Large Language Models (LLMs) have emerged as coding assistants, capable of generating source code from natural language prompts. With the increasing adoption of LLMs in software development, academic research and industry based projects are…

Software Engineering · Computer Science 2025-01-29 Zeeshan Rasheed , Muhammad Waseem , Kai Kristian Kemell , Aakash Ahmad , Malik Abdul Sami , Jussi Rasku , Kari Systä , Pekka Abrahamsson

A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era

Code review is a critical practice in modern software engineering, helping developers detect defects early, improve code quality, and facilitate knowledge sharing. With the rapid advancement of large language models (LLMs), a growing body…

Software Engineering · Computer Science 2026-02-17 Taufiqul Islam Khan , Shaowei Wang , Haoxiang Zhang , Tse-Hsun Chen

RoCode: A Dataset for Measuring Code Intelligence from Problem Definitions in Romanian

Recently, large language models (LLMs) have become increasingly powerful and have become capable of solving a plethora of tasks through proper instructions in natural language. However, the vast majority of testing suites assume that the…

Computation and Language · Computer Science 2024-02-21 Adrian Cosma , Bogdan Iordache , Paolo Rosso

A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption…

Software Engineering · Computer Science 2025-05-22 Jonathan Katzy , Yongcheng Huang , Gopal-Raj Panchu , Maksym Ziemlewski , Paris Loizides , Sander Vermeulen , Arie van Deursen , Maliheh Izadi

SIMCODE: A Benchmark for Natural Language to ns-3 Network Simulation Code Generation

Large language models (LLMs) have demonstrated remarkable capabilities in code generation across various domains. However, their effectiveness in generating simulation scripts for domain-specific environments like ns-3 remains…

Networking and Internet Architecture · Computer Science 2025-07-16 Tasnim Ahmed , Mirza Mohammad Azwad , Salimur Choudhury

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation

As large language models (LLMs) become increasingly embedded in software engineering workflows, a critical capability remains underexplored: generating correct code that enables cross-programming-language (CPL) interoperability. This skill…

Software Engineering · Computer Science 2025-07-29 Zhanhang Xiong , Dongxia Wang , Yuekang Li , Xinyuan An , Wenhai Wang