Related papers: OSS-Bench: Benchmark Generator for Coding LLMs

BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software

Automatically compiling open-source software (OSS) projects is a vital, labor-intensive, and complex task, which makes it a good challenge for LLM Agents. Existing methods rely on manually curated rules and workflows, which cannot adapt to…

Software Engineering · Computer Science 2025-10-01 Zehua Zhang , Ati Priya Bajaj , Divij Handa , Siyu Liu , Arvind S Raj , Hongkai Chen , Hulin Wang , Yibo Liu , Zion Leonahenahe Basque , Souradip Nath , Vishal Juneja , Nikhil Chapre , Yan Shoshitaishvili , Adam Doupé , Chitta Baral , Ruoyu Wang

RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area. Existing benchmarks often fall short by relying on synthetic…

Cryptography and Security · Computer Science 2026-02-02 Yanlin Wang , Ziyao Zhang , Chong Wang , Xinyi Xu , Mingwei Liu , Yong Wang , Jiachi Chen , Zibin Zheng

OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification

We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) on the task of generating complete formal specifications for verifying the functional correctness of operating system kernels. This benchmark is built upon a…

Computation and Language · Computer Science 2025-12-09 Shangyu Li , Juyong Jiang , Tiancheng Zhao , Jiasi Shen

SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code

The code generation capabilities of large language models(LLMs) have emerged as a critical dimension in evaluating their overall performance. However, prior research has largely overlooked the security risks inherent in the generated code.…

Cryptography and Security · Computer Science 2025-06-23 Xinghang Li , Jingzhe Ding , Chao Peng , Bing Zhao , Xiang Gao , Hongwan Gao , Xinchen Gu

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic…

Machine Learning · Computer Science 2025-10-23 Hwiwon Lee , Ziqi Zhang , Hanxiao Lu , Lingming Zhang

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and…

Software Engineering · Computer Science 2026-04-27 Jia Li , Hongyi Deng , Yiran Zhang , Kechi Zhang , Tianqi Shao , Tiankuo Zhao , Weinan Wang , Zhi Jin , Ge Li , Yang Liu , Yingtao Fang , Yihong Dong

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an…

Machine Learning · Computer Science 2026-02-24 Ayush Nangia , Shikhar Mishra , Aman Gokrani , Paras Chopra

BenchBench: Benchmarking Automated Benchmark Generation

Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items…

Computation and Language · Computer Science 2026-03-24 Yandan Zheng , Haoran Luo , Zhenghong Lin , Wenjin Liu , Luu Anh Tuan

SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents

Rigorous software testing is crucial for developing and maintaining high-quality code, making automated test generation a promising avenue for both improving software quality and boosting the effectiveness of code generation methods.…

Software Engineering · Computer Science 2025-02-10 Niels Mündler , Mark Niklas Müller , Jingxuan He , Martin Vechev

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks…

Software Engineering · Computer Science 2026-01-19 Jie Yang , Honglin Guo , Li Ji , Jiazheng Zhou , Rui Zheng , Zhikai Lei , Shuo Zhang , Zhiheng Xi , Shichun Liu , Yuxin Wang , Bo Wang , Yining Zheng , Tao Gui , Xipeng Qiu

Explaining Code Risk in OSS: Towards LLM-Generated Fault Prediction Interpretations

Open Source Software (OSS) has become a very important and crucial infrastructure worldwide because of the value it provides. OSS typically depends on contributions from developers across diverse backgrounds and levels of experience. Making…

Software Engineering · Computer Science 2025-10-08 Elijah Kayode Adejumo , Brittany Johnson

HardSecBench: Benchmarking the Security Awareness of LLMs for Hardware Code Generation

Large language models (LLMs) are being increasingly integrated into practical hardware and firmware development pipelines for code generation. Existing studies have primarily focused on evaluating the functional correctness of LLM-generated…

Cryptography and Security · Computer Science 2026-01-21 Qirui Chen , Jingxian Shuai , Shuangwu Chen , Shenghao Ye , Zijian Wen , Xufei Su , Jie Jin , Jiangming Li , Jun Chen , Xiaobin Tan , Jian Yang

Narrowing the Complexity Gap in the Evaluation of Large Language Models

Evaluating Large Language Models (LLMs) with respect to real-world code complexity is essential. Otherwise, there is a risk of overestimating LLMs' programming abilities based on simplistic benchmarks, only to be disappointed when using…

Software Engineering · Computer Science 2026-02-24 Yang Chen , Shuyang Liu , Reyhaneh Jabbarvand

RealBench: Benchmarking Verilog Generation Models with Real-World IP Designs

The automatic generation of Verilog code using Large Language Models (LLMs) has garnered significant interest in hardware design automation. However, existing benchmarks for evaluating LLMs in Verilog generation fall short in replicating…

Machine Learning · Computer Science 2025-07-23 Pengwei Jin , Di Huang , Chongxiao Li , Shuyao Cheng , Yang Zhao , Xinyao Zheng , Jiaguo Zhu , Shuyi Xing , Bohan Dou , Rui Zhang , Zidong Du , Qi Guo , Xing Hu

DUALGUAGE: Automated Joint Security-Functionality Benchmarking for Secure Code Generation

Large language models (LLMs) and autonomous coding agents are increasingly used to generate software across a wide range of domains. Yet a core requirement remains unmet: ensuring that generated code is secure without compromising its…

Software Engineering · Computer Science 2025-11-27 Abhijeet Pathak , Suvadra Barua , Dinesh Gudimetla , Rupam Patir , Jiawei Guo , Hongxin Hu , Haipeng Cai

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models

Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Their advances in competition-level programming problems have made them an essential pillar of AI-assisted pair…

Cryptography and Security · Computer Science 2023-10-24 Hossein Hajipour , Keno Hassler , Thorsten Holz , Lea Schönherr , Mario Fritz

LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient

The rapid advancement of large language models (LLMs) has led to a surge in both model supply and application demands. To facilitate effective matching between them, reliable, generic and efficient benchmark generators are widely needed.…

Computation and Language · Computer Science 2025-02-05 Peiwen Yuan , Shaoxiong Feng , Yiwei Li , Xinglin Wang , Yueqi Zhang , Jiayi Shi , Chuyi Tan , Boyuan Pan , Yao Hu , Kan Li