Related papers: AutoBaxBuilder: Bootstrapping Code Security Benchm…

BaxBench: Can LLMs Generate Correct and Secure Backends?

Automatic program generation has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve…

Cryptography and Security · Computer Science 2025-06-02 Mark Vero , Niels Mündler , Victor Chibotaru , Veselin Raychev , Maximilian Baader , Nikola Jovanović , Jingxuan He , Martin Vechev

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities,…

Computation and Language · Computer Science 2025-08-13 Jason Chou , Ao Liu , Yuchi Deng , Zhiying Zeng , Tao Zhang , Haotian Zhu , Jianwei Cai , Yue Mao , Chenchen Zhang , Lingyun Tan , Ziyan Xu , Bohui Zhai , Hengyi Liu , Speed Zhu , Wiggin Zhou , Fengzong Lian

Large Language Models in Code Co-generation for Safe Autonomous Vehicles

Software engineers in various industrial domains are already using Large Language Models (LLMs) to accelerate the process of implementing parts of software systems. When considering its potential use for ADAS or AD systems in the automotive…

Software Engineering · Computer Science 2025-05-27 Ali Nouri , Beatriz Cabrero-Daniel , Zhennan Fei , Krishna Ronanki , Håkan Sivencrona , Christian Berger

SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code

The code generation capabilities of large language models(LLMs) have emerged as a critical dimension in evaluating their overall performance. However, prior research has largely overlooked the security risks inherent in the generated code.…

Cryptography and Security · Computer Science 2025-06-23 Xinghang Li , Jingzhe Ding , Chao Peng , Bing Zhao , Xiang Gao , Hongwan Gao , Xinchen Gu

HardSecBench: Benchmarking the Security Awareness of LLMs for Hardware Code Generation

Large language models (LLMs) are being increasingly integrated into practical hardware and firmware development pipelines for code generation. Existing studies have primarily focused on evaluating the functional correctness of LLM-generated…

Cryptography and Security · Computer Science 2026-01-21 Qirui Chen , Jingxian Shuai , Shuangwu Chen , Shenghao Ye , Zijian Wen , Xufei Su , Jie Jin , Jiangming Li , Jun Chen , Xiaobin Tan , Jian Yang

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned…

Machine Learning · Computer Science 2024-10-16 Tianle Li , Wei-Lin Chiang , Evan Frick , Lisa Dunlap , Tianhao Wu , Banghua Zhu , Joseph E. Gonzalez , Ion Stoica

LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient

The rapid advancement of large language models (LLMs) has led to a surge in both model supply and application demands. To facilitate effective matching between them, reliable, generic and efficient benchmark generators are widely needed.…

Computation and Language · Computer Science 2025-02-05 Peiwen Yuan , Shaoxiong Feng , Yiwei Li , Xinglin Wang , Yueqi Zhang , Jiayi Shi , Chuyi Tan , Boyuan Pan , Yao Hu , Kan Li

CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models

Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Their advances in competition-level programming problems have made them an essential pillar of AI-assisted pair…

Cryptography and Security · Computer Science 2023-10-24 Hossein Hajipour , Keno Hassler , Thorsten Holz , Lea Schönherr , Mario Fritz

Write Your Own CodeChecker: An Automated Test-Driven Checker Development Approach with LLMs

With the rising demand for code quality assurance, developers are not only utilizing existing static code checkers but also seeking custom checkers to satisfy their specific needs. Nowadays, various code-checking frameworks provide…

Software Engineering · Computer Science 2025-07-18 Jun Liu , Yuanyuan Xie , Jiwei Yan , Jinhao Huang , Jun Yan , Jian Zhang

SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking

The rapid proliferation of large language models (LLMs) has intensified the requirement for reliable safety evaluation to uncover model vulnerabilities. To this end, numerous LLM safety evaluation benchmarks are proposed. However, existing…

Computation and Language · Computer Science 2025-08-22 Xiangyang Zhu , Yuan Tian , Chunyi Li , Kaiwei Zhang , Wei Sun , Guangtao Zhai

Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis

Artificial Intelligence (AI)-driven code generation tools are increasingly used throughout the software development lifecycle to accelerate coding tasks. However, the security of AI-generated code using Large Language Models (LLMs) remains…

Cryptography and Security · Computer Science 2026-03-10 Mohammed Kharma , Soohyeon Choi , Mohammed AlKhanafseh , David Mohaisen

DUALGUAGE: Automated Joint Security-Functionality Benchmarking for Secure Code Generation

Large language models (LLMs) and autonomous coding agents are increasingly used to generate software across a wide range of domains. Yet a core requirement remains unmet: ensuring that generated code is secure without compromising its…

Software Engineering · Computer Science 2025-11-27 Abhijeet Pathak , Suvadra Barua , Dinesh Gudimetla , Rupam Patir , Jiawei Guo , Hongxin Hu , Haipeng Cai

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

Software Engineering · Computer Science 2025-04-02 Terry Yue Zhuo , Minh Chien Vu , Jenny Chim , Han Hu , Wenhao Yu , Ratnadira Widyasari , Imam Nur Bani Yusuf , Haolan Zhan , Junda He , Indraneil Paul , Simon Brunner , Chen Gong , Thong Hoang , Armel Randy Zebaze , Xiaoheng Hong , Wen-Ding Li , Jean Kaddour , Ming Xu , Zhihan Zhang , Prateek Yadav , Naman Jain , Alex Gu , Zhoujun Cheng , Jiawei Liu , Qian Liu , Zijian Wang , Binyuan Hui , Niklas Muennighoff , David Lo , Daniel Fried , Xiaoning Du , Harm de Vries , Leandro Von Werra

The Hidden Risks of LLM-Generated Web Application Code: A Security-Centric Evaluation of Code Generation Capabilities in Large Language Models

The rapid advancement of Large Language Models (LLMs) has enhanced software development processes, minimizing the time and effort required for coding and enhancing developer productivity. However, despite their potential benefits, code…

Cryptography and Security · Computer Science 2025-04-30 Swaroop Dora , Deven Lunkad , Naziya Aslam , S. Venkatesan , Sandeep Kumar Shukla

Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs

Large vision-language models (LVLMs) exhibit remarkable capabilities in cross-modal tasks but face significant safety challenges, which undermine their reliability in real-world applications. Efforts have been made to build LVLM safety…

Computation and Language · Computer Science 2026-01-28 Xiangyang Zhu , Yuan Tian , Zicheng Zhang , Qi Jia , Chunyi Li , Renrui Zhang , Heng Li , Zongrui Wang , Wei Sun

Smoke and Mirrors: Jailbreaking LLM-based Code Generation via Implicit Malicious Prompts

The proliferation of Large Language Models (LLMs) has revolutionized natural language processing and significantly impacted code generation tasks, enhancing software development efficiency and productivity. Notably, LLMs like GPT-4 have…

Software Engineering · Computer Science 2025-03-25 Sheng Ouyang , Yihao Qin , Bo Lin , Liqian Chen , Xiaoguang Mao , Shangwen Wang

AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing

Recent advancements in automatic code generation using large language models (LLMs) have brought us closer to fully automated secure software development. However, existing approaches often rely on a single agent for code generation, which…

Software Engineering · Computer Science 2024-11-06 Ana Nunez , Nafis Tanveer Islam , Sumit Kumar Jha , Peyman Najafirad

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

OSS-Bench: Benchmark Generator for Coding LLMs

In light of the rapid adoption of AI coding assistants, LLM-assisted development has become increasingly prevalent, creating an urgent need for robust evaluation of generated code quality. Existing benchmarks often require extensive manual…

Software Engineering · Computer Science 2025-05-21 Yuancheng Jiang , Roland Yap , Zhenkai Liang

Generating Unseen Code Tests In Infinitum

Large Language Models (LLMs) are used for many tasks, including those related to coding. An important aspect of being able to utilize LLMs is the ability to assess their fitness for specific usages. The common practice is to evaluate LLMs…

Artificial Intelligence · Computer Science 2024-07-30 Marcel Zalmanovici , Orna Raz , Eitan Farchi , Iftach Freund