Related papers: SecCodeBench-V2 Technical Report

SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

Existing benchmarks for evaluating the security risks and capabilities (e.g., vulnerability detection) of code-generating large language models (LLMs) face several key limitations: (1) limited coverage of risk and capabilities; (2) reliance…

Cryptography and Security · Computer Science 2025-09-22 Yuzhou Nie , Zhun Wang , Yu Yang , Ruizhe Jiang , Yuheng Tang , Xander Davies , Yarin Gal , Bo Li , Wenbo Guo , Dawn Song

RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area. Existing benchmarks often fall short by relying on synthetic…

Cryptography and Security · Computer Science 2026-02-02 Yanlin Wang , Ziyao Zhang , Chong Wang , Xinyi Xu , Mingwei Liu , Yong Wang , Jiachi Chen , Zibin Zheng

SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code

The code generation capabilities of large language models(LLMs) have emerged as a critical dimension in evaluating their overall performance. However, prior research has largely overlooked the security risks inherent in the generated code.…

Cryptography and Security · Computer Science 2025-06-23 Xinghang Li , Jingzhe Ding , Chao Peng , Bing Zhao , Xiang Gao , Hongwan Gao , Xinchen Gu

SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity

Evaluating Large Language Models (LLMs) is crucial for understanding their capabilities and limitations across various applications, including natural language processing and code generation. Existing benchmarks like MMLU, C-Eval, and…

Cryptography and Security · Computer Science 2025-01-07 Pengfei Jing , Mengyun Tang , Xiaorong Shi , Xing Zheng , Sen Nie , Shi Wu , Yong Yang , Xiapu Luo

HardSecBench: Benchmarking the Security Awareness of LLMs for Hardware Code Generation

Large language models (LLMs) are being increasingly integrated into practical hardware and firmware development pipelines for code generation. Existing studies have primarily focused on evaluating the functional correctness of LLM-generated…

Cryptography and Security · Computer Science 2026-01-21 Qirui Chen , Jingxian Shuai , Shuangwu Chen , Shenghao Ye , Zijian Wen , Xufei Su , Jie Jin , Jiangming Li , Jun Chen , Xiaobin Tan , Jian Yang

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

Software testing is a crucial phase in the software life cycle, helping identify potential risks and reduce maintenance costs. With the advancement of Large Language Models (LLMs), researchers have proposed an increasing number of LLM-based…

Software Engineering · Computer Science 2024-09-27 Quanjun Zhang , Ye Shang , Chunrong Fang , Siqi Gu , Jianyi Zhou , Zhenyu Chen

CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models

Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Their advances in competition-level programming problems have made them an essential pillar of AI-assisted pair…

Cryptography and Security · Computer Science 2023-10-24 Hossein Hajipour , Keno Hassler , Thorsten Holz , Lea Schönherr , Mario Fritz

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

Large language models (LLMs) now support automated software security tasks, including vulnerability discovery and proof-of-concept (PoC) generation. Existing benchmarks do not faithfully evaluate LLMs in real-world bug hunting scenarios…

Cryptography and Security · Computer Science 2026-05-27 Hwiwon Lee , Jiawei Liu , Dongjun Kim , Ziqi Zhang , Chunqiu Steven Xia , Lingming Zhang

Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion…

Computation and Language · Computer Science 2025-05-16 Yutao Mou , Xiao Deng , Yuxiao Luo , Shikun Zhang , Wei Ye

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

How to evaluate Large Language Models (LLMs) in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper…

Computation and Language · Computer Science 2024-04-02 Jia Li , Ge Li , Xuanming Zhang , Yihong Dong , Zhi Jin

SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories

This paper introduces SecRepoBench, a benchmark to evaluate code agents on secure code completion in real-world repositories. SecRepoBench has 318 code completion tasks in 27 C/C++ repositories, covering 15 CWEs. We evaluate 29 standalone…

Cryptography and Security · Computer Science 2026-02-17 Chihao Shen , Connor Dilgren , Purva Chiniya , Luke Griffith , Yu Ding , Yizheng Chen

LiveSecBench: A Dynamic and Event-Driven Safety Benchmark for Chinese Language Model Applications

We introduce LiveSecBench, a continuously updated safety benchmark specifically for Chinese-language LLM application scenarios. LiveSecBench constructs a high-quality and unique dataset through a pipeline that combines automated generation…

Computation and Language · Computer Science 2025-12-23 Yudong Li , Peiru Yang , Feng Huang , Zhongliang Yang , Kecheng Wang , Haitian Li , Baocheng Chen , Xingyu An , Ziyu Liu , Youdan Yang , Kejiang Chen , Sifang Wan , Xu Wang , Yufei Sun , Liyan Wu , Ruiqi Zhou , Wenya Wen , Xingchi Gu , Tianxin Zhang , Yue Gao , Yongfeng Huang

MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols

Large Language Models (LLMs) are increasingly integrated into real-world applications via the Model Context Protocol (MCP), a universal open standard for connecting AI agents with data sources and external tools. While MCP enhances the…

Cryptography and Security · Computer Science 2026-02-13 Yixuan Yang , Cuifeng Gao , Daoyuan Wu , Yufan Chen , Yingjiu Li , Shuai Wang

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive…

Software Engineering · Computer Science 2025-09-12 Jielin Qiu , Zuxin Liu , Zhiwei Liu , Rithesh Murthy , Jianguo Zhang , Haolin Chen , Shiyu Wang , Ming Zhu , Liangwei Yang , Juntao Tan , Zhepeng Cen , Cheng Qian , Shelby Heinecke , Weiran Yao , Silvio Savarese , Caiming Xiong , Huan Wang

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu

SecCodePRM: A Process Reward Model for Code Security

Large Language Models are rapidly becoming core components of modern software development workflows, yet ensuring code security remains challenging. Existing vulnerability detection pipelines either rely on static analyzers or use…

Cryptography and Security · Computer Science 2026-02-12 Weichen Yu , Ravi Mangal , Yinyi Luo , Kai Hu , Jingxuan He , Corina S. Pasareanu , Matt Fredrikson

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities,…

Computation and Language · Computer Science 2025-08-13 Jason Chou , Ao Liu , Yuchi Deng , Zhiying Zeng , Tao Zhang , Haotian Zhu , Jianwei Cai , Yue Mao , Chenchen Zhang , Lingyun Tan , Ziyan Xu , Bohui Zhai , Hengyi Liu , Speed Zhu , Wiggin Zhou , Fengzong Lian

VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program…

Cryptography and Security · Computer Science 2024-08-22 Yu Liu , Lang Gao , Mingxin Yang , Yu Xie , Ping Chen , Xiaojin Zhang , Wei Chen

$\alpha^3$-SecBench: A Large-Scale Evaluation Suite of Security, Resilience, and Trust for LLM-based UAV Agents over 6G Networks

Autonomous unmanned aerial vehicle (UAV) systems are increasingly deployed in safety-critical, networked environments where they must operate reliably in the presence of malicious adversaries. While recent benchmarks have evaluated large…

Cryptography and Security · Computer Science 2026-01-27 Mohamed Amine Ferrag , Abderrahmane Lakas , Merouane Debbah

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

Large language model-powered code agents are rapidly transforming software engineering, yet the security risks of their generated code have become a critical concern. Existing benchmarks have provided valuable insights, but they fail to…

Software Engineering · Computer Science 2026-04-27 Junkai Chen , Huihui Huang , Yunbo Lyu , Junwen An , Jieke Shi , Chengran Yang , Ting Zhang , Haoye Tian , Yikun Li , Zhenhao Li , Xin Zhou , Xing Hu , David Lo