SecCodeBench-V2 Technical Report

Longfei Chen; Ji Zhao; Lanxiao Cui; Tong Su; Xingbo Pan; Ziyang Li; Yongxing Wu; Qijiang Cao; Qiyao Cai; Jing Zhang; Yuandong Ni; Junyao He; Zeyu Zhang; Chao Ge; Xuhuai Lu; Zeyu Gao; Yuxin Cui; Weisen Chen; Yuxuan Peng; Shengping Wang; Qi Li; Yukai Huang; Yukun Liu; Tuo Zhou; Terry Yue Zhuo; Junyang Lin; Chao Zhang

SecCodeBench-V2 Technical Report

Cryptography and Security 2026-02-19 v2 Artificial Intelligence Software Engineering

Authors: Longfei Chen , Ji Zhao , Lanxiao Cui , Tong Su , Xingbo Pan , Ziyang Li , Yongxing Wu , Qijiang Cao , Qiyao Cai , Jing Zhang , Yuandong Ni , Junyao He , Zeyu Zhang , Chao Ge , Xuhuai Lu , Zeyu Gao , Yuxin Cui , Weisen Chen , Yuxuan Peng , Shengping Wang , Qi Li , Yukai Huang , Yukun Liu , Tuo Zhou , Terry Yue Zhuo , Junyang Lin , Chao Zhang

View on arXiv ↗ PDF ↗

Abstract

We introduce SecCodeBench-V2, a publicly released benchmark for evaluating Large Language Model (LLM) copilots' capabilities of generating secure code. SecCodeBench-V2 comprises 98 generation and fix scenarios derived from Alibaba Group's industrial productions, where the underlying security issues span 22 common CWE (Common Weakness Enumeration) categories across five programming languages: Java, C, Python, Go, and JavaScript. SecCodeBench-V2 adopts a function-level task formulation: each scenario provides a complete project scaffold and requires the model to implement or patch a designated target function under fixed interfaces and dependencies. For each scenario, SecCodeBench-V2 provides executable proof-of-concept (PoC) test cases for both functional validation and security verification. All test cases are authored and double-reviewed by security experts, ensuring high fidelity, broad coverage, and reliable ground truth. Beyond the benchmark itself, we build a unified evaluation pipeline that assesses models primarily via dynamic execution. For most scenarios, we compile and run model-generated artifacts in isolated environments and execute PoC test cases to validate both functional correctness and security properties. For scenarios where security issues cannot be adjudicated with deterministic test cases, we additionally employ an LLM-as-a-judge oracle. To summarize performance across heterogeneous scenarios and difficulty levels, we design a Pass@K-based scoring protocol with principled aggregation over scenarios and severity, enabling holistic and comparable evaluation across models. Overall, SecCodeBench-V2 provides a rigorous and reproducible foundation for assessing the security posture of AI coding assistants, with results and artifacts released at https://alibaba.github.io/sec-code-bench. The benchmark is publicly available at https://github.com/alibaba/sec-code-bench.

Keywords

code generation vulnerability detection benchmark evaluation

Cite

@article{arxiv.2602.15485,
  title  = {SecCodeBench-V2 Technical Report},
  author = {Longfei Chen and Ji Zhao and Lanxiao Cui and Tong Su and Xingbo Pan and Ziyang Li and Yongxing Wu and Qijiang Cao and Qiyao Cai and Jing Zhang and Yuandong Ni and Junyao He and Zeyu Zhang and Chao Ge and Xuhuai Lu and Zeyu Gao and Yuxin Cui and Weisen Chen and Yuxuan Peng and Shengping Wang and Qi Li and Yukai Huang and Yukun Liu and Tuo Zhou and Terry Yue Zhuo and Junyang Lin and Chao Zhang},
  journal= {arXiv preprint arXiv:2602.15485},
  year   = {2026}
}

SecCodeBench-V2 Technical Report

Abstract

Keywords

Cite

Related papers