Related papers: QHackBench: Benchmarking Large Language Models for…

A PennyLane-Centric Dataset to Enhance LLM-based Quantum Code Generation using RAG

Large Language Models (LLMs) offer powerful capabilities in code generation, natural language understanding, and domain-specific reasoning. Their application to quantum software development remains limited, in part because of the lack of…

Software Engineering · Computer Science 2026-04-20 Abdul Basit , Nouhaila Innan , Muhammad Haider Asif , Minghao Shao , Muhammad Kashif , Alberto Marchisio , Muhammad Shafique

QuanBench: Benchmarking Quantum Code Generation with Large Language Models

Large language models (LLMs) have demonstrated good performance in general code generation; however, their capabilities in quantum code generation remain insufficiently studied. This paper presents QuanBench, a benchmark for evaluating LLMs…

Software Engineering · Computer Science 2025-10-21 Xiaoyu Guo , Minggu Wang , Jianjun Zhao

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We…

Machine Learning · Computer Science 2026-04-23 Ali Slim , Haydar Hamieh , Jawad Kotaich , Yehya Ghosn , Mahdi Chehimi , Ammar Mohanna , Hasan Abed Al Kader Hammoud , Bernard Ghanem

QuantumBench: A Benchmark for Quantum Problem Solving

Large language models are now integrated into many scientific workflows, accelerating data analysis, hypothesis generation, and design space exploration. In parallel with this growth, there is a growing need to carefully evaluate whether…

Artificial Intelligence · Computer Science 2025-11-04 Shunya Minami , Tatsuya Ishigaki , Ikko Hamamura , Taku Mikuriya , Youmi Ma , Naoaki Okazaki , Hiroya Takamura , Yohichi Suzuki , Tadashi Kadowaki

QCircuitBench: A Large-Scale Dataset for Benchmarking Quantum Algorithm Design

Quantum computing is an emerging field recognized for the significant speedup it offers over classical computing through quantum algorithms. However, designing and implementing quantum algorithms pose challenges due to the complex nature of…

Quantum Physics · Physics 2025-12-17 Rui Yang , Ziruo Wang , Yuntian Gu , Tianyi Chen , Yitao Liang , Tongyang Li

QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback

Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it…

Computation and Language · Computer Science 2025-11-04 Taku Mikuriya , Tatsuya Ishigaki , Masayuki Kawarada , Shunya Minami , Tadashi Kadowaki , Yohichi Suzuki , Soshun Naito , Shunya Takata , Takumi Kato , Tamotsu Basseda , Reo Yamada , Hiroya Takamura

PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation

The growing complexity of quantum programming frameworks has exposed a critical limitation in existing large language model (LLM)-based code assistants: general-purpose models hallucinate PennyLane-specific gate names, misplace device…

Computation and Language · Computer Science 2026-05-26 Minghao Shao , Nouhaila Innan , Hariharan Janardhanan , Muhammad Kashif , Alberto Marchisio , Muhammad Shafique

PennyCoder: Efficient Domain-Specific LLMs for PennyLane-Based Quantum Code Generation

The growing demand for robust quantum programming frameworks has unveiled a critical limitation: current large language model (LLM) based quantum code assistants heavily rely on remote APIs, introducing challenges related to privacy,…

Quantum Physics · Physics 2025-12-05 Abdul Basit , Minghao Shao , Muhammad Haider Asif , Nouhaila Innan , Muhammad Kashif , Alberto Marchisio , Muhammad Shafique

Model-Driven Quantum Code Generation Using Large Language Models and Retrieval-Augmented Generation

This paper introduces a novel research direction for model-to-text/code transformations by leveraging Large Language Models (LLMs) that can be enhanced with Retrieval-Augmented Generation (RAG) pipelines. The focus is on quantum and hybrid…

Software Engineering · Computer Science 2025-12-03 Nazanin Siavash , Armin Moin

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models

Quantum programs are typically developed using quantum Software Development Kits (SDKs). The rapid advancement of quantum computing necessitates new tools to streamline this development process, and one such tool could be Generative…

Quantum Physics · Physics 2024-06-24 Sanjay Vishwakarma , Francis Harkins , Siddharth Golecha , Vishal Sharathchandra Bajpe , Nicolas Dupuis , Luca Buratti , David Kremer , Ismael Faro , Ruchir Puri , Juan Cruz-Benito

QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

While Large Language Models (LLMs) excel on standardized medical exams, high scores often fail to translate to high-quality responses for real-world medical queries. Current evaluations rely heavily on multiple-choice questions, failing to…

Computation and Language · Computer Science 2026-03-17 Yao Wu , Kangping Yin , Liang Dong , Zhenxin Ma , Shuting Xu , Xuehai Wang , Yuxuan Jiang , Tingting Yu , Yunqing Hong , Jiayi Liu , Rianzhe Huang , Shuxin Zhao , Haiping Hu , Wen Shang , Jian Xu , Guanjun Jiang

AInsteinBench: Benchmarking Coding Agents on Scientific Repositories

We introduce AInsteinBench, a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents within real research software ecosystems. Unlike existing scientific…

Software Engineering · Computer Science 2025-12-29 Titouan Duston , Shuo Xin , Yang Sun , Daoguang Zan , Aoyan Li , Shulin Xin , Kai Shen , Yixiao Chen , Qiming Sun , Ge Zhang , Jiashuo Liu , Huan Zhou , Jingkai Liu , Zhichen Pu , Yuanheng Wang , Bo-Xuan Ge , Xin Tong , Fei Ye , Zhi-Chao Zhao , Wen-Biao Han , Zhoujian Cao , Yueran Zhao , Weiluo Ren , Qingshen Long , Yuxiao Liu , Anni Huang , Yidi Du , Yuanyuan Rong , Jiahao Peng

CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts

Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and…

Machine Learning · Computer Science 2025-05-09 Manik Sheokand , Parth Sawant

BenchBench: Benchmarking Automated Benchmark Generation

Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items…

Computation and Language · Computer Science 2026-03-24 Yandan Zheng , Haoran Luo , Zhenghong Lin , Wenjin Liu , Luu Anh Tuan

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities,…

Computation and Language · Computer Science 2025-08-13 Jason Chou , Ao Liu , Yuchi Deng , Zhiying Zeng , Tao Zhang , Haotian Zhu , Jianwei Cai , Yue Mao , Chenchen Zhang , Lingyun Tan , Ziyan Xu , Bohui Zhai , Hengyi Liu , Speed Zhu , Wiggin Zhou , Fengzong Lian

EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e.,…

Machine Learning · Computer Science 2025-09-23 Anjiang Wei , Jiannan Cao , Ran Li , Hongyu Chen , Yuhui Zhang , Ziheng Wang , Yuan Liu , Thiago S. F. X. Teixeira , Diyi Yang , Ke Wang , Alex Aiken

Can LLMs Solve Science or Just Write Code? Evaluating Quantum Solver Generation

Large Language Models (LLMs) show strong capabilities in code generation, motivating their use in automated quantum solver development. However, in quantum computing, successful execution of generated code is not sufficient: correctness…

Software Engineering · Computer Science 2026-05-12 Luciano Baresi , Domenico Bianculli , Maryse Ernzer , Livia Lestingi , Fabrizio Pastore , Seung Yeob Shin

ProBench: Benchmarking Large Language Models in Competitive Programming

With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging, large language models (LLMs) have entered a new phase of development. However, existing benchmarks for coding evaluation are gradually inadequate to assess the…

Computation and Language · Computer Science 2025-03-03 Lei Yang , Renren Jin , Ling Shi , Jianxiang Peng , Yue Chen , Deyi Xiong

SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code

The code generation capabilities of large language models(LLMs) have emerged as a critical dimension in evaluating their overall performance. However, prior research has largely overlooked the security risks inherent in the generated code.…

Cryptography and Security · Computer Science 2025-06-23 Xinghang Li , Jingzhe Ding , Chao Peng , Bing Zhao , Xiang Gao , Hongwan Gao , Xinchen Gu