Related papers: QuantumBench: A Benchmark for Quantum Problem Solv…

QCircuitBench: A Large-Scale Dataset for Benchmarking Quantum Algorithm Design

Quantum computing is an emerging field recognized for the significant speedup it offers over classical computing through quantum algorithms. However, designing and implementing quantum algorithms pose challenges due to the complex nature of…

Quantum Physics · Physics 2025-12-17 Rui Yang , Ziruo Wang , Yuntian Gu , Tianyi Chen , Yitao Liang , Tongyang Li

QuanBench: Benchmarking Quantum Code Generation with Large Language Models

Large language models (LLMs) have demonstrated good performance in general code generation; however, their capabilities in quantum code generation remain insufficiently studied. This paper presents QuanBench, a benchmark for evaluating LLMs…

Software Engineering · Computer Science 2025-10-21 Xiaoyu Guo , Minggu Wang , Jianjun Zhao

QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

While Large Language Models (LLMs) excel on standardized medical exams, high scores often fail to translate to high-quality responses for real-world medical queries. Current evaluations rely heavily on multiple-choice questions, failing to…

Computation and Language · Computer Science 2026-03-17 Yao Wu , Kangping Yin , Liang Dong , Zhenxin Ma , Shuting Xu , Xuehai Wang , Yuxuan Jiang , Tingting Yu , Yunqing Hong , Jiayi Liu , Rianzhe Huang , Shuxin Zhao , Haiping Hu , Wen Shang , Jian Xu , Guanjun Jiang

Large Language Models in the Clinic: A Comprehensive Benchmark

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical…

Computation and Language · Computer Science 2024-10-17 Fenglin Liu , Zheng Li , Hongjian Zhou , Qingyu Yin , Jingfeng Yang , Xianfeng Tang , Chen Luo , Ming Zeng , Haoming Jiang , Yifan Gao , Priyanka Nigam , Sreyashi Nag , Bing Yin , Yining Hua , Xuan Zhou , Omid Rohanian , Anshul Thakur , Lei Clifton , David A. Clifton

QMBench: A Research Level Benchmark for Quantum Materials Research

We introduce QMBench, a comprehensive benchmark designed to evaluate the capability of large language model agents in quantum materials research. This specialized benchmark assesses the model's ability to apply condensed matter physics…

Materials Science · Physics 2025-12-24 Yanzhen Wang , Yiyang Jiang , Diana Golovanova , Kamal Das , Hyeonhu Bae , Yufei Zhao , Huu-Thong Le , Abhinava Chatterjee , Yunzhe Liu , Chao-Xing Liu , Felipe H. da Jornada , Binghai Yan , Xiao-Liang Qi

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform…

Machine Learning · Computer Science 2025-06-03 Eunsu Kim , Haneul Yoo , Guijin Son , Hitesh Patel , Amit Agarwal , Alice Oh

NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain

As LLMs have become increasingly popular, they have been used in almost every field. But as the application for LLMs expands from generic fields to narrow, focused science domains, there exists an ever-increasing gap in ways to evaluate…

Computation and Language · Computer Science 2023-10-18 Anurag Acharya , Sai Munikoti , Aaron Hellinger , Sara Smith , Sridevi Wagle , Sameera Horawalavithana

QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry

Quantitative chemistry is central to modern chemical research, yet the ability of large language models (LLMs) to perform its rigorous, step-by-step calculations remains underexplored. To fill this blank, we propose QCBench, a Quantitative…

Artificial Intelligence · Computer Science 2025-11-05 Jiaqing Xie , Weida Wang , Ben Gao , Zhuo Yang , Haiyuan Wan , Shufei Zhang , Tianfan Fu , Yuqiang Li

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

Large language models (LLMs) have demonstrated significant potential in advancing various fields of research and society. However, the current community of LLMs overly focuses on benchmarks for analyzing specific foundational skills (e.g.…

Computation and Language · Computer Science 2025-03-03 Xiaoshuai Song , Muxi Diao , Guanting Dong , Zhengyang Wang , Yujia Fu , Runqi Qiao , Zhexu Wang , Dayuan Fu , Huangxuan Wu , Bin Liang , Weihao Zeng , Yejie Wang , Zhuoma GongQue , Jianing Yu , Qiuna Tan , Weiran Xu

QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges

Recent advances in Large Language Models (LLMs) have demonstrated strong potential in code generation, yet their effectiveness in quantum computing remains underexplored. This paper benchmarks LLMs for PennyLane-based quantum code…

Artificial Intelligence · Computer Science 2025-09-01 Abdul Basit , Minghao Shao , Muhammad Haider Asif , Nouhaila Innan , Muhammad Kashif , Alberto Marchisio , Muhammad Shafique

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant…

Computation and Language · Computer Science 2025-03-19 Xianjie Wu , Jian Yang , Linzheng Chai , Ge Zhang , Jiaheng Liu , Xinrun Du , Di Liang , Daixin Shu , Xianfu Cheng , Tianzhen Sun , Guanglin Niu , Tongliang Li , Zhoujun Li

INTEGRALBENCH: Benchmarking LLMs with Definite Integral Problems

We present INTEGRALBENCH, a focused benchmark designed to evaluate Large Language Model (LLM) performance on definite integral problems. INTEGRALBENCH provides both symbolic and numerical ground truth solutions with manual difficulty…

Artificial Intelligence · Computer Science 2025-07-30 Bintao Tang , Xin Yang , Yuhao Wang , Zixuan Qiu , Zimo Ji , Wenyuan Jiang

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions…

Computation and Language · Computer Science 2026-05-15 Yahan Li , Jifan Yao , John Bosco S. Bunyi , Adam C. Frank , Angel Hsing-Chi Hwang , Ruishan Liu

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

Recent advancements in large language models (LLMs) have showcased significant improvements in mathematics. However, traditional math benchmarks like GSM8k offer a unidimensional perspective, falling short in providing a holistic assessment…

Computation and Language · Computer Science 2024-05-21 Hongwei Liu , Zilong Zheng , Yuxuan Qiao , Haodong Duan , Zhiwei Fei , Fengzhe Zhou , Wenwei Zhang , Songyang Zhang , Dahua Lin , Kai Chen

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we…

Computation and Language · Computer Science 2024-07-03 Bodhisattwa Prasad Majumder , Harshit Surana , Dhruv Agarwal , Bhavana Dalvi Mishra , Abhijeetsingh Meena , Aryan Prakhar , Tirth Vora , Tushar Khot , Ashish Sabharwal , Peter Clark

DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding

Recent advances in large language models (LLMs) and medical LLMs (Med-LLMs) have demonstrated strong performance on general medical benchmarks. However, their capabilities in specialized medical fields, such as dentistry which require…

Computation and Language · Computer Science 2025-08-29 Hengchuan Zhu , Yihuan Xu , Yichen Li , Zijie Meng , Zuozhu Liu

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going…

Computation and Language · Computer Science 2024-07-16 Anni Zou , Wenhao Yu , Hongming Zhang , Kaixin Ma , Deng Cai , Zhuosheng Zhang , Hai Zhao , Dong Yu

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

With the proliferation of Large Language Models (LLMs) in diverse domains, there is a particular need for unified evaluation standards in clinical medical scenarios, where models need to be examined very thoroughly. We present CliMedBench,…

Computation and Language · Computer Science 2024-10-07 Zetian Ouyang , Yishuai Qiu , Linlin Wang , Gerard de Melo , Ya Zhang , Yanfeng Wang , Liang He

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail…

Computation and Language · Computer Science 2026-02-16 Ziqian Zhang , Xingjian Hu , Yue Huang , Kai Zhang , Ruoxi Chen , Yixin Liu , Qingsong Wen , Kaidi Xu , Xiangliang Zhang , Neil Zhenqiang Gong , Lichao Sun

The Quantum LLM: Modeling Semantic Spaces with Quantum Principles

In the previous article, we presented a quantum-inspired framework for modeling semantic representation and processing in Large Language Models (LLMs), drawing upon mathematical tools and conceptual analogies from quantum mechanics to offer…

Artificial Intelligence · Computer Science 2025-05-26 Timo Aukusti Laine