Related papers: SemBench: A Benchmark for Semantic Query Processin…

SemBench: A Universal Semantic Framework for LLM Evaluation

Recent progress in Natural Language Processing (NLP) has been driven by the emergence of Large Language Models (LLMs), which exhibit remarkable generative and reasoning capabilities. However, despite their success, evaluating the true…

Computation and Language · Computer Science 2026-03-27 Mikel Zubillaga , Naiara Perez , Oscar Sainz , German Rigau

SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing

Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems. Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in…

Artificial Intelligence · Computer Science 2025-04-01 Pengrui Quan , Xiaomin Ouyang , Jeya Vikranth Jeyakumar , Ziqi Wang , Yang Xing , Mani Srivastava

EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e.,…

Machine Learning · Computer Science 2025-09-23 Anjiang Wei , Jiannan Cao , Ran Li , Hongyu Chen , Yuhui Zhang , Ziheng Wang , Yuan Liu , Thiago S. F. X. Teixeira , Diyi Yang , Ke Wang , Alex Aiken

seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs

We introduce seqBench, a parametrized benchmark for probing sequential reasoning limits in Large Language Models (LLMs) through precise, multi-dimensional control over several key complexity dimensions. seqBench allows systematic variation…

Artificial Intelligence · Computer Science 2025-09-23 Mohammad Ramezanali , Mo Vazifeh , Paolo Santi

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

Multimodal large language models (MLLMs) have shown great potential in perception and interpretation tasks, but their capabilities in predictive reasoning remain under-explored. To address this gap, we introduce a novel benchmark that…

Computer Vision and Pattern Recognition · Computer Science 2023-10-23 Mingwei Zhu , Leigang Sha , Yu Shu , Kangjia Zhao , Tiancheng Zhao , Jianwei Yin

SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity

Evaluating Large Language Models (LLMs) is crucial for understanding their capabilities and limitations across various applications, including natural language processing and code generation. Existing benchmarks like MMLU, C-Eval, and…

Cryptography and Security · Computer Science 2025-01-07 Pengfei Jing , Mengyun Tang , Xiaorong Shi , Xing Zheng , Sen Nie , Shi Wu , Yong Yang , Xiapu Luo

Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models

We introduce LexBench, a comprehensive evaluation suite enabled to test language models (LMs) on ten semantic phrase processing tasks. Unlike prior studies, it is the first work to propose a framework from the comparative perspective to…

Computation and Language · Computer Science 2024-05-07 Yang Liu , Melissa Xiaohui Qin , Hongming Li , Chao Huang

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert…

Computation and Language · Computer Science 2025-10-17 Dongwon Noh , Donghyeok Koh , Junghun Yuk , Gyuwan Kim , Jaeyong Lee , Kyungtae Lim , Cheoneum Park

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal…

Artificial Intelligence · Computer Science 2024-10-02 Xuwu Wang , Qiwen Cui , Yunzhe Tao , Yiran Wang , Ziwei Chai , Xiaotian Han , Boyi Liu , Jianbo Yuan , Jing Su , Guoyin Wang , Tingkai Liu , Liyu Chen , Tianyi Liu , Tao Sun , Yufeng Zhang , Sirui Zheng , Quanzeng You , Yang Yang , Hongxia Yang

LLMStructBench: Benchmarking Large Language Model Structured Data Extraction

We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises…

Computation and Language · Computer Science 2026-02-17 Sönke Tenckhoff , Mario Koddenbrock , Erik Rodner

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for…

Artificial Intelligence · Computer Science 2026-03-02 Antoine Peyronnet , Fabian Gloeckle , Amaury Hayat

BenchBench: Benchmarking Automated Benchmark Generation

Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items…

Computation and Language · Computer Science 2026-03-24 Yandan Zheng , Haoran Luo , Zhenghong Lin , Wenjin Liu , Luu Anh Tuan

Enterprise Large Language Model Evaluation Benchmark

Large Language Models (LLMs) ) have demonstrated promise in boosting productivity across AI-powered tools, yet existing benchmarks like Massive Multitask Language Understanding (MMLU) inadequately assess enterprise-specific task…

Artificial Intelligence · Computer Science 2025-06-26 Liya Wang , David Yi , Damien Jose , John Passarelli , James Gao , Jordan Leventis , Kang Li

SimulBench: Evaluating Language Models with Creative Simulation Tasks

We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation…

Computation and Language · Computer Science 2024-09-13 Qi Jia , Xiang Yue , Tianyu Zheng , Jie Huang , Bill Yuchen Lin

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going…

Computation and Language · Computer Science 2024-07-16 Anni Zou , Wenhao Yu , Hongming Zhang , Kaixin Ma , Deng Cai , Zhuosheng Zhang , Hai Zhao , Dong Yu

SQLBench: A Comprehensive Evaluation for Text-to-SQL Capabilities of Large Language Models

Large Language Models (LLMs) have emerged as a powerful tool in advancing the Text-to-SQL task, significantly outperforming traditional methods.Nevertheless, as a nascent research field, there is still no consensus on the optimal prompt…

Computation and Language · Computer Science 2026-03-20 Bin Zhang , Yuxiao Ye , Guoqing Du , Xiaoru Hu , Zhishuai Li , Chi Harold Liu , Zhiwei Xu , Guoliang Fan , Rui Zhao , Ziyue Li , Hangyu Mao

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on.…

Computation and Language · Computer Science 2025-04-22 Xu Huang , Wenhao Zhu , Hanxu Hu , Conghui He , Lei Li , Shujian Huang , Fei Yuan

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed.…

Computation and Language · Computer Science 2026-05-20 Yang Liu , Hongming Li , Melissa Xiaohui Qin , Qiankun Liu , Chao Huang

FML-bench: Benchmarking Machine Learning Agents for Scientific Research

Large language models (LLMs) have sparked growing interest in machine learning research agents that can autonomously propose ideas and conduct experiments. However, existing benchmarks predominantly adopt an engineering-oriented…

Computation and Language · Computer Science 2026-02-26 Qiran Zou , Hou Hei Lam , Wenhao Zhao , Yiming Tang , Tingting Chen , Samson Yu , Tianyi Zhang , Chang Liu , Xiangyang Ji , Dianbo Liu

MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification

Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, where the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to…

Artificial Intelligence · Computer Science 2025-03-18 Zhaopan Xu , Pengfei Zhou , Jiaxin Ai , Wangbo Zhao , Kai Wang , Xiaojiang Peng , Wenqi Shao , Hongxun Yao , Kaipeng Zhang