Related papers: LLMStructBench: Benchmarking Large Language Model …

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and…

Computation and Language · Computer Science 2025-03-21 Hailin Chen , Fangkai Jiao , Mathieu Ravaut , Nawshad Farruque , Xuan Phi Nguyen , Chengwei Qin , Manan Dey , Bosheng Ding , Caiming Xiong , Shafiq Joty , Yingbo Zhou

StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation

Extracting structured information from text, such as key-value pairs that could augment tabular data, is quite useful in many enterprise use cases. Although large language models (LLMs) have enabled numerous automated pipelines for…

Computation and Language · Computer Science 2025-07-30 Satyananda Kashyap , Sola Shirai , Nandana Mihindukulasooriya , Horst Samulowitz

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs'…

Software Engineering · Computer Science 2026-04-06 Jialin Yang , Dongfu Jiang , Lipeng He , Sherman Siu , Yuxuan Zhang , Disen Liao , Zhuofeng Li , Huaye Zeng , Yiming Jia , Haozhe Wang , Benjamin Schneider , Chi Ruan , Wentao Ma , Zhiheng Lyu , Yifei Wang , Yi Lu , Quy Duc Do , Ziyan Jiang , Ping Nie , Wenhu Chen

PromptBench: A Unified Library for Evaluation of Large Language Models

The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components…

Artificial Intelligence · Computer Science 2024-08-21 Kaijie Zhu , Qinlin Zhao , Hao Chen , Jindong Wang , Xing Xie

ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and…

Machine Learning · Computer Science 2026-02-17 Nick Ferguson , Josh Pennington , Narek Beghian , Aravind Mohan , Douwe Kiela , Sheshansh Agrawal , Thien Hang Nguyen

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, the understanding of their capability to process structured data like tables remains an under-explored area.…

Computation and Language · Computer Science 2024-07-18 Yuan Sui , Mengyu Zhou , Mingjie Zhou , Shi Han , Dongmei Zhang

SQLBench: A Comprehensive Evaluation for Text-to-SQL Capabilities of Large Language Models

Large Language Models (LLMs) have emerged as a powerful tool in advancing the Text-to-SQL task, significantly outperforming traditional methods.Nevertheless, as a nascent research field, there is still no consensus on the optimal prompt…

Computation and Language · Computer Science 2026-03-20 Bin Zhang , Yuxiao Ye , Guoqing Du , Xiaoru Hu , Zhishuai Li , Chi Harold Liu , Zhiwei Xu , Guoliang Fan , Rui Zhao , Ziyue Li , Hangyu Mao

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs' proficiency in structuring tables and introduces a novel fine-tuning…

Computation and Language · Computer Science 2024-04-08 Xiangru Tang , Yiming Zong , Jason Phang , Yilun Zhao , Wangchunshu Zhou , Arman Cohan , Mark Gerstein

StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following

Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and…

Computation and Language · Computer Science 2025-06-02 Jinnan Li , Jinzhe Li , Yue Wang , Yi Chang , Yuan Wu

LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

Although large language models (LLMs) have demonstrated their strong intelligence ability, the high demand for computation and storage hinders their practical application. To this end, many model compression techniques are proposed to…

Computation and Language · Computer Science 2024-11-01 Ge Yang , Changyi He , Jinyang Guo , Jianyu Wu , Yifu Ding , Aishan Liu , Haotong Qin , Pengliang Ji , Xianglong Liu

StrucText-Eval: Evaluating Large Language Model's Reasoning Ability in Structure-Rich Text

The effective utilization of structured data, integral to corporate data strategies, has been challenged by the rise of large language models (LLMs) capable of processing unstructured information. This shift prompts the question: can LLMs…

Computation and Language · Computer Science 2024-10-22 Zhouhong Gu , Haoning Ye , Xingzhou Chen , Zeyang Zhou , Hongwei Feng , Yanghua Xiao

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going…

Computation and Language · Computer Science 2024-07-16 Anni Zou , Wenhao Yu , Hongming Zhang , Kaixin Ma , Deng Cai , Zhuosheng Zhang , Hai Zhao , Dong Yu

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

Software testing is a crucial phase in the software life cycle, helping identify potential risks and reduce maintenance costs. With the advancement of Large Language Models (LLMs), researchers have proposed an increasing number of LLM-based…

Software Engineering · Computer Science 2024-09-27 Quanjun Zhang , Ye Shang , Chunrong Fang , Siqi Gu , Jianyi Zhou , Zhenyu Chen

From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding

Extracting structured information from visual documents (Visual Information Extraction, VIE) is a cornerstone of business automation. While recent Multimodal Large Language Models (MLLMs) have shown promising capabilities, existing…

Computer Vision and Pattern Recognition · Computer Science 2026-05-22 Yandi Wang , Libin Zhan , Ziwei Huang , Tiancheng Luo , Yuxuan Jiang , Wang Dong , Leilei Gan , Jun Chen

MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis

Multimodal Large Language Models (MLLM) have made significant progress in the field of document analysis. Despite this, existing benchmarks typically focus only on extracting text and simple layout information, neglecting the complex…

Computer Vision and Pattern Recognition · Computer Science 2024-07-04 Lei Chen , Feng Yan , Yujie Zhong , Shaoxiang Chen , Zequn Jie , Lin Ma

SRBench: A Comprehensive Benchmark for Sequential Recommendation with Large Language Models

LLM development has aroused great interest in Sequential Recommendation (SR) applications. However, comprehensive evaluation of SR models remains lacking due to the limitations of the existing benchmarks: 1) an overemphasis on accuracy,…

Information Retrieval · Computer Science 2026-04-14 Jianhong Li , Zeheng Qian , Wangze Ni , Haoyang Li , Hongwei Yao , Yang Bai , Kui Ren

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

The ability to follow instructions is crucial for Large Language Models (LLMs) to handle various real-world applications. Existing benchmarks primarily focus on evaluating pure response quality, rather than assessing whether the response…

Computation and Language · Computer Science 2024-06-06 Yuxin Jiang , Yufei Wang , Xingshan Zeng , Wanjun Zhong , Liangyou Li , Fei Mi , Lifeng Shang , Xin Jiang , Qun Liu , Wei Wang

SO-Bench: A Structural Output Evaluation of Multimodal LLMs

Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Di Feng , Kaixin Ma , Feng Nan , Haofeng Chen , Bohan Zhai , David Griffiths , Mingfei Gao , Zhe Gan , Eshan Verma , Yinfei Yang , Zhifeng Chen , Afshin Dehghan

SemBench: A Benchmark for Semantic Query Processing Engines

We present a benchmark targeting a novel class of systems: semantic query processing engines. Those systems rely inherently on generative and reasoning capabilities of state-of-the-art large language models (LLMs). They extend SQL with…

Databases · Computer Science 2026-03-17 Jiale Lao , Andreas Zimmerer , Olga Ovcharenko , Tianji Cong , Matthew Russo , Gerardo Vitagliano , Michael Cochez , Fatma Özcan , Gautam Gupta , Thibaud Hottelier , H. V. Jagadish , Kris Kissel , Sebastian Schelter , Andreas Kipf , Immanuel Trummer

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available,…

Computation and Language · Computer Science 2024-02-27 Fahim Dalvi , Maram Hasanain , Sabri Boughorbel , Basel Mousi , Samir Abdaljalil , Nizi Nazar , Ahmed Abdelali , Shammur Absar Chowdhury , Hamdy Mubarak , Ahmed Ali , Majd Hawasly , Nadir Durrani , Firoj Alam