English

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

Computation and Language 2025-03-21 v2

Abstract

The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to follow compositional instructions and generate structured outputs, providing an unbiased, cost-effective, and difficult-to-cheat evaluation framework. Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets. By testing structured outputs across diverse domains including Summarization, Code, HTML, and Math, and evaluating 17 popular LLMs, we demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o, establishing it as a robust proxy for measuring reasoning capabilities. We believe StructTest offers a critical and complementary approach to achieving objective and comprehensive model evaluation.

Keywords

Cite

@article{arxiv.2412.18011,
  title  = {StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs},
  author = {Hailin Chen and Fangkai Jiao and Mathieu Ravaut and Nawshad Farruque and Xuan Phi Nguyen and Chengwei Qin and Manan Dey and Bosheng Ding and Caiming Xiong and Shafiq Joty and Yingbo Zhou},
  journal= {arXiv preprint arXiv:2412.18011},
  year   = {2025}
}