English
Related papers

Related papers: SemBench: A Benchmark for Semantic Query Processin…

200 papers

Recent progress in Natural Language Processing (NLP) has been driven by the emergence of Large Language Models (LLMs), which exhibit remarkable generative and reasoning capabilities. However, despite their success, evaluating the true…

Computation and Language · Computer Science 2026-03-27 Mikel Zubillaga , Naiara Perez , Oscar Sainz , German Rigau

Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems. Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in…

Artificial Intelligence · Computer Science 2025-04-01 Pengrui Quan , Xiaomin Ouyang , Jeya Vikranth Jeyakumar , Ziqi Wang , Yang Xing , Mani Srivastava

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e.,…

Machine Learning · Computer Science 2025-09-23 Anjiang Wei , Jiannan Cao , Ran Li , Hongyu Chen , Yuhui Zhang , Ziheng Wang , Yuan Liu , Thiago S. F. X. Teixeira , Diyi Yang , Ke Wang , Alex Aiken

We introduce seqBench, a parametrized benchmark for probing sequential reasoning limits in Large Language Models (LLMs) through precise, multi-dimensional control over several key complexity dimensions. seqBench allows systematic variation…

Artificial Intelligence · Computer Science 2025-09-23 Mohammad Ramezanali , Mo Vazifeh , Paolo Santi

Multimodal large language models (MLLMs) have shown great potential in perception and interpretation tasks, but their capabilities in predictive reasoning remain under-explored. To address this gap, we introduce a novel benchmark that…

Computer Vision and Pattern Recognition · Computer Science 2023-10-23 Mingwei Zhu , Leigang Sha , Yu Shu , Kangjia Zhao , Tiancheng Zhao , Jianwei Yin

Evaluating Large Language Models (LLMs) is crucial for understanding their capabilities and limitations across various applications, including natural language processing and code generation. Existing benchmarks like MMLU, C-Eval, and…

Cryptography and Security · Computer Science 2025-01-07 Pengfei Jing , Mengyun Tang , Xiaorong Shi , Xing Zheng , Sen Nie , Shi Wu , Yong Yang , Xiapu Luo

We introduce LexBench, a comprehensive evaluation suite enabled to test language models (LMs) on ten semantic phrase processing tasks. Unlike prior studies, it is the first work to propose a framework from the comparative perspective to…

Computation and Language · Computer Science 2024-05-07 Yang Liu , Melissa Xiaohui Qin , Hongming Li , Chao Huang

Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert…

Computation and Language · Computer Science 2025-10-17 Dongwon Noh , Donghyeok Koh , Junghun Yuk , Gyuwan Kim , Jaeyong Lee , Kyungtae Lim , Cheoneum Park

Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal…

We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises…

Computation and Language · Computer Science 2026-02-17 Sönke Tenckhoff , Mario Koddenbrock , Erik Rodner

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for…

Artificial Intelligence · Computer Science 2026-03-02 Antoine Peyronnet , Fabian Gloeckle , Amaury Hayat

Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items…

Computation and Language · Computer Science 2026-03-24 Yandan Zheng , Haoran Luo , Zhenghong Lin , Wenjin Liu , Luu Anh Tuan

Large Language Models (LLMs) ) have demonstrated promise in boosting productivity across AI-powered tools, yet existing benchmarks like Massive Multitask Language Understanding (MMLU) inadequately assess enterprise-specific task…

Artificial Intelligence · Computer Science 2025-06-26 Liya Wang , David Yi , Damien Jose , John Passarelli , James Gao , Jordan Leventis , Kang Li

We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation…

Computation and Language · Computer Science 2024-09-13 Qi Jia , Xiang Yue , Tianyu Zheng , Jie Huang , Bill Yuchen Lin

Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going…

Computation and Language · Computer Science 2024-07-16 Anni Zou , Wenhao Yu , Hongming Zhang , Kaixin Ma , Deng Cai , Zhuosheng Zhang , Hai Zhao , Dong Yu

Large Language Models (LLMs) have emerged as a powerful tool in advancing the Text-to-SQL task, significantly outperforming traditional methods.Nevertheless, as a nascent research field, there is still no consensus on the optimal prompt…

Computation and Language · Computer Science 2026-03-20 Bin Zhang , Yuxiao Ye , Guoqing Du , Xiaoru Hu , Zhishuai Li , Chi Harold Liu , Zhiwei Xu , Guoliang Fan , Rui Zhao , Ziyue Li , Hangyu Mao

Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on.…

Computation and Language · Computer Science 2025-04-22 Xu Huang , Wenhao Zhu , Hanxu Hu , Conghui He , Lei Li , Shujian Huang , Fei Yuan

We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed.…

Computation and Language · Computer Science 2026-05-20 Yang Liu , Hongming Li , Melissa Xiaohui Qin , Qiankun Liu , Chao Huang

Large language models (LLMs) have sparked growing interest in machine learning research agents that can autonomously propose ideas and conduct experiments. However, existing benchmarks predominantly adopt an engineering-oriented…

Computation and Language · Computer Science 2026-02-26 Qiran Zou , Hou Hei Lam , Wenhao Zhao , Yiming Tang , Tingting Chen , Samson Yu , Tianyi Zhang , Chang Liu , Xiangyang Ji , Dianbo Liu

Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, where the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to…

Artificial Intelligence · Computer Science 2025-03-18 Zhaopan Xu , Pengfei Zhou , Jiaxin Ai , Wangbo Zhao , Kai Wang , Xiaojiang Peng , Wenqi Shao , Hongxun Yao , Kaipeng Zhang
‹ Prev 1 2 3 10 Next ›