Related papers: SimulBench: Evaluating Language Models with Creati…

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities by leveraging instruction tuning, which refreshes human impressions of dialogue systems. The long-standing goal of dialogue systems is to be…

Computation and Language · Computer Science 2024-04-01 Jiao Ou , Junda Lu , Che Liu , Yihong Tang , Fuzheng Zhang , Di Zhang , Kun Gai

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations of simulation fidelity are…

Computation and Language · Computer Science 2026-04-14 Tiancheng Hu , Joachim Baumann , Lorenzo Lupo , Nigel Collier , Dirk Hovy , Paul Röttger

RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines

We present RPGBench, the first benchmark designed to evaluate large language models (LLMs) as text-based role-playing game (RPG) engines. RPGBench comprises two core tasks: Game Creation (GC) and Game Simulation (GS). In GC, an LLM must…

Computation and Language · Computer Science 2025-02-04 Pengfei Yu , Dongming Shen , Silin Meng , Jaewon Lee , Weisu Yin , Andrea Yaoyun Cui , Zhenlin Xu , Yi Zhu , Xingjian Shi , Mu Li , Alex Smola

Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents

Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents"-agents that operate in rich linguistic and non-linguistic contexts-through testing them in carefully constructed interactive…

Computation and Language · Computer Science 2023-11-27 Kranti Chalamalasetti , Jana Götze , Sherzod Hakimov , Brielen Madureira , Philipp Sadler , David Schlangen

RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

Recent advancements in Large Language Models (LLMs) have shown outstanding potential for role-playing applications. Evaluating these capabilities is becoming crucial yet remains challenging. Existing benchmarks mostly adopt a…

Computation and Language · Computer Science 2025-10-24 Hao Xiang , Tianyi Tang , Yang Su , Bowen Yu , An Yang , Fei Huang , Yichang Zhang , Yaojie Lu , Hongyu Lin , Xianpei Han , Jingren Zhou , Junyang Lin , Le Sun

SQLBench: A Comprehensive Evaluation for Text-to-SQL Capabilities of Large Language Models

Large Language Models (LLMs) have emerged as a powerful tool in advancing the Text-to-SQL task, significantly outperforming traditional methods.Nevertheless, as a nascent research field, there is still no consensus on the optimal prompt…

Computation and Language · Computer Science 2026-03-20 Bin Zhang , Yuxiao Ye , Guoqing Du , Xiaoru Hu , Zhishuai Li , Chi Harold Liu , Zhiwei Xu , Guoliang Fan , Rui Zhao , Ziyue Li , Hangyu Mao

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce LiveCultureBench, a multi-cultural,…

Artificial Intelligence · Computer Science 2026-03-03 Viet-Thanh Pham , Lizhen Qu , Thuy-Trang Vu , Gholamreza Haffari , Dinh Phung

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

It has been established in recent work that Large Language Models (LLMs) can be prompted to "self-play" conversational games that probe certain capabilities (general instruction following, strategic goal orientation, language understanding…

Computation and Language · Computer Science 2024-06-03 Anne Beyer , Kranti Chalamalasetti , Sherzod Hakimov , Brielen Madureira , Philipp Sadler , David Schlangen

GPT-Based Models Meet Simulation: How to Efficiently Use Large-Scale Pre-Trained Language Models Across Simulation Tasks

The disruptive technology provided by large-scale pre-trained language models (LLMs) such as ChatGPT or GPT-4 has received significant attention in several application domains, often with an emphasis on high-level opportunities and…

Human-Computer Interaction · Computer Science 2023-06-27 Philippe J. Giabbanelli

CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks

As large language models (LLMs) continue to advance and gain widespread use, establishing systematic and reliable evaluation methodologies for LLMs and vision-language models (VLMs) has become essential to ensure their real-world…

Artificial Intelligence · Computer Science 2025-06-03 Jie Feng , Jun Zhang , Tianhui Liu , Xin Zhang , Tianjian Ouyang , Junbo Yan , Yuwei Du , Siqi Guo , Yong Li

VoiceBench: Benchmarking LLM-Based Voice Assistants

Building on the success of large language models (LLMs), recent advancements such as GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering a significantly improved user experience compared to…

Computation and Language · Computer Science 2024-12-12 Yiming Chen , Xianghu Yue , Chen Zhang , Xiaoxue Gao , Robby T. Tan , Haizhou Li

LLMRec: Benchmarking Large Language Models on Recommendation Task

Recently, the fast development of Large Language Models (LLMs) such as ChatGPT has significantly advanced NLP tasks by enhancing the capabilities of conversational models. However, the application of LLMs in the recommendation domain has…

Information Retrieval · Computer Science 2023-08-24 Junling Liu , Chao Liu , Peilin Zhou , Qichen Ye , Dading Chong , Kang Zhou , Yueqi Xie , Yuwei Cao , Shoujin Wang , Chenyu You , Philip S. Yu

A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on…

Computation and Language · Computer Science 2026-02-27 David Schlangen , Sherzod Hakimov , Chalamalasetti Kranti , Jonathan Jordan , Philipp Sadler

RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models

The advent of Large Language Models (LLMs) has paved the way for complex tasks such as role-playing, which enhances user interactions by enabling models to imitate various characters. However, the closed-source nature of state-of-the-art…

Computation and Language · Computer Science 2024-06-19 Zekun Moore Wang , Zhongyuan Peng , Haoran Que , Jiaheng Liu , Wangchunshu Zhou , Yuhan Wu , Hongcheng Guo , Ruitong Gan , Zehao Ni , Jian Yang , Man Zhang , Zhaoxiang Zhang , Wanli Ouyang , Ke Xu , Stephen W. Huang , Jie Fu , Junran Peng

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines),…

Computation and Language · Computer Science 2024-08-22 Xiangru Tang , Yuliang Liu , Zefan Cai , Yanjun Shao , Junjie Lu , Yichi Zhang , Zexuan Deng , Helan Hu , Kaikai An , Ruijun Huang , Shuzheng Si , Sheng Chen , Haozhe Zhao , Liang Chen , Yan Wang , Tianyu Liu , Zhiwei Jiang , Baobao Chang , Yin Fang , Yujia Qin , Wangchunshu Zhou , Yilun Zhao , Arman Cohan , Mark Gerstein

MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer,…

Computation and Language · Computer Science 2025-09-16 Chenghao Yang , Yinbo Luo , Zhoufutu Wen , Qi Chu , Tao Gong , Longxiang Liu , Kaiyuan Zhang , Jianpeng Jiao , Ge Zhang , Wenhao Huang , Nenghai Yu

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

Software testing is a crucial phase in the software life cycle, helping identify potential risks and reduce maintenance costs. With the advancement of Large Language Models (LLMs), researchers have proposed an increasing number of LLM-based…

Software Engineering · Computer Science 2024-09-27 Quanjun Zhang , Ye Shang , Chunrong Fang , Siqi Gu , Jianyi Zhou , Zhenyu Chen

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments…

Computation and Language · Computer Science 2024-06-11 Jinhao Duan , Renming Zhang , James Diffenderfer , Bhavya Kailkhura , Lichao Sun , Elias Stengel-Eskin , Mohit Bansal , Tianlong Chen , Kaidi Xu

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing…

Computation and Language · Computer Science 2024-10-16 Pei Wang , Yanan Wu , Zekun Wang , Jiaheng Liu , Xiaoshuai Song , Zhongyuan Peng , Ken Deng , Chenchen Zhang , Jiakai Wang , Junran Peng , Ge Zhang , Hangyu Guo , Zhaoxiang Zhang , Wenbo Su , Bo Zheng

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available,…

Computation and Language · Computer Science 2024-02-27 Fahim Dalvi , Maram Hasanain , Sabri Boughorbel , Basel Mousi , Samir Abdaljalil , Nizi Nazar , Ahmed Abdelali , Shammur Absar Chowdhury , Hamdy Mubarak , Ahmed Ali , Majd Hawasly , Nadir Durrani , Firoj Alam