English
Related papers

Related papers: SimulBench: Evaluating Language Models with Creati…

200 papers

Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities by leveraging instruction tuning, which refreshes human impressions of dialogue systems. The long-standing goal of dialogue systems is to be…

Computation and Language · Computer Science 2024-04-01 Jiao Ou , Junda Lu , Che Liu , Yihong Tang , Fuzheng Zhang , Di Zhang , Kun Gai

Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations of simulation fidelity are…

Computation and Language · Computer Science 2026-04-14 Tiancheng Hu , Joachim Baumann , Lorenzo Lupo , Nigel Collier , Dirk Hovy , Paul Röttger

We present RPGBench, the first benchmark designed to evaluate large language models (LLMs) as text-based role-playing game (RPG) engines. RPGBench comprises two core tasks: Game Creation (GC) and Game Simulation (GS). In GC, an LLM must…

Computation and Language · Computer Science 2025-02-04 Pengfei Yu , Dongming Shen , Silin Meng , Jaewon Lee , Weisu Yin , Andrea Yaoyun Cui , Zhenlin Xu , Yi Zhu , Xingjian Shi , Mu Li , Alex Smola

Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents"-agents that operate in rich linguistic and non-linguistic contexts-through testing them in carefully constructed interactive…

Computation and Language · Computer Science 2023-11-27 Kranti Chalamalasetti , Jana Götze , Sherzod Hakimov , Brielen Madureira , Philipp Sadler , David Schlangen

Recent advancements in Large Language Models (LLMs) have shown outstanding potential for role-playing applications. Evaluating these capabilities is becoming crucial yet remains challenging. Existing benchmarks mostly adopt a…

Computation and Language · Computer Science 2025-10-24 Hao Xiang , Tianyi Tang , Yang Su , Bowen Yu , An Yang , Fei Huang , Yichang Zhang , Yaojie Lu , Hongyu Lin , Xianpei Han , Jingren Zhou , Junyang Lin , Le Sun

Large Language Models (LLMs) have emerged as a powerful tool in advancing the Text-to-SQL task, significantly outperforming traditional methods.Nevertheless, as a nascent research field, there is still no consensus on the optimal prompt…

Computation and Language · Computer Science 2026-03-20 Bin Zhang , Yuxiao Ye , Guoqing Du , Xiaoru Hu , Zhishuai Li , Chi Harold Liu , Zhiwei Xu , Guoliang Fan , Rui Zhao , Ziyue Li , Hangyu Mao

Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce LiveCultureBench, a multi-cultural,…

Artificial Intelligence · Computer Science 2026-03-03 Viet-Thanh Pham , Lizhen Qu , Thuy-Trang Vu , Gholamreza Haffari , Dinh Phung

It has been established in recent work that Large Language Models (LLMs) can be prompted to "self-play" conversational games that probe certain capabilities (general instruction following, strategic goal orientation, language understanding…

Computation and Language · Computer Science 2024-06-03 Anne Beyer , Kranti Chalamalasetti , Sherzod Hakimov , Brielen Madureira , Philipp Sadler , David Schlangen

The disruptive technology provided by large-scale pre-trained language models (LLMs) such as ChatGPT or GPT-4 has received significant attention in several application domains, often with an emphasis on high-level opportunities and…

Human-Computer Interaction · Computer Science 2023-06-27 Philippe J. Giabbanelli

As large language models (LLMs) continue to advance and gain widespread use, establishing systematic and reliable evaluation methodologies for LLMs and vision-language models (VLMs) has become essential to ensure their real-world…

Artificial Intelligence · Computer Science 2025-06-03 Jie Feng , Jun Zhang , Tianhui Liu , Xin Zhang , Tianjian Ouyang , Junbo Yan , Yuwei Du , Siqi Guo , Yong Li

Building on the success of large language models (LLMs), recent advancements such as GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering a significantly improved user experience compared to…

Computation and Language · Computer Science 2024-12-12 Yiming Chen , Xianghu Yue , Chen Zhang , Xiaoxue Gao , Robby T. Tan , Haizhou Li

Recently, the fast development of Large Language Models (LLMs) such as ChatGPT has significantly advanced NLP tasks by enhancing the capabilities of conversational models. However, the application of LLMs in the recommendation domain has…

Information Retrieval · Computer Science 2023-08-24 Junling Liu , Chao Liu , Peilin Zhou , Qichen Ye , Dading Chong , Kang Zhou , Yueqi Xie , Yuwei Cao , Shoujin Wang , Chenyu You , Philip S. Yu

There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on…

Computation and Language · Computer Science 2026-02-27 David Schlangen , Sherzod Hakimov , Chalamalasetti Kranti , Jonathan Jordan , Philipp Sadler

The advent of Large Language Models (LLMs) has paved the way for complex tasks such as role-playing, which enhances user interactions by enabling models to imitate various characters. However, the closed-source nature of state-of-the-art…

Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines),…

Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer,…

Computation and Language · Computer Science 2025-09-16 Chenghao Yang , Yinbo Luo , Zhoufutu Wen , Qi Chu , Tao Gong , Longxiang Liu , Kaiyuan Zhang , Jianpeng Jiao , Ge Zhang , Wenhao Huang , Nenghai Yu

Software testing is a crucial phase in the software life cycle, helping identify potential risks and reduce maintenance costs. With the advancement of Large Language Models (LLMs), researchers have proposed an increasing number of LLM-based…

Software Engineering · Computer Science 2024-09-27 Quanjun Zhang , Ye Shang , Chunrong Fang , Siqi Gu , Jianyi Zhou , Zhenyu Chen

As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments…

Computation and Language · Computer Science 2024-06-11 Jinhao Duan , Renming Zhang , James Diffenderfer , Bhavya Kailkhura , Lichao Sun , Elias Stengel-Eskin , Mohit Bansal , Tianlong Chen , Kaidi Xu

Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing…

The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available,…

‹ Prev 1 2 3 10 Next ›