Related papers: Evaluating Multimodal Interactive Agents

Towards Objectively Benchmarking Social Intelligence for Language Agents at Action Level

Prominent large language models have exhibited human-level performance in many domains, even enabling the derived agents to simulate human and social interactions. While practical works have substantiated the practicability of grounding…

Computation and Language · Computer Science 2024-04-09 Chenxu Wang , Bin Dai , Huaping Liu , Baoyuan Wang

Beyond Static Evaluation: Rethinking the Assessment of Personalized Agent Adaptability in Information Retrieval

Personalized AI agents are becoming central to modern information retrieval, yet most evaluation methodologies remain static, relying on fixed benchmarks and one-off metrics that fail to reflect how users' needs evolve over time. These…

Information Retrieval · Computer Science 2025-10-07 Kirandeep Kaur , Preetam Prabhu Srikar Dammu , Hideo Joho , Chirag Shah

Imitating Interactive Intelligence

A common vision from science fiction is that robots will one day inhabit our physical spaces, sense the world as we do, assist our physical labours, and communicate with us through natural language. Here we study how to design artificial…

Machine Learning · Computer Science 2021-01-22 Josh Abramson , Arun Ahuja , Iain Barr , Arthur Brussee , Federico Carnevale , Mary Cassin , Rachita Chhaparia , Stephen Clark , Bogdan Damoc , Andrew Dudzik , Petko Georgiev , Aurelia Guy , Tim Harley , Felix Hill , Alden Hung , Zachary Kenton , Jessica Landon , Timothy Lillicrap , Kory Mathewson , Soňa Mokrá , Alistair Muldal , Adam Santoro , Nikolay Savinov , Vikrant Varma , Greg Wayne , Duncan Williams , Nathaniel Wong , Chen Yan , Rui Zhu

Design and Evaluation of Generative Agent-based Platform for Human-Assistant Interaction Research: A Tale of 10 User Studies

Designing and evaluating personalized and proactive assistant agents remains challenging due to the time, cost, and ethical concerns associated with human-in-the-loop experimentation. Existing Human-Computer Interaction (HCI) methods often…

Human-Computer Interaction · Computer Science 2025-11-25 Ziyi Xuan , Yiwen Wu , Xuhai Xu , Vinod Namboodiri , Mooi Choo Chuah , Yu Yang

Towards interactive evaluations for interaction harms in human-AI systems

Current AI evaluation methods, which rely on static, model-only tests, fail to account for harms that emerge through sustained human-AI interaction. As AI systems proliferate and are increasingly integrated into real-world applications,…

Computers and Society · Computer Science 2025-07-31 Lujain Ibrahim , Saffron Huang , Umang Bhatt , Lama Ahmad , Markus Anderljung

Agentic Persona Control and Task State Tracking for Realistic User Simulation in Interactive Scenarios

Testing conversational AI systems at scale across diverse domains necessitates realistic and diverse user interactions capturing a wide array of behavioral patterns. We present a novel multi-agent framework for realistic, explainable human…

Human-Computer Interaction · Computer Science 2026-01-23 Hareeshwar Karthikeyan

Automatable Evaluation Method Oriented toward Behaviour Believability for Video Games

Classic evaluation methods of believable agents are time-consuming because they involve many human to judge agents. They are well suited to validate work on new believable behaviours models. However, during the implementation, numerous…

Artificial Intelligence · Computer Science 2010-09-03 Fabien Tencé , Cédric Buche

How can we assess human-agent interactions? Case studies in software agent design

LLM-powered agents are both a promising new technology and a source of complexity, where choices about models, tools, and prompting can affect their usefulness. While numerous benchmarks measure agent accuracy across domains, they mostly…

Artificial Intelligence · Computer Science 2025-11-05 Valerie Chen , Rohit Malhotra , Xingyao Wang , Juan Michelini , Xuhui Zhou , Aditya Bharat Soni , Hoang H. Tran , Calvin Smith , Ameet Talwalkar , Graham Neubig

SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving

Multi-agent interaction is a fundamental aspect of autonomous driving in the real world. Despite more than a decade of research and development, the problem of how to competently interact with diverse road users in diverse scenarios remains…

Multiagent Systems · Computer Science 2020-11-03 Ming Zhou , Jun Luo , Julian Villella , Yaodong Yang , David Rusu , Jiayu Miao , Weinan Zhang , Montgomery Alban , Iman Fadakar , Zheng Chen , Aurora Chongxi Huang , Ying Wen , Kimia Hassanzadeh , Daniel Graves , Dong Chen , Zhengbang Zhu , Nhat Nguyen , Mohamed Elsayed , Kun Shao , Sanjeevan Ahilan , Baokuan Zhang , Jiannan Wu , Zhengang Fu , Kasra Rezaee , Peyman Yadmellat , Mohsen Rohani , Nicolas Perez Nieves , Yihan Ni , Seyedershad Banijamali , Alexander Cowen Rivers , Zheng Tian , Daniel Palenicek , Haitham bou Ammar , Hongbo Zhang , Wulong Liu , Jianye Hao , Jun Wang

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

Target Safety Assessment (TSA) requires systematic integration of heterogeneous evidence, including genetic, transcriptomic, target homology, pharmacological, and clinical data, to evaluate potential safety liabilities of therapeutic…

Computation and Language · Computer Science 2026-05-11 Xiaochen Zheng , Zhiwen Jiang , Melanie Guerard , Klas Hatje , Tatyana Doktorova

Agent-Testing Agent: A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents

LLM agents are increasingly deployed to plan, retrieve, and write with tools, yet evaluation still leans on static benchmarks and small human studies. We present the Agent-Testing Agent (ATA), a meta-agent that combines static code…

Computation and Language · Computer Science 2025-08-26 Sameer Komoravolu , Khalil Mrini

User-Centered Design (VIII): A New Framework of Intelligent Sociotechnical Systems and Prospects for Future Human Factors Research

Traditional sociotechnical systems (STS) theory has been widely used, but there are many new characteristics in the STS environment as we enter the intelligence era, resulting in the limitations of traditional STS. Based on the…

Human-Computer Interaction · Computer Science 2023-03-07 Wei Xu

TestAgent: An Adaptive and Intelligent Expert for Human Assessment

Accurately assessing internal human states is key to understanding preferences, offering personalized services, and identifying challenges in real-world applications. Originating from psychometrics, adaptive testing has become the…

Artificial Intelligence · Computer Science 2025-06-04 Junhao Yu , Yan Zhuang , YuXuan Sun , Weibo Gao , Qi Liu , Mingyue Cheng , Zhenya Huang , Enhong Chen

Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems

Evaluating and iterating upon recommender systems is crucial, yet traditional A/B testing is resource-intensive, and offline methods struggle with dynamic user-platform interactions. While agent-based simulation is promising, existing…

Computation and Language · Computer Science 2025-09-29 Song Jin , Juntian Zhang , Yuhan Liu , Xun Zhang , Yufei Zhang , Guojun Yin , Fei Jiang , Wei Lin , Rui Yan

Limitations of Current Evaluation Practices for Conversational Recommender Systems and the Potential of User Simulation

Research and development on conversational recommender systems (CRSs) critically depends on sound and reliable evaluation methodologies. However, the interactive nature of these systems poses significant challenges for automatic evaluation.…

Information Retrieval · Computer Science 2025-10-08 Nolwenn Bernard , Krisztian Balog

Speculative Interaction Agents: Building Real-Time Agents with Asynchronous I/O and Speculative Tool Calling

There is a growing demand for agentic AI technologies for a range of downstream applications like customer service and personal assistants. For applications where the agent needs to interact with a person, real-time low-latency…

Machine Learning · Computer Science 2026-05-15 Coleman Hooper , Minwoo Kang , Suhong Moon , Nicholas Lee , Eric Wen , John Wawrzynek , Michael W. Mahoney , Yakun Sophia Shao , Amir Gholami , Kurt Keutzer

ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation

Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with…

Computation and Language · Computer Science 2026-02-04 Xingshan Zeng , Lingzhi Wang , Weiwen Liu , Liangyou Li , Yasheng Wang , Lifeng Shang , Xin Jiang , Qun Liu

StressWeb: A Diagnostic Benchmark for Web Agent Robustness under Realistic Interaction Variability

Large language model-based web agents have demonstrated strong performance on realistic web interaction tasks. However, existing evaluations are predominantly conducted under relatively stable and well-behaved interaction conditions, which…

Software Engineering · Computer Science 2026-04-21 Haoyue Bai , Dong Wang , Long Chen , Bingguang Hao , Pengyang Shao , Yonghui Yang , Yicheng He , Chenyi Zhuang

Synthetic End-User Testing: Modeling Realistic Agents Based on Behavioral Examples

For software interacting directly with real-world end-users, it is common practice to script scenario tests validating the system's compliance with a number of its features. However, these do not accommodate the replication of the type of…

Software Engineering · Computer Science 2022-08-26 Pasquale Salza , Marco Edoardo Palma , Harald C. Gall

Evaluating Empathy in Artificial Agents

The novel research area of computational empathy is in its infancy and moving towards developing methods and standards. One major problem is the lack of agreement on the evaluation of empathy in artificial interactive systems. Even though…

Artificial Intelligence · Computer Science 2019-08-16 Özge Nilay Yalçın