English
Related papers

Related papers: Evaluating Multimodal Interactive Agents

200 papers

Prominent large language models have exhibited human-level performance in many domains, even enabling the derived agents to simulate human and social interactions. While practical works have substantiated the practicability of grounding…

Computation and Language · Computer Science 2024-04-09 Chenxu Wang , Bin Dai , Huaping Liu , Baoyuan Wang

Personalized AI agents are becoming central to modern information retrieval, yet most evaluation methodologies remain static, relying on fixed benchmarks and one-off metrics that fail to reflect how users' needs evolve over time. These…

Information Retrieval · Computer Science 2025-10-07 Kirandeep Kaur , Preetam Prabhu Srikar Dammu , Hideo Joho , Chirag Shah

A common vision from science fiction is that robots will one day inhabit our physical spaces, sense the world as we do, assist our physical labours, and communicate with us through natural language. Here we study how to design artificial…

Designing and evaluating personalized and proactive assistant agents remains challenging due to the time, cost, and ethical concerns associated with human-in-the-loop experimentation. Existing Human-Computer Interaction (HCI) methods often…

Human-Computer Interaction · Computer Science 2025-11-25 Ziyi Xuan , Yiwen Wu , Xuhai Xu , Vinod Namboodiri , Mooi Choo Chuah , Yu Yang

Current AI evaluation methods, which rely on static, model-only tests, fail to account for harms that emerge through sustained human-AI interaction. As AI systems proliferate and are increasingly integrated into real-world applications,…

Computers and Society · Computer Science 2025-07-31 Lujain Ibrahim , Saffron Huang , Umang Bhatt , Lama Ahmad , Markus Anderljung

Testing conversational AI systems at scale across diverse domains necessitates realistic and diverse user interactions capturing a wide array of behavioral patterns. We present a novel multi-agent framework for realistic, explainable human…

Human-Computer Interaction · Computer Science 2026-01-23 Hareeshwar Karthikeyan

Classic evaluation methods of believable agents are time-consuming because they involve many human to judge agents. They are well suited to validate work on new believable behaviours models. However, during the implementation, numerous…

Artificial Intelligence · Computer Science 2010-09-03 Fabien Tencé , Cédric Buche

LLM-powered agents are both a promising new technology and a source of complexity, where choices about models, tools, and prompting can affect their usefulness. While numerous benchmarks measure agent accuracy across domains, they mostly…

Multi-agent interaction is a fundamental aspect of autonomous driving in the real world. Despite more than a decade of research and development, the problem of how to competently interact with diverse road users in diverse scenarios remains…

Target Safety Assessment (TSA) requires systematic integration of heterogeneous evidence, including genetic, transcriptomic, target homology, pharmacological, and clinical data, to evaluate potential safety liabilities of therapeutic…

Computation and Language · Computer Science 2026-05-11 Xiaochen Zheng , Zhiwen Jiang , Melanie Guerard , Klas Hatje , Tatyana Doktorova

LLM agents are increasingly deployed to plan, retrieve, and write with tools, yet evaluation still leans on static benchmarks and small human studies. We present the Agent-Testing Agent (ATA), a meta-agent that combines static code…

Computation and Language · Computer Science 2025-08-26 Sameer Komoravolu , Khalil Mrini

Traditional sociotechnical systems (STS) theory has been widely used, but there are many new characteristics in the STS environment as we enter the intelligence era, resulting in the limitations of traditional STS. Based on the…

Human-Computer Interaction · Computer Science 2023-03-07 Wei Xu

Accurately assessing internal human states is key to understanding preferences, offering personalized services, and identifying challenges in real-world applications. Originating from psychometrics, adaptive testing has become the…

Artificial Intelligence · Computer Science 2025-06-04 Junhao Yu , Yan Zhuang , YuXuan Sun , Weibo Gao , Qi Liu , Mingyue Cheng , Zhenya Huang , Enhong Chen

Evaluating and iterating upon recommender systems is crucial, yet traditional A/B testing is resource-intensive, and offline methods struggle with dynamic user-platform interactions. While agent-based simulation is promising, existing…

Computation and Language · Computer Science 2025-09-29 Song Jin , Juntian Zhang , Yuhan Liu , Xun Zhang , Yufei Zhang , Guojun Yin , Fei Jiang , Wei Lin , Rui Yan

Research and development on conversational recommender systems (CRSs) critically depends on sound and reliable evaluation methodologies. However, the interactive nature of these systems poses significant challenges for automatic evaluation.…

Information Retrieval · Computer Science 2025-10-08 Nolwenn Bernard , Krisztian Balog

There is a growing demand for agentic AI technologies for a range of downstream applications like customer service and personal assistants. For applications where the agent needs to interact with a person, real-time low-latency…

Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with…

Computation and Language · Computer Science 2026-02-04 Xingshan Zeng , Lingzhi Wang , Weiwen Liu , Liangyou Li , Yasheng Wang , Lifeng Shang , Xin Jiang , Qun Liu

Large language model-based web agents have demonstrated strong performance on realistic web interaction tasks. However, existing evaluations are predominantly conducted under relatively stable and well-behaved interaction conditions, which…

Software Engineering · Computer Science 2026-04-21 Haoyue Bai , Dong Wang , Long Chen , Bingguang Hao , Pengyang Shao , Yonghui Yang , Yicheng He , Chenyi Zhuang

For software interacting directly with real-world end-users, it is common practice to script scenario tests validating the system's compliance with a number of its features. However, these do not accommodate the replication of the type of…

Software Engineering · Computer Science 2022-08-26 Pasquale Salza , Marco Edoardo Palma , Harald C. Gall

The novel research area of computational empathy is in its infancy and moving towards developing methods and standards. One major problem is the lack of agreement on the evaluation of empathy in artificial interactive systems. Even though…

Artificial Intelligence · Computer Science 2019-08-16 Özge Nilay Yalçın
‹ Prev 1 2 3 10 Next ›