Related papers: Dynamic benchmarking framework for LLM-based conve…

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user$\leftrightarrow$agent interaction. The interaction is a conversation between the user and…

Computation and Language · Computer Science 2024-10-14 David Castillo-Bolado , Joseph Davidson , Finlay Gray , Marek Rosa

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to…

Computation and Language · Computer Science 2024-02-20 Siyuan Wang , Zhuohan Long , Zhihao Fan , Zhongyu Wei , Xuanjing Huang

The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation

As Large Language Models (LLMs) transition from static tools to autonomous agents, traditional evaluation benchmarks that measure performance on downstream tasks are becoming insufficient. These methods fail to capture the emergent social…

Artificial Intelligence · Computer Science 2025-10-03 Zarreen Reza

Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks

Standard single-turn, static benchmarks fall short in evaluating the nuanced capabilities of Large Language Models (LLMs) on complex tasks such as software engineering. In this work, we propose a novel interactive evaluation framework that…

Artificial Intelligence · Computer Science 2025-08-27 Dimitrios Rontogiannis , Maxime Peyrard , Nicolas Baldwin , Martin Josifoski , Robert West , Dimitrios Gunopulos

Dynamic Evaluation Framework for Personalized and Trustworthy Agents: A Multi-Session Approach to Preference Adaptability

Recent advancements in generative AI have significantly increased interest in personalized agents. With increased personalization, there is also a greater need for being able to trust decision-making and action taking capabilities of these…

Information Retrieval · Computer Science 2025-04-10 Chirag Shah , Hideo Joho , Kirandeep Kaur , Preetam Prabhu Srikar Dammu

Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation

Large Language Models (LLMs) have made progress in various real-world tasks, which stimulates requirements for the evaluation of LLMs. Existing LLM evaluation methods are mainly supervised signal-based which depends on static datasets and…

Computation and Language · Computer Science 2023-09-11 Jiatong Li , Rui Li , Qi Liu

AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate…

Computation and Language · Computer Science 2026-02-02 Shicheng Fang , Yuxin Wang , Xiaoran Liu , Jiahao Lu , Chuanyuan Tan , Xinchi Chen , Yining Zheng , Xuanjing Huang , Xipeng Qiu

Human vs. Agent in Task-Oriented Conversations

Task-oriented conversational systems are essential for efficiently addressing diverse user needs, yet their development requires substantial amounts of high-quality conversational data that is challenging and costly to obtain. While large…

Information Retrieval · Computer Science 2025-11-06 Zhefan Wang , Ning Geng , Zhiqiang Guo , Weizhi Ma , Min Zhang

Evaluating the Performance of Large Language Models via Debates

Large Language Models (LLMs) are rapidly evolving and impacting various fields, necessitating the development of effective methods to evaluate and compare their performance. Most current approaches for performance evaluation are either…

Computation and Language · Computer Science 2025-02-11 Behrad Moniri , Hamed Hassani , Edgar Dobriban

Context-Aware Intelligent Chatbot Framework Leveraging Mobile Sensing

With the rapid advancement of large language models (LLMs), intelligent conversational assistants have demonstrated remarkable capabilities across various domains. However, they still mainly rely on explicit textual input and do not know…

Human-Computer Interaction · Computer Science 2025-12-29 Ziyan Zhang , Nan Gao , Zhiqiang Nie , Shantanu Pal , Haining Zhang

Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning

Large Language Models (LLMs) have become a popular interface for human-AI interaction, supporting information seeking and task assistance through natural, multi-turn dialogue. To respond to users within multi-turn dialogues, the…

Computation and Language · Computer Science 2026-04-16 Fengran Mo , Yifan Gao , Sha Li , Hansi Zeng , Xin Liu , Zhaoxuan Tan , Xian Li , Jianshu Chen , Dakuo Wang , Meng Jiang

Adaptive Multi-Agent Response Refinement in Conversational Systems

Large Language Models (LLMs) have demonstrated remarkable success in conversational systems by generating human-like responses. However, they can fall short, especially when required to account for personalization or specific knowledge. In…

Computation and Language · Computer Science 2025-11-12 Soyeong Jeong , Aparna Elangovan , Emine Yilmaz , Oleg Rokhlenko

An LLM Feature-based Framework for Dialogue Constructiveness Assessment

Research on dialogue constructiveness assessment focuses on (i) analysing conversational factors that influence individuals to take specific actions, win debates, change their perspectives or broaden their open-mindedness and (ii)…

Computation and Language · Computer Science 2024-10-03 Lexin Zhou , Youmna Farag , Andreas Vlachos

Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions

Large language models (LLMs) demonstrate strong potential as agents for tool invocation due to their advanced comprehension and planning capabilities. Users increasingly rely on LLM-based agents to solve complex missions through iterative…

Artificial Intelligence · Computer Science 2025-04-17 Peijie Yu , Yifan Yang , Jinjian Li , Zelong Zhang , Haorui Wang , Xiao Feng , Feng Zhang

Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators

Large Language Models (LLMs) have recently demonstrated impressive capabilities across various real-world applications. However, due to the current text-in-text-out paradigm, it remains challenging for LLMs to handle dynamic and complex…

Artificial Intelligence · Computer Science 2024-10-25 Timothy Wei , Annabelle Miin , Anastasia Miin

Simulation Agent: A Framework for Integrating Simulation and Large Language Models for Enhanced Decision-Making

Simulations, although powerful in accurately replicating real-world systems, often remain inaccessible to non-technical users due to their complexity. Conversely, large language models (LLMs) provide intuitive, language-based interactions…

Computation and Language · Computer Science 2025-05-22 Jacob Kleiman , Kevin Frank , Joseph Voyles , Sindy Campagna

Learning to Make Friends: Coaching LLM Agents toward Emergent Social Ties

Can large language model (LLM) agents reproduce the complex social dynamics that characterize human online behavior -- shaped by homophily, reciprocity, and social validation -- and what memory and learning mechanisms enable such dynamics…

Artificial Intelligence · Computer Science 2025-10-23 Philipp J. Schneider , Lin Tian , Marian-Andrei Rizoiu

LLM Harmony: Multi-Agent Communication for Problem Solving

Large Language Models (LLMs) have revolutionized Natural Language Processing but exhibit limitations, particularly in autonomously addressing novel challenges such as reasoning and problem-solving. Traditional techniques like…

Multiagent Systems · Computer Science 2024-01-03 Sumedh Rasal

FABRIC: Framework for Agent-Based Realistic Intelligence Creation

Large language models (LLMs) are increasingly deployed as agents, expected to decompose goals, invoke tools, and verify results in dynamic environments. Realizing these capabilities requires access to agentic data-structured interaction…

Artificial Intelligence · Computer Science 2025-10-22 Abhigya Verma , Seganrasan Subramanian , Nandhakumar Kandasamy , Naman Gupta

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

As large language models (LLMs) evolve into sophisticated autonomous agents capable of complex software development tasks, evaluating their real-world capabilities becomes critical. While existing benchmarks like…

Software Engineering · Computer Science 2025-11-19 Jielin Qiu , Zuxin Liu , Zhiwei Liu , Rithesh Murthy , Jianguo Zhang , Haolin Chen , Shiyu Wang , Ming Zhu , Liangwei Yang , Juntao Tan , Roshan Ram , Akshara Prabhakar , Tulika Awalgaonkar , Zixiang Chen , Zhepeng Cen , Cheng Qian , Shelby Heinecke , Weiran Yao , Silvio Savarese , Caiming Xiong , Huan Wang