Related papers: DFEE: Interactive DataFlow Execution and Evaluatio…
In \citep{andreas2020task-oriented}, a dataflow (DF) based dialogue system was introduced, showing clear advantages compared to many commonly used current systems. This was accompanied by the release of SMCalFlow, a practically relevant,…
We describe an approach to task-oriented dialogue in which dialogue state is represented as a dataflow graph. A dialogue agent maps each user utterance to a program that extends this graph. Programs include metacomputation operators for…
Developing language model-based dialogue agents requires effective data to train models that can follow specific task logic. However, most existing data simulation methods focus on increasing diversity in language, topics, or dialogue acts…
Though widely used in industry, traditional task-oriented dialogue systems suffer from three bottlenecks: (i) difficult ontology construction (e.g., intents and slots); (ii) poor controllability and interpretability; (iii)…
The use of chatbots has spread, generating great interest in the industry for the possibility of automating tasks within the execution of their processes. The implementation of chatbots, however simple, is a complex endeavor that involves…
Nowadays, open-domain dialogue models can generate acceptable responses according to the historical context based on the large-scale pre-trained language models. However, they generally concatenate the dialogue history directly as the model…
Large Language Models (LLMs) have revolutionized natural language interaction with data. The "holy grail" of data analytics is to build autonomous Data Agents that can self-drive complex data analysis workflows. However, current…
Data-flow testing (DFT) aims to detect potential data interaction anomalies by focusing on the points at which variables receive values and the points at which these values are used. Such test objectives are referred as \emph{def-use…
Despite recent progress in open-domain dialogue evaluation, how to develop automatic metrics remains an open problem. We explore the potential of dialogue evaluation featuring dialog act information, which was hardly explicitly modeled in…
Dataflow coverage, one of the white-box testing criteria, focuses on the relations between variable definitions and their uses.Several empirical studies have proved data-flow testing is more effective than control-flow testing. However,…
Conversational AI chatbots are transforming industries by streamlining customer service, automating transactions, and enhancing user engagement. However, evaluating these systems remains a challenge, particularly in financial services,…
Embodied AI research is increasingly moving beyond single-task, single-environment policy learning toward multi-task, multi-scene, and multi-model settings. This shift substantially increases the engineering overhead and development time…
Semantic Machines (SM) have introduced the use of the dataflow (DF) paradigm to dialogue modelling, using computational graphs to hierarchically represent user requests, data, and the dialogue history [Semantic Machines et al. 2020].…
Following the rapid progress in natural language processing (NLP) models, language models are applied to increasingly more complex interactive tasks such as negotiations and conversation moderations. Having human evaluators directly…
We present DataFlow, a computational framework for building, testing, and deploying high-performance machine learning systems on unbounded time-series data. Traditional data science workflows assume finite datasets and require substantial…
Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric…
Automated service agents require well-structured workflows to provide consistent and accurate responses to customer queries. However, these workflows are often undocumented, and their automatic extraction from conversations remains…
The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc…
Benchmarking AI systems in multi-turn interactive scenarios is essential for understanding their practical capabilities in real-world applications. However, existing evaluation protocols are highly heterogeneous, differing significantly in…
Evaluating and understanding the inappropriateness of chatbot behaviors can be challenging, particularly for chatbot designers without technical backgrounds. To democratize the debugging process of chatbot misbehaviors for non-technical…