Related papers: Visual Dialog

Building Multimodal AI Chatbots

This work aims to create a multimodal AI system that chats with humans and shares relevant photos. While earlier works were limited to dialogues about specific objects or scenes within images, recent works have incorporated images into…

Computation and Language · Computer Science 2023-05-08 Min Young Lee

DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue

Different from Visual Question Answering task that requires to answer only one question about an image, Visual Dialogue involves multiple questions which cover a broad range of visual content that could be related to any objects,…

Computer Vision and Pattern Recognition · Computer Science 2019-11-19 Xiaoze Jiang , Jing Yu , Zengchang Qin , Yingying Zhuang , Xingxing Zhang , Yue Hu , Qi Wu

Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations

We introduce Affective Visual Dialog, an emotion explanation and reasoning task as a testbed for research on understanding the formation of emotions in visually grounded conversations. The task involves three skills: (1) Dialog-based…

Computation and Language · Computer Science 2025-01-03 Kilichbek Haydarov , Xiaoqian Shen , Avinash Madasu , Mahmoud Salem , Li-Jia Li , Gamaleldin Elsayed , Mohamed Elhoseiny

DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog

Visual Dialog is a vision-language task that requires an AI agent to engage in a conversation with humans grounded in an image. It remains a challenging task since it requires the agent to fully understand a given question before making an…

Computation and Language · Computer Science 2019-12-19 Feilong Chen , Fandong Meng , Jiaming Xu , Peng Li , Bo Xu , Jie Zhou

Saying the Unseen: Video Descriptions via Dialog Agents

Current vision and language tasks usually take complete visual data (e.g., raw images or videos) as input, however, practical scenarios may often consist the situations where part of the visual information becomes inaccessible due to…

Computer Vision and Pattern Recognition · Computer Science 2021-06-29 Ye Zhu , Yu Wu , Yi Yang , Yan Yan

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system…

Computer Vision and Pattern Recognition · Computer Science 2024-08-05 Se Jin Park , Chae Won Kim , Hyeongseop Rha , Minsu Kim , Joanna Hong , Jeong Hun Yeo , Yong Man Ro

Modeling Coreference Relations in Visual Dialog

Visual dialog is a vision-language task where an agent needs to answer a series of questions grounded in an image based on the understanding of the dialog history and the image. The occurrences of coreference relations in the dialog makes…

Computer Vision and Pattern Recognition · Computer Science 2022-03-08 Mingxiao Li , Marie-Francine Moens

OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts

When humans converse, what a speaker will say next significantly depends on what he sees. Unfortunately, existing dialogue models generate dialogue utterances only based on preceding textual contexts, and visual contexts are rarely…

Computation and Language · Computer Science 2021-06-01 Yuxian Meng , Shuhe Wang , Qinghong Han , Xiaofei Sun , Fei Wu , Rui Yan , Jiwei Li

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image, using the conversation history as context. It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks…

Computer Vision and Pattern Recognition · Computer Science 2019-09-20 Satwik Kottur , José M. F. Moura , Devi Parikh , Dhruv Batra , Marcus Rohrbach

Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering

Human conversation is a complex mechanism with subtle nuances. It is hence an ambitious goal to develop artificial intelligence agents that can participate fluently in a conversation. While we are still far from achieving this goal, recent…

Computer Vision and Pattern Recognition · Computer Science 2018-03-30 Unnat Jain , Svetlana Lazebnik , Alexander Schwing

History for Visual Dialog: Do we really need it?

Visual Dialog involves "understanding" the dialog history (what has been discussed previously) and the current question (what is asked), in addition to grounding information in the image, to generate the correct response. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2020-05-18 Shubham Agarwal , Trung Bui , Joon-Young Lee , Ioannis Konstas , Verena Rieser

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models

In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is…

Computer Vision and Pattern Recognition · Computer Science 2023-12-22 Bingbing Wen , Zhengyuan Yang , Jianfeng Wang , Zhe Gan , Bill Howe , Lijuan Wang

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial intelligence. With the gradually…

Artificial Intelligence · Computer Science 2022-07-05 Hao Wang , Bin Guo , Yating Zeng , Yasan Ding , Chen Qiu , Ying Zhang , Lina Yao , Zhiwen Yu

Multi-View Attention Network for Visual Dialog

Visual dialog is a challenging vision-language task in which a series of questions visually grounded by a given image are answered. To resolve the visual dialog task, a high-level understanding of various multimodal inputs (e.g., question,…

Artificial Intelligence · Computer Science 2020-10-08 Sungjin Park , Taesun Whang , Yeochan Yoon , Heuiseok Lim

Conversational DNA: A New Visual Language for Understanding Dialogue Structure in Human and AI

What if the patterns hidden within dialogue reveal more about communication than the words themselves? We introduce Conversational DNA, a novel visual language that treats any dialogue -- whether between humans, between human and AI, or…

Human-Computer Interaction · Computer Science 2025-08-12 Baihan Lin

ValueNet: A New Dataset for Human Value Driven Dialogue System

Building a socially intelligent agent involves many challenges, one of which is to teach the agent to speak guided by its value like a human. However, value-driven chatbots are still understudied in the area of dialogue systems. Most…

Computation and Language · Computer Science 2022-07-25 Liang Qiu , Yizhou Zhao , Jinchao Li , Pan Lu , Baolin Peng , Jianfeng Gao , Song-Chun Zhu

Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning

The Visual Dialogue task requires an agent to engage in a conversation about an image with a human. It represents an extension of the Visual Question Answering task in that the agent needs to answer a question about an image, but it needs…

Computer Vision and Pattern Recognition · Computer Science 2017-11-22 Qi Wu , Peng Wang , Chunhua Shen , Ian Reid , Anton van den Hengel

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Amrita Mazumdar , Seonwook Park , Rajarshi Roy , Nikhil Srihari , Shengze Wang , Yuhao Zhou , Julia Wang , Koki Nagano , Shalini De Mello

ViDA-MAN: Visual Dialog with Digital Humans

We demonstrate ViDA-MAN, a digital-human agent for multi-modal interaction, which offers realtime audio-visual responses to instant speech inquiries. Compared to traditional text or voice-based system, ViDA-MAN offers human-like…

Computer Vision and Pattern Recognition · Computer Science 2021-10-27 Tong Shen , Jiawei Zuo , Fan Shi , Jin Zhang , Liqin Jiang , Meng Chen , Zhengchen Zhang , Wei Zhang , Xiaodong He , Tao Mei

Expressing Visual Relationships via Language

Describing images with text is a fundamental problem in vision-language research. Current studies in this domain mostly focus on single image captioning. However, in various real applications (e.g., image editing, difference interpretation,…

Computation and Language · Computer Science 2019-06-20 Hao Tan , Franck Dernoncourt , Zhe Lin , Trung Bui , Mohit Bansal