Related papers: TutorialVQA: Question Answering Dataset for Tutori…

Video Question Answering on Screencast Tutorials

This paper presents a new video question answering task on screencast tutorials. We introduce a dataset including question, answer and context triples from the tutorial videos for a software. Unlike other video question answering works, all…

Computation and Language · Computer Science 2020-08-04 Wentian Zhao , Seokhwan Kim , Ning Xu , Hailin Jin

Learning to Answer Visual Questions from Web Videos

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual…

Computer Vision and Pattern Recognition · Computer Science 2022-05-12 Antoine Yang , Antoine Miech , Josef Sivic , Ivan Laptev , Cordelia Schmid

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual…

Computer Vision and Pattern Recognition · Computer Science 2021-08-13 Antoine Yang , Antoine Miech , Josef Sivic , Ivan Laptev , Cordelia Schmid

Recent Advances in Video Question Answering: A Review of Datasets and Methods

Video Question Answering (VQA) is a recent emerging challenging task in the field of Computer Vision. Several visual information retrieval techniques like Video Captioning/Description and Video-guided Machine Translation have preceded the…

Computer Vision and Pattern Recognition · Computer Science 2021-03-19 Devshree Patel , Ratnam Parikh , Yesha Shastri

Video Question Answering: Datasets, Algorithms and Challenges

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. It has earned increasing attention with recent research trends in joint vision and language understanding. Yet, compared with…

Computer Vision and Pattern Recognition · Computer Science 2022-11-03 Yaoyao Zhong , Junbin Xiao , Wei Ji , Yicong Li , Weihong Deng , Tat-Seng Chua

NEWSKVQA: Knowledge-Aware News Video Question Answering

Answering questions in the context of videos can be helpful in video indexing, video retrieval systems, video summarization, learning management systems and surveillance video analysis. Although there exists a large body of work on visual…

Computer Vision and Pattern Recognition · Computer Science 2022-02-09 Pranay Gupta , Manish Gupta

Watching the News: Towards VideoQA Models that can Read

Video Question Answering methods focus on commonsense reasoning and visual cognition of objects or persons and their interactions over time. Current VideoQA approaches ignore the textual information present in the video. Instead, we argue…

Computer Vision and Pattern Recognition · Computer Science 2023-12-08 Soumya Jahagirdar , Minesh Mathew , Dimosthenis Karatzas , C. V. Jawahar

YTCommentQA: Video Question Answerability in Instructional Videos

Instructional videos provide detailed how-to guides for various tasks, with viewers often posing questions regarding the content. Addressing these questions is vital for comprehending the content, yet receiving immediate answers is…

Computer Vision and Pattern Recognition · Computer Science 2024-02-01 Saelyne Yang , Sunghyun Park , Yunseok Jang , Moontae Lee

TVQA: Localized, Compositional Video Question Answering

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a large-scale video QA…

Computation and Language · Computer Science 2019-05-09 Jie Lei , Licheng Yu , Mohit Bansal , Tamara L. Berg

Knowledge-Based Visual Question Answering in Videos

We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset…

Computer Vision and Pattern Recognition · Computer Science 2020-04-21 Noa Garcia , Mayu Otani , Chenhui Chu , Yuta Nakashima

DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering

Remote work and online courses have become important methods of knowledge dissemination, leading to a large number of document-based instructional videos. Unlike traditional video datasets, these videos mainly feature rich-text images and…

Computer Vision and Pattern Recognition · Computer Science 2025-03-21 Haochen Wang , Kai Hu , Liangcai Gao

Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

Researchers have extensively studied the field of vision and language, discovering that both visual and textual content is crucial for understanding scenes effectively. Particularly, comprehending text in videos holds great significance,…

Computer Vision and Pattern Recognition · Computer Science 2023-09-12 Soumya Jahagirdar , Minesh Mathew , Dimosthenis Karatzas , C. V. Jawahar

Causal Understanding For Video Question Answering

Video Question Answering is a challenging task, which requires the model to reason over multiple frames and understand the interaction between different objects to answer questions based on the context provided within the video, especially…

Artificial Intelligence · Computer Science 2024-07-31 Bhanu Prakash Reddy Guda , Tanmay Kulkarni , Adithya Sampath , Swarnashree Mysore Sathyendra

Reading Between the Lanes: Text VideoQA on the Road

Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span,…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 George Tom , Minesh Mathew , Sergi Garcia , Dimosthenis Karatzas , C. V. Jawahar

AQuA: Automated Question-Answering in Software Tutorial Videos with Visual Anchors

Tutorial videos are a popular help source for learning feature-rich software. However, getting quick answers to questions about tutorial videos is difficult. We present an automated approach for responding to tutorial questions. By…

Human-Computer Interaction · Computer Science 2024-03-11 Saelyne Yang , Jo Vermeulen , George Fitzmaurice , Justin Matejka

KnowIT VQA: Answering Knowledge-Based Questions about Videos

We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset…

Computer Vision and Pattern Recognition · Computer Science 2019-12-25 Noa Garcia , Mayu Otani , Chenhui Chu , Yuta Nakashima

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA).…

Computer Vision and Pattern Recognition · Computer Science 2019-06-07 Zhou Yu , Dejing Xu , Jun Yu , Ting Yu , Zhou Zhao , Yueting Zhuang , Dacheng Tao

Data augmentation techniques for the Video Question Answering task

Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both the visual content given by the input video and the textual part given by the question, and the interaction between them in order to produce a…

Computer Vision and Pattern Recognition · Computer Science 2020-08-25 Alex Falcon , Oswald Lanz , Giuseppe Serra

A Dataset for Medical Instructional Video Classification and Question Answering

This paper introduces a new challenge and datasets to foster research toward designing systems that can understand medical videos and provide visual answers to natural language questions. We believe medical videos may provide the best…

Computer Vision and Pattern Recognition · Computer Science 2022-02-01 Deepak Gupta , Kush Attal , Dina Demner-Fushman

Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering

This paper proposes a method to gain extra supervision via multi-task learning for multi-modal video question answering. Multi-modal video question answering is an important task that aims at the joint understanding of vision and language.…

Computer Vision and Pattern Recognition · Computer Science 2019-06-03 Junyeong Kim , Minuk Ma , Kyungsu Kim , Sungjin Kim , Chang D. Yoo