Related papers: Widget Captioning: Generating Natural Language Des…

An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation

Existing automated techniques for software documentation typically attempt to reason between two main sources of information: code and natural language. However, this reasoning process is often complicated by the lexical gap between more…

Software Engineering · Computer Science 2023-01-04 Kevin Moran , Ali Yachnes , George Purnell , Junayed Mahmud , Michele Tufano , Carlos Bernal-Cárdenas , Denys Poshyvanyk , Zach H'Doubler

TGIF: A New Dataset and Benchmark on Animated GIF Description

With the recent popularity of animated GIFs on social media, there is need for ways to index them with rich metadata. To advance research on animated GIF understanding, we collected a new dataset, Tumblr GIF (TGIF), with 100K animated GIFs…

Computer Vision and Pattern Recognition · Computer Science 2016-04-13 Yuncheng Li , Yale Song , Liangliang Cao , Joel Tetreault , Larry Goldberg , Alejandro Jaimes , Jiebo Luo

Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning

Mobile User Interface Summarization generates succinct language descriptions of mobile screens for conveying important contents and functionalities of the screen, which can be useful for many language-based application scenarios. We present…

Human-Computer Interaction · Computer Science 2021-08-10 Bryan Wang , Gang Li , Xin Zhou , Zhourong Chen , Tovi Grossman , Yang Li

Visualizing Natural Language Descriptions: A Survey

A natural language interface exploits the conceptual simplicity and naturalness of the language to create a high-level user-friendly communication channel between humans and machines. One of the promising applications of such interfaces is…

Computation and Language · Computer Science 2016-07-05 Kaveh Hassani , Won-Sook Lee

What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most…

Computer Vision and Pattern Recognition · Computer Science 2023-01-16 David M. Chan , Austin Myers , Sudheendra Vijayanarasimhan , David A. Ross , Bryan Seybold , John F. Canny

WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics

Modeling user interfaces (UIs) from visual information allows systems to make inferences about the functionality and semantics needed to support use cases in accessibility, app automation, and testing. Current datasets for training machine…

Human-Computer Interaction · Computer Science 2023-02-01 Jason Wu , Siyan Wang , Siman Shen , Yi-Hao Peng , Jeffrey Nichols , Jeffrey P. Bigham

Accessible Visualization via Natural Language Descriptions: A Four-Level Model of Semantic Content

Natural language descriptions sometimes accompany visualizations to better communicate and contextualize their insights, and to improve their accessibility for readers with disabilities. However, it is difficult to evaluate the usefulness…

Human-Computer Interaction · Computer Science 2021-10-12 Alan Lundgard , Arvind Satyanarayan

An Attempt towards Interpretable Audio-Visual Video Captioning

Automatically generating a natural language sentence to describe the content of an input video is a very challenging problem. It is an essential multimodal task in which auditory and visual contents are equally important. Although audio…

Computer Vision and Pattern Recognition · Computer Science 2018-12-10 Yapeng Tian , Chenxiao Guan , Justin Goodman , Marc Moore , Chenliang Xu

Caption: Generating Informative Content Labels for Image Buttons Using Next-Screen Context

We present Caption, an LLM-powered content label generation tool for visual interactive elements on mobile devices. Content labels are essential for screen readers to provide announcements for image-based elements, but are often missing or…

Human-Computer Interaction · Computer Science 2025-08-13 Mingyuan Zhong , Ajit Mallavarapu , Qing Nie

Generating Diverse and Meaningful Captions

Image Captioning is a task that requires models to acquire a multi-modal understanding of the world and to express this understanding in natural language text. While the state-of-the-art for this task has rapidly improved in terms of n-gram…

Computer Vision and Pattern Recognition · Computer Science 2018-12-20 Annika Lindh , Robert J. Ross , Abhijit Mahalunkar , Giancarlo Salton , John D. Kelleher

Beyond Caption To Narrative: Video Captioning With Multiple Sentences

Recent advances in image captioning task have led to increasing interests in video captioning task. However, most works on video captioning are focused on generating single input of aggregated features, which hardly deviates from image…

Computer Vision and Pattern Recognition · Computer Science 2016-05-19 Andrew Shin , Katsunori Ohnishi , Tatsuya Harada

Face-Cap: Image Captioning using Facial Expression Analysis

Image captioning is the process of generating a natural language description of an image. Most current image captioning models, however, do not take into account the emotional aspect of an image, which is very relevant to activities and…

Computer Vision and Pattern Recognition · Computer Science 2019-01-28 Omid Mohamad Nezami , Mark Dras , Peter Anderson , Len Hamey

Composable Prompting Workspaces for Creative Writing: Exploration and Iteration Using Dynamic Widgets

Generative AI models offer many possibilities for text creation and transformation. Current graphical user interfaces (GUIs) for prompting them lack support for iterative exploration, as they do not represent prompts as actionable interface…

Human-Computer Interaction · Computer Science 2025-03-28 Rifat Mehreen Amin , Oliver Hans Kühle , Daniel Buschek , Andreas Butz

A Comprehensive Analysis of Real-World Image Captioning and Scene Identification

Image captioning is a computer vision task that involves generating natural language descriptions for images. This method has numerous applications in various domains, including image retrieval systems, medicine, and various industries.…

Computer Vision and Pattern Recognition · Computer Science 2023-08-08 Sai Suprabhanu Nallapaneni , Subrahmanyam Konakanchi

From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Mihai Masala , Marius Leordeanu

Taking an Emotional Look at Video Paragraph Captioning

Translating visual data into natural language is essential for machines to understand the world and interact with humans. In this work, a comprehensive study is conducted on video paragraph captioning, with the goal to generate…

Computer Vision and Pattern Recognition · Computer Science 2022-03-15 Qinyu Li , Tengpeng Li , Hanli Wang , Chang Wen Chen

Improving Multimodal Datasets with Image Captioning

Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our…

Machine Learning · Computer Science 2023-10-27 Thao Nguyen , Samir Yitzhak Gadre , Gabriel Ilharco , Sewoong Oh , Ludwig Schmidt

Discourse Analysis for Evaluating Coherence in Video Paragraph Captions

Video paragraph captioning is the task of automatically generating a coherent paragraph description of the actions in a video. Previous linguistic studies have demonstrated that coherence of a natural language text is reflected by its…

Computer Vision and Pattern Recognition · Computer Science 2022-01-19 Arjun R Akula , Song-Chun Zhu

Caption Generation of Robot Behaviors based on Unsupervised Learning of Action Segments

Bridging robot action sequences and their natural language captions is an important task to increase explainability of human assisting robots in their recently evolving field. In this paper, we propose a system for generating natural…

Computation and Language · Computer Science 2020-03-24 Koichiro Yoshino , Kohei Wakimoto , Yuta Nishimura , Satoshi Nakamura

Towards Better Semantic Understanding of Mobile Interfaces

Improving the accessibility and automation capabilities of mobile devices can have a significant positive impact on the daily lives of countless users. To stimulate research in this direction, we release a human-annotated dataset with…

Human-Computer Interaction · Computer Science 2022-10-07 Srinivas Sunkara , Maria Wang , Lijuan Liu , Gilles Baechler , Yu-Chung Hsiao , Jindong , Chen , Abhanshu Sharma , James Stout