Related papers: Multimedia Generative Script Learning for Task Pla…

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Automatically generating scripts (i.e. sequences of key steps described in text) from video demonstrations and reasoning about the subsequent steps are crucial to the modern AI virtual assistants to guide humans to complete everyday tasks,…

Computation and Language · Computer Science 2024-01-22 Jingyuan Qi , Minqian Liu , Ying Shen , Zhiyang Xu , Lifu Huang

Goal-Oriented Script Construction

The knowledge of scripts, common chains of events in stereotypical scenarios, is a valuable asset for task-oriented natural language understanding systems. We propose the Goal-Oriented Script Construction task, where a model produces a…

Computation and Language · Computer Science 2021-09-01 Qing Lyu , Li Zhang , Chris Callison-Burch

Take a Break in the Middle: Investigating Subgoals towards Hierarchical Script Generation

Goal-oriented Script Generation is a new task of generating a list of steps that can fulfill the given goal. In this paper, we propose to extend the task from the perspective of cognitive theory. Instead of a simple flat structure, the…

Computation and Language · Computer Science 2023-05-19 Xinze Li , Yixin Cao , Muhao Chen , Aixin Sun

Generative Modeling for Multi-task Visual Learning

Generative modeling has recently shown great promise in computer vision, but it has mostly focused on synthesizing visually realistic images. In this paper, motivated by multi-task learning of shareable feature representations, we consider…

Computer Vision and Pattern Recognition · Computer Science 2021-06-28 Zhipeng Bao , Martial Hebert , Yu-Xiong Wang

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

The goal of this work is to generate step-by-step visual instructions in the form of a sequence of images, given an input image that provides the scene context and the sequence of textual instructions. This is a challenging problem as it…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Tomáš Souček , Prajwal Gatti , Michael Wray , Ivan Laptev , Dima Damen , Josef Sivic

Generative Timelines for Instructed Visual Assembly

The objective of this work is to manipulate visual timelines (e.g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users. We call this task…

Computer Vision and Pattern Recognition · Computer Science 2024-11-20 Alejandro Pardo , Jui-Hsien Wang , Bernard Ghanem , Josef Sivic , Bryan Russell , Fabian Caba Heilbron

Visual Goal-Step Inference using wikiHow

Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual…

Computer Vision and Pattern Recognition · Computer Science 2021-09-13 Yue Yang , Artemis Panagopoulou , Qing Lyu , Li Zhang , Mark Yatskar , Chris Callison-Burch

StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

Generating video stories from text prompts is a complex task. In addition to having high visual quality, videos need to realistically adhere to a sequence of text prompts whilst being consistent throughout the frames. Creating a benchmark…

Computer Vision and Pattern Recognition · Computer Science 2023-10-13 Emanuele Bugliarello , Hernan Moraldo , Ruben Villegas , Mohammad Babaeizadeh , Mohammad Taghi Saffar , Han Zhang , Dumitru Erhan , Vittorio Ferrari , Pieter-Jan Kindermans , Paul Voigtlaender

Non-Sequential Graph Script Induction via Multimedia Grounding

Online resources such as WikiHow compile a wide range of scripts for performing everyday tasks, which can assist models in learning to reason about procedures. However, the scripts are always presented in a linear manner, which does not…

Computation and Language · Computer Science 2023-05-30 Yu Zhou , Sha Li , Manling Li , Xudong Lin , Shih-Fu Chang , Mohit Bansal , Heng Ji

Incorporating Task-specific Concept Knowledge into Script Learning

In this paper, we present Tetris, a new task of Goal-Oriented Script Completion. Unlike previous work, it considers a more realistic and general setting, where the input includes not only the goal but also additional user context, including…

Computation and Language · Computer Science 2023-04-25 Chenkai Sun , Tie Xu , ChengXiang Zhai , Heng Ji

Steerable Scene Generation with Post Training and Inference-Time Search

Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement,…

Robotics · Computer Science 2025-08-27 Nicholas Pfaff , Hongkai Dai , Sergey Zakharov , Shun Iwase , Russ Tedrake

Reading Between the Lines: Exploring Infilling in Visual Narratives

Generating long form narratives such as stories and procedures from multiple modalities has been a long standing dream for artificial intelligence. In this regard, there is often crucial subtext that is derived from the surrounding…

Computation and Language · Computer Science 2020-10-28 Khyathi Raghavi Chandu , Ruo-Ping Dong , Alan Black

Text-Only Training for Visual Storytelling

Visual storytelling aims to generate a narrative based on a sequence of images, necessitating both vision-language alignment and coherent story generation. Most existing solutions predominantly depend on paired image-text training data,…

Computer Vision and Pattern Recognition · Computer Science 2023-08-21 Yuechen Wang , Wengang Zhou , Zhenbo Lu , Houqiang Li

Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond

The recent advancements in generative language models have demonstrated their ability to memorize knowledge from documents and recall knowledge to respond to user queries effectively. Building upon this capability, we propose to enable…

Multimedia · Computer Science 2024-02-19 Yongqi Li , Wenjie Wang , Leigang Qu , Liqiang Nie , Wenjie Li , Tat-Seng Chua

GIT: A Generative Image-to-text Transformer for Vision and Language

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between…

Computer Vision and Pattern Recognition · Computer Science 2022-12-19 Jianfeng Wang , Zhengyuan Yang , Xiaowei Hu , Linjie Li , Kevin Lin , Zhe Gan , Zicheng Liu , Ce Liu , Lijuan Wang

Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges

Creating engaging narratives from visual data is crucial for automated digital media consumption, assistive technologies, and interactive entertainment. This survey covers methodologies used in the generation of these narratives, focusing…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Daniel A. P. Oliveira , Eugénio Ribeiro , David Martins de Matos

Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting

Despite significant progress in the field, it is still challenging to create personalized visual representations that align closely with the desires and preferences of individual users. This process requires users to articulate their ideas…

Computer Vision and Pattern Recognition · Computer Science 2025-01-03 Zijie Chen , Lichao Zhang , Fangsheng Weng , Lili Pan , Zhenzhong Lan

Goal-Directed Story Generation: Augmenting Generative Language Models with Reinforcement Learning

The advent of large pre-trained generative language models has provided a common framework for AI story generation via sampling the model to create sequences that continue the story. However, sampling alone is insufficient for story…

Computation and Language · Computer Science 2021-12-17 Amal Alabdulkarim , Winston Li , Lara J. Martin , Mark O. Riedl

Multi-modal Generation via Cross-Modal In-Context Learning

In this work, we study the problem of generating novel images from complex multimodal prompt sequences. While existing methods achieve promising results for text-to-image generation, they often struggle to capture fine-grained details from…

Computer Vision and Pattern Recognition · Computer Science 2024-05-29 Amandeep Kumar , Muzammal Naseer , Sanath Narayan , Rao Muhammad Anwer , Salman Khan , Hisham Cholakkal

Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

Video storytelling is engaging multimedia content that utilizes video and its accompanying narration to attract the audience, where a key challenge is creating narrations for recorded visual scenes. Previous studies on dense video…

Multimedia · Computer Science 2024-12-31 Dingyi Yang , Chunru Zhan , Ziheng Wang , Biao Wang , Tiezheng Ge , Bo Zheng , Qin Jin