Related papers: Generating Illustrated Instructions

Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks

Multistep instructions, such as recipes and how-to guides, greatly benefit from visual aids, such as a series of images that accompany the instruction steps. While Large Language Models (LLMs) have become adept at generating coherent…

Computer Vision and Pattern Recognition · Computer Science 2024-05-17 João Bordalo , Vasco Ramos , Rodrigo Valério , Diogo Glória-Silva , Yonatan Bitton , Michal Yarom , Idan Szpektor , Joao Magalhaes

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

In the text-to-image generation field, recent remarkable progress in Stable Diffusion makes it possible to generate rich kinds of novel photorealistic images. However, current models still face misalignment issues (e.g., problematic spatial…

Computer Vision and Pattern Recognition · Computer Science 2023-08-15 Leigang Qu , Shengqiong Wu , Hao Fei , Liqiang Nie , Tat-Seng Chua

Coherent Zero-Shot Visual Instruction Generation

Despite the advances in text-to-image synthesis, particularly with diffusion models, generating visual instructions that require consistent representation and smooth state transitions of objects across sequential steps remains a formidable…

Computer Vision and Pattern Recognition · Computer Science 2024-06-11 Quynh Phung , Songwei Ge , Jia-Bin Huang

ClickDiffusion: Harnessing LLMs for Interactive Precise Image Editing

Recently, researchers have proposed powerful systems for generating and manipulating images using natural language instructions. However, it is difficult to precisely specify many common classes of image transformations with text alone. For…

Computer Vision and Pattern Recognition · Computer Science 2024-04-09 Alec Helbling , Seongmin Lee , Polo Chau

IGD: Instructional Graphic Design with Multimodal Layer Generation

Graphic design visually conveys information and data by creating and combining text, images and graphics. Two-stage methods that rely primarily on layout generation lack creativity and intelligence, making graphic design still…

Computer Vision and Pattern Recognition · Computer Science 2025-07-15 Yadong Qu , Shancheng Fang , Yuxin Wang , Xiaorui Wang , Zhineng Chen , Hongtao Xie , Yongdong Zhang

InstanceGen: Image Generation with Instance-level Instructions

Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes.…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Etai Sella , Yanir Kleiman , Hadar Averbuch-Elor

Instruct-Imagen: Image Generation with Multi-modal Instruction

This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of…

Computer Vision and Pattern Recognition · Computer Science 2024-01-05 Hexiang Hu , Kelvin C. K. Chan , Yu-Chuan Su , Wenhu Chen , Yandong Li , Kihyuk Sohn , Yang Zhao , Xue Ben , Boqing Gong , William Cohen , Ming-Wei Chang , Xuhui Jia

$I^2G$: Generating Instructional Illustrations via Text-Conditioned Diffusion

The effective communication of procedural knowledge remains a significant challenge in natural language processing (NLP), as purely textual instructions often fail to convey complex physical actions and spatial relationships. We address…

Computation and Language · Computer Science 2025-05-23 Jing Bi , Pinxin Liu , Ali Vosoughi , Jiarui Wu , Jinxi He , Chenliang Xu

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in…

Computer Vision and Pattern Recognition · Computer Science 2024-02-27 Hanan Gani , Shariq Farooq Bhat , Muzammal Naseer , Salman Khan , Peter Wonka

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Recent advancements in text-to-image diffusion models have yielded impressive results in generating realistic and diverse images. However, these models still struggle with complex prompts, such as those that involve numeracy and spatial…

Computer Vision and Pattern Recognition · Computer Science 2024-03-05 Long Lian , Boyi Li , Adam Yala , Trevor Darrell

PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation

Preference-conditioned image generation seeks to adapt generative models to individual users, producing outputs that reflect personal aesthetic choices beyond the given textual prompt. Despite recent progress, existing approaches either…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Wenyi Mo , Tianyu Zhang , Yalong Bai , Ligong Han , Ying Ba , Dimitris N. Metaxas

Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task

Virtual Reality (VR) has emerged as a powerful tool for workforce training, offering immersive, interactive, and risk-free environments that enhance skill acquisition, decision-making, and confidence. Despite its advantages, developing VR…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Subin Raj Peter

Guided Reality: Generating Visually-Enriched AR Task Guidance with LLMs and Vision Models

Large language models (LLMs) have enabled the automatic generation of step-by-step augmented reality (AR) instructions for a wide range of physical tasks. However, existing LLM-based AR guidance often lacks rich visual augmentations to…

Human-Computer Interaction · Computer Science 2025-09-25 Ada Yi Zhao , Aditya Gunturu , Ellen Yi-Luen Do , Ryo Suzuki

Generate, Not Recommend: Personalized Multimodal Content Generation

To address the challenge of information overload from massive web contents, recommender systems are widely applied to retrieve and present personalized results for users. However, recommendation tasks are inherently constrained to filtering…

Artificial Intelligence · Computer Science 2025-06-04 Jiongnan Liu , Zhicheng Dou , Ning Hu , Chenyan Xiong

LuciBot: Automated Robot Policy Learning from Generated Videos

Automatically generating training supervision for embodied tasks is crucial, as manual designing is tedious and not scalable. While prior works use large language models (LLMs) or vision-language models (VLMs) to generate rewards, these…

Computer Vision and Pattern Recognition · Computer Science 2025-03-14 Xiaowen Qiu , Yian Wang , Jiting Cai , Zhehuan Chen , Chunru Lin , Tsun-Hsuan Wang , Chuang Gan

Empowering Large Language Models for Textual Data Augmentation

With the capabilities of understanding and executing natural language instructions, Large language models (LLMs) can potentially act as a powerful tool for textual data augmentation. However, the quality of augmented data depends heavily on…

Computation and Language · Computer Science 2024-04-30 Yichuan Li , Kaize Ding , Jianling Wang , Kyumin Lee

Diffusion Self-Guidance for Controllable Image Generation

Large-scale generative models are capable of producing high-quality images from detailed text descriptions. However, many aspects of an image are difficult or impossible to convey through text. We introduce self-guidance, a method that…

Computer Vision and Pattern Recognition · Computer Science 2023-06-13 Dave Epstein , Allan Jabri , Ben Poole , Alexei A. Efros , Aleksander Holynski

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer…

Computer Vision and Pattern Recognition · Computer Science 2024-03-19 Yulu Gan , Sungwoo Park , Alexander Schubert , Anthony Philippakis , Ahmed M. Alaa

Instruction-based Image Editing with Planning, Reasoning, and Generation

Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-27 Liya Ji , Chenyang Qi , Qifeng Chen

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Yabo Zhang , Kunchang Li , Dewei Zhou , Xinyu Huang , Xun Wang