Related papers: OBJ2TEXT: Generating Visually Descriptive Language…

Sequence to Sequence -- Video to Text

Real-world videos often have complex dynamics; and methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length.…

Computer Vision and Pattern Recognition · Computer Science 2015-10-20 Subhashini Venugopalan , Marcus Rohrbach , Jeff Donahue , Raymond Mooney , Trevor Darrell , Kate Saenko

Exploring Visual Relationship for Image Captioning

It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation.…

Computer Vision and Pattern Recognition · Computer Science 2018-09-20 Ting Yao , Yingwei Pan , Yehao Li , Tao Mei

Object Captioning and Retrieval with Natural Language

We address the problem of jointly learning vision and language to understand the object in a fine-grained manner. The key idea of our approach is the use of object descriptions to provide the detailed understanding of an object. Based on…

Computer Vision and Pattern Recognition · Computer Science 2018-03-19 Anh Nguyen , Thanh-Toan Do , Ian Reid , Darwin G. Caldwell , Nikos G. Tsagarakis

Image Captioning with Object Detection and Localization

Automatically generating a natural language description of an image is a task close to the heart of image understanding. In this paper, we present a multi-model neural network method closely related to the human visual system that…

Computer Vision and Pattern Recognition · Computer Science 2017-06-09 Zhongliang Yang , Yu-Jin Zhang , Sadaqat ur Rehman , Yongfeng Huang

Meaning guided video captioning

Current video captioning approaches often suffer from problems of missing objects in the video to be described, while generating captions semantically similar with ground truth sentences. In this paper, we propose a new approach to video…

Computer Vision and Pattern Recognition · Computer Science 2019-12-13 Rushi J. Babariya , Toru Tamaki

Pix2seq: A Language Modeling Framework for Object Detection

We present Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we cast object detection as a language modeling task conditioned on the observed…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Ting Chen , Saurabh Saxena , Lala Li , David J. Fleet , Geoffrey Hinton

Learning Object Detection from Captions via Textual Scene Attributes

Object detection is a fundamental task in computer vision, requiring large annotated datasets that are difficult to collect, as annotators need to label objects and their bounding boxes. Thus, it is a significant challenge to use cheaper…

Computer Vision and Pattern Recognition · Computer Science 2020-10-01 Achiya Jerbi , Roei Herzig , Jonathan Berant , Gal Chechik , Amir Globerson

Phrase-based Image Captioning with Hierarchical LSTM Model

Automatic generation of caption to describe the content of an image has been gaining a lot of research interests recently, where most of the existing works treat the image caption as pure sequential data. Natural language, however possess a…

Computer Vision and Pattern Recognition · Computer Science 2017-11-16 Ying Hua Tan , Chee Seng Chan

Image Generation from Layout

Despite significant recent progress on generative models, controlled generation of images depicting multiple and complex object layouts is still a difficult problem. Among the core challenges are the diversity of appearance a given object…

Computer Vision and Pattern Recognition · Computer Science 2019-10-16 Bo Zhao , Lili Meng , Weidong Yin , Leonid Sigal

Generating Descriptions for Sequential Images with Local-Object Attention and Global Semantic Context Modelling

In this paper, we propose an end-to-end CNN-LSTM model for generating descriptions for sequential images with a local-object attention mechanism. To generate coherent descriptions, we capture global semantic context using a multi-layer…

Computation and Language · Computer Science 2020-12-03 Jing Su , Chenghua Lin , Mian Zhou , Qingyun Dai , Haoyu Lv

Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style

Image captioning is a research hotspot where encoder-decoder models combining convolutional neural network (CNN) and long short-term memory (LSTM) achieve promising results. Despite significant progress, these models generate sentences…

Computer Vision and Pattern Recognition · Computer Science 2019-10-16 Hongwei Ge , Zehang Yan , Kai Zhang , Mingde Zhao , Liang Sun

Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks

Visual tasks vary a lot in their output formats and concerned contents, therefore it is hard to process them with an identical structure. One main obstacle lies in the high-dimensional outputs in object-level visual tasks. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2022-09-29 Zhiyang Chen , Yousong Zhu , Zhaowen Li , Fan Yang , Wei Li , Haixin Wang , Chaoyang Zhao , Liwei Wu , Rui Zhao , Jinqiao Wang , Ming Tang

phi-LSTM: A Phrase-based Hierarchical LSTM Model for Image Captioning

A picture is worth a thousand words. Not until recently, however, we noticed some success stories in understanding of visual scenes: a model that is able to detect/name objects, describe their attributes, and recognize their…

Computation and Language · Computer Science 2017-10-27 Ying Hua Tan , Chee Seng Chan

CapText: Large Language Model-based Caption Generation From Image Context and Description

While deep-learning models have been shown to perform well on image-to-text datasets, it is difficult to use them in practice for captioning images. This is because captions traditionally tend to be context-dependent and offer complementary…

Machine Learning · Computer Science 2023-06-07 Shinjini Ghosh , Sagnik Anupam

Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects

Image captioning often requires a large set of training image-sentence pairs. In practice, however, acquiring sufficient training pairs is always expensive, making the recent captioning models limited in their ability to describe objects…

Computer Vision and Pattern Recognition · Computer Science 2017-08-18 Ting Yao , Yingwei Pan , Yehao Li , Tao Mei

Encoder-Decoder Based Long Short-Term Memory (LSTM) Model for Video Captioning

This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-many mapping of video data to text captions. The many-to-many mapping occurs via an input temporal sequence of video frames to an output…

Computer Vision and Pattern Recognition · Computer Science 2024-01-05 Sikiru Adewale , Tosin Ige , Bolanle Hafiz Matti

Image Captioning: Transforming Objects into Words

Image captioning models typically follow an encoder-decoder architecture which uses abstract image feature vectors as input to the encoder. One of the most successful algorithms uses feature vectors extracted from the region proposals…

Computer Vision and Pattern Recognition · Computer Science 2020-01-14 Simao Herdade , Armin Kappeler , Kofi Boakye , Joao Soares

Image Captioning with Deep Bidirectional LSTMs

This work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning. Our model builds on a deep convolutional neural network (CNN) and two separate LSTM networks. It is capable of learning…

Computer Vision and Pattern Recognition · Computer Science 2016-07-21 Cheng Wang , Haojin Yang , Christian Bartz , Christoph Meinel

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

Image captioning attempts to generate a sentence composed of several linguistic words, which are used to describe objects, attributes, and interactions in an image, denoted as visual semantic units in this paper. Based on this view, we…

Computer Vision and Pattern Recognition · Computer Science 2019-08-07 Longteng Guo , Jing Liu , Jinhui Tang , Jiangwei Li , Wei Luo , Hanqing Lu

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in…

Computer Vision and Pattern Recognition · Computer Science 2024-02-27 Hanan Gani , Shariq Farooq Bhat , Muzammal Naseer , Salman Khan , Peter Wonka