Related papers: iCap: Interactive Image Captioning with Predictive…

Text Data-Centric Image Captioning with Interactive Prompts

Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data. Recently, large-scale vision and language models (e.g., CLIP) and large-scale generative…

Computer Vision and Pattern Recognition · Computer Science 2024-03-29 Yiyu Wang , Hao Luo , Jungang Xu , Yingfei Sun , Fan Wang

TextCaps: a Dataset for Image Captioning with Reading Comprehension

Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include…

Computer Vision and Pattern Recognition · Computer Science 2020-08-05 Oleksii Sidorov , Ronghang Hu , Marcus Rohrbach , Amanpreet Singh

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

In recent years, advancements in representation learning and language models have propelled Automated Captioning (AC) to new heights, enabling the generation of human-level descriptions. Leveraging these advancements, we propose AVCap, an…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-12 Jongsuk Kim , Jiwon Shin , Junmo Kim

A-CAP: Anticipation Captioning with Commonsense Knowledge

Humans possess the capacity to reason about the future based on a sparse collection of visual cues acquired over time. In order to emulate this ability, we introduce a novel task called Anticipation Captioning, which generates a caption for…

Computer Vision and Pattern Recognition · Computer Science 2023-04-14 Duc Minh Vo , Quoc-An Luong , Akihiro Sugimoto , Hideki Nakayama

Self-Supervised Image Captioning with CLIP

Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. Current image captioning approaches heavily rely on high-quality image-caption pairs, which…

Computer Vision and Pattern Recognition · Computer Science 2023-11-03 Chuanyang Jin

MICap: A Unified Model for Identity-aware Movie Descriptions

Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized…

Computer Vision and Pattern Recognition · Computer Science 2024-05-21 Haran Raajesh , Naveen Reddy Desanur , Zeeshan Khan , Makarand Tapaswi

Face-Cap: Image Captioning using Facial Expression Analysis

Image captioning is the process of generating a natural language description of an image. Most current image captioning models, however, do not take into account the emotional aspect of an image, which is very relevant to activities and…

Computer Vision and Pattern Recognition · Computer Science 2019-01-28 Omid Mohamad Nezami , Mark Dras , Peter Anderson , Len Hamey

A Comprehensive Analysis of Real-World Image Captioning and Scene Identification

Image captioning is a computer vision task that involves generating natural language descriptions for images. This method has numerous applications in various domains, including image retrieval systems, medicine, and various industries.…

Computer Vision and Pattern Recognition · Computer Science 2023-08-08 Sai Suprabhanu Nallapaneni , Subrahmanyam Konakanchi

Putting Humans in the Image Captioning Loop

Image Captioning (IC) models can highly benefit from human feedback in the training process, especially in cases where data is limited. We present work-in-progress on adapting an IC system to integrate human feedback, with the goal to make…

Computation and Language · Computer Science 2023-06-07 Aliki Anagnostopoulou , Mareike Hartmann , Daniel Sonntag

Towards Accurate Text-based Image Captioning with Content Diversity Exploration

Text-based image captioning (TextCap) which aims to read and reason images with texts is crucial for a machine to understand a detailed and complex scene environment, considering that texts are omnipresent in daily life. This task, however,…

Computer Vision and Pattern Recognition · Computer Science 2021-05-10 Guanghui Xu , Shuaicheng Niu , Mingkui Tan , Yucheng Luo , Qing Du , Qi Wu

Controllable Image Captioning

State-of-the-art image captioners can generate accurate sentences to describe images in a sequence to sequence manner without considering the controllability and interpretability. This, however, is far from making image captioning widely…

Computer Vision and Pattern Recognition · Computer Science 2022-05-26 Luka Maxwell

Image Captioning based on Feature Refinement and Reflective Decoding

Image captioning is the process of automatically generating a description of an image in natural language. Image captioning is one of the significant challenges in image understanding since it requires not only recognizing salient objects…

Computer Vision and Pattern Recognition · Computer Science 2022-07-26 Ghadah Alabduljabbar , Hafida Benhidour , Said Kerrache

ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with Unpaired Stylistic Corpora

Generating visually grounded image captions with specific linguistic styles using unpaired stylistic corpora is a challenging task, especially since we expect stylized captions with a wide variety of stylistic patterns. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2023-08-03 Kanzhi Cheng , Zheng Ma , Shi Zong , Jianbing Zhang , Xinyu Dai , Jiajun Chen

Pragmatic Issue-Sensitive Image Captioning

Image captioning systems have recently improved dramatically, but they still tend to produce captions that are insensitive to the communicative goals that captions should meet. To address this, we propose Issue-Sensitive Image Captioning…

Computation and Language · Computer Science 2020-10-07 Allen Nie , Reuben Cohn-Gordon , Christopher Potts

MeaCap: Memory-Augmented Zero-shot Image Captioning

Zero-shot image captioning (IC) without well-paired image-text data can be divided into two categories, training-free and text-only-training. Generally, these two types of methods realize zero-shot IC by integrating pretrained…

Computer Vision and Pattern Recognition · Computer Science 2024-03-07 Zequn Zeng , Yan Xie , Hao Zhang , Chiyu Chen , Zhengjue Wang , Bo Chen

SentiCap: Generating Image Descriptions with Sentiments

The recent progress on image recognition and language modeling is making automatic description of image content a reality. However, stylized, non-factual aspects of the written description are missing from the current systems. One such…

Computer Vision and Pattern Recognition · Computer Science 2015-12-15 Alexander Mathews , Lexing Xie , Xuming He

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

Controllable Image Captioning (CIC) -- generating image descriptions following designated control signals -- has received unprecedented attention over the last few years. To emulate the human ability in controlling caption generation,…

Computer Vision and Pattern Recognition · Computer Science 2021-03-24 Long Chen , Zhihong Jiang , Jun Xiao , Wei Liu

Iconographic Image Captioning for Artworks

Image captioning implies automatically generating textual descriptions of images based only on the visual input. Although this has been an extensively addressed research topic in recent years, not many contributions have been made in the…

Computer Vision and Pattern Recognition · Computer Science 2021-02-09 Eva Cetinic

ClipCap: CLIP Prefix for Image Captioning

Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding…

Computer Vision and Pattern Recognition · Computer Science 2021-11-19 Ron Mokady , Amir Hertz , Amit H. Bermano

RefineCap: Concept-Aware Refinement for Image Captioning

Automatically translating images to texts involves image scene understanding and language modeling. In this paper, we propose a novel model, termed RefineCap, that refines the output vocabulary of the language decoder using decoder-guided…

Computation and Language · Computer Science 2021-09-09 Yekun Chai , Shuo Jin , Junliang Xing