Related papers: Oracle performance for visual captioning

Self-Supervised Image Captioning with CLIP

Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. Current image captioning approaches heavily rely on high-quality image-caption pairs, which…

Computer Vision and Pattern Recognition · Computer Science 2023-11-03 Chuanyang Jin

Partially-Supervised Image Captioning

Image captioning models are becoming increasingly successful at describing the content of images in restricted domains. However, if these models are to function in the wild - for example, as assistants for people with impaired vision - a…

Computer Vision and Pattern Recognition · Computer Science 2018-11-29 Peter Anderson , Stephen Gould , Mark Johnson

Improving Image Captioning with Better Use of Captions

Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics…

Computer Vision and Pattern Recognition · Computer Science 2020-06-23 Zhan Shi , Xu Zhou , Xipeng Qiu , Xiaodan Zhu

Fusion Models for Improved Visual Captioning

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This…

Computer Vision and Pattern Recognition · Computer Science 2021-03-01 Marimuthu Kalimuthu , Aditya Mogadala , Marius Mosbach , Dietrich Klakow

A Thorough Review on Recent Deep Learning Methodologies for Image Captioning

Image Captioning is a task that combines computer vision and natural language processing, where it aims to generate descriptive legends for images. It is a two-fold process relying on accurate image understanding and correct language…

Computer Vision and Pattern Recognition · Computer Science 2021-07-29 Ahmed Elhagry , Karima Kadaoui

From Captions to Visual Concepts and Back

This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to…

Computer Vision and Pattern Recognition · Computer Science 2016-02-22 Hao Fang , Saurabh Gupta , Forrest Iandola , Rupesh Srivastava , Li Deng , Piotr Dollár , Jianfeng Gao , Xiaodong He , Margaret Mitchell , John C. Platt , C. Lawrence Zitnick , Geoffrey Zweig

Deep Learning Approaches on Image Captioning: A Review

Image captioning is a research area of immense importance, aiming to generate natural language descriptions for visual content in the form of still images. The advent of deep learning and more recently vision-language pre-training…

Computer Vision and Pattern Recognition · Computer Science 2023-08-29 Taraneh Ghandi , Hamidreza Pourreza , Hamidreza Mahyar

Injecting Prior Knowledge into Image Caption Generation

Automatically generating natural language descriptions from an image is a challenging problem in artificial intelligence that requires a good understanding of the visual and textual signals and the correlations between them. The…

Computation and Language · Computer Science 2020-08-07 Arushi Goel , Basura Fernando , Thanh-Son Nguyen , Hakan Bilen

An Integrated Approach for Video Captioning and Applications

Physical computing infrastructure, data gathering, and algorithms have recently had significant advances to extract information from images and videos. The growth has been especially outstanding in image captioning and video captioning.…

Computer Vision and Pattern Recognition · Computer Science 2022-01-25 Soheyla Amirian , Thiab R. Taha , Khaled Rasheed , Hamid R. Arabnia

Accurate and Fast Compressed Video Captioning

Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame…

Computer Vision and Pattern Recognition · Computer Science 2024-01-04 Yaojie Shen , Xin Gu , Kai Xu , Heng Fan , Longyin Wen , Libo Zhang

Delving Deeper into the Decoder for Video Captioning

Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence. The encoder-decoder framework is the most popular paradigm for this task in recent years. However, there exist some…

Computer Vision and Pattern Recognition · Computer Science 2021-02-15 Haoran Chen , Jianmin Li , Xiaolin Hu

Guided Open Vocabulary Image Captioning with Constrained Beam Search

Existing image captioning models do not generalize well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real world applications dealing with images in the wild. We…

Computer Vision and Pattern Recognition · Computer Science 2017-07-21 Peter Anderson , Basura Fernando , Mark Johnson , Stephen Gould

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Video detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner…

Computer Vision and Pattern Recognition · Computer Science 2025-04-10 Wenhao Chai , Enxin Song , Yilun Du , Chenlin Meng , Vashisht Madhavan , Omer Bar-Tal , Jenq-Neng Hwang , Saining Xie , Christopher D. Manning

Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers

Video captioning is an essential technology to understand scenes and describe events in natural language. To apply it to real-time monitoring, a system needs not only to describe events accurately but also to produce the captions as soon as…

Computer Vision and Pattern Recognition · Computer Science 2021-08-05 Chiori Hori , Takaaki Hori , Jonathan Le Roux

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Chia-Wen Kuo , Zsolt Kira

VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall

Image captioning has become an essential Vision & Language research task. It is about predicting the most accurate caption given a specific image or video. The research community has achieved impressive results by continuously proposing new…

Computer Vision and Pattern Recognition · Computer Science 2025-01-28 Guillermo Ruiz , Tania Ramírez , Daniela Moctezuma

Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

Given the accelerating progress of vision and language modeling, accurate evaluation of machine-generated image captions remains critical. In order to evaluate captions more closely to human preferences, metrics need to discriminate between…

Computer Vision and Pattern Recognition · Computer Science 2024-02-29 Koki Maeda , Shuhei Kurita , Taiki Miyanishi , Naoaki Okazaki

Automated Image Captioning for Rapid Prototyping and Resource Constrained Environments

Significant performance gains in deep learning coupled with the exponential growth of image and video data on the Internet have resulted in the recent emergence of automated image captioning systems. Ensuring scalability of automated image…

Computer Vision and Pattern Recognition · Computer Science 2016-06-07 Karan Sharma , Arun CS Kumar , Suchendra Bhandarkar

From Image Captioning to Visual Storytelling

Visual Storytelling is a challenging multimodal task between Vision & Language, where the purpose is to generate a story for a stream of images. Its difficulty lies on the fact that the story should be both grounded to the image sequence…

Computation and Language · Computer Science 2025-08-21 Admitos Passadakis , Yingjin Song , Albert Gatt

Image Captioning with Clause-Focused Metrics in a Multi-Modal Setting for Marketing

Automatically generating descriptive captions for images is a well-researched area in computer vision. However, existing evaluation approaches focus on measuring the similarity between two sentences disregarding fine-grained semantics of…

Computer Vision and Pattern Recognition · Computer Science 2019-08-07 Philipp Harzig , Dan Zecha , Rainer Lienhart , Carolin Kaiser , René Schallner