Related papers: Consensus-based Sequence Training for Video Captio…

Reinforced Video Captioning with Entailment Rewards

Sequence-to-sequence models have shown promising improvements on the temporal task of video captioning, but they optimize word-level cross-entropy loss during training. First, using policy gradient and mixed-loss methods for reinforcement…

Computation and Language · Computer Science 2017-08-09 Ramakanth Pasunuru , Mohit Bansal

Self-critical Sequence Training for Image Captioning

Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of…

Machine Learning · Computer Science 2017-11-17 Steven J. Rennie , Etienne Marcheret , Youssef Mroueh , Jarret Ross , Vaibhava Goel

Teacher-Critical Training Strategies for Image Captioning

Existing image captioning models are usually trained by cross-entropy (XE) loss and reinforcement learning (RL), which set ground-truth words as hard targets and force the captioning model to learn from them. However, the widely adopted…

Computer Vision and Pattern Recognition · Computer Science 2022-08-08 Yiqing Huang , Jiansheng Chen

Boosting Video Captioning with Dynamic Loss Network

Video captioning is one of the challenging problems at the intersection of vision and language, having many real-life applications in video retrieval, video surveillance, assisting visually challenged people, Human-machine interface, and…

Computer Vision and Pattern Recognition · Computer Science 2022-02-03 Nasib Ullah , Partha Pratim Mohanta

Self-critical Sequence Training for Automatic Speech Recognition

Although automatic speech recognition (ASR) task has gained remarkable success by sequence-to-sequence models, there are two main mismatches between its training and testing that might lead to performance degradation: 1) The typically used…

Computation and Language · Computer Science 2022-04-14 Chen Chen , Yuchen Hu , Nana Hou , Xiaofeng Qi , Heqing Zou , Eng Siong Chng

Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning

Training image captioning models using teacher forcing results in very generic samples, whereas more distinctive captions can be very useful in retrieval applications or to produce alternative texts describing images for accessibility.…

Computation and Language · Computer Science 2024-02-22 Antoine Chaffin , Ewa Kijak , Vincent Claveau

CNN+CNN: Convolutional Decoders for Image Captioning

Image captioning is a challenging task that combines the field of computer vision and natural language processing. A variety of approaches have been proposed to achieve the goal of automatically describing an image, and recurrent neural…

Computer Vision and Pattern Recognition · Computer Science 2018-05-24 Qingzhong Wang , Antoni B. Chan

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Long Xing , Xiaoyi Dong , Yuhang Zang , Yuhang Cao , Jianze Liang , Qidong Huang , Jiaqi Wang , Feng Wu , Dahua Lin

Actor-Critic Sequence Training for Image Captioning

Generating natural language descriptions of images is an important capability for a robot or other visual-intelligence driven AI agent that may need to communicate with human users about what it is seeing. Such image captioning methods are…

Computer Vision and Pattern Recognition · Computer Science 2017-11-29 Li Zhang , Flood Sung , Feng Liu , Tao Xiang , Shaogang Gong , Yongxin Yang , Timothy M. Hospedales

An Efficient Self-Supervised Cross-View Training For Sentence Embedding

Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a…

Computation and Language · Computer Science 2023-11-07 Peerat Limkonchotiwat , Wuttikorn Ponwitayarat , Lalita Lowphansirikul , Can Udomcharoenchaikit , Ekapol Chuangsuwanich , Sarana Nutanong

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT)…

Computer Vision and Pattern Recognition · Computer Science 2025-10-27 Ziyang Wang , Jaehong Yoon , Shoubin Yu , Md Mohaiminul Islam , Gedas Bertasius , Mohit Bansal

Reinforcement Learning for Unsupervised Video Summarization with Reward Generator Training

This paper presents a novel approach for unsupervised video summarization using reinforcement learning (RL), addressing limitations like unstable adversarial training and reliance on heuristic-based reward functions. The method operates on…

Multimedia · Computer Science 2025-12-24 Mehryar Abbasi , Hadi Hadizadeh , Parvaneh Saeedi

Summary Level Training of Sentence Rewriting for Abstractive Summarization

As an attempt to combine extractive and abstractive summarization, Sentence Rewriting models adopt the strategy of extracting salient sentences from a document first and then paraphrasing the selected ones to generate a summary. However,…

Computation and Language · Computer Science 2019-09-27 Sanghwan Bae , Taeuk Kim , Jihoon Kim , Sang-goo Lee

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-11-14 Jiahao Wang , Weiye Xu , Aijun Yang , Wengang Zhou , Lewei Lu , Houqiang Li , Xiaohua Wang , Jinguo Zhu

A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling

Given the features of a video, recurrent neural networks can be used to automatically generate a caption for the video. Existing methods for video captioning have at least three limitations. First, semantic information has been widely…

Computer Vision and Pattern Recognition · Computer Science 2021-02-15 Haoran Chen , Ke Lin , Alexander Maye , Jianming Li , Xiaolin Hu

B-SCST: Bayesian Self-Critical Sequence Training for Image Captioning

Bayesian deep neural networks (DNNs) can provide a mathematically grounded framework to quantify uncertainty in predictions from image captioning models. We propose a Bayesian variant of policy-gradient based reinforcement learning training…

Machine Learning · Computer Science 2020-06-30 Shashank Bujimalla , Mahesh Subedar , Omesh Tickoo

CLIP4Caption: CLIP for Video Caption

Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps…

Computer Vision and Pattern Recognition · Computer Science 2021-10-14 Mingkang Tang , Zhanyu Wang , Zhenhua Liu , Fengyun Rao , Dian Li , Xiu Li

Active Learning for Video Description With Cluster-Regularized Ensemble Ranking

Automatic video captioning aims to train models to generate text descriptions for all segments in a video, however, the most effective approaches require large amounts of manual annotation which is slow and expensive. Active learning is a…

Computer Vision and Pattern Recognition · Computer Science 2020-12-04 David M. Chan , Sudheendra Vijayanarasimhan , David A. Ross , John Canny

RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning

Research on continual learning has led to a variety of approaches to mitigating catastrophic forgetting in feed-forward classification networks. Until now surprisingly little attention has been focused on continual learning of recurrent…

Computer Vision and Pattern Recognition · Computer Science 2020-10-30 Riccardo Del Chiaro , Bartłomiej Twardowski , Andrew D. Bagdanov , Joost van de Weijer

Multitask learning in Audio Captioning: a sentence embedding regression loss acts as a regularizer

In this work, we propose to study the performance of a model trained with a sentence embedding regression loss component for the Automated Audio Captioning task. This task aims to build systems that can describe audio content with a single…

Sound · Computer Science 2023-05-03 Etienne Labbé , Julien Pinquier , Thomas Pellegrini