Related papers: Efficient Image Captioning for Edge Devices

ClipCap: CLIP Prefix for Image Captioning

Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding…

Computer Vision and Pattern Recognition · Computer Science 2021-11-19 Ron Mokady , Amir Hertz , Amit H. Bermano

Linear Alignment of Vision-language Models for Image Captioning

Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches leverage CLIP for cross-modal retrieval to condition…

Computer Vision and Pattern Recognition · Computer Science 2025-02-11 Fabian Paischer , Markus Hofmarcher , Sepp Hochreiter , Thomas Adler

ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning

Recent lightweight image captioning models using retrieved data mainly focus on text prompts. However, previous works only utilize the retrieved text as text prompts, and the visual information relies only on the CLIP visual embedding.…

Computer Vision and Pattern Recognition · Computer Science 2025-01-27 Taewhan Kim , Soeun Lee , Si-Woo Kim , Dong-Jin Kim

SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation

Recent advances in image captioning have focused on scaling the data and model size, substantially increasing the cost of pre-training and finetuning. As an alternative to large models, we present SmallCap, which generates a caption…

Computer Vision and Pattern Recognition · Computer Science 2023-03-30 Rita Ramos , Bruno Martins , Desmond Elliott , Yova Kementchedjhieva

Injecting Semantic Concepts into End-to-End Image Captioning

Tremendous progress has been made in recent years in developing better image captioning models, yet most of them rely on a separate object detector to extract regional features. Recent vision-language studies are shifting towards the…

Computer Vision and Pattern Recognition · Computer Science 2022-04-01 Zhiyuan Fang , Jianfeng Wang , Xiaowei Hu , Lin Liang , Zhe Gan , Lijuan Wang , Yezhou Yang , Zicheng Liu

Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation

Foundation models like CLIP (Contrastive Language-Image Pretraining) have revolutionized vision-language tasks by enabling zero-shot and few-shot learning through cross-modal alignment. However, their computational complexity and large…

Computer Vision and Pattern Recognition · Computer Science 2025-05-26 Li Zhong , Ahmed Ghazal , Jun-Jun Wan , Frederik Zilly , Patrick Mackens , Joachim E. Vollrath , Bogdan Sorin Coseriu

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic…

Computer Vision and Pattern Recognition · Computer Science 2024-05-16 Pavan Kumar Anasosalu Vasu , Hadi Pouransari , Fartash Faghri , Oncel Tuzel

Fine-grained Image Captioning with CLIP Reward

Modern image captioning models are usually trained with text similarity objectives. However, since reference captions in public datasets often describe the most salient common objects, models trained with text similarity objectives tend to…

Computation and Language · Computer Science 2023-03-31 Jaemin Cho , Seunghyun Yoon , Ajinkya Kale , Franck Dernoncourt , Trung Bui , Mohit Bansal

Text Data-Centric Image Captioning with Interactive Prompts

Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data. Recently, large-scale vision and language models (e.g., CLIP) and large-scale generative…

Computer Vision and Pattern Recognition · Computer Science 2024-03-29 Yiyu Wang , Hao Luo , Jungang Xu , Yingfei Sun , Fan Wang

ContCap: A scalable framework for continual image captioning

While advanced image captioning systems are increasingly describing images coherently and exactly, recent progress in continual learning allows deep learning models to avoid catastrophic forgetting. However, the domain where image…

Computer Vision and Pattern Recognition · Computer Science 2020-04-22 Giang Nguyen , Tae Joon Jun , Trung Tran , Tolcha Yalew , Daeyoung Kim

BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning

This study aims to explore efficient tuning methods for the screenshot captioning task. Recently, image captioning has seen significant advancements, but research in captioning tasks for mobile screens remains relatively scarce. Current…

Machine Learning · Computer Science 2023-09-27 Ching-Yu Chiang , I-Hua Chang , Shih-Wei Liao

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest.…

Computer Vision and Pattern Recognition · Computer Science 2023-03-07 Wei Li , Linchao Zhu , Longyin Wen , Yi Yang

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language…

Computer Vision and Pattern Recognition · Computer Science 2024-01-05 Longtian Qiu , Shan Ning , Xuming He

IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data…

Computer Vision and Pattern Recognition · Computer Science 2024-09-27 Soeun Lee , Si-Woo Kim , Taewhan Kim , Dong-Jin Kim

DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts

Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex…

Computer Vision and Pattern Recognition · Computer Science 2025-10-30 Binbin Li , Guimiao Yang , Zisen Qi , Haiping Wang , Yu Ding

LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation

CLIP is a seminal multimodal model that maps images and text into a shared representation space through contrastive learning on billions of image-caption pairs. Inspired by the rapid progress of large language models (LLMs), we investigate…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Weiquan Huang , Aoqi Wu , Yifan Yang , Xufang Luo , Yuqing Yang , Usman Naseem , Chunyu Wang , Chunyu Wang , Qi Dai , Xiyang Dai , Dongdong Chen , Chong Luo , Lili Qiu , Liang Hu

SuperCap: Multi-resolution Superpixel-based Image Captioning

It has been a longstanding goal within image captioning to move beyond a dependence on object detection. We investigate using superpixels coupled with Vision Language Models (VLMs) to bridge the gap between detector-based captioning…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Henry Senior , Luca Rossi , Gregory Slabaugh , Shanxin Yuan

Accurate and Fast Compressed Video Captioning

Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame…

Computer Vision and Pattern Recognition · Computer Science 2024-01-04 Yaojie Shen , Xin Gu , Kai Xu , Heng Fan , Longyin Wen , Libo Zhang

COMIC: Towards A Compact Image Captioning Model with Attention

Recent works in image captioning have shown very promising raw performance. However, we realize that most of these encoder-decoder style networks with attention do not scale naturally to large vocabulary size, making them difficult to be…

Computer Vision and Pattern Recognition · Computer Science 2019-06-13 Jia Huei Tan , Chee Seng Chan , Joon Huang Chuah

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Pavan Kumar Anasosalu Vasu , Hadi Pouransari , Fartash Faghri , Raviteja Vemulapalli , Oncel Tuzel