Related papers: DiffCap-Bench: A Comprehensive, Challenging, Robus…

OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning

Image Difference Captioning (IDC) aims to generate natural language descriptions of subtle differences between image pairs, requiring both precise visual change localization and coherent semantic expression. Despite recent advancements,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-12 Yuan Liu , Saihui Hou , Saijie Hou , Jiabao Du , Shibei Meng , Yongzhen Huang

Image Difference Captioning with Pre-training and Contrastive Learning

The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language. The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require…

Multimedia · Computer Science 2022-02-10 Linli Yao , Weiying Wang , Qin Jin

ViDiC: Video Difference Captioning

Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Jiangtao Wu , Shihao Li , Zhaozhou Bian , Jialu Chen , Runzhe Wen , An Ping , Yiwen He , Jiakai Wang , Yuanxing Zhang , Jiaheng Liu

ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Zitong Xu , Huiyu Duan , Shengyao Qin , Guangyu Yang , Guangji Ma , Xiongkuo Min , Ke Gu , Guangtao Zhai , Patrick Le Callet

CLIP4IDC: CLIP for Image Difference Captioning

Image Difference Captioning (IDC) aims at generating sentences to describe differences between two similar-looking images. Conventional approaches learn an IDC model with a pre-trained and usually frozen visual feature extractor.…

Computer Vision and Pattern Recognition · Computer Science 2022-10-19 Zixin Guo , Tzu-Jui Julius Wang , Jorma Laaksonen

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Qinghao Ye , Xianhan Zeng , Fu Li , Chunyuan Li , Haoqi Fan

OneDiff: A Generalist Model for Image Difference Captioning

In computer vision, Image Difference Captioning (IDC) is crucial for accurately describing variations between closely related images. Traditional IDC methods often rely on specialist models, which restrict their applicability across varied…

Computer Vision and Pattern Recognition · Computer Science 2025-05-27 Erdong Hu , Longteng Guo , Tongtian Yue , Zijia Zhao , Shuning Xue , Jing Liu

HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning

Hallucination detection in captions (HalDec) assesses a vision-language model's ability to correctly align image content with text by identifying errors in captions that misrepresent the image. Beyond evaluation, effective hallucination…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Kuniaki Saito , Risa Shinoda , Shohei Tanaka , Tosho Hirasawa , Fumio Okura , Yoshitaka Ushiku

HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning

Hallucination detection in captions (HalDec) assesses a vision-language model's ability to correctly align image content with text by identifying errors in captions that misrepresent the image. Beyond evaluation, effective hallucination…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Kuniaki Saito , Risa Shinoda , Shohei Tanaka , Tosho Hirasawa , Fumio Okura , Yoshitaka Ushiku

MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Mingrui Wu , Hang Liu , Jiayi Ji , Xiaoshuai Sun , Rongrong Ji

CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset

Remote Sensing Image Change Captioning (RSICC) aims to generate natural language descriptions of surface changes between multi-temporal remote sensing images, detailing the categories, locations, and dynamics of changed objects (e.g.,…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Zhiming Wang , Mingze Wang , Sheng Xu , Yanjing Li , Baochang Zhang

Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain

In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge.…

Computer Vision and Pattern Recognition · Computer Science 2026-05-08 Lianying Chao , Kai Zhang , Haoran Cai , Sijie Wu , Xubin Li , Xin Chen

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping while…

Computer Vision and Pattern Recognition · Computer Science 2025-01-14 Jihyung Kil , Zheda Mai , Justin Lee , Zihe Wang , Kerrie Cheng , Lemeng Wang , Ye Liu , Arpita Chowdhury , Wei-Lun Chao

Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation

The rise of the generative models quality during the past years enabled the generation of edited variations of images at an important scale. To counter the harmful effects of such technology, the Image Difference Captioning (IDC) task aims…

Computer Vision and Pattern Recognition · Computer Science 2025-10-15 Gautier Evennou , Antoine Chaffin , Vivien Chappelier , Ewa Kijak

IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Tao Zhang , Yuyang Hong , Yang Xia , Kun Ding , Zeyu Zhang , Ying Wang , Shiming Xiang , Chunhong Pan

Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning

In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Haoyu Wang , Haonan Wang , Yuyan Chen , Jun Chen , Gang Liu , Qian Wang , Jiahong Yan , Yanghua Xiao

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Zhihang Liu , Chen-Wei Xie , Bin Wen , Feiwu Yu , Jixuan Chen , Pandeng Li , Boqiang Zhang , Nianzu Yang , Yinglu Li , Zuan Gao , Yun Zheng , Hongtao Xie

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

Large Vision-Language Models (LVLMs) have achieved remarkable performance in many vision-language tasks, yet their capabilities in fine-grained visual understanding remain insufficiently evaluated. Existing benchmarks either contain limited…

Computer Vision and Pattern Recognition · Computer Science 2024-10-30 Fengbin Zhu , Ziyang Liu , Xiang Yao Ng , Haohui Wu , Wenjie Wang , Fuli Feng , Chao Wang , Huanbo Luan , Tat Seng Chua

IF-VidCap: Can Video Caption Models Follow Instructions?

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained…

Computer Vision and Pattern Recognition · Computer Science 2025-10-22 Shihao Li , Yuanxing Zhang , Jiangtao Wu , Zhide Lei , Yiwen He , Runzhe Wen , Chenxi Liao , Chengkang Jiang , An Ping , Shuo Gao , Suhan Wang , Zhaozhou Bian , Zijun Zhou , Jingyi Xie , Jiayi Zhou , Jing Wang , Yifan Yao , Weihao Xie , Yingshui Tan , Yanghai Wang , Qianqian Xie , Zhaoxiang Zhang , Jiaheng Liu

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

Recent advancements in Large Vision-Language Models (LVLMs) have significantly enhanced their ability to integrate visual and linguistic information, achieving near-human proficiency in tasks like object recognition, captioning, and visual…

Computer Vision and Pattern Recognition · Computer Science 2025-05-14 Zhikai Wang , Jiashuo Sun , Wenqi Zhang , Zhiqiang Hu , Xin Li , Fan Wang , Deli Zhao