English
Related papers

Related papers: DiffCap-Bench: A Comprehensive, Challenging, Robus…

200 papers

Image Difference Captioning (IDC) aims to generate natural language descriptions of subtle differences between image pairs, requiring both precise visual change localization and coherent semantic expression. Despite recent advancements,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-12 Yuan Liu , Saihui Hou , Saijie Hou , Jiabao Du , Shibei Meng , Yongzhen Huang

The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language. The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require…

Multimedia · Computer Science 2022-02-10 Linli Yao , Weiying Wang , Qin Jin

Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Jiangtao Wu , Shihao Li , Zhaozhou Bian , Jialu Chen , Runzhe Wen , An Ping , Yiwen He , Jiakai Wang , Yuanxing Zhang , Jiaheng Liu

Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Zitong Xu , Huiyu Duan , Shengyao Qin , Guangyu Yang , Guangji Ma , Xiongkuo Min , Ke Gu , Guangtao Zhai , Patrick Le Callet

Image Difference Captioning (IDC) aims at generating sentences to describe differences between two similar-looking images. Conventional approaches learn an IDC model with a pre-trained and usually frozen visual feature extractor.…

Computer Vision and Pattern Recognition · Computer Science 2022-10-19 Zixin Guo , Tzu-Jui Julius Wang , Jorma Laaksonen

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Qinghao Ye , Xianhan Zeng , Fu Li , Chunyuan Li , Haoqi Fan

In computer vision, Image Difference Captioning (IDC) is crucial for accurately describing variations between closely related images. Traditional IDC methods often rely on specialist models, which restrict their applicability across varied…

Computer Vision and Pattern Recognition · Computer Science 2025-05-27 Erdong Hu , Longteng Guo , Tongtian Yue , Zijia Zhao , Shuning Xue , Jing Liu

Hallucination detection in captions (HalDec) assesses a vision-language model's ability to correctly align image content with text by identifying errors in captions that misrepresent the image. Beyond evaluation, effective hallucination…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Kuniaki Saito , Risa Shinoda , Shohei Tanaka , Tosho Hirasawa , Fumio Okura , Yoshitaka Ushiku

Hallucination detection in captions (HalDec) assesses a vision-language model's ability to correctly align image content with text by identifying errors in captions that misrepresent the image. Beyond evaluation, effective hallucination…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Kuniaki Saito , Risa Shinoda , Shohei Tanaka , Tosho Hirasawa , Fumio Okura , Yoshitaka Ushiku

Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Mingrui Wu , Hang Liu , Jiayi Ji , Xiaoshuai Sun , Rongrong Ji

Remote Sensing Image Change Captioning (RSICC) aims to generate natural language descriptions of surface changes between multi-temporal remote sensing images, detailing the categories, locations, and dynamics of changed objects (e.g.,…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Zhiming Wang , Mingze Wang , Sheng Xu , Yanjing Li , Baochang Zhang

In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge.…

Computer Vision and Pattern Recognition · Computer Science 2026-05-08 Lianying Chao , Kai Zhang , Haoran Cai , Sijie Wu , Xubin Li , Xin Chen

The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping while…

Computer Vision and Pattern Recognition · Computer Science 2025-01-14 Jihyung Kil , Zheda Mai , Justin Lee , Zihe Wang , Kerrie Cheng , Lemeng Wang , Ye Liu , Arpita Chowdhury , Wei-Lun Chao

The rise of the generative models quality during the past years enabled the generation of edited variations of images at an important scale. To counter the harmful effects of such technology, the Image Difference Captioning (IDC) task aims…

Computer Vision and Pattern Recognition · Computer Science 2025-10-15 Gautier Evennou , Antoine Chaffin , Vivien Chappelier , Ewa Kijak

Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Tao Zhang , Yuyang Hong , Yang Xia , Kun Ding , Zeyu Zhang , Ying Wang , Shiming Xiang , Chunhong Pan

In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Haoyu Wang , Haonan Wang , Yuyan Chen , Jun Chen , Gang Liu , Qian Wang , Jiahong Yan , Yanghua Xiao

Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Zhihang Liu , Chen-Wei Xie , Bin Wen , Feiwu Yu , Jixuan Chen , Pandeng Li , Boqiang Zhang , Nianzu Yang , Yinglu Li , Zuan Gao , Yun Zheng , Hongtao Xie

Large Vision-Language Models (LVLMs) have achieved remarkable performance in many vision-language tasks, yet their capabilities in fine-grained visual understanding remain insufficiently evaluated. Existing benchmarks either contain limited…

Computer Vision and Pattern Recognition · Computer Science 2024-10-30 Fengbin Zhu , Ziyang Liu , Xiang Yao Ng , Haohui Wu , Wenjie Wang , Fuli Feng , Chao Wang , Huanbo Luan , Tat Seng Chua

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained…

Recent advancements in Large Vision-Language Models (LVLMs) have significantly enhanced their ability to integrate visual and linguistic information, achieving near-human proficiency in tasks like object recognition, captioning, and visual…

Computer Vision and Pattern Recognition · Computer Science 2025-05-14 Zhikai Wang , Jiashuo Sun , Wenqi Zhang , Zhiqiang Hu , Xin Li , Fan Wang , Deli Zhao
‹ Prev 1 2 3 10 Next ›