Related papers: ViDiC: Video Difference Captioning

DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

Image Difference Captioning (IDC) generates natural language descriptions that precisely identify differences between two images, serving as a key benchmark for fine-grained change perception, cross-modal reasoning, and image editing data…

Computer Vision and Pattern Recognition · Computer Science 2026-05-07 Yuancheng Wei , Haojie Zhang , Linli Yao , Lei Li , Jiali Chen , Tao Huang , Yiting Lu , Duojun Huang , Xin Li , Zhao Zhong

Image Difference Captioning with Pre-training and Contrastive Learning

The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language. The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require…

Multimedia · Computer Science 2022-02-10 Linli Yao , Weiying Wang , Qin Jin

CLIP4IDC: CLIP for Image Difference Captioning

Image Difference Captioning (IDC) aims at generating sentences to describe differences between two similar-looking images. Conventional approaches learn an IDC model with a pre-trained and usually frozen visual feature extractor.…

Computer Vision and Pattern Recognition · Computer Science 2022-10-19 Zixin Guo , Tzu-Jui Julius Wang , Jorma Laaksonen

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus…

Computer Vision and Pattern Recognition · Computer Science 2025-11-26 Yolo Y. Tang , Junjia Guo , Hang Hua , Susan Liang , Mingqian Feng , Xinyang Li , Rui Mao , Chao Huang , Jing Bi , Zeliang Zhang , Pooyan Fazli , Chenliang Xu

VisMin: Visual Minimal-Change Understanding

Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar…

Computer Vision and Pattern Recognition · Computer Science 2025-01-23 Rabiul Awal , Saba Ahmadi , Le Zhang , Aishwarya Agrawal

A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Kyle Buettner , Jacob T. Emmerson , Adriana Kovashka

OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning

Image Difference Captioning (IDC) aims to generate natural language descriptions of subtle differences between image pairs, requiring both precise visual change localization and coherent semantic expression. Despite recent advancements,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-12 Yuan Liu , Saihui Hou , Saijie Hou , Jiabao Du , Shibei Meng , Yongzhen Huang

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Zhihang Liu , Chen-Wei Xie , Bin Wen , Feiwu Yu , Jixuan Chen , Pandeng Li , Boqiang Zhang , Nianzu Yang , Yinglu Li , Zuan Gao , Yun Zheng , Hongtao Xie

VidCtx: Context-aware Video Question Answering with Image Models

To address computational and memory limitations of Large Multimodal Models in the Video Question-Answering task, several recent methods extract textual representations per frame (e.g., by captioning) and feed them to a Large Language Model…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Andreas Goulas , Vasileios Mezaris , Ioannis Patras

DVCFlow: Modeling Information Flow Towards Human-like Video Captioning

Dense video captioning (DVC) aims to generate multi-sentence descriptions to elucidate the multiple events in the video, which is challenging and demands visual consistency, discoursal coherence, and linguistic diversity. Existing methods…

Computer Vision and Pattern Recognition · Computer Science 2021-11-22 Xu Yan , Zhengcong Fei , Shuhui Wang , Qingming Huang , Qi Tian

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow…

Computer Vision and Pattern Recognition · Computer Science 2025-04-22 Weiye Xu , Jiahao Wang , Weiyun Wang , Zhe Chen , Wengang Zhou , Aijun Yang , Lewei Lu , Houqiang Li , Xiaohua Wang , Xizhou Zhu , Wenhai Wang , Jifeng Dai , Jinguo Zhu

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

The ability to perceive how objects change over time is a crucial ingredient in human intelligence. However, current benchmarks cannot faithfully reflect the temporal understanding abilities of video-language models (VidLMs) due to the…

Computer Vision and Pattern Recognition · Computer Science 2024-09-24 Shicheng Li , Lei Li , Shuhuai Ren , Yuanxin Liu , Yi Liu , Rundong Gao , Xu Sun , Lu Hou

VideoMCC: a New Benchmark for Video Comprehension

While there is overall agreement that future technology for organizing, browsing and searching videos hinges on the development of methods for high-level semantic understanding of video, so far no consensus has been reached on the best way…

Computer Vision and Pattern Recognition · Computer Science 2017-06-20 Du Tran , Maksim Bolonkin , Manohar Paluri , Lorenzo Torresani

OneDiff: A Generalist Model for Image Difference Captioning

In computer vision, Image Difference Captioning (IDC) is crucial for accurately describing variations between closely related images. Traditional IDC methods often rely on specialist models, which restrict their applicability across varied…

Computer Vision and Pattern Recognition · Computer Science 2025-05-27 Erdong Hu , Longteng Guo , Tongtian Yue , Zijia Zhao , Shuning Xue , Jing Liu

CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models

We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal…

Computer Vision and Pattern Recognition · Computer Science 2025-12-19 Jie Cai , Kangning Yang , Lan Fu , Jiaming Ding , Jinlong Li , Huiming Sun , Daitao Xing , Jinglin Shen , Zibo Meng

Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives

Video captioning is a critical task in the field of multimodal machine learning, aiming to generate descriptive and coherent textual narratives for video content. While large vision-language models (LVLMs) have shown significant progress,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Ji-jun Park , Soo-joon Choi

L2C: Describing Visual Differences Needs Semantic Understanding of Individuals

Recent advances in language and vision push forward the research of captioning a single image to describing visual differences between image pairs. Suppose there are two images, I_1 and I_2, and the task is to generate a description W_{1,2}…

Computer Vision and Pattern Recognition · Computer Science 2021-02-04 An Yan , Xin Eric Wang , Tsu-Jui Fu , William Yang Wang

IF-VidCap: Can Video Caption Models Follow Instructions?

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained…

Computer Vision and Pattern Recognition · Computer Science 2025-10-22 Shihao Li , Yuanxing Zhang , Jiangtao Wu , Zhide Lei , Yiwen He , Runzhe Wen , Chenxi Liao , Chengkang Jiang , An Ping , Shuo Gao , Suhan Wang , Zhaozhou Bian , Zijun Zhou , Jingyi Xie , Jiayi Zhou , Jing Wang , Yifan Yao , Weihao Xie , Yingshui Tan , Yanghai Wang , Qianqian Xie , Zhaoxiang Zhang , Jiaheng Liu

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Bo Feng , Zhengfeng Lai , Shiyu Li , Zizhen Wang , Simon Wang , Ping Huang , Meng Cao

VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs

Recently, Large Vision-Language Models (LVLMs) have made significant strides across diverse multimodal tasks and benchmarks. This paper reveals a largely under-explored problem from existing video-involved LVLMs - language bias, where…

Computer Vision and Pattern Recognition · Computer Science 2025-02-25 Yiming Yang , Yangyang Guo , Hui Lu , Yan Wang