Related papers: Image2Struct: Benchmarking Structure Extraction fo…

ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation

State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual…

Computer Vision and Pattern Recognition · Computer Science 2023-11-23 Yangyi Chen , Xingyao Wang , Manling Li , Derek Hoiem , Heng Ji

Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Ajay Vikram Periasami , Junlin Wang , Bhuwan Dhingra

Coding the Visual World: From Image to Simulation Using Vision Language Models

The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Sagi Eppel

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a…

Computer Vision and Pattern Recognition · Computer Science 2026-04-09 Yuechen Jiang , Enze Zhang , Md Mohsinul Kabir , Qianqian Xie , Stavroula Golfomitsou , Konstantinos Arvanitis , Sophia Ananiadou

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on…

Computation and Language · Computer Science 2023-06-19 Kenton Lee , Mandar Joshi , Iulia Turc , Hexiang Hu , Fangyu Liu , Julian Eisenschlos , Urvashi Khandelwal , Peter Shaw , Ming-Wei Chang , Kristina Toutanova

Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural…

Computation and Language · Computer Science 2025-07-29 Mizanur Rahman , Md Tahmid Rahman Laskar , Shafiq Joty , Enamul Hoque

Inference-Time Structural Reasoning for Compositional Vision-Language Understanding

Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We present, a unified evaluation and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Amartya Bhattacharya

Benchmarking and Improving Detail Image Caption

Image captioning has long been regarded as a fundamental task in visual understanding. Recently, however, few large vision-language model (LVLM) research discusses model's image captioning performance because of the outdated short-caption…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Hongyuan Dong , Jiawen Li , Bohong Wu , Jiacong Wang , Yuan Zhang , Haoyuan Guo

VLM@school -- Evaluation of AI image understanding on German middle school knowledge

This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to…

Artificial Intelligence · Computer Science 2025-06-30 René Peinl , Vincent Tischler

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

Large Vision-Language Models (LVLMs), despite their recent success, are hardly comprehensively tested for their cognitive abilities. Inspired by the prevalent use of the Cookie Theft task in human cognitive tests, we propose a novel…

Artificial Intelligence · Computer Science 2025-02-14 Xiujie Song , Mengyue Wu , Kenny Q. Zhu , Chunhao Zhang , Yanyi Chen

Vision language models are unreliable at trivial spatial cognition

Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability…

Computer Vision and Pattern Recognition · Computer Science 2025-04-23 Sangeet Khemlani , Tyler Tran , Nathaniel Gyory , Anthony M. Harrison , Wallace E. Lawson , Ravenna Thielstrom , Hunter Thompson , Taaren Singh , J. Gregory Trafton

Image Recognition with Vision and Language Embeddings of VLMs

Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both…

Computer Vision and Pattern Recognition · Computer Science 2025-09-12 Illia Volkov , Nikita Kisel , Klara Janouskova , Jiri Matas

StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation

Extracting structured information from text, such as key-value pairs that could augment tabular data, is quite useful in many enterprise use cases. Although large language models (LLMs) have enabled numerous automated pipelines for…

Computation and Language · Computer Science 2025-07-30 Satyananda Kashyap , Sola Shirai , Nandana Mihindukulasooriya , Horst Samulowitz

Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

Despite progress on general tasks, vision-language models (VLMs) still struggle with challenges that demand both fine-grained visual grounding and external knowledge, a synergy overlooked by existing benchmarks that evaluate these abilities…

Computer Vision and Pattern Recognition · Computer Science 2026-05-21 Yifan Jiang , Cong Zhang , Bofei Zhang , Qiaofeng Zheng , Yifan Yang , Bingzhang Wang , Yew-Soon Ong

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Cheolhong Min , Jaeyun Jung , Daeun Lee , Hyeonseong Jeon , Yu Su , Jonathan Tremblay , Chan Hee Song , Jaesik Park

CogVLM2: Visual Language Models for Image and Video Understanding

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Wenyi Hong , Weihan Wang , Ming Ding , Wenmeng Yu , Qingsong Lv , Yan Wang , Yean Cheng , Shiyu Huang , Junhui Ji , Zhao Xue , Lei Zhao , Zhuoyi Yang , Xiaotao Gu , Xiaohan Zhang , Guanyu Feng , Da Yin , Zihan Wang , Ji Qi , Xixuan Song , Peng Zhang , Debing Liu , Bin Xu , Juanzi Li , Yuxiao Dong , Jie Tang

IRR: Image Review Ranking Framework for Evaluating Vision-Language Models

Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to…

Computation and Language · Computer Science 2024-12-17 Kazuki Hayashi , Kazuma Onishi , Toma Suzuki , Yusuke Ide , Seiji Gobara , Shigeki Saito , Yusuke Sakai , Hidetaka Kamigaito , Katsuhiko Hayashi , Taro Watanabe

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also…

Computer Vision and Pattern Recognition · Computer Science 2026-05-19 Qing'an Liu , Juntong Feng , Yuhao Wang , Xinzhe Han , Yujie Cheng , Yue Zhu , Haiwen Diao , Yunzhi Zhuge , Huchuan Lu

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Large Vision-Language Models (LVLMs) have made significant strides in the field of video understanding in recent times. Nevertheless, existing video benchmarks predominantly rely on text prompts for evaluation, which often require complex…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Yiming Zhao , Yu Zeng , Yukun Qi , YaoYang Liu , Xikun Bao , Lin Chen , Zehui Chen , Qing Miao , Chenxi Liu , Jie Zhao , Feng Zhao

FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Hang Hua , Jing Shi , Kushal Kafle , Simon Jenni , Daoan Zhang , John Collomosse , Scott Cohen , Jiebo Luo