Related papers: ViStruct: Visual Structural Knowledge Extraction v…

Teaching Structured Vision&Language Concepts to Vision&Language Models

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured…

Computer Vision and Pattern Recognition · Computer Science 2023-06-01 Sivan Doveh , Assaf Arbelle , Sivan Harary , Rameswar Panda , Roei Herzig , Eli Schwartz , Donghyun Kim , Raja Giryes , Rogerio Feris , Shimon Ullman , Leonid Karlinsky

ViStruct: Simulating Expert-Like Reasoning Through Task Decomposition and Visual Attention Cues

Data visualization tasks often require multi-step reasoning, and the interpretive strategies experts use, such as decomposing complex goals into smaller subtasks and selectively attending to key chart regions are rarely made explicit.…

Human-Computer Interaction · Computer Science 2025-06-30 Oliver Huang , Carolina Nobre

Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based…

Computer Vision and Pattern Recognition · Computer Science 2024-10-31 Josselin Somerville Roberts , Tony Lee , Chi Heem Wong , Michihiro Yasunaga , Yifan Mai , Percy Liang

VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction

This paper introduces VisuCraft, a novel framework designed to significantly enhance the capabilities of Large Vision-Language Models (LVLMs) in complex visual-guided creative content generation. Existing LVLMs often exhibit limitations in…

Computer Vision and Pattern Recognition · Computer Science 2025-08-06 Rongxin Jiang , Robert Long , Chenghao Gu , Mingrui Yan

Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking

Vision-language models (VLMs) excel in visual understanding but often lack reliable grounding capabilities and actionable inference rates. Integrating them with open-vocabulary object detection (OVD), instance segmentation, and tracking…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Bastian Pätzold , Jan Nogga , Sven Behnke

Vision language models are unreliable at trivial spatial cognition

Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability…

Computer Vision and Pattern Recognition · Computer Science 2025-04-23 Sangeet Khemlani , Tyler Tran , Nathaniel Gyory , Anthony M. Harrison , Wallace E. Lawson , Ravenna Thielstrom , Hunter Thompson , Taaren Singh , J. Gregory Trafton

ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) on visual commonsense reasoning (VCR) problems. We find that VLMs and LLMs-based decision pipelines are…

Computer Vision and Pattern Recognition · Computer Science 2024-05-20 Kaiwen Zhou , Kwonjoon Lee , Teruhisa Misu , Xin Eric Wang

Learning to Compose Dynamic Tree Structures for Visual Contexts

We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A. Our visual context tree model, dubbed VCTree, has two key…

Computer Vision and Pattern Recognition · Computer Science 2018-12-06 Kaihua Tang , Hanwang Zhang , Baoyuan Wu , Wenhan Luo , Wei Liu

Language Model as Visual Explainer

In this paper, we present Language Model as Visual Explainer LVX, a systematic approach for interpreting the internal workings of vision models using a tree-structured linguistic explanation, without the need for model training. Central to…

Computer Vision and Pattern Recognition · Computer Science 2024-12-12 Xingyi Yang , Xinchao Wang

Improving Fine-grained Visual Understanding in VLMs through Text-Only Training

Visual-Language Models (VLMs) have become a powerful tool for bridging the gap between visual and linguistic understanding. However, the conventional learning approaches for VLMs often suffer from limitations, such as the high resource…

Computation and Language · Computer Science 2025-04-01 Dasol Choi , Guijin Son , Soo Yong Kim , Gio Paik , Seunghyeok Hong

Beyond Embeddings: The Promise of Visual Table in Visual Reasoning

Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often…

Computer Vision and Pattern Recognition · Computer Science 2024-06-18 Yiwu Zhong , Zi-Yuan Hu , Michael R. Lyu , Liwei Wang

Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs

Despite progress in Large Vision-Language Models (LVLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents. This…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Amirmohammad Izadi , Mohammad Ali Banayeeanzade , Fatemeh Askari , Ali Rahimiakbar , Mohammad Mahdi Vahedi , Hosein Hasani , Mahdieh Soleymani Baghshah

VicTR: Video-conditioned Text Representations for Activity Recognition

Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as…

Computer Vision and Pattern Recognition · Computer Science 2024-04-01 Kumara Kahatapitiya , Anurag Arnab , Arsha Nagrani , Michael S. Ryoo

Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of…

Artificial Intelligence · Computer Science 2026-02-25 Dhita Putri Pratama , Soyeon Caren Han , Yihao Ding

Visually-Augmented Language Modeling

Human language is grounded on multimodal knowledge including visual knowledge like colors, sizes, and shapes. However, current large-scale pre-trained language models rely on text-only self-supervised training with massive text data, which…

Computation and Language · Computer Science 2023-02-28 Weizhi Wang , Li Dong , Hao Cheng , Haoyu Song , Xiaodong Liu , Xifeng Yan , Jianfeng Gao , Furu Wei

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Jialiang Kang , Han Shu , Wenshuo Li , Yingjie Zhai , Xinghao Chen

Visual-Semantic Embedding Model Informed by Structured Knowledge

We propose a novel approach to improve a visual-semantic embedding model by incorporating concept representations captured from an external structured knowledge base. We investigate its performance on image classification under both…

Computer Vision and Pattern Recognition · Computer Science 2020-09-22 Mirantha Jayathilaka , Tingting Mu , Uli Sattler

Vision-Language Models for Vision Tasks: A Survey

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition…

Computer Vision and Pattern Recognition · Computer Science 2024-02-19 Jingyi Zhang , Jiaxing Huang , Sheng Jin , Shijian Lu

Visual In-Context Learning for Large Vision-Language Models

In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities. To overcome these challenges, we introduce a novel Visual…

Computer Vision and Pattern Recognition · Computer Science 2024-02-20 Yucheng Zhou , Xiang Li , Qianning Wang , Jianbing Shen

Coding the Visual World: From Image to Simulation Using Vision Language Models

The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Sagi Eppel