English
Related papers

Related papers: Image2Struct: Benchmarking Structure Extraction fo…

200 papers

State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual…

Computer Vision and Pattern Recognition · Computer Science 2023-11-23 Yangyi Chen , Xingyao Wang , Manling Li , Derek Hoiem , Heng Ji

Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Ajay Vikram Periasami , Junlin Wang , Bhuwan Dhingra

The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Sagi Eppel

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a…

Computer Vision and Pattern Recognition · Computer Science 2026-04-09 Yuechen Jiang , Enze Zhang , Md Mohsinul Kabir , Qianqian Xie , Stavroula Golfomitsou , Konstantinos Arvanitis , Sophia Ananiadou

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on…

Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural…

Computation and Language · Computer Science 2025-07-29 Mizanur Rahman , Md Tahmid Rahman Laskar , Shafiq Joty , Enamul Hoque

Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We present, a unified evaluation and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Amartya Bhattacharya

Image captioning has long been regarded as a fundamental task in visual understanding. Recently, however, few large vision-language model (LVLM) research discusses model's image captioning performance because of the outdated short-caption…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Hongyuan Dong , Jiawen Li , Bohong Wu , Jiacong Wang , Yuan Zhang , Haoyuan Guo

This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to…

Artificial Intelligence · Computer Science 2025-06-30 René Peinl , Vincent Tischler

Large Vision-Language Models (LVLMs), despite their recent success, are hardly comprehensively tested for their cognitive abilities. Inspired by the prevalent use of the Cookie Theft task in human cognitive tests, we propose a novel…

Artificial Intelligence · Computer Science 2025-02-14 Xiujie Song , Mengyue Wu , Kenny Q. Zhu , Chunhao Zhang , Yanyi Chen

Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability…

Computer Vision and Pattern Recognition · Computer Science 2025-04-23 Sangeet Khemlani , Tyler Tran , Nathaniel Gyory , Anthony M. Harrison , Wallace E. Lawson , Ravenna Thielstrom , Hunter Thompson , Taaren Singh , J. Gregory Trafton

Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both…

Computer Vision and Pattern Recognition · Computer Science 2025-09-12 Illia Volkov , Nikita Kisel , Klara Janouskova , Jiri Matas

Extracting structured information from text, such as key-value pairs that could augment tabular data, is quite useful in many enterprise use cases. Although large language models (LLMs) have enabled numerous automated pipelines for…

Computation and Language · Computer Science 2025-07-30 Satyananda Kashyap , Sola Shirai , Nandana Mihindukulasooriya , Horst Samulowitz

Despite progress on general tasks, vision-language models (VLMs) still struggle with challenges that demand both fine-grained visual grounding and external knowledge, a synergy overlooked by existing benchmarks that evaluate these abilities…

Computer Vision and Pattern Recognition · Computer Science 2026-05-21 Yifan Jiang , Cong Zhang , Bofei Zhang , Qiaofeng Zheng , Yifan Yang , Bingzhang Wang , Yew-Soon Ong

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Cheolhong Min , Jaeyun Jung , Daeun Lee , Hyeonseong Jeon , Yu Su , Jonathan Tremblay , Chan Hee Song , Jaesik Park

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a…

Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to…

Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also…

Computer Vision and Pattern Recognition · Computer Science 2026-05-19 Qing'an Liu , Juntong Feng , Yuhao Wang , Xinzhe Han , Yujie Cheng , Yue Zhu , Haiwen Diao , Yunzhi Zhuge , Huchuan Lu

Large Vision-Language Models (LVLMs) have made significant strides in the field of video understanding in recent times. Nevertheless, existing video benchmarks predominantly rely on text prompts for evaluation, which often require complex…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Yiming Zhao , Yu Zeng , Yukun Qi , YaoYang Liu , Xikun Bao , Lin Chen , Zehui Chen , Qing Miao , Chenxi Liu , Jie Zhao , Feng Zhao

Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Hang Hua , Jing Shi , Kushal Kafle , Simon Jenni , Daoan Zhang , John Collomosse , Scott Cohen , Jiebo Luo
‹ Prev 1 2 3 10 Next ›