Related papers: Language Model as Visual Explainer

Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving

Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified…

Artificial Intelligence · Computer Science 2025-08-14 Zixian Guo , Ming Liu , Qilong Wang , Zhilong Ji , Jinfeng Bai , Lei Zhang , Wangmeng Zuo

LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

Vision-language models (VLMs) offer a promising paradigm for image classification by comparing the similarity between images and class embeddings. A critical challenge lies in crafting precise textual representations for class names. While…

Computer Vision and Pattern Recognition · Computer Science 2024-02-20 Songhao Han , Le Zhuo , Yue Liao , Si Liu

3VL: Using Trees to Improve Vision-Language Models' Interpretability

Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key…

Computer Vision and Pattern Recognition · Computer Science 2025-01-16 Nir Yellinek , Leonid Karlinsky , Raja Giryes

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared…

Computation and Language · Computer Science 2025-11-04 Ahmed Masry , Juan A. Rodriguez , Tianyu Zhang , Suyuchen Wang , Chao Wang , Aarash Feizi , Akshay Kalkunte Suresh , Abhay Puri , Xiangru Jian , Pierre-André Noël , Sathwik Tejaswi Madhusudhan , Marco Pedersoli , Bang Liu , Nicolas Chapados , Yoshua Bengio , Enamul Hoque , Christopher Pal , Issam H. Laradji , David Vazquez , Perouz Taslakian , Spandana Gella , Sai Rajeswar

Overcoming Vision Language Model Challenges in Diagram Understanding: A Proof-of-Concept with XML-Driven Large Language Models Solutions

Diagrams play a crucial role in visually conveying complex relationships and processes within business documentation. Despite recent advances in Vision-Language Models (VLMs) for various image understanding tasks, accurately identifying and…

Software Engineering · Computer Science 2025-02-10 Shue Shiinoki , Ryo Koshihara , Hayato Motegi , Masumi Morishige

An Introduction to Vision-Language Modeling

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models…

Machine Learning · Computer Science 2024-05-28 Florian Bordes , Richard Yuanzhe Pang , Anurag Ajay , Alexander C. Li , Adrien Bardes , Suzanne Petryk , Oscar Mañas , Zhiqiu Lin , Anas Mahmoud , Bargav Jayaraman , Mark Ibrahim , Melissa Hall , Yunyang Xiong , Jonathan Lebensold , Candace Ross , Srihari Jayakumar , Chuan Guo , Diane Bouchacourt , Haider Al-Tahan , Karthik Padthe , Vasu Sharma , Hu Xu , Xiaoqing Ellen Tan , Megan Richards , Samuel Lavoie , Pietro Astolfi , Reyhane Askari Hemmat , Jun Chen , Kushal Tirumala , Rim Assouel , Mazda Moayeri , Arjang Talattof , Kamalika Chaudhuri , Zechun Liu , Xilun Chen , Quentin Garrido , Karen Ullrich , Aishwarya Agrawal , Kate Saenko , Asli Celikyilmaz , Vikas Chandra

Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models

Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Yueyan Li , Chenggong Zhao , Zeyuan Zang , Caixia Yuan , Xiaojie Wang

Language-Informed Visual Concept Learning

Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along…

Computer Vision and Pattern Recognition · Computer Science 2024-04-04 Sharon Lee , Yunzhi Zhang , Shangzhe Wu , Jiajun Wu

Vision language models have difficulty recognizing virtual objects

Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question…

Computer Vision and Pattern Recognition · Computer Science 2025-05-16 Tyler Tran , Sangeet Khemlani , J. G. Trafton

Visually-Augmented Language Modeling

Human language is grounded on multimodal knowledge including visual knowledge like colors, sizes, and shapes. However, current large-scale pre-trained language models rely on text-only self-supervised training with massive text data, which…

Computation and Language · Computer Science 2023-02-28 Weizhi Wang , Li Dong , Hao Cheng , Haoyu Song , Xiaodong Liu , Xifeng Yan , Jianfeng Gao , Furu Wei

Improving Large Vision-Language Models' Understanding for Flow Field Data

Large Vision-Language Models (LVLMs) have shown impressive capabilities across a range of tasks that integrate visual and textual understanding, such as image captioning and visual question answering. These models are trained on large-scale…

Computer Vision and Pattern Recognition · Computer Science 2026-03-11 Xiaomei Zhang , Hanyu Zheng , Xiangyu Zhu , Jinghuan Wei , Junhong Zou , Zhen Lei , Zhaoxiang Zhang

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Large language models (LLMs) have made significant advancements in natural language understanding. However, through that enormous semantic representation that the LLM has learnt, is it somehow possible for it to understand images as well?…

Computer Vision and Pattern Recognition · Computer Science 2024-07-12 Mu Cai , Zeyi Huang , Yuheng Li , Utkarsh Ojha , Haohan Wang , Yong Jae Lee

Large Language Models Facilitate Vision Reflection in Image Classification

This paper presents several novel findings on the explainability of vision reflection in large multimodal models (LMMs). First, we show that prompting an LMM to verify the prediction of a specialized vision model can improve recognition…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Guoyuan An , JaeYoon Kim , SungEui Yoon

Do Vision-Language Models Really Understand Visual Language?

Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of…

Computation and Language · Computer Science 2025-05-27 Yifan Hou , Buse Giledereli , Yilei Tu , Mrinmaya Sachan

Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models

Recent advancements in dialogue systems have highlighted the significance of integrating multimodal responses, which enable conveying ideas through diverse modalities rather than solely relying on text-based interactions. This enrichment…

Computation and Language · Computer Science 2024-07-08 Chang-Sheng Kao , Yun-Nung Chen

Towards Interpreting Visual Information Processing in Vision-Language Models

Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the…

Computer Vision and Pattern Recognition · Computer Science 2025-04-29 Clement Neo , Luke Ong , Philip Torr , Mor Geva , David Krueger , Fazl Barez

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow.…

Computation and Language · Computer Science 2025-02-17 Shintaro Ozaki , Kazuki Hayashi , Yusuke Sakai , Hidetaka Kamigaito , Katsuhiko Hayashi , Taro Watanabe

Vision-Language Models Align with Human Neural Representations in Concept Processing

Recent studies suggest that transformer-based vision-language models (VLMs) capture the multimodality of concept processing in the human brain. However, a systematic evaluation exploring different types of VLM architectures and the role…

Computation and Language · Computer Science 2026-01-23 Anna Bavaresco , Marianne de Heer Kloots , Sandro Pezzelle , Raquel Fernández

Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings

Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework…

Machine Learning · Computer Science 2026-03-31 Gesina Schwalbe , Mert Keser , Moritz Bayerkuhnlein , Edgar Heinert , Annika Mütze , Marvin Keller , Sparsh Tiwari , Georgii Mikriukov , Diedrich Wolter , Jae Hee Lee , Matthias Rottmann

Object Detection with Multimodal Large Vision-Language Models: An In-depth Review

The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Ranjan Sapkota , Manoj Karkee