Related papers: VoxML: A Visualization Modeling Language

An Abstract Specification of VoxML as an Annotation Language

VoxML is a modeling language used to map natural language expressions into real-time visualizations using commonsense semantic knowledge of objects and events. Its utility has been demonstrated in embodied simulation environments and in…

Computation and Language · Computer Science 2023-05-23 Kiyong Lee , Nikhil Krishnaswamy , James Pustejovsky

VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

Comprehending 3D environments is vital for intelligent systems in domains like robotics and autonomous navigation. Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging.…

Computer Vision and Pattern Recognition · Computer Science 2025-12-03 Alan Dao , Norapat Buppodom

Vision language models have difficulty recognizing virtual objects

Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question…

Computer Vision and Pattern Recognition · Computer Science 2025-05-16 Tyler Tran , Sangeet Khemlani , J. G. Trafton

Volumetric Procedural Models for Shape Representation

This article describes a volumetric approach for procedural shape modeling and a new Procedural Shape Modeling Language (PSML) that facilitates the specification of these models. PSML provides programmers the ability to describe shapes in…

Graphics · Computer Science 2021-03-23 Andrew Willis , Prashant Ganesh , Kyle Volle , Jincheng Zhang , Kevin Brink

Visually-Augmented Language Modeling

Human language is grounded on multimodal knowledge including visual knowledge like colors, sizes, and shapes. However, current large-scale pre-trained language models rely on text-only self-supervised training with massive text data, which…

Computation and Language · Computer Science 2023-02-28 Weizhi Wang , Li Dong , Hao Cheng , Haoyu Song , Xiaodong Liu , Xifeng Yan , Jianfeng Gao , Furu Wei

An Introduction to Vision-Language Modeling

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models…

Machine Learning · Computer Science 2024-05-28 Florian Bordes , Richard Yuanzhe Pang , Anurag Ajay , Alexander C. Li , Adrien Bardes , Suzanne Petryk , Oscar Mañas , Zhiqiu Lin , Anas Mahmoud , Bargav Jayaraman , Mark Ibrahim , Melissa Hall , Yunyang Xiong , Jonathan Lebensold , Candace Ross , Srihari Jayakumar , Chuan Guo , Diane Bouchacourt , Haider Al-Tahan , Karthik Padthe , Vasu Sharma , Hu Xu , Xiaoqing Ellen Tan , Megan Richards , Samuel Lavoie , Pietro Astolfi , Reyhane Askari Hemmat , Jun Chen , Kushal Tirumala , Rim Assouel , Mazda Moayeri , Arjang Talattof , Kamalika Chaudhuri , Zechun Liu , Xilun Chen , Quentin Garrido , Karen Ullrich , Aishwarya Agrawal , Kate Saenko , Asli Celikyilmaz , Vikas Chandra

Visual Large Language Models for Generalized and Specialized Applications

Visual-language models (VLM) have emerged as a powerful tool for learning a unified embedding space for vision and language. Inspired by large language models, which have demonstrated strong reasoning and multi-task capabilities, visual…

Computer Vision and Pattern Recognition · Computer Science 2025-01-07 Yifan Li , Zhixin Lai , Wentao Bao , Zhen Tan , Anh Dao , Kewei Sui , Jiayi Shen , Dong Liu , Huan Liu , Yu Kong

Towards Understanding Visual Grounding in Visual Language Models

Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Georgios Pantazopoulos , Eda B. Özyiğit

Vision language models are unreliable at trivial spatial cognition

Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability…

Computer Vision and Pattern Recognition · Computer Science 2025-04-23 Sangeet Khemlani , Tyler Tran , Nathaniel Gyory , Anthony M. Harrison , Wallace E. Lawson , Ravenna Thielstrom , Hunter Thompson , Taaren Singh , J. Gregory Trafton

Towards a Formalization of the Unified Modeling Language

The Unified Modeling Language UML is a language for specifying visualizing and documenting object oriented systems UML combines the concepts of OOA OODOMT and OOSE and is intended as a standard in the domain of object oriented analysis and…

Software Engineering · Computer Science 2014-09-26 Ruth Breu , Ursula Hinkel , Christoph Hofmann , Cornel Klein , Barbara Paech , Bernhard Rumpe , V. Thurner

A3VLM: Actionable Articulation-Aware Vision Language Model

Vision Language Models (VLMs) have received significant attention in recent years in the robotics community. VLMs are shown to be able to perform complex visual reasoning and scene understanding tasks, which makes them regarded as a…

Robotics · Computer Science 2024-06-14 Siyuan Huang , Haonan Chang , Yuhan Liu , Yimeng Zhu , Hao Dong , Peng Gao , Abdeslam Boularias , Hongsheng Li

Multimodal Semantic Simulations of Linguistically Underspecified Motion Events

In this paper, we describe a system for generating three-dimensional visual simulations of natural language motion expressions. We use a rich formal model of events and their participants to generate simulations that satisfy the minimal…

Computation and Language · Computer Science 2016-10-04 Nikhil Krishnaswamy , James Pustejovsky

Coding the Visual World: From Image to Simulation Using Vision Language Models

The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Sagi Eppel

A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Kyle Buettner , Jacob T. Emmerson , Adriana Kovashka

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. While foundation models demonstrate remarkable performance on some benchmarks, they still…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Fan-Yun Sun , Weiyu Liu , Siyi Gu , Dylan Lim , Goutam Bhat , Federico Tombari , Manling Li , Nick Haber , Jiajun Wu

How Can Objects Help Video-Language Understanding?

Do we still need to represent objects explicitly in multimodal large language models (MLLMs)? To one extreme, pre-trained encoders convert images into visual tokens, with which objects and spatiotemporal relationships may be implicitly…

Computer Vision and Pattern Recognition · Computer Science 2025-08-06 Zitian Tang , Shijie Wang , Junho Cho , Jaewook Yoo , Chen Sun

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to…

Robotics · Computer Science 2023-11-03 Wenlong Huang , Chen Wang , Ruohan Zhang , Yunzhu Li , Jiajun Wu , Li Fei-Fei

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-22 Qiushi Zhu , Long Zhou , Ziqiang Zhang , Shujie Liu , Binxing Jiao , Jie Zhang , Lirong Dai , Daxin Jiang , Jinyu Li , Furu Wei

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable promise in generating visually grounded responses. However, their application in the medical domain is hindered by unique challenges. For instance, most VLMs…

Computer Vision and Pattern Recognition · Computer Science 2025-02-19 Lingxiao Luo , Bingda Tang , Xuanzhong Chen , Rong Han , Ting Chen

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human's perception of reality isn't always faithful to the physical world.…

Artificial Intelligence · Computer Science 2023-11-02 Yichi Zhang , Jiayi Pan , Yuchen Zhou , Rui Pan , Joyce Chai