Related papers: Does language help generalization in vision models…

On the Performance of Multimodal Language Models

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently…

Computation and Language · Computer Science 2023-11-29 Utsav Garg , Erhan Bas

Lessons learned in multilingual grounded language learning

Recent work has shown how to learn better visual-semantic embeddings by leveraging image descriptions in more than one language. Here, we investigate in detail which conditions affect the performance of this type of grounded language…

Computation and Language · Computer Science 2018-09-21 Ákos Kádár , Desmond Elliott , Marc-Alexandre Côté , Grzegorz Chrupała , Afra Alishahi

Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. While being actively researched, multimodal models that can additionally handle images as input have yet to…

Computation and Language · Computer Science 2023-05-24 Sherzod Hakimov , David Schlangen

Like a bilingual baby: The advantage of visually grounding a bilingual language model

Unlike most neural language models, humans learn language in a rich, multi-sensory and, often, multi-lingual environment. Current language models typically fail to fully capture the complexities of multilingual language use. We train an…

Computation and Language · Computer Science 2023-02-15 Khai-Nguyen Nguyen , Zixin Tang , Ankur Mali , Alex Kelly

Does Vision Accelerate Hierarchical Generalization in Neural Language Learners?

Neural language models (LMs) are arguably less data-efficient than humans from a language acquisition perspective. One fundamental question is why this human-LM gap arises. This study explores the advantage of grounded language acquisition,…

Computation and Language · Computer Science 2024-12-18 Tatsuki Kuribayashi , Timothy Baldwin

What do Models Learn From Training on More Than Text? Measuring Visual Commonsense Knowledge

There are limitations in learning language from text alone. Therefore, recent focus has been on developing multimodal models. However, few benchmarks exist that can measure what language models learn about language from multimodal training.…

Computation and Language · Computer Science 2022-05-17 Lovisa Hagström , Richard Johansson

Grounding Language Models to Images for Multimodal Inputs and Outputs

We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method…

Computation and Language · Computer Science 2023-06-16 Jing Yu Koh , Ruslan Salakhutdinov , Daniel Fried

Visual Grounding Helps Learn Word Meanings in Low-Data Regimes

Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to…

Computation and Language · Computer Science 2024-03-27 Chengxu Zhuang , Evelina Fedorenko , Jacob Andreas

Generalization in Multimodal Language Learning from Simulation

Neural networks can be powerful function approximators, which are able to model high-dimensional feature distributions from a subset of examples drawn from the target distribution. Naturally, they perform well at generalizing within the…

Machine Learning · Computer Science 2021-08-06 Aaron Eisermann , Jae Hee Lee , Cornelius Weber , Stefan Wermter

Towards Understanding Visual Grounding in Visual Language Models

Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Georgios Pantazopoulos , Eda B. Özyiğit

A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation

Large language models such as BERT and the GPT series started a paradigm shift that calls for building general-purpose models via pre-training on large datasets, followed by fine-tuning on task-specific datasets. There is now a plethora of…

Computation and Language · Computer Science 2023-06-13 Jeremy Gwinnup , Kevin Duh

Context-Aware Multimodal Pretraining

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage…

Computer Vision and Pattern Recognition · Computer Science 2024-11-25 Karsten Roth , Zeynep Akata , Dima Damen , Ivana Balažević , Olivier J. Hénaff

MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the…

Computation and Language · Computer Science 2025-07-01 Jianhong Tu , Zhuohao Ni , Nicholas Crispino , Zihao Yu , Michael Bendersky , Beliz Gunel , Ruoxi Jia , Xin Liu , Lingjuan Lyu , Dawn Song , Chenguang Wang

Efficient Generalization via Multimodal Co-Training under Data Scarcity and Distribution Shift

This paper explores a multimodal co-training framework designed to enhance model generalization in situations where labeled data is limited and distribution shifts occur. We thoroughly examine the theoretical foundations of this framework,…

Machine Learning · Computer Science 2025-10-10 Tianyu Bell Pan , Damon L. Woodard

Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models

Despite the impressive advancements achieved through vision-and-language pretraining, it remains unclear whether this joint learning paradigm can help understand each individual modality. In this work, we conduct a comparative analysis of…

Computer Vision and Pattern Recognition · Computer Science 2024-01-31 Zhuowan Li , Cihang Xie , Benjamin Van Durme , Alan Yuille

Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

Foundational multimodal models pre-trained on large scale image-text pairs or video-text pairs or both have shown strong generalization abilities on downstream tasks. However unlike image-text models, pretraining video-text models is always…

Computer Vision and Pattern Recognition · Computer Science 2023-11-28 Avinash Madasu , Anahita Bhiwandiwalla , Vasudev Lal

Attention-Based Keyword Localisation in Speech using Visual Grounding

Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model…

Computation and Language · Computer Science 2021-06-24 Kayode Olaleye , Herman Kamper

CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in…

Computation and Language · Computer Science 2022-07-12 Gabriel Skantze , Bram Willemsen

Language Models are General-Purpose Interfaces

Foundation models have received much attention due to their effectiveness across a broad range of downstream applications. Though there is a big convergence in terms of architecture, most pretrained models are typically still developed for…

Computation and Language · Computer Science 2022-06-14 Yaru Hao , Haoyu Song , Li Dong , Shaohan Huang , Zewen Chi , Wenhui Wang , Shuming Ma , Furu Wei

Generalization Measures for Zero-Shot Cross-Lingual Transfer

A model's capacity to generalize its knowledge to interpret unseen inputs with different characteristics is crucial to build robust and reliable machine learning systems. Language model evaluation tasks lack information metrics about model…

Computation and Language · Computer Science 2024-09-10 Saksham Bassi , Duygu Ataman , Kyunghyun Cho