English
Related papers

Related papers: Does language help generalization in vision models…

200 papers

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently…

Computation and Language · Computer Science 2023-11-29 Utsav Garg , Erhan Bas

Recent work has shown how to learn better visual-semantic embeddings by leveraging image descriptions in more than one language. Here, we investigate in detail which conditions affect the performance of this type of grounded language…

Computation and Language · Computer Science 2018-09-21 Ákos Kádár , Desmond Elliott , Marc-Alexandre Côté , Grzegorz Chrupała , Afra Alishahi

Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. While being actively researched, multimodal models that can additionally handle images as input have yet to…

Computation and Language · Computer Science 2023-05-24 Sherzod Hakimov , David Schlangen

Unlike most neural language models, humans learn language in a rich, multi-sensory and, often, multi-lingual environment. Current language models typically fail to fully capture the complexities of multilingual language use. We train an…

Computation and Language · Computer Science 2023-02-15 Khai-Nguyen Nguyen , Zixin Tang , Ankur Mali , Alex Kelly

Neural language models (LMs) are arguably less data-efficient than humans from a language acquisition perspective. One fundamental question is why this human-LM gap arises. This study explores the advantage of grounded language acquisition,…

Computation and Language · Computer Science 2024-12-18 Tatsuki Kuribayashi , Timothy Baldwin

There are limitations in learning language from text alone. Therefore, recent focus has been on developing multimodal models. However, few benchmarks exist that can measure what language models learn about language from multimodal training.…

Computation and Language · Computer Science 2022-05-17 Lovisa Hagström , Richard Johansson

We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method…

Computation and Language · Computer Science 2023-06-16 Jing Yu Koh , Ruslan Salakhutdinov , Daniel Fried

Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to…

Computation and Language · Computer Science 2024-03-27 Chengxu Zhuang , Evelina Fedorenko , Jacob Andreas

Neural networks can be powerful function approximators, which are able to model high-dimensional feature distributions from a subset of examples drawn from the target distribution. Naturally, they perform well at generalizing within the…

Machine Learning · Computer Science 2021-08-06 Aaron Eisermann , Jae Hee Lee , Cornelius Weber , Stefan Wermter

Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Georgios Pantazopoulos , Eda B. Özyiğit

Large language models such as BERT and the GPT series started a paradigm shift that calls for building general-purpose models via pre-training on large datasets, followed by fine-tuning on task-specific datasets. There is now a plethora of…

Computation and Language · Computer Science 2023-06-13 Jeremy Gwinnup , Kevin Duh

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage…

Computer Vision and Pattern Recognition · Computer Science 2024-11-25 Karsten Roth , Zeynep Akata , Dima Damen , Ivana Balažević , Olivier J. Hénaff

We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the…

Computation and Language · Computer Science 2025-07-01 Jianhong Tu , Zhuohao Ni , Nicholas Crispino , Zihao Yu , Michael Bendersky , Beliz Gunel , Ruoxi Jia , Xin Liu , Lingjuan Lyu , Dawn Song , Chenguang Wang

This paper explores a multimodal co-training framework designed to enhance model generalization in situations where labeled data is limited and distribution shifts occur. We thoroughly examine the theoretical foundations of this framework,…

Machine Learning · Computer Science 2025-10-10 Tianyu Bell Pan , Damon L. Woodard

Despite the impressive advancements achieved through vision-and-language pretraining, it remains unclear whether this joint learning paradigm can help understand each individual modality. In this work, we conduct a comparative analysis of…

Computer Vision and Pattern Recognition · Computer Science 2024-01-31 Zhuowan Li , Cihang Xie , Benjamin Van Durme , Alan Yuille

Foundational multimodal models pre-trained on large scale image-text pairs or video-text pairs or both have shown strong generalization abilities on downstream tasks. However unlike image-text models, pretraining video-text models is always…

Computer Vision and Pattern Recognition · Computer Science 2023-11-28 Avinash Madasu , Anahita Bhiwandiwalla , Vasudev Lal

Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model…

Computation and Language · Computer Science 2021-06-24 Kayode Olaleye , Herman Kamper

This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in…

Computation and Language · Computer Science 2022-07-12 Gabriel Skantze , Bram Willemsen

Foundation models have received much attention due to their effectiveness across a broad range of downstream applications. Though there is a big convergence in terms of architecture, most pretrained models are typically still developed for…

Computation and Language · Computer Science 2022-06-14 Yaru Hao , Haoyu Song , Li Dong , Shaohan Huang , Zewen Chi , Wenhui Wang , Shuming Ma , Furu Wei

A model's capacity to generalize its knowledge to interpret unseen inputs with different characteristics is crucial to build robust and reliable machine learning systems. Language model evaluation tasks lack information metrics about model…

Computation and Language · Computer Science 2024-09-10 Saksham Bassi , Duygu Ataman , Kyunghyun Cho
‹ Prev 1 2 3 10 Next ›