Related papers: Modulating early visual processing by language

Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters

How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct…

Computer Vision and Pattern Recognition · Computer Science 2022-06-24 İlker Kesen , Ozan Arkan Can , Erkut Erdem , Aykut Erdem , Deniz Yuret

Modulation of early visual processing alleviates capacity limits in solving multiple tasks

In daily life situations, we have to perform multiple tasks given a visual stimulus, which requires task-relevant information to be transmitted through our visual system. When it is not possible to transmit all the possibly relevant…

Neurons and Cognition · Quantitative Biology 2019-09-24 Sushrut Thorat , Giacomo Aldegheri , Marcel A. J. van Gerven , Marius V. Peelen

MULE: Multimodal Universal Language Embedding

Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We…

Computer Vision and Pattern Recognition · Computer Science 2020-01-01 Donghyun Kim , Kuniaki Saito , Kate Saenko , Stan Sclaroff , Bryan A. Plummer

LanteRn: Latent Visual Structured Reasoning

While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 André G. Viveiros , Nuno Gonçalves , Matthias Lindemann , André Martins

Distilling Internet-Scale Vision-Language Models into Embodied Agents

Instruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To…

Artificial Intelligence · Computer Science 2023-06-16 Theodore Sumers , Kenneth Marino , Arun Ahuja , Rob Fergus , Ishita Dasgupta

Beyond Intermediate States: Explaining Visual Redundancy through Language

Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens, which consume a significant portion of the context window and impose a substantial computational burden. Prior work has empirically explored visual token…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Dingchen Yang , Bowen Cao , Anran Zhang , Weibo Gu , Winston Hu , Guang Chen

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered…

Computer Vision and Pattern Recognition · Computer Science 2023-03-01 Ivona Najdenkoska , Xiantong Zhen , Marcel Worring

Modular Prompt Learning Improves Vision-Language Models

Pre-trained vision-language models are able to interpret visual concepts and language semantics. Prompt learning, a method of constructing prompts for text encoders or image encoders, elicits the potentials of pre-trained models and readily…

Computer Vision and Pattern Recognition · Computer Science 2025-02-21 Zhenhan Huang , Tejaswini Pedapati , Pin-Yu Chen , Jianxi Gao

Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction

In this paper, we introduce $\text{EVL}_{\text{Gen}}$, a streamlined framework designed for the pre-training of visually conditioned language generation models with high computational demands, utilizing frozen pre-trained large language…

Computer Vision and Pattern Recognition · Computer Science 2024-02-22 Yiren Jian , Tingkai Liu , Yunzhe Tao , Chunhui Zhang , Soroush Vosoughi , Hongxia Yang

Deep Neural Networks for Visual Reasoning

Visual perception and language understanding are - fundamental components of human intelligence, enabling them to understand and reason about objects and their interactions. It is crucial for machines to have this capacity to reason using…

Computer Vision and Pattern Recognition · Computer Science 2022-09-27 Thao Minh Le

Towards Language-guided Visual Recognition via Dynamic Convolutions

In this paper, we are committed to establishing an unified and end-to-end multi-modal network via exploring the language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called…

Computer Vision and Pattern Recognition · Computer Science 2023-09-15 Gen Luo , Yiyi Zhou , Xiaoshuai Sun , Yongjian Wu , Yue Gao , Rongrong Ji

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually…

Computation and Language · Computer Science 2021-09-10 Stella Frank , Emanuele Bugliarello , Desmond Elliott

Language Guided Networks for Cross-modal Moment Retrieval

We address the challenging task of cross-modal moment retrieval, which aims to localize a temporal segment from an untrimmed video described by a natural language query. It poses great challenges over the proper semantic alignment between…

Computer Vision and Pattern Recognition · Computer Science 2022-08-22 Kun Liu , Huadong Ma , Chuang Gan

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast…

Machine Learning · Computer Science 2024-05-24 William Chen , Oier Mees , Aviral Kumar , Sergey Levine

Modulated Self-attention Convolutional Network for VQA

As new data-sets for real-world visual reasoning and compositional question answering are emerging, it might be needed to use the visual feature extraction as a end-to-end process during training. This small contribution aims to suggest new…

Computer Vision and Pattern Recognition · Computer Science 2019-11-01 Jean-Benoit Delbrouck , Antoine Maiorca , Nathan Hubens , Stéphane Dupont

Learning language through pictures

We propose Imaginet, a model of learning visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with shared word embeddings, and uses a multi-task…

Computation and Language · Computer Science 2015-06-22 Grzegorz Chrupała , Ákos Kádár , Afra Alishahi

Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation

Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Yuting Li , Lai Wei , Kaipeng Zheng , Jingyuan Huang , Guilin Li , Bo Wang , Linghe Kong , Lichao Sun , Weiran Huang

Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks

The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many…

Computer Vision and Pattern Recognition · Computer Science 2026-04-06 Yaxin Luo , Zhiqiang Shen

Perceiving Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models

Achieving deep alignment between vision and language remains a central challenge for Multimodal Large Language Models (MLLMs). These models often fail to fully leverage visual input, defaulting to strong language priors. Our approach first…

Computer Vision and Pattern Recognition · Computer Science 2025-07-03 Aarti Ghatkesar , Ganesh Venkatesh

Attention over learned object embeddings enables complex visual reasoning

Neural networks have achieved success in a wide array of perceptual tasks but often fail at tasks involving both perception and higher-level reasoning. On these more challenging tasks, bespoke approaches (such as modular symbolic…

Computer Vision and Pattern Recognition · Computer Science 2021-10-27 David Ding , Felix Hill , Adam Santoro , Malcolm Reynolds , Matt Botvinick