English
Related papers

Related papers: Modulating early visual processing by language

200 papers

How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct…

Computer Vision and Pattern Recognition · Computer Science 2022-06-24 İlker Kesen , Ozan Arkan Can , Erkut Erdem , Aykut Erdem , Deniz Yuret

In daily life situations, we have to perform multiple tasks given a visual stimulus, which requires task-relevant information to be transmitted through our visual system. When it is not possible to transmit all the possibly relevant…

Neurons and Cognition · Quantitative Biology 2019-09-24 Sushrut Thorat , Giacomo Aldegheri , Marcel A. J. van Gerven , Marius V. Peelen

Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We…

Computer Vision and Pattern Recognition · Computer Science 2020-01-01 Donghyun Kim , Kuniaki Saito , Kate Saenko , Stan Sclaroff , Bryan A. Plummer

While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 André G. Viveiros , Nuno Gonçalves , Matthias Lindemann , André Martins

Instruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To…

Artificial Intelligence · Computer Science 2023-06-16 Theodore Sumers , Kenneth Marino , Arun Ahuja , Rob Fergus , Ishita Dasgupta

Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens, which consume a significant portion of the context window and impose a substantial computational burden. Prior work has empirically explored visual token…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Dingchen Yang , Bowen Cao , Anran Zhang , Weibo Gu , Winston Hu , Guang Chen

Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered…

Computer Vision and Pattern Recognition · Computer Science 2023-03-01 Ivona Najdenkoska , Xiantong Zhen , Marcel Worring

Pre-trained vision-language models are able to interpret visual concepts and language semantics. Prompt learning, a method of constructing prompts for text encoders or image encoders, elicits the potentials of pre-trained models and readily…

Computer Vision and Pattern Recognition · Computer Science 2025-02-21 Zhenhan Huang , Tejaswini Pedapati , Pin-Yu Chen , Jianxi Gao

In this paper, we introduce $\text{EVL}_{\text{Gen}}$, a streamlined framework designed for the pre-training of visually conditioned language generation models with high computational demands, utilizing frozen pre-trained large language…

Computer Vision and Pattern Recognition · Computer Science 2024-02-22 Yiren Jian , Tingkai Liu , Yunzhe Tao , Chunhui Zhang , Soroush Vosoughi , Hongxia Yang

Visual perception and language understanding are - fundamental components of human intelligence, enabling them to understand and reason about objects and their interactions. It is crucial for machines to have this capacity to reason using…

Computer Vision and Pattern Recognition · Computer Science 2022-09-27 Thao Minh Le

In this paper, we are committed to establishing an unified and end-to-end multi-modal network via exploring the language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called…

Computer Vision and Pattern Recognition · Computer Science 2023-09-15 Gen Luo , Yiyi Zhou , Xiaoshuai Sun , Yongjian Wu , Yue Gao , Rongrong Ji

Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually…

Computation and Language · Computer Science 2021-09-10 Stella Frank , Emanuele Bugliarello , Desmond Elliott

We address the challenging task of cross-modal moment retrieval, which aims to localize a temporal segment from an untrimmed video described by a natural language query. It poses great challenges over the proper semantic alignment between…

Computer Vision and Pattern Recognition · Computer Science 2022-08-22 Kun Liu , Huadong Ma , Chuang Gan

Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast…

Machine Learning · Computer Science 2024-05-24 William Chen , Oier Mees , Aviral Kumar , Sergey Levine

As new data-sets for real-world visual reasoning and compositional question answering are emerging, it might be needed to use the visual feature extraction as a end-to-end process during training. This small contribution aims to suggest new…

Computer Vision and Pattern Recognition · Computer Science 2019-11-01 Jean-Benoit Delbrouck , Antoine Maiorca , Nathan Hubens , Stéphane Dupont

We propose Imaginet, a model of learning visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with shared word embeddings, and uses a multi-task…

Computation and Language · Computer Science 2015-06-22 Grzegorz Chrupała , Ákos Kádár , Afra Alishahi

Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Yuting Li , Lai Wei , Kaipeng Zheng , Jingyuan Huang , Guilin Li , Bo Wang , Linghe Kong , Lichao Sun , Weiran Huang

The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many…

Computer Vision and Pattern Recognition · Computer Science 2026-04-06 Yaxin Luo , Zhiqiang Shen

Achieving deep alignment between vision and language remains a central challenge for Multimodal Large Language Models (MLLMs). These models often fail to fully leverage visual input, defaulting to strong language priors. Our approach first…

Computer Vision and Pattern Recognition · Computer Science 2025-07-03 Aarti Ghatkesar , Ganesh Venkatesh

Neural networks have achieved success in a wide array of perceptual tasks but often fail at tasks involving both perception and higher-level reasoning. On these more challenging tasks, bespoke approaches (such as modular symbolic…

Computer Vision and Pattern Recognition · Computer Science 2021-10-27 David Ding , Felix Hill , Adam Santoro , Malcolm Reynolds , Matt Botvinick
‹ Prev 1 2 3 10 Next ›