Related papers: LXMERT: Learning Cross-Modality Encoder Representa…

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by integrating visual perception capabilities into Large Language Models (LLMs). The prevailing trend in this…

Computer Vision and Pattern Recognition · Computer Science 2024-07-22 Sirnam Swetha , Jinyu Yang , Tal Neiman , Mamshad Nayeem Rizve , Son Tran , Benjamin Yao , Trishul Chilimbi , Mubarak Shah

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and…

Computer Vision and Pattern Recognition · Computer Science 2020-09-24 Jaemin Cho , Jiasen Lu , Dustin Schwenk , Hannaneh Hajishirzi , Aniruddha Kembhavi

Rethinking Visual Information Processing in Multimodal LLMs

Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this…

Computer Vision and Pattern Recognition · Computer Science 2025-11-14 Dongwan Kim , Viresh Ranjan , Takashi Nagata , Arnab Dhua , Amit Kumar K C

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring…

Computer Vision and Pattern Recognition · Computer Science 2022-04-07 Zhao Yang , Jiaqi Wang , Yansong Tang , Kai Chen , Hengshuang Zhao , Philip H. S. Torr

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks. This study explores distilling visual information from pretrained multimodal…

Computation and Language · Computer Science 2022-05-04 Chan-Jan Hsu , Hung-yi Lee , Yu Tsao

Cross-Modality Relevance for Reasoning on Language and Vision

This work deals with the challenge of learning and reasoning over language and vision data for the related downstream tasks such as visual question answering (VQA) and natural language for visual reasoning (NLVR). We design a novel…

Computation and Language · Computer Science 2020-05-14 Chen Zheng , Quan Guo , Parisa Kordjamshidi

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM and Unicoder, both visual and linguistic…

Computer Vision and Pattern Recognition · Computer Science 2019-12-04 Gen Li , Nan Duan , Yuejian Fang , Ming Gong , Daxin Jiang , Ming Zhou

Data Efficient Masked Language Modeling for Vision and Language

Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In…

Computation and Language · Computer Science 2021-09-07 Yonatan Bitton , Gabriel Stanovsky , Michael Elhadad , Roy Schwartz

Frozen Transformers in Language Models Are Effective Visual Encoder Layers

This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a…

Computer Vision and Pattern Recognition · Computer Science 2024-05-07 Ziqi Pang , Ziyang Xie , Yunze Man , Yu-Xiong Wang

Brain encoding models based on multimodal transformers can transfer across language and vision

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain…

Computation and Language · Computer Science 2023-05-23 Jerry Tang , Meng Du , Vy A. Vo , Vasudev Lal , Alexander G. Huth

Vision-Language Models Align with Human Neural Representations in Concept Processing

Recent studies suggest that transformer-based vision-language models (VLMs) capture the multimodality of concept processing in the human brain. However, a systematic evaluation exploring different types of VLM architectures and the role…

Computation and Language · Computer Science 2026-01-23 Anna Bavaresco , Marianne de Heer Kloots , Sandro Pezzelle , Raquel Fernández

Revisiting Language Encoding in Learning Multilingual Representations

Transformer has demonstrated its great power to learn contextual word representations for multiple languages in a single model. To process multilingual sentences in the model, a learnable vector is usually assigned to each language, which…

Computation and Language · Computer Science 2021-02-17 Shengjie Luo , Kaiyuan Gao , Shuxin Zheng , Guolin Ke , Di He , Liwei Wang , Tie-Yan Liu

VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering

The increasing availability of multimodal data across text, tables, and images presents new challenges for developing models capable of complex cross-modal reasoning. Existing methods for Multimodal Multi-hop Question Answering (MMQA) often…

Computer Vision and Pattern Recognition · Computer Science 2025-04-14 Qi Zhi Lim , Chin Poo Lee , Kian Ming Lim , Kalaiarasi Sonai Muthu Anbananthen

Learning Multi-View Spatial Reasoning from Cross-View Relations

Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Suchae Jeong , Jaehwi Song , Haeone Lee , Hanna Kim , Jian Kim , Dongjun Lee , Dong Kyu Shin , Changyeon Kim , Dongyoon Hahm , Woogyeol Jin , Juheon Choi , Kimin Lee

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow.…

Computation and Language · Computer Science 2025-02-17 Shintaro Ozaki , Kazuki Hayashi , Yusuke Sakai , Hidetaka Kamigaito , Katsuhiko Hayashi , Taro Watanabe

Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts

Multimodal large language models (MLLMs) require a nuanced interpretation of complex image information, typically leveraging a vision encoder to perceive various visual scenarios. However, relying solely on a single vision encoder to handle…

Computer Vision and Pattern Recognition · Computer Science 2025-06-02 Xin He , Xumeng Han , Longhui Wei , Lingxi Xie , Qi Tian

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually…

Computation and Language · Computer Science 2021-09-10 Stella Frank , Emanuele Bugliarello , Desmond Elliott

LanteRn: Latent Visual Structured Reasoning

While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 André G. Viveiros , Nuno Gonçalves , Matthias Lindemann , André Martins

Visual Relationship Detection with Visual-Linguistic Knowledge from Multimodal Representations

Visual relationship detection aims to reason over relationships among salient objects in images, which has drawn increasing attention over the past few years. Inspired by human reasoning mechanisms, it is believed that external visual…

Computer Vision and Pattern Recognition · Computer Science 2021-04-06 Meng-Jiun Chiou , Roger Zimmermann , Jiashi Feng

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features. There are different ways…

Computer Vision and Pattern Recognition · Computer Science 2022-11-28 Henghui Ding , Chang Liu , Suchen Wang , Xudong Jiang