English
Related papers

Related papers: LXMERT: Learning Cross-Modality Encoder Representa…

200 papers

Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by integrating visual perception capabilities into Large Language Models (LLMs). The prevailing trend in this…

Computer Vision and Pattern Recognition · Computer Science 2024-07-22 Sirnam Swetha , Jinyu Yang , Tal Neiman , Mamshad Nayeem Rizve , Son Tran , Benjamin Yao , Trishul Chilimbi , Mubarak Shah

Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and…

Computer Vision and Pattern Recognition · Computer Science 2020-09-24 Jaemin Cho , Jiasen Lu , Dustin Schwenk , Hannaneh Hajishirzi , Aniruddha Kembhavi

Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this…

Computer Vision and Pattern Recognition · Computer Science 2025-11-14 Dongwan Kim , Viresh Ranjan , Takashi Nagata , Arnab Dhua , Amit Kumar K C

Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring…

Computer Vision and Pattern Recognition · Computer Science 2022-04-07 Zhao Yang , Jiaqi Wang , Yansong Tang , Kai Chen , Hengshuang Zhao , Philip H. S. Torr

Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks. This study explores distilling visual information from pretrained multimodal…

Computation and Language · Computer Science 2022-05-04 Chan-Jan Hsu , Hung-yi Lee , Yu Tsao

This work deals with the challenge of learning and reasoning over language and vision data for the related downstream tasks such as visual question answering (VQA) and natural language for visual reasoning (NLVR). We design a novel…

Computation and Language · Computer Science 2020-05-14 Chen Zheng , Quan Guo , Parisa Kordjamshidi

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM and Unicoder, both visual and linguistic…

Computer Vision and Pattern Recognition · Computer Science 2019-12-04 Gen Li , Nan Duan , Yuejian Fang , Ming Gong , Daxin Jiang , Ming Zhou

Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In…

Computation and Language · Computer Science 2021-09-07 Yonatan Bitton , Gabriel Stanovsky , Michael Elhadad , Roy Schwartz

This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a…

Computer Vision and Pattern Recognition · Computer Science 2024-05-07 Ziqi Pang , Ziyang Xie , Yunze Man , Yu-Xiong Wang

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain…

Computation and Language · Computer Science 2023-05-23 Jerry Tang , Meng Du , Vy A. Vo , Vasudev Lal , Alexander G. Huth

Recent studies suggest that transformer-based vision-language models (VLMs) capture the multimodality of concept processing in the human brain. However, a systematic evaluation exploring different types of VLM architectures and the role…

Computation and Language · Computer Science 2026-01-23 Anna Bavaresco , Marianne de Heer Kloots , Sandro Pezzelle , Raquel Fernández

Transformer has demonstrated its great power to learn contextual word representations for multiple languages in a single model. To process multilingual sentences in the model, a learnable vector is usually assigned to each language, which…

Computation and Language · Computer Science 2021-02-17 Shengjie Luo , Kaiyuan Gao , Shuxin Zheng , Guolin Ke , Di He , Liwei Wang , Tie-Yan Liu

The increasing availability of multimodal data across text, tables, and images presents new challenges for developing models capable of complex cross-modal reasoning. Existing methods for Multimodal Multi-hop Question Answering (MMQA) often…

Computer Vision and Pattern Recognition · Computer Science 2025-04-14 Qi Zhi Lim , Chin Poo Lee , Kian Ming Lim , Kalaiarasi Sonai Muthu Anbananthen

Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Suchae Jeong , Jaehwi Song , Haeone Lee , Hanna Kim , Jian Kim , Dongjun Lee , Dong Kyu Shin , Changyeon Kim , Dongyoon Hahm , Woogyeol Jin , Juheon Choi , Kimin Lee

As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow.…

Computation and Language · Computer Science 2025-02-17 Shintaro Ozaki , Kazuki Hayashi , Yusuke Sakai , Hidetaka Kamigaito , Katsuhiko Hayashi , Taro Watanabe

Multimodal large language models (MLLMs) require a nuanced interpretation of complex image information, typically leveraging a vision encoder to perceive various visual scenarios. However, relying solely on a single vision encoder to handle…

Computer Vision and Pattern Recognition · Computer Science 2025-06-02 Xin He , Xumeng Han , Longhui Wei , Lingxi Xie , Qi Tian

Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually…

Computation and Language · Computer Science 2021-09-10 Stella Frank , Emanuele Bugliarello , Desmond Elliott

While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 André G. Viveiros , Nuno Gonçalves , Matthias Lindemann , André Martins

Visual relationship detection aims to reason over relationships among salient objects in images, which has drawn increasing attention over the past few years. Inspired by human reasoning mechanisms, it is believed that external visual…

Computer Vision and Pattern Recognition · Computer Science 2021-04-06 Meng-Jiun Chiou , Roger Zimmermann , Jiashi Feng

We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features. There are different ways…

Computer Vision and Pattern Recognition · Computer Science 2022-11-28 Henghui Ding , Chang Liu , Suchen Wang , Xudong Jiang
‹ Prev 1 2 3 10 Next ›