Related papers: MULE: Multimodal Universal Language Embedding

MuMUR : Multilingual Multimodal Universal Retrieval

Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework…

Computer Vision and Pattern Recognition · Computer Science 2023-09-26 Avinash Madasu , Estelle Aflalo , Gabriela Ben Melech Stan , Shachar Rosenman , Shao-Yen Tseng , Gedas Bertasius , Vasudev Lal

Learning to Scale Multilingual Representations for Vision-Language Tasks

Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we propose a Scalable Multilingual…

Computer Vision and Pattern Recognition · Computer Science 2020-08-31 Andrea Burns , Donghyun Kim , Derry Wijaya , Kate Saenko , Bryan A. Plummer

Learning to Predict: A Fast Re-constructive Method to Generate Multimodal Embeddings

Integrating visual and linguistic information into a single multimodal representation is an unsolved problem with wide-reaching applications to both natural language processing and computer vision. In this paper, we present a simple method…

Machine Learning · Statistics 2017-03-28 Guillem Collell , Teddy Zhang , Marie-Francine Moens

Multilingual Neural Machine Translation with Language Clustering

Multilingual neural machine translation (NMT), which translates multiple languages using a single model, is of great practical importance due to its advantages in simplifying the training process, reducing online maintenance costs, and…

Computation and Language · Computer Science 2019-08-27 Xu Tan , Jiale Chen , Di He , Yingce Xia , Tao Qin , Tie-Yan Liu

MURAL: Multimodal, Multitask Retrieval Across Languages

Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual…

Information Retrieval · Computer Science 2021-09-14 Aashi Jain , Mandy Guo , Krishna Srinivasan , Ting Chen , Sneha Kudugunta , Chao Jia , Yinfei Yang , Jason Baldridge

Multimodal Few-Shot Learning with Frozen Language Models

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring…

Computer Vision and Pattern Recognition · Computer Science 2021-07-06 Maria Tsimpoukelli , Jacob Menick , Serkan Cabi , S. M. Ali Eslami , Oriol Vinyals , Felix Hill

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks by combining visual representations with the abstract skill set large language models (LLMs) learn during pretraining. Vision,…

Artificial Intelligence · Computer Science 2023-09-01 Riley Tavassoli , Mani Amani , Reza Akhavian

Towards General Continuous Memory for Vision-Language Models

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual…

Machine Learning · Computer Science 2025-07-09 Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao , Zhiting Hu , Biwei Huang

Veagle: Advancements in Multimodal Representation Learning

Lately, researchers in artificial intelligence have been really interested in how language and vision come together, giving rise to the development of multimodal models that aim to seamlessly integrate textual and visual information.…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Rajat Chawla , Arkajit Datta , Tushar Verma , Adarsh Jha , Anmol Gautam , Ayush Vatsal , Sukrit Chaterjee , Mukunda NS , Ishaan Bhola

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Multimodal representation learning has shown promising improvements on various vision-language tasks. Most existing methods excel at building global-level alignment between vision and language while lacking effective fine-grained image-text…

Computer Vision and Pattern Recognition · Computer Science 2023-06-16 Zijia Zhao , Longteng Guo , Xingjian He , Shuai Shao , Zehuan Yuan , Jing Liu

MaPLe: Multi-modal Prompt Learning

Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Muhammad Uzair Khattak , Hanoona Rasheed , Muhammad Maaz , Salman Khan , Fahad Shahbaz Khan

PALO: A Polyglot Large Multimodal Model for 5B People

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi,…

Computation and Language · Computer Science 2024-03-06 Muhammad Maaz , Hanoona Rasheed , Abdelrahman Shaker , Salman Khan , Hisham Cholakal , Rao M. Anwer , Tim Baldwin , Michael Felsberg , Fahad S. Khan

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard…

Computer Vision and Pattern Recognition · Computer Science 2025-06-19 Chuwei Luo , Guozhi Tang , Qi Zheng , Cong Yao , Lianwen Jin , Chenliang Li , Yang Xue , Luo Si

Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models

Recent advances in vision-language models have significantly expanded the frontiers of automated image analysis. However, applying these models in safety-critical contexts remains challenging due to the complex relationships between…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Muhammad Imran , Yugyung Lee

Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval

This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality resources in an embedding space for searching candidates from…

Information Retrieval · Computer Science 2023-02-07 Zhenghao Liu , Chenyan Xiong , Yuanhuiyi Lv , Zhiyuan Liu , Ge Yu

Universal Multimodal Representation for Language Understanding

Representation learning is the foundation of natural language processing (NLP). This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of…

Computation and Language · Computer Science 2023-01-10 Zhuosheng Zhang , Kehai Chen , Rui Wang , Masao Utiyama , Eiichiro Sumita , Zuchao Li , Hai Zhao

Unsupervised Multilingual Word Embeddings

Multilingual Word Embeddings (MWEs) represent words from multiple languages in a single distributional vector space. Unsupervised MWE (UMWE) methods acquire multilingual embeddings without cross-lingual supervision, which is a significant…

Computation and Language · Computer Science 2018-09-07 Xilun Chen , Claire Cardie

MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision…

Computer Vision and Pattern Recognition · Computer Science 2025-02-24 Matvey Skripkin , Elizaveta Goncharova , Dmitrii Tarasov , Andrey Kuznetsov

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered…

Computer Vision and Pattern Recognition · Computer Science 2023-03-01 Ivona Najdenkoska , Xiantong Zhen , Marcel Worring

A Simple Approach to Learning Unsupervised Multilingual Embeddings

Recent progress on unsupervised learning of cross-lingual embeddings in bilingual setting has given impetus to learning a shared embedding space for several languages without any supervision. A popular framework to solve the latter problem…

Computation and Language · Computer Science 2020-04-21 Pratik Jawanpuria , Mayank Meghwanshi , Bamdev Mishra