English
Related papers

Related papers: MULE: Multimodal Universal Language Embedding

200 papers

Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework…

Computer Vision and Pattern Recognition · Computer Science 2023-09-26 Avinash Madasu , Estelle Aflalo , Gabriela Ben Melech Stan , Shachar Rosenman , Shao-Yen Tseng , Gedas Bertasius , Vasudev Lal

Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we propose a Scalable Multilingual…

Computer Vision and Pattern Recognition · Computer Science 2020-08-31 Andrea Burns , Donghyun Kim , Derry Wijaya , Kate Saenko , Bryan A. Plummer

Integrating visual and linguistic information into a single multimodal representation is an unsolved problem with wide-reaching applications to both natural language processing and computer vision. In this paper, we present a simple method…

Machine Learning · Statistics 2017-03-28 Guillem Collell , Teddy Zhang , Marie-Francine Moens

Multilingual neural machine translation (NMT), which translates multiple languages using a single model, is of great practical importance due to its advantages in simplifying the training process, reducing online maintenance costs, and…

Computation and Language · Computer Science 2019-08-27 Xu Tan , Jiale Chen , Di He , Yingce Xia , Tao Qin , Tie-Yan Liu

Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual…

Information Retrieval · Computer Science 2021-09-14 Aashi Jain , Mandy Guo , Krishna Srinivasan , Ting Chen , Sneha Kudugunta , Chao Jia , Yinfei Yang , Jason Baldridge

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring…

Computer Vision and Pattern Recognition · Computer Science 2021-07-06 Maria Tsimpoukelli , Jacob Menick , Serkan Cabi , S. M. Ali Eslami , Oriol Vinyals , Felix Hill

Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks by combining visual representations with the abstract skill set large language models (LLMs) learn during pretraining. Vision,…

Artificial Intelligence · Computer Science 2023-09-01 Riley Tavassoli , Mani Amani , Reza Akhavian

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual…

Machine Learning · Computer Science 2025-07-09 Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao , Zhiting Hu , Biwei Huang

Lately, researchers in artificial intelligence have been really interested in how language and vision come together, giving rise to the development of multimodal models that aim to seamlessly integrate textual and visual information.…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Rajat Chawla , Arkajit Datta , Tushar Verma , Adarsh Jha , Anmol Gautam , Ayush Vatsal , Sukrit Chaterjee , Mukunda NS , Ishaan Bhola

Multimodal representation learning has shown promising improvements on various vision-language tasks. Most existing methods excel at building global-level alignment between vision and language while lacking effective fine-grained image-text…

Computer Vision and Pattern Recognition · Computer Science 2023-06-16 Zijia Zhao , Longteng Guo , Xingjian He , Shuai Shao , Zehuan Yuan , Jing Liu

Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Muhammad Uzair Khattak , Hanoona Rasheed , Muhammad Maaz , Salman Khan , Fahad Shahbaz Khan

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi,…

Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard…

Computer Vision and Pattern Recognition · Computer Science 2025-06-19 Chuwei Luo , Guozhi Tang , Qi Zheng , Cong Yao , Lianwen Jin , Chenliang Li , Yang Xue , Luo Si

Recent advances in vision-language models have significantly expanded the frontiers of automated image analysis. However, applying these models in safety-critical contexts remains challenging due to the complex relationships between…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Muhammad Imran , Yugyung Lee

This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality resources in an embedding space for searching candidates from…

Information Retrieval · Computer Science 2023-02-07 Zhenghao Liu , Chenyan Xiong , Yuanhuiyi Lv , Zhiyuan Liu , Ge Yu

Representation learning is the foundation of natural language processing (NLP). This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of…

Computation and Language · Computer Science 2023-01-10 Zhuosheng Zhang , Kehai Chen , Rui Wang , Masao Utiyama , Eiichiro Sumita , Zuchao Li , Hai Zhao

Multilingual Word Embeddings (MWEs) represent words from multiple languages in a single distributional vector space. Unsupervised MWE (UMWE) methods acquire multilingual embeddings without cross-lingual supervision, which is a significant…

Computation and Language · Computer Science 2018-09-07 Xilun Chen , Claire Cardie

Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision…

Computer Vision and Pattern Recognition · Computer Science 2025-02-24 Matvey Skripkin , Elizaveta Goncharova , Dmitrii Tarasov , Andrey Kuznetsov

Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered…

Computer Vision and Pattern Recognition · Computer Science 2023-03-01 Ivona Najdenkoska , Xiantong Zhen , Marcel Worring

Recent progress on unsupervised learning of cross-lingual embeddings in bilingual setting has given impetus to learning a shared embedding space for several languages without any supervision. A popular framework to solve the latter problem…

Computation and Language · Computer Science 2020-04-21 Pratik Jawanpuria , Mayank Meghwanshi , Bamdev Mishra
‹ Prev 1 2 3 10 Next ›