Related papers: Diversifying Joint Vision-Language Tokenization Le…

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

We present an effective method for fusing visual-and-language representations for several question answering tasks including visual question answering and visual entailment. In contrast to prior works that concatenate unimodal…

Computer Vision and Pattern Recognition · Computer Science 2022-12-06 Maxwell Mbabilla Aladago , AJ Piergiovanni

Learning Visual Representations via Language-Guided Sampling

Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-30 Mohamed El Banani , Karan Desai , Justin Johnson

Multi-task Learning of Hierarchical Vision-Language Representation

It is still challenging to build an AI system that can perform tasks that involve vision and language at human level. So far, researchers have singled out individual tasks separately, for each of which they have designed networks and…

Computer Vision and Pattern Recognition · Computer Science 2018-12-04 Duy-Kien Nguyen , Takayuki Okatani

Joint Representation Learning of Cross-lingual Words and Entities via Attentive Distant Supervision

Joint representation learning of words and entities benefits many NLP tasks, but has not been well explored in cross-lingual settings. In this paper, we propose a novel method for joint representation learning of cross-lingual words and…

Computation and Language · Computer Science 2018-11-28 Yixin Cao , Lei Hou , Juanzi Li , Zhiyuan Liu , Chengjiang Li , Xu Chen , Tiansi Dong

Jointly Learning to Label Sentences and Tokens

Learning to construct text representations in end-to-end systems can be difficult, as natural languages are highly compositional and task-specific annotated datasets are often limited in size. Methods for directly supervising language…

Computation and Language · Computer Science 2018-11-15 Marek Rei , Anders Søgaard

Vision-Language Pre-Training for Boosting Scene Text Detectors

Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves…

Computer Vision and Pattern Recognition · Computer Science 2022-05-02 Sibo Song , Jianqiang Wan , Zhibo Yang , Jun Tang , Wenqing Cheng , Xiang Bai , Cong Yao

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to both vision and language data. However, the prevailing approaches primarily regard the visual…

Computer Vision and Pattern Recognition · Computer Science 2024-03-25 Yang Jin , Kun Xu , Kun Xu , Liwei Chen , Chao Liao , Jianchao Tan , Quzhe Huang , Bin Chen , Chenyi Lei , An Liu , Chengru Song , Xiaoqiang Lei , Di Zhang , Wenwu Ou , Kun Gai , Yadong Mu

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting…

Computer Vision and Pattern Recognition · Computer Science 2022-04-05 Michael S. Ryoo , AJ Piergiovanni , Anurag Arnab , Mostafa Dehghani , Anelia Angelova

Universal Multimodal Representation for Language Understanding

Representation learning is the foundation of natural language processing (NLP). This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of…

Computation and Language · Computer Science 2023-01-10 Zhuosheng Zhang , Kehai Chen , Rui Wang , Masao Utiyama , Eiichiro Sumita , Zuchao Li , Hai Zhao

See, Hear, and Read: Deep Aligned Representations

We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language. By leveraging over a year of sound from video and…

Computer Vision and Pattern Recognition · Computer Science 2017-06-06 Yusuf Aytar , Carl Vondrick , Antonio Torralba

Masked Vision and Language Modeling for Multi-modal Representation Learning

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint…

Computer Vision and Pattern Recognition · Computer Science 2023-03-16 Gukyeong Kwon , Zhaowei Cai , Avinash Ravichandran , Erhan Bas , Rahul Bhotika , Stefano Soatto

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete…

Computer Vision and Pattern Recognition · Computer Science 2025-06-24 Jiaming Han , Hao Chen , Yang Zhao , Hanyu Wang , Qi Zhao , Ziyan Yang , Hao He , Xiangyu Yue , Lu Jiang

Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models

Despite the impressive advancements achieved through vision-and-language pretraining, it remains unclear whether this joint learning paradigm can help understand each individual modality. In this work, we conduct a comparative analysis of…

Computer Vision and Pattern Recognition · Computer Science 2024-01-31 Zhuowan Li , Cihang Xie , Benjamin Van Durme , Alan Yuille

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the…

Computation and Language · Computer Science 2020-10-15 Hao Tan , Mohit Bansal

Multi-modal Alignment using Representation Codebook

Aligning signals from different modalities is an important step in vision-language representation learning as it affects the performance of later stages such as cross-modality fusion. Since image and text typically reside in different…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Jiali Duan , Liqun Chen , Son Tran , Jinyu Yang , Yi Xu , Belinda Zeng , Trishul Chilimbi

Learning Token-based Representation for Image Retrieval

In image retrieval, deep local features learned in a data-driven manner have been demonstrated effective to improve retrieval performance. To realize efficient retrieval on large image database, some approaches quantize deep local features…

Image and Video Processing · Electrical Eng. & Systems 2021-12-14 Hui Wu , Min Wang , Wengang Zhou , Yang Hu , Houqiang Li

Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it…

Computer Vision and Pattern Recognition · Computer Science 2017-10-17 Tanmay Gupta , Kevin Shih , Saurabh Singh , Derek Hoiem

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Wei Song , Yuran Wang , Zijia Song , Yadong Li , Zenan Zhou , Long Chen , Jianhua Xu , Jiaqi Wang , Kaicheng Yu

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Language modality within the vision language pretraining framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which…

Computer Vision and Pattern Recognition · Computer Science 2022-08-02 Xiaoyuan Guo , Jiali Duan , C. -C. Jay Kuo , Judy Wawira Gichoya , Imon Banerjee

The Surprising Effectiveness of Representation Learning for Visual Imitation

While visual imitation learning offers one of the most effective ways of learning from visual demonstrations, generalizing from them requires either hundreds of diverse demonstrations, task specific priors, or large, hard-to-train…

Robotics · Computer Science 2021-12-07 Jyothish Pari , Nur Muhammad Shafiullah , Sridhar Pandian Arunachalam , Lerrel Pinto