Related papers: Augmenting Vision Language Pretraining by Learning…

Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models

Despite the impressive advancements achieved through vision-and-language pretraining, it remains unclear whether this joint learning paradigm can help understand each individual modality. In this work, we conduct a comparative analysis of…

Computer Vision and Pattern Recognition · Computer Science 2024-01-31 Zhuowan Li , Cihang Xie , Benjamin Van Durme , Alan Yuille

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the…

Computation and Language · Computer Science 2020-10-15 Hao Tan , Mohit Bansal

Conceptual Codebook Learning for Vision-Language Models

In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Yi Zhang , Ke Yu , Siqi Wu , Zhihai He

Cross-Modal Discrete Representation Learning

Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised…

Computer Vision and Pattern Recognition · Computer Science 2021-06-11 Alexander H. Liu , SouYoung Jin , Cheng-I Jeff Lai , Andrew Rouditchenko , Aude Oliva , James Glass

ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of…

Computer Vision and Pattern Recognition · Computer Science 2023-09-01 Weihan Wang , Zhen Yang , Bin Xu , Juanzi Li , Yankui Sun

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to both vision and language data. However, the prevailing approaches primarily regard the visual…

Computer Vision and Pattern Recognition · Computer Science 2024-03-25 Yang Jin , Kun Xu , Kun Xu , Liwei Chen , Chao Liao , Jianchao Tan , Quzhe Huang , Bin Chen , Chenyi Lei , An Liu , Chengru Song , Xiaoqiang Lei , Di Zhang , Wenwu Ou , Kun Gai , Yadong Mu

Masked Vision and Language Modeling for Multi-modal Representation Learning

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint…

Computer Vision and Pattern Recognition · Computer Science 2023-03-16 Gukyeong Kwon , Zhaowei Cai , Avinash Ravichandran , Erhan Bas , Rahul Bhotika , Stefano Soatto

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing approaches primarily rely on parameter-efficient fine-tuning of image-text pre-trained models, yet they…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Wencheng Zhu , Yuexin Wang , Hongxuan Li , Pengfei Zhu , Qinghua Hu

Diversifying Joint Vision-Language Tokenization Learning

Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering. In this work, we find that the representations must not only jointly capture features from…

Computer Vision and Pattern Recognition · Computer Science 2023-06-19 Vardaan Pahuja , AJ Piergiovanni , Anelia Angelova

Self-supervised vision-language pretraining for Medical visual question answering

Medical image visual question answering (VQA) is a task to answer clinical questions, given a radiographic image, which is a challenging problem that requires a model to integrate both vision and language information. To solve medical VQA…

Computer Vision and Pattern Recognition · Computer Science 2022-11-28 Pengfei Li , Gang Liu , Lin Tan , Jinying Liao , Shenjun Zhong

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment. However, in most existing methods, the reconstruction targets for MIM lack high-level semantics, and…

Computer Vision and Pattern Recognition · Computer Science 2024-03-04 Haowei Liu , Yaya Shi , Haiyang Xu , Chunfeng Yuan , Qinghao Ye , Chenliang Li , Ming Yan , Ji Zhang , Fei Huang , Bing Li , Weiming Hu

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer…

Computer Vision and Pattern Recognition · Computer Science 2024-03-04 Junyi Chen , Longteng Guo , Jia Sun , Shuai Shao , Zehuan Yuan , Liang Lin , Dongyu Zhang

Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment

Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks. The current trend is to move towards ever larger models and pretraining datasets. This computational headlong rush does not seem…

Computer Vision and Pattern Recognition · Computer Science 2022-10-06 Mustafa Shukor , Guillaume Couairon , Matthieu Cord

Leveraging per Image-Token Consistency for Vision-Language Pre-training

Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations:…

Computer Vision and Pattern Recognition · Computer Science 2023-09-06 Yunhao Gou , Tom Ko , Hansi Yang , James Kwok , Yu Zhang , Mingxuan Wang

Visually-Augmented Language Modeling

Human language is grounded on multimodal knowledge including visual knowledge like colors, sizes, and shapes. However, current large-scale pre-trained language models rely on text-only self-supervised training with massive text data, which…

Computation and Language · Computer Science 2023-02-28 Weizhi Wang , Li Dong , Hao Cheng , Haoyu Song , Xiaodong Liu , Xifeng Yan , Jianfeng Gao , Furu Wei

VLMAE: Vision-Language Masked Autoencoder

Image and language modeling is of crucial importance for vision-language pre-training (VLP), which aims to learn multi-modal representations from large-scale paired image-text data. However, we observe that most existing VLP methods focus…

Computer Vision and Pattern Recognition · Computer Science 2022-08-22 Sunan He , Taian Guo , Tao Dai , Ruizhi Qiao , Chen Wu , Xiujun Shu , Bo Ren

VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling

Vector quantization (VQ) transforms continuous image features into discrete representations, providing compressed, tokenized inputs for generative models. However, VQ-based frameworks suffer from several issues, such as non-smooth latent…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Sicheng Yang , Xing Hu , Qiang Wu , Dawei Yang

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities. However, the interpretability of the alignment component remains uninvestigated due to the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-27 Shufan Shen , Junshu Sun , Qingming Huang , Shuhui Wang

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation

This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-19 Minsu Kim , Jeong Hun Yeo , Se Jin Park , Hyeongseop Rha , Yong Man Ro

Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs

Recent breakthroughs in reasoning models have markedly advanced the reasoning capabilities of large language models, particularly via training on tasks with verifiable rewards. Yet, a significant gap persists in their adaptation to real…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Jiaao Yu , Shenwei Li , Mingjie Han , Yifei Yin , Wenzheng Song , Chenghao Jia , Man Lan