Related papers: Text-Guided Semantic Image Encoder

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Bangzheng Li , Fei Wang , Wenxuan Zhou , Nan Xu , Ben Zhou , Sheng Zhang , Hoifung Poon , Muhao Chen

Empowering Semantic-Sensitive Underwater Image Enhancement with VLM

In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream…

Computer Vision and Pattern Recognition · Computer Science 2026-03-16 Guodong Fan , Shengning Zhou , Genji Yuan , Huiyu Li , Jingchun Zhou , Jinjiang Li

DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-01-01 Brandon Huang , Hang Hua , Zhuoran Yu , Trevor Darrell , Rogerio Feris , Roei Herzig

Unveiling Encoder-Free Vision-Language Models

Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting…

Computer Vision and Pattern Recognition · Computer Science 2024-10-30 Haiwen Diao , Yufeng Cui , Xiaotong Li , Yueze Wang , Huchuan Lu , Xinlong Wang

Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the…

Computer Vision and Pattern Recognition · Computer Science 2025-07-14 Tiezheng Zhang , Yitong Li , Yu-cheng Chou , Jieneng Chen , Alan Yuille , Chen Wei , Junfei Xiao

High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Ji Woo Hong , Hee Suk Yoon , Gwanhyeong Koo , Eunseop Yoon , SooHwan Eom , Qi Dai , Chong Luo , Chang D. Yoo

Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment

Recent progress in scaling up large language models has shown impressive capabilities in performing few-shot learning across a wide range of text-based tasks. However, a key limitation is that these language models fundamentally lack visual…

Machine Learning · Computer Science 2023-02-06 Hao Liu , Wilson Yan , Pieter Abbeel

BRAVE: Broadening the visual encoding of vision-language models

Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several…

Computer Vision and Pattern Recognition · Computer Science 2024-04-11 Oğuzhan Fatih Kar , Alessio Tonioni , Petra Poklukar , Achin Kulshrestha , Amir Zamir , Federico Tombari

VLMAE: Vision-Language Masked Autoencoder

Image and language modeling is of crucial importance for vision-language pre-training (VLP), which aims to learn multi-modal representations from large-scale paired image-text data. However, we observe that most existing VLP methods focus…

Computer Vision and Pattern Recognition · Computer Science 2022-08-22 Sunan He , Taian Guo , Tao Dai , Ruizhi Qiao , Chen Wu , Xiujun Shu , Bo Ren

Learning with Unmasked Tokens Drives Stronger Vision Learners

Masked image modeling (MIM) has become a leading self-supervised learning strategy. MIMs such as Masked Autoencoder (MAE) learn strong representations by randomly masking input tokens for the encoder to process, with the decoder…

Computer Vision and Pattern Recognition · Computer Science 2024-08-27 Taekyung Kim , Sanghyuk Chun , Byeongho Heo , Dongyoon Han

Vision language models have difficulty recognizing virtual objects

Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question…

Computer Vision and Pattern Recognition · Computer Science 2025-05-16 Tyler Tran , Sangeet Khemlani , J. G. Trafton

Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation

In Vision-and-Language Navigation (VLN), researchers typically take an image encoder pre-trained on ImageNet without fine-tuning on the environments that the agent will be trained or tested on. However, the distribution shift between the…

Computer Vision and Pattern Recognition · Computer Science 2022-11-22 Chia-Wen Kuo , Chih-Yao Ma , Judy Hoffman , Zsolt Kira

Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective

Vision-language models (VLMs) pre-trained on extensive datasets can inadvertently learn biases by correlating gender information with specific objects or scenarios. Current methods, which focus on modifying inputs and monitoring changes in…

Artificial Intelligence · Computer Science 2025-06-09 Zhaotian Weng , Zijun Gao , Jerone Andrews , Jieyu Zhao

Language-Informed Visual Concept Learning

Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along…

Computer Vision and Pattern Recognition · Computer Science 2024-04-04 Sharon Lee , Yunzhi Zhang , Shangzhe Wu , Jiajun Wu

Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models

Vision-language models (VLMs), serve as foundation models for multi-modal applications such as image captioning and text-to-image generation. Recent studies have highlighted limitations in VLM text encoders, particularly in areas like…

Computer Vision and Pattern Recognition · Computer Science 2024-12-12 Sri Harsha Dumpala , David Arps , Sageev Oore , Laura Kallmeyer , Hassan Sajjad

What's in the Image? A Deep-Dive into the Vision of Vision Language Models

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2024-11-27 Omri Kaduri , Shai Bagon , Tali Dekel

Do Vision Language Models Need to Process Image Tokens?

Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational…

Computer Vision and Pattern Recognition · Computer Science 2026-04-13 Sambit Ghosh , R. Venkatesh Babu , Chirag Agarwal

Perception Encoder: The best visual embeddings are not at the output of the network

We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each…

Computer Vision and Pattern Recognition · Computer Science 2025-04-30 Daniel Bolya , Po-Yao Huang , Peize Sun , Jang Hyun Cho , Andrea Madotto , Chen Wei , Tengyu Ma , Jiale Zhi , Jathushan Rajasegaran , Hanoona Rasheed , Junke Wang , Marco Monteiro , Hu Xu , Shiyu Dong , Nikhila Ravi , Daniel Li , Piotr Dollár , Christoph Feichtenhofer

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings

Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the…

Computer Vision and Pattern Recognition · Computer Science 2024-09-23 Dawei Yan , Pengcheng Li , Yang Li , Hao Chen , Qingguo Chen , Weihua Luo , Wei Dong , Qingsen Yan , Haokui Zhang , Chunhua Shen

Frozen Transformers in Language Models Are Effective Visual Encoder Layers

This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a…

Computer Vision and Pattern Recognition · Computer Science 2024-05-07 Ziqi Pang , Ziyang Xie , Yunze Man , Yu-Xiong Wang