Related papers: Efficient Universal Perception Encoder

Perception Encoder: The best visual embeddings are not at the output of the network

We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each…

Computer Vision and Pattern Recognition · Computer Science 2025-04-30 Daniel Bolya , Po-Yao Huang , Peize Sun , Jang Hyun Cho , Andrea Madotto , Chen Wei , Tengyu Ma , Jiale Zhi , Jathushan Rajasegaran , Hanoona Rasheed , Junke Wang , Marco Monteiro , Hu Xu , Shiyu Dong , Nikhila Ravi , Daniel Li , Piotr Dollár , Christoph Feichtenhofer

You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. The performance improvements come with increasing model size, resulting in slow inference speed and…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Shengkun Tang , Yaqing Wang , Zhenglun Kong , Tianchi Zhang , Yao Li , Caiwen Ding , Yanzhi Wang , Yi Liang , Dongkuan Xu

Unifying Visual Perception by Dispersible Points Learning

We present a conceptually simple, flexible, and universal visual perception head for variant visual tasks, e.g., classification, object detection, instance segmentation and pose estimation, and different frameworks, such as one-stage or…

Computer Vision and Pattern Recognition · Computer Science 2022-09-13 Jianming Liang , Guanglu Song , Biao Leng , Yu Liu

Unified Camera Positional Encoding for Controlled Video Generation

Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 Cheng Zhang , Boying Li , Meng Wei , Yan-Pei Cao , Camilo Cruz Gambardella , Dinh Phung , Jianfei Cai

UNIT: Unifying Image and Text Recognition in One Vision Encoder

Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-09 Yi Zhu , Yanpeng Zhou , Chunwei Wang , Yang Cao , Jianhua Han , Lu Hou , Hang Xu

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for…

Computer Vision and Pattern Recognition · Computer Science 2022-11-18 Hao Li , Jinguo Zhu , Xiaohu Jiang , Xizhou Zhu , Hongsheng Li , Chun Yuan , Xiaohua Wang , Yu Qiao , Xiaogang Wang , Wenhai Wang , Jifeng Dai

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Hao Tang , Chenwei Xie , Haiyang Wang , Xiaoyi Bao , Tingyu Weng , Pandeng Li , Yun Zheng , Liwei Wang

Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving

We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to…

Computer Vision and Pattern Recognition · Computer Science 2025-12-16 Jiawei Yang , Ziyu Chen , Yurong You , Yan Wang , Yiming Li , Yuxiao Chen , Boyi Li , Boris Ivanovic , Marco Pavone , Yue Wang

Entropy-Based Feature Extraction For Real-Time Semantic Segmentation

This paper introduces an efficient patch-based computational module, coined Entropy-based Patch Encoder (EPE) module, for resource-constrained semantic segmentation. The EPE module consists of three lightweight fully-convolutional encoders,…

Computer Vision and Pattern Recognition · Computer Science 2022-07-08 Lusine Abrahamyan , Nikos Deligiannis

Universal Medical Image Representation Learning with Compositional Decoders

Visual-language models have advanced the development of universal models, yet their application in medical imaging remains constrained by specific functional requirements and the limited data. Current general-purpose models are typically…

Computer Vision and Pattern Recognition · Computer Science 2024-10-08 Kaini Wang , Ling Yang , Siping Zhou , Guangquan Zhou , Wentao Zhang , Bin Cui , Shuo Li

Aligning and Prompting Everything All at Once for Universal Visual Perception

Vision foundation models have been explored recently to build general-purpose vision systems. However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which…

Computer Vision and Pattern Recognition · Computer Science 2023-12-05 Yunhang Shen , Chaoyou Fu , Peixian Chen , Mengdan Zhang , Ke Li , Xing Sun , Yunsheng Wu , Shaohui Lin , Rongrong Ji

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Harrish Thasarathan , Julian Forsyth , Thomas Fel , Matthew Kowal , Konstantinos G. Derpanis

UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-13 Ziyao Wang , Chen Chen , Jingtao Li , Weiming Zhuang , Jiabo Huang , Ang Li , Lingjuan Lyu

UniEDU: A Unified Language and Vision Assistant for Education Applications

Education materials for K-12 students often consist of multiple modalities, such as text and images, posing challenges for models to fully understand nuanced information in these materials. In this paper, we propose a unified language and…

Computation and Language · Computer Science 2025-10-10 Zhendong Chu , Jian Xie , Shen Wang , Zichao Wang , Qingsong Wen

Tagger: Deep Unsupervised Perceptual Grouping

We present a framework for efficient perceptual inference that explicitly reasons about the segmentation of its inputs and features. Rather than being trained for any specific segmentation, our framework learns the grouping process in an…

Computer Vision and Pattern Recognition · Computer Science 2016-11-29 Klaus Greff , Antti Rasmus , Mathias Berglund , Tele Hotloo Hao , Jürgen Schmidhuber , Harri Valpola

Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

Autonomous driving systems require a comprehensive understanding of the environment, achieved by extracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Huy-Dung Nguyen , Anass Bairouk , Mirjana Maras , Wei Xiao , Tsun-Hsuan Wang , Patrick Chareyre , Ramin Hasani , Marc Blanchon , Daniela Rus

Visual Bridge: Universal Visual Perception Representations Generating

Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Yilin Gao , Shuguang Dou , Junzhou Li , Zhiheng Yu , Yin Li , Dongsheng Jiang , Shugong Xu

SeqPE: Transformer with Sequential Position Encoding

Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position…

Machine Learning · Computer Science 2025-06-18 Huayang Li , Yahui Liu , Hongyu Sun , Deng Cai , Leyang Cui , Wei Bi , Peilin Zhao , Taro Watanabe

CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing

Universal phoneme recognition typically requires analyzing long speech segments and language-specific patterns. Many speech processing tasks require pure phoneme representations free from contextual influence, which motivated our…

Computation and Language · Computer Science 2025-08-22 Abdul Rehman , Jian-Jun Zhang , Xiaosong Yang

Selective Visual Representations Improve Convergence and Generalization for Embodied AI

Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this…

Computer Vision and Pattern Recognition · Computer Science 2024-03-12 Ainaz Eftekhar , Kuo-Hao Zeng , Jiafei Duan , Ali Farhadi , Ani Kembhavi , Ranjay Krishna