English
Related papers

Related papers: Efficient Universal Perception Encoder

200 papers

We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each…

Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. The performance improvements come with increasing model size, resulting in slow inference speed and…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Shengkun Tang , Yaqing Wang , Zhenglun Kong , Tianchi Zhang , Yao Li , Caiwen Ding , Yanzhi Wang , Yi Liang , Dongkuan Xu

We present a conceptually simple, flexible, and universal visual perception head for variant visual tasks, e.g., classification, object detection, instance segmentation and pose estimation, and different frameworks, such as one-stage or…

Computer Vision and Pattern Recognition · Computer Science 2022-09-13 Jianming Liang , Guanglu Song , Biao Leng , Yu Liu

Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 Cheng Zhang , Boying Li , Meng Wei , Yan-Pei Cao , Camilo Cruz Gambardella , Dinh Phung , Jianfei Cai

Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-09 Yi Zhu , Yanpeng Zhou , Chunwei Wang , Yang Cao , Jianhua Han , Lu Hou , Hang Xu

Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for…

Computer Vision and Pattern Recognition · Computer Science 2022-11-18 Hao Li , Jinguo Zhu , Xiaohu Jiang , Xizhou Zhu , Hongsheng Li , Chun Yuan , Xiaohua Wang , Yu Qiao , Xiaogang Wang , Wenhai Wang , Jifeng Dai

Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Hao Tang , Chenwei Xie , Haiyang Wang , Xiaoyi Bao , Tingyu Weng , Pandeng Li , Yun Zheng , Liwei Wang

We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to…

Computer Vision and Pattern Recognition · Computer Science 2025-12-16 Jiawei Yang , Ziyu Chen , Yurong You , Yan Wang , Yiming Li , Yuxiao Chen , Boyi Li , Boris Ivanovic , Marco Pavone , Yue Wang

This paper introduces an efficient patch-based computational module, coined Entropy-based Patch Encoder (EPE) module, for resource-constrained semantic segmentation. The EPE module consists of three lightweight fully-convolutional encoders,…

Computer Vision and Pattern Recognition · Computer Science 2022-07-08 Lusine Abrahamyan , Nikos Deligiannis

Visual-language models have advanced the development of universal models, yet their application in medical imaging remains constrained by specific functional requirements and the limited data. Current general-purpose models are typically…

Computer Vision and Pattern Recognition · Computer Science 2024-10-08 Kaini Wang , Ling Yang , Siping Zhou , Guangquan Zhou , Wentao Zhang , Bin Cui , Shuo Li

Vision foundation models have been explored recently to build general-purpose vision systems. However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which…

Computer Vision and Pattern Recognition · Computer Science 2023-12-05 Yunhang Shen , Chaoyou Fu , Peixian Chen , Mengdan Zhang , Ke Li , Xing Sun , Yunsheng Wu , Shaohui Lin , Rongrong Ji

We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Harrish Thasarathan , Julian Forsyth , Thomas Fel , Matthew Kowal , Konstantinos G. Derpanis

Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-13 Ziyao Wang , Chen Chen , Jingtao Li , Weiming Zhuang , Jiabo Huang , Ang Li , Lingjuan Lyu

Education materials for K-12 students often consist of multiple modalities, such as text and images, posing challenges for models to fully understand nuanced information in these materials. In this paper, we propose a unified language and…

Computation and Language · Computer Science 2025-10-10 Zhendong Chu , Jian Xie , Shen Wang , Zichao Wang , Qingsong Wen

We present a framework for efficient perceptual inference that explicitly reasons about the segmentation of its inputs and features. Rather than being trained for any specific segmentation, our framework learns the grouping process in an…

Computer Vision and Pattern Recognition · Computer Science 2016-11-29 Klaus Greff , Antti Rasmus , Mathias Berglund , Tele Hotloo Hao , Jürgen Schmidhuber , Harri Valpola

Autonomous driving systems require a comprehensive understanding of the environment, achieved by extracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Huy-Dung Nguyen , Anass Bairouk , Mirjana Maras , Wei Xiao , Tsun-Hsuan Wang , Patrick Chareyre , Ramin Hasani , Marc Blanchon , Daniela Rus

Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Yilin Gao , Shuguang Dou , Junzhou Li , Zhiheng Yu , Yin Li , Dongsheng Jiang , Shugong Xu

Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position…

Machine Learning · Computer Science 2025-06-18 Huayang Li , Yahui Liu , Hongyu Sun , Deng Cai , Leyang Cui , Wei Bi , Peilin Zhao , Taro Watanabe

Universal phoneme recognition typically requires analyzing long speech segments and language-specific patterns. Many speech processing tasks require pure phoneme representations free from contextual influence, which motivated our…

Computation and Language · Computer Science 2025-08-22 Abdul Rehman , Jian-Jun Zhang , Xiaosong Yang

Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this…

Computer Vision and Pattern Recognition · Computer Science 2024-03-12 Ainaz Eftekhar , Kuo-Hao Zeng , Jiafei Duan , Ali Farhadi , Ani Kembhavi , Ranjay Krishna
‹ Prev 1 2 3 10 Next ›