English
Related papers

Related papers: Single-pass Adaptive Image Tokenization for Minimu…

200 papers

Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts…

Computer Vision and Pattern Recognition · Computer Science 2025-01-07 Junhong Shen , Kushal Tirumala , Michihiro Yasunaga , Ishan Misra , Luke Zettlemoyer , Lili Yu , Chunting Zhou

Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach that…

Information Retrieval · Computer Science 2026-01-16 Mikel Williams-Lekuona , Georgina Cosma

Previous work on action representation learning focused on global representations for short video clips. In contrast, many practical applications, such as video alignment, strongly demand learning the intensive representation of long…

Computer Vision and Pattern Recognition · Computer Science 2023-03-03 Minghao Chen , Renbo Tu , Chenxi Huang , Yuqi Lin , Boxi Wu , Deng Cai

Learned lossless image compression has achieved significant advancements in recent years. However, existing methods often rely on training amortized generative models on massive datasets, resulting in sub-optimal probability distribution…

Computer Vision and Pattern Recognition · Computer Science 2024-12-24 Daxin Li , Yuanchao Bai , Kai Wang , Junjun Jiang , Xianming Liu , Wen Gao

Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational…

Computer Vision and Pattern Recognition · Computer Science 2024-11-05 Shivam Duggal , Phillip Isola , Antonio Torralba , William T. Freeman

We introduce Consistent Assignment for Representation Learning (CARL), an unsupervised learning method to learn visual representations by combining ideas from self-supervised contrastive learning and deep clustering. By viewing contrastive…

Machine Learning · Computer Science 2023-10-23 Thalles Silva , Adín Ramírez Rivera

Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm…

Machine Learning · Computer Science 2026-05-12 Leyang Shen , Yang Zhang , Chun Kai Ling , Xiaoyan Zhao , Tat-Seng Chua

Prior works on action representation learning mainly focus on designing various architectures to extract the global representations for short video clips. In contrast, many practical applications such as video alignment have strong demand…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Minghao Chen , Fangyun Wei , Chong Li , Deng Cai

Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Haotian Ye , Qiyuan He , Jiaqi Han , Puheng Li , Jiaojiao Fan , Zekun Hao , Fitsum Reda , Yogesh Balaji , Huayu Chen , Sheng Liu , Angela Yao , James Zou , Stefano Ermon , Haoxiang Wang , Ming-Yu Liu

The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. More tokens normally lead to better performance but considerably…

Computer Vision and Pattern Recognition · Computer Science 2021-12-07 Yichen Zhu , Yuqin Zhu , Jie Du , Yi Wang , Zhicai Ou , Feifei Feng , Jian Tang

We present Soft Tail-dropping Adaptive Tokenizer (STAT), a 1D discrete visual tokenizer that adaptively chooses the number of output tokens per image according to its structural complexity and level of detail. STAT encodes an image into a…

Computer Vision and Pattern Recognition · Computer Science 2026-01-21 Zeyuan Chen , Kai Zhang , Zhuowen Tu , Yuanjun Xiong

The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of tokens typically results in…

Machine Learning · Computer Science 2023-07-06 Qiqi Zhou , Yichen Zhu

Heterogeneous networks not only present a challenge of heterogeneity in the types of nodes and relations, but also the attributes and content associated with the nodes. While recent works have looked at representation learning on…

Social and Information Networks · Computer Science 2018-05-15 Chuxu Zhang , Ananthram Swami , Nitesh V. Chawla

Knowledge-Intensive Visual Grounding (KVG) requires models to localize objects using fine-grained, domain-specific entity names rather than generic referring expressions. Although Multimodal Large Language Models (MLLMs) possess rich entity…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Xinyu Ma , Ziyang Ding , Zhicong Luo , Chi Chen , Zonghao Guo , Derek F. Wong , Zhen Zhao , Xiaoyi Feng , Maosong Sun

The ability to find short representations, i.e. to compress data, is crucial for many intelligent systems. We present a theory of incremental compression showing that arbitrary data strings, that can be described by a set of features, can…

Information Theory · Computer Science 2020-09-15 Arthur Franz , Oleksandr Antonenko , Roman Soletskyi

Semantic segmentation of structural defects in civil infrastructure remains challenging due to variable defect appearances, harsh imaging conditions, and significant class imbalance. Current deep learning methods, despite their…

Computer Vision and Pattern Recognition · Computer Science 2025-11-10 Md Meftahul Ferdaus , Mahdi Abdelguerfi , Elias Ioup , Steven Sloan , Kendall N. Niles , Ken Pathak

Multimodal Large Language Models (MLLMs) are becoming increasingly popular, while the high computational cost associated with multimodal data input, particularly from visual tokens, poses a significant challenge. Existing training-based…

Computer Vision and Pattern Recognition · Computer Science 2025-03-14 Xudong Tan , Peng Ye , Chongjun Tu , Jianjian Cao , Yaoxin Yang , Lin Zhang , Dongzhan Zhou , Tao Chen

Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Xin Xiao , Bohong Wu , Jiacong Wang , Chunyuan Li , Xun Zhou , Haoyuan Guo

Human-driven vehicles (HVs) exhibit complex and diverse behaviors. Accurately modeling such behavior is crucial for validating Robot Vehicles (RVs) in simulation and realizing the potential of mixed traffic control. However, existing…

Robotics · Computer Science 2024-07-10 Bibek Poudel , Weizi Li , Shuai Li

Current image tokenization methods require a large number of tokens to capture the information contained within images. Although the amount of information varies across images, most image tokenizers only support fixed-length tokenization,…

Computer Vision and Pattern Recognition · Computer Science 2025-01-20 Keita Miwa , Kento Sasaki , Hidehisa Arai , Tsubasa Takahashi , Yu Yamaguchi
‹ Prev 1 2 3 10 Next ›