Related papers: Communication-Inspired Tokenization for Structured…

CAT: Content-Adaptive Image Tokenization

Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts…

Computer Vision and Pattern Recognition · Computer Science 2025-01-07 Junhong Shen , Kushal Tirumala , Michihiro Yasunaga , Ishan Misra , Luke Zettlemoyer , Lili Yu , Chunting Zhou

Tokenize Image as a Set

This paper proposes a fundamentally new paradigm for image generation through set-based tokenization and distribution modeling. Unlike conventional methods that serialize images into fixed-position latent codes with a uniform compression…

Computer Vision and Pattern Recognition · Computer Science 2025-03-21 Zigang Geng , Mengde Xu , Han Hu , Shuyang Gu

Visual Concepts Tokenization

Obtaining the human-like perception ability of abstracting visual concepts from concrete pixels has always been a fundamental and important target in machine learning research fields such as disentangled representation learning and scene…

Computer Vision and Pattern Recognition · Computer Science 2022-10-14 Tao Yang , Yuwang Wang , Yan Lu , Nanning Zheng

Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

Token Communication (TokenCom) is a new paradigm, motivated by the recent success of Large AI Models (LAMs) and Multimodal Large Language Models (MLLMs), where tokens serve as unified units of communication and computation, enabling…

Information Theory · Computer Science 2026-03-04 Jingxuan Men , Mahdi Boloursaz Mashhadi , Ning Wang , Yi Ma , Mike Nilsson , Rahim Tafazolli

TokenCompose: Text-to-Image Diffusion with Token-level Supervision

We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process…

Computer Vision and Pattern Recognition · Computer Science 2024-06-25 Zirui Wang , Zhizhou Sha , Zheng Ding , Yilin Wang , Zhuowen Tu

Learning Decomposed Contextual Token Representations from Pretrained and Collaborative Signals for Generative Recommendation

Recent advances in generative recommenders adopt a two-stage paradigm: items are first tokenized into semantic IDs using a pretrained tokenizer, and then large language models (LLMs) are trained to generate the next item via…

Information Retrieval · Computer Science 2026-05-05 Yifan Liu , Yaokun Liu , Zelin Li , Zhenrui Yue , Gyuseok Lee , Ruichen Yao , Yang Zhang , Dong Wang

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to image or video inputs, this paper…

Computer Vision and Pattern Recognition · Computer Science 2024-06-14 Junke Wang , Yi Jiang , Zehuan Yuan , Binyue Peng , Zuxuan Wu , Yu-Gang Jiang

Efficient Semantic Communication Through Transformer-Aided Compression

Transformers, known for their attention mechanisms, have proven highly effective in focusing on critical elements within complex data. This feature can effectively be used to address the time-varying channels in wireless communication…

Machine Learning · Computer Science 2024-12-03 Matin Mortaheb , Mohammad A. Amir Khojastepour , Sennur Ulukus

Spectral Image Tokenizer

Image tokenizers map images to sequences of discrete tokens, and are a crucial component of autoregressive transformer-based image generation. The tokens are typically associated with spatial locations in the input image, arranged in raster…

Computer Vision and Pattern Recognition · Computer Science 2025-06-12 Carlos Esteves , Mohammed Suhail , Ameesh Makadia

Interpreting the structure of multi-object representations in vision encoders

In this work, we interpret the representations of multi-object scenes in vision encoders through the lens of structured representations. Structured representations allow modeling of individual objects distinctly and their flexible use based…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Tarun Khajuria , Braian Olmiro Dias , Marharyta Domnich , Jaan Aru

Joint Semantic-Channel Coding and Modulation for Token Communications

In recent years, the Transformer architecture has achieved outstanding performance across a wide range of tasks and modalities. Token is the unified input and output representation in Transformer-based models, which has become a fundamental…

Signal Processing · Electrical Eng. & Systems 2025-11-20 Jingkai Ying , Zhijin Qin , Yulong Feng , Liejun Wang , Xiaoming Tao

CODA: Repurposing Continuous VAEs for Discrete Tokenization

Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Zeyu Liu , Zanlin Ni , Yeguo Hua , Xin Deng , Xiao Ma , Cheng Zhong , Gao Huang

Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Guangfu Guo , Xiaoqian Lu , Yue Feng , Mingming Sun

SWAT: Spatial Structure Within and Among Tokens

Modeling visual data as tokens (i.e., image patches) using attention mechanisms, feed-forward networks or convolutions has been highly effective in recent years. Such methods usually have a common pipeline: a tokenization method, followed…

Computer Vision and Pattern Recognition · Computer Science 2023-11-21 Kumara Kahatapitiya , Michael S. Ryoo

SFTok: Bridging the Performance Gap in Discrete Tokenizers

Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in…

Computer Vision and Pattern Recognition · Computer Science 2025-12-19 Qihang Rao , Borui Zhang , Wenzhao Zheng , Jie Zhou , Jiwen Lu

Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Neha Kalibhat , Priyatham Kattakinda , Sumit Nawathe , Arman Zarei , Nikita Seleznev , Samuel Sharpe , Senthil Kumar , Soheil Feizi

UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-13 Ziyao Wang , Chen Chen , Jingtao Li , Weiming Zhuang , Jiabo Huang , Ang Li , Lingjuan Lyu

On the Role of Discrete Tokenization in Visual Representation Learning

In the realm of self-supervised learning (SSL), masked image modeling (MIM) has gained popularity alongside contrastive learning methods. MIM involves reconstructing masked regions of input images using their unmasked portions. A notable…

Machine Learning · Computer Science 2024-07-15 Tianqi Du , Yifei Wang , Yisen Wang

TMCIR: Token Merge Benefits Composed Image Retrieval

Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications. The primary challenge is effectively fusing this visual and textual information.…

Computer Vision and Pattern Recognition · Computer Science 2025-04-16 Chaoyang Wang , Zeyu Zhang , Long Teng , Zijun Li , Shichao Kan

Morphing Tokens Draw Strong Masked Image Models

Masked image modeling (MIM) has emerged as a promising approach for pre-training Vision Transformers (ViTs). MIMs predict masked tokens token-wise to recover target signals that are tokenized from images or generated by pre-trained models…

Computer Vision and Pattern Recognition · Computer Science 2025-03-24 Taekyung Kim , Byeongho Heo , Dongyoon Han