Related papers: Network-Efficient World Model Token Streaming

Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value…

Computer Vision and Pattern Recognition · Computer Science 2026-02-23 Vatsal Agarwal , Saksham Suri , Matthew Gwilliam , Pulkit Kumar , Abhinav Shrivastava

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain…

Computer Vision and Pattern Recognition · Computer Science 2026-04-07 Tommie Kerssies , Gabriele Berton , Ju He , Qihang Yu , Wufei Ma , Daan de Geus , Gijs Dubbelman , Liang-Chieh Chen

WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching

Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Weilun Feng , Guoxin Fan , Haotong Qin , Chuanguang Yang , Mingqiang Wu , Yuqi Li , Xiangqi Li , Zhulin An , Libo Huang , Dingrui Wang , Longlong Liao , Michele Magno , Yongjun Xu

StreamingTOM: Streaming Token Compression for Efficient Video Understanding

Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Xueyi Chen , Keda Tao , Kele Shao , Huan Wang

VectorWorld: Efficient Streaming World Model via Diffusion Flow on Vector Graphs

Closed-loop evaluation of autonomous-driving policies requires interactive simulation beyond log replay. However, existing generative world models often degrade in closed loop due to (i) history-free initialization that mismatches policy…

Robotics · Computer Science 2026-03-19 Chaokang Jiang , Desen Zhou , Jiuming Liu , Kevin Li Sun

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse…

Computer Vision and Pattern Recognition · Computer Science 2026-04-10 Junjie Fei , Jun Chen , Zechun Liu , Yunyang Xiong , Chong Zhou , Wei Wen , Junlin Han , Mingchen Zhuge , Saksham Suri , Qi Qian , Shuming Liu , Lemeng Wu , Raghuraman Krishnamoorthi , Vikas Chandra , Mohamed Elhoseiny , Chenchen Zhu

StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding

Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with…

Sound · Computer Science 2025-07-02 Dake Guo , Jixun Yao , Linhan Ma , He Wang , Lei Xie

Real-Time Text Transmission via LLM-Based Entropy Coding over Fixed-Rate Channels

Learning, prediction, and compression are intimately connected: a model that accurately predicts the next symbol in a sequence can be coupled with a source coder to compress that sequence near its information-theoretic limit. When tokenized…

Information Theory · Computer Science 2026-05-05 Vishnu Teja Kunde , Jean-Francois Chamberland , Krishna R. Narayanan , Jamison Ebert

State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models

We propose an efficient framework to compress massive video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from hour-long videos. Our design leverages a bidirectional…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Geewook Kim , Minjoon Seo

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the…

Computer Vision and Pattern Recognition · Computer Science 2024-05-09 Ethan Smith , Nayan Saxena , Aninda Saha

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens

This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-25 Pin-Jui Ku , He Huang , Jean-Marie Lemercier , Subham Sekhar Sahoo , Zhehuai Chen , Ante Jukić

Streaming Data Transmission in the Moderate Deviations and Central Limit Regimes

We consider streaming data transmission over a discrete memoryless channel. A new message is given to the encoder at the beginning of each block and the decoder decodes each message sequentially, after a delay of $T$ blocks. In this…

Information Theory · Computer Science 2015-12-22 Si-Hyeon Lee , Vincent Y. F. Tan , Ashish Khisti

Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio, thanks to their temporally uni-directional attention mechanism, which models correlations between the current token and previous…

Computer Vision and Pattern Recognition · Computer Science 2024-07-12 Zhening Xing , Gereon Fox , Yanhong Zeng , Xingang Pan , Mohamed Elgharib , Christian Theobalt , Kai Chen

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Yunhong Lu , Yanhong Zeng , Haobo Li , Hao Ouyang , Qiuyu Wang , Ka Leong Cheng , Jiapeng Zhu , Hengyuan Cao , Zhipeng Zhang , Xing Zhu , Yujun Shen , Min Zhang

Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation

Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent…

Computer Vision and Pattern Recognition · Computer Science 2025-06-27 Ze Wang , Hao Chen , Benran Hu , Jiang Liu , Ximeng Sun , Jialian Wu , Yusheng Su , Xiaodong Yu , Emad Barsoum , Zicheng Liu

TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training

Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Felix Krause , Timy Phan , Ming Gui , Stefan Andreas Baumann , Vincent Tao Hu , Björn Ommer

A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies

The Stable Diffusion Model (SDM) is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation. Despite various attempts at sampler optimization, model distillation, and network quantification, these…

Computer Vision and Pattern Recognition · Computer Science 2024-06-18 Jinchao Zhu , Yuxuan Wang , Siyuan Pan , Pengfei Wan , Di Zhang , Gao Huang

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned…

Computer Vision and Pattern Recognition · Computer Science 2026-03-06 Dongwon Kim , Gawon Seo , Jinsung Lee , Minsu Cho , Suha Kwak

Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Lunjun Zhang , Yuwen Xiong , Ze Yang , Sergio Casas , Rui Hu , Raquel Urtasun

On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention

Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they…

Machine Learning · Computer Science 2025-06-18 Yeonju Ro , Zhenyu Zhang , Souvik Kundu , Zhangyang Wang , Aditya Akella