English
Related papers

Related papers: Network-Efficient World Model Token Streaming

200 papers

Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value…

Computer Vision and Pattern Recognition · Computer Science 2026-02-23 Vatsal Agarwal , Saksham Suri , Matthew Gwilliam , Pulkit Kumar , Abhinav Shrivastava

Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain…

Computer Vision and Pattern Recognition · Computer Science 2026-04-07 Tommie Kerssies , Gabriele Berton , Ju He , Qihang Yu , Wufei Ma , Daan de Geus , Gijs Dubbelman , Liang-Chieh Chen

Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Weilun Feng , Guoxin Fan , Haotong Qin , Chuanguang Yang , Mingqiang Wu , Yuqi Li , Xiangqi Li , Zhulin An , Libo Huang , Dingrui Wang , Longlong Liao , Michele Magno , Yongjun Xu

Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Xueyi Chen , Keda Tao , Kele Shao , Huan Wang

Closed-loop evaluation of autonomous-driving policies requires interactive simulation beyond log replay. However, existing generative world models often degrade in closed loop due to (i) history-free initialization that mismatches policy…

Robotics · Computer Science 2026-03-19 Chaokang Jiang , Desen Zhou , Jiuming Liu , Kevin Li Sun

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse…

Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with…

Sound · Computer Science 2025-07-02 Dake Guo , Jixun Yao , Linhan Ma , He Wang , Lei Xie

Learning, prediction, and compression are intimately connected: a model that accurately predicts the next symbol in a sequence can be coupled with a source coder to compress that sequence near its information-theoretic limit. When tokenized…

Information Theory · Computer Science 2026-05-05 Vishnu Teja Kunde , Jean-Francois Chamberland , Krishna R. Narayanan , Jamison Ebert

We propose an efficient framework to compress massive video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from hour-long videos. Our design leverages a bidirectional…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Geewook Kim , Minjoon Seo

Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the…

Computer Vision and Pattern Recognition · Computer Science 2024-05-09 Ethan Smith , Nayan Saxena , Aninda Saha

This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-25 Pin-Jui Ku , He Huang , Jean-Marie Lemercier , Subham Sekhar Sahoo , Zhehuai Chen , Ante Jukić

We consider streaming data transmission over a discrete memoryless channel. A new message is given to the encoder at the beginning of each block and the decoder decodes each message sequentially, after a delay of $T$ blocks. In this…

Information Theory · Computer Science 2015-12-22 Si-Hyeon Lee , Vincent Y. F. Tan , Ashish Khisti

Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio, thanks to their temporally uni-directional attention mechanism, which models correlations between the current token and previous…

Computer Vision and Pattern Recognition · Computer Science 2024-07-12 Zhening Xing , Gereon Fox , Yanhong Zeng , Xingang Pan , Mohamed Elgharib , Christian Theobalt , Kai Chen

Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Yunhong Lu , Yanhong Zeng , Haobo Li , Hao Ouyang , Qiuyu Wang , Ka Leong Cheng , Jiapeng Zhu , Hengyuan Cao , Zhipeng Zhang , Xing Zhu , Yujun Shen , Min Zhang

Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent…

Computer Vision and Pattern Recognition · Computer Science 2025-06-27 Ze Wang , Hao Chen , Benran Hu , Jiang Liu , Ximeng Sun , Jialian Wu , Yusheng Su , Xiaodong Yu , Emad Barsoum , Zicheng Liu

Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Felix Krause , Timy Phan , Ming Gui , Stefan Andreas Baumann , Vincent Tao Hu , Björn Ommer

The Stable Diffusion Model (SDM) is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation. Despite various attempts at sampler optimization, model distillation, and network quantification, these…

Computer Vision and Pattern Recognition · Computer Science 2024-06-18 Jinchao Zhu , Yuxuan Wang , Siyuan Pan , Pengfei Wan , Di Zhang , Gao Huang

World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned…

Computer Vision and Pattern Recognition · Computer Science 2026-03-06 Dongwon Kim , Gawon Seo , Jinsung Lee , Minsu Cho , Suha Kwak

Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Lunjun Zhang , Yuwen Xiong , Ze Yang , Sergio Casas , Rui Hu , Raquel Urtasun

Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they…

Machine Learning · Computer Science 2025-06-18 Yeonju Ro , Zhenyu Zhang , Souvik Kundu , Zhangyang Wang , Aditya Akella
‹ Prev 1 2 3 10 Next ›