Related papers: Space Time Recurrent Memory Network

Video World Models with Long-term Spatial Memory

Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to…

Computer Vision and Pattern Recognition · Computer Science 2025-06-06 Tong Wu , Shuai Yang , Ryan Po , Yinghao Xu , Ziwei Liu , Dahua Lin , Gordon Wetzstein

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Kabir Swain , Sijie Han , Daniel Karl I. Weidele , Mauro Martino , Antonio Torralba

Dual Temporal Memory Network for Efficient Video Object Segmentation

Video Object Segmentation (VOS) is typically formulated in a semi-supervised setting. Given the ground-truth segmentation mask on the first frame, the task of VOS is to track and segment the single or multiple objects of interests in the…

Computer Vision and Pattern Recognition · Computer Science 2020-03-16 Kaihua Zhang , Long Wang , Dong Liu , Bo Liu , Qingshan Liu , Zhu Li

Space-time Reinforcement Network for Video Object Segmentation

Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from…

Computer Vision and Pattern Recognition · Computer Science 2024-05-08 Yadang Chen , Wentao Zhu , Zhi-Xin Yang , Enhua Wu

Long-Context State-Space Video World Models

Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with…

Computer Vision and Pattern Recognition · Computer Science 2025-05-27 Ryan Po , Yotam Nitzan , Richard Zhang , Berlin Chen , Tri Dao , Eli Shechtman , Gordon Wetzstein , Xun Huang

Robust and Efficient Memory Network for Video Object Segmentation

This paper proposes a Robust and Efficient Memory Network, referred to as REMN, for studying semi-supervised video object segmentation (VOS). Memory-based methods have recently achieved outstanding VOS performance by performing non-local…

Computer Vision and Pattern Recognition · Computer Science 2023-04-25 Yadang Chen , Dingwei Zhang , Zhi-xin Yang , Enhua Wu

Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking

Real-time video analysis remains a challenging problem in computer vision, requiring efficient processing of both spatial and temporal information while maintaining computational efficiency. Existing approaches often struggle to balance…

Computer Vision and Pattern Recognition · Computer Science 2025-07-31 Shahla John

Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

This paper presents a simple yet effective approach to modeling space-time correspondences in the context of video object segmentation. Unlike most existing approaches, we establish correspondences directly between frames without…

Computer Vision and Pattern Recognition · Computer Science 2021-10-11 Ho Kei Cheng , Yu-Wing Tai , Chi-Keung Tang

Spatiotemporal Residual Networks for Video Action Recognition

Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2016-11-08 Christoph Feichtenhofer , Axel Pinz , Richard P. Wildes

HumMUSS: Human Motion Understanding using State Space Models

Understanding human motion from video is essential for a range of applications, including pose estimation, mesh recovery and action recognition. While state-of-the-art methods predominantly rely on transformer-based architectures, these…

Computer Vision and Pattern Recognition · Computer Science 2024-04-18 Arnab Kumar Mondal , Stefano Alletto , Denis Tome

Video Object Segmentation using Space-Time Memory Networks

We propose a novel solution for semi-supervised video object segmentation. By the nature of the problem, available cues (e.g. video frame(s) with object masks) become richer with the intermediate predictions. However, the existing methods…

Computer Vision and Pattern Recognition · Computer Science 2019-08-13 Seoung Wug Oh , Joon-Young Lee , Ning Xu , Seon Joo Kim

Linear-Time and Constant-Memory Text Embeddings Based on Recurrent Language Models

Transformer-based embedding models suffer from quadratic computational and linear memory complexity, limiting their utility for long sequences. We propose recurrent architectures as an efficient alternative, introducing a vertically chunked…

Computation and Language · Computer Science 2026-04-21 Tobias Grantner , Emanuel Sallinger , Martin Flechl

PatchBlender: A Motion Prior for Video Transformers

Transformers have become one of the dominant architectures in the field of computer vision. However, there are yet several challenges when applying such architectures to video data. Most notably, these models struggle to model the temporal…

Computer Vision and Pattern Recognition · Computer Science 2023-02-14 Gabriele Prato , Yale Song , Janarthanan Rajendran , R Devon Hjelm , Neel Joshi , Sarath Chandar

An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time Video Enhancement

Video enhancement is a challenging problem, more than that of stills, mainly due to high computational cost, larger data volumes and the difficulty of achieving consistency in the spatio-temporal domain. In practice, these challenges are…

Image and Video Processing · Electrical Eng. & Systems 2022-12-13 Dario Fuoli , Zhiwu Huang , Danda Pani Paudel , Luc Van Gool , Radu Timofte

Adaptive Memory Management for Video Object Segmentation

Matching-based networks have achieved state-of-the-art performance for video object segmentation (VOS) tasks by storing every-k frames in an external memory bank for future inference. Storing the intermediate frames' predictions provides…

Computer Vision and Pattern Recognition · Computer Science 2022-04-15 Ali Pourganjalikhan , Charalambos Poullis

Learning Trajectory-Aware Transformer for Video Super-Resolution

Video super-resolution (VSR) aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Although some progress has been made, there are grand challenges to effectively utilize temporal dependency…

Image and Video Processing · Electrical Eng. & Systems 2022-04-21 Chengxu Liu , Huan Yang , Jianlong Fu , Xueming Qian

Local Frequency Domain Transformer Networks for Video Prediction

Video prediction is commonly referred to as forecasting future frames of a video sequence provided several past frames thereof. It remains a challenging domain as visual scenes evolve according to complex underlying dynamics, such as the…

Computer Vision and Pattern Recognition · Computer Science 2021-05-12 Hafez Farazi , Jan Nogga , Sven Behnke

Memformer: A Memory-Augmented Transformer for Sequence Modeling

Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network…

Computation and Language · Computer Science 2022-04-14 Qingyang Wu , Zhenzhong Lan , Kun Qian , Jing Gu , Alborz Geramifard , Zhou Yu

ViStripformer: A Token-Efficient Transformer for Versatile Video Restoration

Video restoration is a low-level vision task that seeks to restore clean, sharp videos from quality-degraded frames. One would use the temporal information from adjacent frames to make video restoration successful. Recently, the success of…

Computer Vision and Pattern Recognition · Computer Science 2023-12-25 Fu-Jen Tsai , Yan-Tsung Peng , Chen-Yu Chang , Chan-Yu Li , Yen-Yu Lin , Chung-Chi Tsai , Chia-Wen Lin

Vision-Language Memory for Spatial Reasoning

Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a…

Computer Vision and Pattern Recognition · Computer Science 2025-11-26 Zuntao Liu , Yi Du , Taimeng Fu , Shaoshu Su , Cherie Ho , Chen Wang