Related papers: Spatial-TTT: Streaming Visual-based Spatial Intell…

Test-Time Training on Video Streams

Prior work has established Test-Time Training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is first trained on the same instance using a…

Computer Vision and Pattern Recognition · Computer Science 2025-01-07 Renhao Wang , Yu Sun , Arnuv Tandon , Yossi Gandelsman , Xinlei Chen , Alexei A. Efros , Xiaolong Wang

Streaming Video Model

Video understanding tasks have traditionally been modeled by two separate architectures, specially tailored for two distinct tasks. Sequence-based video tasks, such as action recognition, use a video backbone to directly extract…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Yucheng Zhao , Chong Luo , Chuanxin Tang , Dongdong Chen , Noel Codella , Zheng-Jun Zha

SViTT: Temporal Learning of Sparse Video-Text Transformers

Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards…

Computer Vision and Pattern Recognition · Computer Science 2023-04-19 Yi Li , Kyle Min , Subarna Tripathi , Nuno Vasconcelos

GISE-TTT:A Framework for Global InformationSegmentation and Enhancement

This paper addresses the challenge of capturing global temporaldependencies in long video sequences for Video Object Segmentation (VOS). Existing architectures often fail to effectively model these dependencies acrossextended temporal…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Fenglei Hao , Yuliang Yang , Ruiyuan Su , Zhengran Zhao , Yukun Qiao , Mengyu Zhu

ViT$^3$: Unlocking Test-Time Training in Vision

Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Dongchen Han , Yining Li , Tianyu Li , Zixuan Cao , Ziming Wang , Jun Song , Yu Cheng , Bo Zheng , Gao Huang

Enhancing Spatial Reasoning through Visual and Textual Thinking

The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Xun Liang , Xin Guo , Zhongming Jin , Weihang Pan , Penghui Shang , Deng Cai , Binbin Lin , Jieping Ye

Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training

Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision…

Computer Vision and Pattern Recognition · Computer Science 2026-03-23 Quan Kong , Yanru Xiao , Yuhao Shen , Cong Wang

Towards Long-Form Spatio-Temporal Video Grounding

In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-27 Xin Gu , Bing Fan , Jiali Yao , Zhipeng Zhang , Yan Huang , Cheng Han , Heng Fan , Libo Zhang

Learning from Multiple Cities: A Meta-Learning Approach for Spatial-Temporal Prediction

Spatial-temporal prediction is a fundamental problem for constructing smart city, which is useful for tasks such as traffic control, taxi dispatching, and environmental policy making. Due to data collection mechanism, it is common to see…

Machine Learning · Computer Science 2020-08-25 Huaxiu Yao , Yiding Liu , Ying Wei , Xianfeng Tang , Zhenhui Li

STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond

Video prediction aims to predict future frames by modeling the complex spatiotemporal dynamics in videos. However, most of the existing methods only model the temporal information and the spatial information for videos in an independent…

Computer Vision and Pattern Recognition · Computer Science 2022-04-21 Zheng Chang , Xinfeng Zhang , Shanshe Wang , Siwei Ma , Wen Gao

video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM

Long-duration streaming video understanding is fundamental for future AI agents, yet remains limited by ineffective long-term memory. We introduce video-SALMONN S, a memory-enhanced streaming audio-visual large language model that processes…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Guangzhi Sun , Yixuan Li , Xiaodong Wu , Yudong Yang , Wei Li , Zejun Ma , Chao Zhang

Visual Spatial Tuning

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-10 Rui Yang , Ziyu Zhu , Yanwei Li , Jingjia Huang , Shen Yan , Siyuan Zhou , Zhe Liu , Xiangtai Li , Shuangye Li , Wenqian Wang , Yi Lin , Hengshuang Zhao

Collaborative Spatio-temporal Feature Learning for Video Action Recognition

Spatio-temporal feature learning is of central importance for action recognition in videos. Existing deep neural network models either learn spatial and temporal features independently (C2D) or jointly with unconstrained parameters (C3D).…

Computer Vision and Pattern Recognition · Computer Science 2019-03-05 Chao Li , Qiaoyong Zhong , Di Xie , Shiliang Pu

Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Linghao Zhang , Jungang Li , Yonghua Hei , Sicheng Tao , Song Dai , Yibo Yan , Zihao Dongfang , Weiting Liu , Chenxi Qin , Hanqian Li , Xin Zou , Jiahao Zhang , Shuhang Xun , Haiyun Jiang , Xuming Hu

Learning Streaming Video Representation via Multitask Training

Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Yibin Yan , Jilan Xu , Shangzhe Di , Yikun Liu , Yudi Shi , Qirui Chen , Zeqian Li , Yifei Huang , Weidi Xie

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Dohwan Ko , Sihyeon Kim , Yumin Suh , Vijay Kumar B. G , Minseo Yoon , Manmohan Chandraker , Hyunwoo J. Kim

Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding

Vision Transformer models have shown impressive effectiveness in the surgical video understanding tasks through long-range dependency modeling. However, current methods suffer from prohibitive computational costs due to processing massive…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Xixi Jiang , Chen Yang , Dong Zhang , Pingcheng Dong , Xin Yang , Kwang-Ting Cheng

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or…

Computer Vision and Pattern Recognition · Computer Science 2026-05-20 Diankun Wu , Fangfu Liu , Yi-Hsin Hung , Yueqi Duan

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Haoyu Zhang , Meng Liu , Zaijing Li , Haokun Wen , Weili Guan , Yaowei Wang , Liqiang Nie

Instant Reality: Gaze-Contingent Perceptual Optimization for 3D Virtual Reality Streaming

Media streaming has been adopted for a variety of applications such as entertainment, visualization, and design. Unlike video/audio streaming where the content is usually consumed sequentially, 3D applications such as gaming require…

Human-Computer Interaction · Computer Science 2022-01-11 Shaoyu Chen , Budmonde Duinkharjav , Xin Sun , Li-Yi Wei , Stefano Petrangeli , Jose Echevarria , Claudio Silva , Qi Sun