English
Related papers

Related papers: Spatial-TTT: Streaming Visual-based Spatial Intell…

200 papers

Prior work has established Test-Time Training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is first trained on the same instance using a…

Computer Vision and Pattern Recognition · Computer Science 2025-01-07 Renhao Wang , Yu Sun , Arnuv Tandon , Yossi Gandelsman , Xinlei Chen , Alexei A. Efros , Xiaolong Wang

Video understanding tasks have traditionally been modeled by two separate architectures, specially tailored for two distinct tasks. Sequence-based video tasks, such as action recognition, use a video backbone to directly extract…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Yucheng Zhao , Chong Luo , Chuanxin Tang , Dongdong Chen , Noel Codella , Zheng-Jun Zha

Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards…

Computer Vision and Pattern Recognition · Computer Science 2023-04-19 Yi Li , Kyle Min , Subarna Tripathi , Nuno Vasconcelos

This paper addresses the challenge of capturing global temporaldependencies in long video sequences for Video Object Segmentation (VOS). Existing architectures often fail to effectively model these dependencies acrossextended temporal…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Fenglei Hao , Yuliang Yang , Ruiyuan Su , Zhengran Zhao , Yukun Qiao , Mengyu Zhu

Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Dongchen Han , Yining Li , Tianyu Li , Zixuan Cao , Ziming Wang , Jun Song , Yu Cheng , Bo Zheng , Gao Huang

The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Xun Liang , Xin Guo , Zhongming Jin , Weihang Pan , Penghui Shang , Deng Cai , Binbin Lin , Jieping Ye

Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision…

Computer Vision and Pattern Recognition · Computer Science 2026-03-23 Quan Kong , Yanru Xiao , Yuhao Shen , Cong Wang

In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-27 Xin Gu , Bing Fan , Jiali Yao , Zhipeng Zhang , Yan Huang , Cheng Han , Heng Fan , Libo Zhang

Spatial-temporal prediction is a fundamental problem for constructing smart city, which is useful for tasks such as traffic control, taxi dispatching, and environmental policy making. Due to data collection mechanism, it is common to see…

Machine Learning · Computer Science 2020-08-25 Huaxiu Yao , Yiding Liu , Ying Wei , Xianfeng Tang , Zhenhui Li

Video prediction aims to predict future frames by modeling the complex spatiotemporal dynamics in videos. However, most of the existing methods only model the temporal information and the spatial information for videos in an independent…

Computer Vision and Pattern Recognition · Computer Science 2022-04-21 Zheng Chang , Xinfeng Zhang , Shanshe Wang , Siwei Ma , Wen Gao

Long-duration streaming video understanding is fundamental for future AI agents, yet remains limited by ineffective long-term memory. We introduce video-SALMONN S, a memory-enhanced streaming audio-visual large language model that processes…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Guangzhi Sun , Yixuan Li , Xiaodong Wu , Yudong Yang , Wei Li , Zejun Ma , Chao Zhang

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-10 Rui Yang , Ziyu Zhu , Yanwei Li , Jingjia Huang , Shen Yan , Siyuan Zhou , Zhe Liu , Xiangtai Li , Shuangye Li , Wenqian Wang , Yi Lin , Hengshuang Zhao

Spatio-temporal feature learning is of central importance for action recognition in videos. Existing deep neural network models either learn spatial and temporal features independently (C2D) or jointly with unconstrained parameters (C3D).…

Computer Vision and Pattern Recognition · Computer Science 2019-03-05 Chao Li , Qiaoyong Zhong , Di Xie , Shiliang Pu

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Linghao Zhang , Jungang Li , Yonghua Hei , Sicheng Tao , Song Dai , Yibo Yan , Zihao Dongfang , Weiting Liu , Chenxi Qin , Hanqian Li , Xin Zou , Jiahao Zhang , Shuhang Xun , Haiyun Jiang , Xuming Hu

Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Yibin Yan , Jilan Xu , Shangzhe Di , Yikun Liu , Yudi Shi , Qirui Chen , Zeqian Li , Yifei Huang , Weidi Xie

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Dohwan Ko , Sihyeon Kim , Yumin Suh , Vijay Kumar B. G , Minseo Yoon , Manmohan Chandraker , Hyunwoo J. Kim

Vision Transformer models have shown impressive effectiveness in the surgical video understanding tasks through long-range dependency modeling. However, current methods suffer from prohibitive computational costs due to processing massive…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Xixi Jiang , Chen Yang , Dong Zhang , Pingcheng Dong , Xin Yang , Kwang-Ting Cheng

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or…

Computer Vision and Pattern Recognition · Computer Science 2026-05-20 Diankun Wu , Fangfu Liu , Yi-Hsin Hung , Yueqi Duan

Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Haoyu Zhang , Meng Liu , Zaijing Li , Haokun Wen , Weili Guan , Yaowei Wang , Liqiang Nie

Media streaming has been adopted for a variety of applications such as entertainment, visualization, and design. Unlike video/audio streaming where the content is usually consumed sequentially, 3D applications such as gaming require…

Human-Computer Interaction · Computer Science 2022-01-11 Shaoyu Chen , Budmonde Duinkharjav , Xin Sun , Li-Yi Wei , Stefano Petrangeli , Jose Echevarria , Claudio Silva , Qi Sun
‹ Prev 1 2 3 10 Next ›