Related papers: Implicit Temporal Modeling with Learnable Alignmen…

MILA: Multi-Task Learning from Videos via Efficient Inter-Frame Attention

Prior work in multi-task learning has mainly focused on predictions on a single image. In this work, we present a new approach for multi-task learning from videos via efficient inter-frame local attention (MILA). Our approach contains a…

Computer Vision and Pattern Recognition · Computer Science 2021-10-12 Donghyun Kim , Tian Lan , Chuhang Zou , Ning Xu , Bryan A. Plummer , Stan Sclaroff , Jayan Eledath , Gerard Medioni

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the…

Computer Vision and Pattern Recognition · Computer Science 2026-04-30 Nikita Araslanov , Martin Sundermeyer , Hidenobu Matsuki , David Joseph Tan , Federico Tombari

Advancing Analytic Class-Incremental Learning through Vision-Language Calibration

Class-incremental learning (CIL) with pre-trained models (PTMs) faces a critical trade-off between efficient adaptation and long-term stability. While analytic learning enables rapid, recursive closed-form updates, its efficacy is often…

Machine Learning · Computer Science 2026-05-08 Binyu Zhao , Wei Zhang , Xingrui Yu , Zhaonian Zou , Ivor Tsang

LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling

Recent large-scale video-language pre-trained models have shown appealing performance on various downstream tasks. However, the pre-training process is computationally expensive due to the requirement of millions of video-text pairs and the…

Computer Vision and Pattern Recognition · Computer Science 2022-10-24 Dongsheng Chen , Chaofan Tao , Lu Hou , Lifeng Shang , Xin Jiang , Qun Liu

Learning by Aligning Videos in Time

We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task, while exploiting both frame-level and video-level information. We leverage a novel combination of temporal alignment…

Computer Vision and Pattern Recognition · Computer Science 2023-08-21 Sanjay Haresh , Sateesh Kumar , Huseyin Coskun , Shahram Najam Syed , Andrey Konin , Muhammad Zeeshan Zia , Quoc-Huy Tran

Time-Contrastive Pretraining for In-Context Image and Video Segmentation

In-context learning (ICL) enables generalization to new tasks with minimal labeled data. However, mainstream ICL approaches rely on a gridding strategy, which lacks the flexibility required for vision applications. We introduce Temporal, a…

Computer Vision and Pattern Recognition · Computer Science 2025-06-24 Assefa Wahd , Jacob Jaremko , Abhilash Hareendranathan

Learning Implicit Temporal Alignment for Few-shot Video Classification

Few-shot video classification aims to learn new video categories with only a few labeled examples, alleviating the burden of costly annotation in real-world applications. However, it is particularly challenging to learn a class-invariant…

Computer Vision and Pattern Recognition · Computer Science 2021-05-12 Songyang Zhang , Jiale Zhou , Xuming He

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i.e., within 30 seconds) and sentences,…

Computer Vision and Pattern Recognition · Computer Science 2023-03-03 Yuchong Sun , Hongwei Xue , Ruihua Song , Bei Liu , Huan Yang , Jianlong Fu

LITA: Language Instructed Temporal-Localization Assistant

There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal…

Computer Vision and Pattern Recognition · Computer Science 2024-03-29 De-An Huang , Shijia Liao , Subhashree Radhakrishnan , Hongxu Yin , Pavlo Molchanov , Zhiding Yu , Jan Kautz

Expanding Language-Image Pretrained Models for General Video Recognition

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to…

Computer Vision and Pattern Recognition · Computer Science 2022-08-05 Bolin Ni , Houwen Peng , Minghao Chen , Songyang Zhang , Gaofeng Meng , Jianlong Fu , Shiming Xiang , Haibin Ling

Alignment-guided Temporal Attention for Video Action Recognition

Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more…

Computer Vision and Pattern Recognition · Computer Science 2023-01-03 Yizhou Zhao , Zhenyang Li , Xun Guo , Yan Lu

FILIP: Fine-grained Interactive Language-Image Pre-Training

Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either via the similarity of the global feature of each modality which…

Computer Vision and Pattern Recognition · Computer Science 2021-11-16 Lewei Yao , Runhui Huang , Lu Hou , Guansong Lu , Minzhe Niu , Hang Xu , Xiaodan Liang , Zhenguo Li , Xin Jiang , Chunjing Xu

STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding

Pre-trained on tremendous image-text pairs, vision-language models like CLIP have demonstrated promising zero-shot generalization across numerous image-based tasks. However, extending these capabilities to video tasks remains challenging…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Zichen Liu , Kunlun Xu , Bing Su , Xu Zou , Yuxin Peng , Jiahuan Zhou

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the…

Computer Vision and Pattern Recognition · Computer Science 2024-12-23 Chenyu Yang , Xizhou Zhu , Jinguo Zhu , Weijie Su , Junjie Wang , Xuan Dong , Wenhai Wang , Lewei Lu , Bin Li , Jie Zhou , Yu Qiao , Jifeng Dai

Is Temporal Prompting All We Need For Limited Labeled Action Recognition?

Video understanding has shown remarkable improvements in recent years, largely dependent on the availability of large scaled labeled datasets. Recent advancements in visual-language models, especially based on contrastive pretraining, have…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Shreyank N Gowda , Boyan Gao , Xiao Gu , Xiaobo Jin

Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics

Implicit Neural Networks (INRs) have emerged as powerful representations to encode all forms of data, including images, videos, audios, and scenes. With video, many INRs for video have been proposed for the compression task, and recent…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Shishira R Maiya , Anubhav Gupta , Matthew Gwilliam , Max Ehrlich , Abhinav Shrivastava

Live Interactive Training for Video Segmentation

Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.). Yet, even state-of-the-art models like SAM2 use corrections only…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Xinyu Yang , Haozheng Yu , Yihong Sun , Bharath Hariharan , Jennifer J. Sun

CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from…

Robotics · Computer Science 2026-01-08 Chubin Zhang , Jianan Wang , Zifeng Gao , Yue Su , Tianru Dai , Cai Zhou , Jiwen Lu , Yansong Tang

VALA: Learning Latent Anchors for Training-Free and Temporally Consistent

Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Zhangkai Wu , Xuhui Fan , Zhongyuan Xie , Kaize Shi , Longbing Cao

LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation

Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Mirlan Karimov , Teodora Spasojevic , Markus Braun , Julian Wiederer , Vasileios Belagiannis , Marc Pollefeys