Related papers: Diffusion Action Segmentation

Temporal Segment Transformer for Action Segmentation

Recognizing human actions from untrimmed videos is an important task in activity understanding, and poses unique challenges in modeling long-range temporal relations. Recent works adopt a predict-and-refine strategy which converts an…

Computer Vision and Pattern Recognition · Computer Science 2023-02-28 Zhichao Liu , Leshan Wang , Desen Zhou , Jian Wang , Songyang Zhang , Yang Bai , Errui Ding , Rui Fan

DiffAnt: Diffusion Models for Action Anticipation

Anticipating future actions is inherently uncertain. Given an observed video segment containing ongoing actions, multiple subsequent actions can plausibly follow. This uncertainty becomes even larger when predicting far into the future.…

Computer Vision and Pattern Recognition · Computer Science 2023-11-28 Zeyun Zhong , Chengzhi Wu , Manuel Martin , Michael Voit , Juergen Gall , Jürgen Beyerer

Flexible Diffusion Modeling of Long Videos

We present a framework for video modeling based on denoising diffusion probabilistic models that produces long-duration video completions in a variety of realistic environments. We introduce a generative model that can at test-time sample…

Computer Vision and Pattern Recognition · Computer Science 2022-12-19 William Harvey , Saeid Naderiparizi , Vaden Masrani , Christian Weilbach , Frank Wood

ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos. Despite apparent relevance and potential complementarity, these two problems have been investigated…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Dayoung Gong , Suha Kwak , Minsu Cho

ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos

We present ActionDiffusion -- a novel diffusion model for procedure planning in instructional videos that is the first to take temporal inter-dependencies between actions into account in a diffusion model for procedure planning. This…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Lei Shi , Paul Bürkner , Andreas Bulling

DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion

We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a…

Computer Vision and Pattern Recognition · Computer Science 2023-07-17 Sauradip Nag , Xiatian Zhu , Jiankang Deng , Yi-Zhe Song , Tao Xiang

Learning Action Hierarchies via Hybrid Geometric Diffusion

Temporal action segmentation is a critical task in video understanding, where the goal is to assign action labels to each frame in a video. While recent advances leverage iterative refinement-based strategies, they fail to explicitly…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Arjun Ramesh Kaushik , Nalini K. Ratha , Venu Govindaraju

Video Diffusion Models

Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial…

Computer Vision and Pattern Recognition · Computer Science 2022-06-24 Jonathan Ho , Tim Salimans , Alexey Gritsenko , William Chan , Mohammad Norouzi , David J. Fleet

Distill and Collect for Semi-Supervised Temporal Action Segmentation

Recent temporal action segmentation approaches need frame annotations during training to be effective. These annotations are very expensive and time-consuming to obtain. This limits their performances when only limited annotated data is…

Computer Vision and Pattern Recognition · Computer Science 2022-11-04 Sovan Biswas , Anthony Rhodes , Ramesh Manuvinakurike , Giuseppe Raffa , Richard Beckwith

Exploring Iterative Refinement with Diffusion Models for Video Grounding

Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query. Existing methods typically select the best prediction from a set of predefined proposals or directly regress the target span…

Computer Vision and Pattern Recognition · Computer Science 2024-01-01 Xiao Liang , Tao Shi , Yaoyuan Liang , Te Tao , Shao-Lun Huang

Hierarchical Attention Network for Action Segmentation

The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video. Several attempts have been made to capture frame-level salient aspects through attention but they lack the…

Computer Vision and Pattern Recognition · Computer Science 2020-05-08 Harshala Gammulle , Simon Denman , Sridha Sridharan , Clinton Fookes

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks…

Machine Learning · Computer Science 2024-10-10 Haoran He , Chenjia Bai , Ling Pan , Weinan Zhang , Bin Zhao , Xuelong Li

Denoising Diffusion Semantic Segmentation with Mask Prior Modeling

The evolution of semantic segmentation has long been dominated by learning more discriminative image representations for classifying each pixel. Despite the prominent advancements, the priors of segmentation masks themselves, e.g.,…

Computer Vision and Pattern Recognition · Computer Science 2023-06-23 Zeqiang Lai , Yuchen Duan , Jifeng Dai , Ziheng Li , Ying Fu , Hongsheng Li , Yu Qiao , Wenhai Wang

ReVision: Refining Video Diffusion with Explicit 3D Motion Modeling

In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that…

Computer Vision and Pattern Recognition · Computer Science 2026-01-12 Qihao Liu , Ju He , Qihang Yu , Liang-Chieh Chen , Alan Yuille

Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy

Diffusion models have recently achieved great success in the synthesis of high-quality images and videos. However, the existing denoising techniques in diffusion models are commonly based on step-by-step noise predictions, which suffers…

Computer Vision and Pattern Recognition · Computer Science 2024-10-15 Hancheng Ye , Jiakang Yuan , Renqiu Xia , Xiangchao Yan , Tao Chen , Junchi Yan , Botian Shi , Bo Zhang

AdaDiff: Adaptive Step Selection for Fast Diffusion Models

Diffusion models, as a type of generative model, have achieved impressive results in generating images and videos conditioned on textual conditions. However, the generation process of diffusion models involves denoising dozens of steps to…

Computer Vision and Pattern Recognition · Computer Science 2024-12-31 Hui Zhang , Zuxuan Wu , Zhen Xing , Jie Shao , Yu-Gang Jiang

FIFO-Diffusion: Generating Infinite Videos from Text without Training

We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without additional…

Computer Vision and Pattern Recognition · Computer Science 2024-11-05 Jihwan Kim , Junoh Kang , Jinyoung Choi , Bohyung Han

Conditional Video Generation for High-Efficiency Video Compression

Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional…

Computer Vision and Pattern Recognition · Computer Science 2025-09-26 Fangqiu Yi , Jingyu Xu , Jiawei Shao , Chi Zhang , Xuelong Li

SummDiff: Generative Modeling of Video Summarization with Diffusion

Video summarization is a task of shortening a video by choosing a subset of frames while preserving its essential moments. Despite the innate subjectivity of the task, previous works have deterministically regressed to an averaged frame…

Machine Learning · Computer Science 2025-10-10 Kwanseok Kim , Jaehoon Hahm , Sumin Kim , Jinhwan Sul , Byunghak Kim , Joonseok Lee

DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection

Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Given that the video content…

Computer Vision and Pattern Recognition · Computer Science 2024-03-05 Henghao Zhao , Kevin Qinghong Lin , Rui Yan , Zechao Li