Related papers: DiffVC: A Non-autoregressive Framework Based on Di…

DiffCap: Exploring Continuous Diffusion on Image Captioning

Current image captioning works usually focus on generating descriptions in an autoregressive manner. However, there are limited works that focus on generating descriptions non-autoregressively, which brings more decoding diversity. Inspired…

Computer Vision and Pattern Recognition · Computer Science 2023-05-23 Yufeng He , Zefan Cai , Xu Gan , Baobao Chang

Masked Non-Autoregressive Image Captioning

Existing captioning models often adopt the encoder-decoder architecture, where the decoder uses autoregressive decoding to generate captions, such that each token is generated sequentially given the preceding generated tokens. However,…

Computer Vision and Pattern Recognition · Computer Science 2019-06-04 Junlong Gao , Xi Meng , Shiqi Wang , Xia Li , Shanshe Wang , Siwei Ma , Wen Gao

Non-Autoregressive Coarse-to-Fine Video Captioning

It is encouraged to see that progress has been made to bridge videos and natural language. However, mainstream video captioning methods suffer from slow inference speed due to the sequential manner of autoregressive decoding, and prefer…

Computer Vision and Pattern Recognition · Computer Science 2021-03-25 Bang Yang , Yuexian Zou , Fenglin Liu , Can Zhang

Exploring Discrete Diffusion Models for Image Captioning

The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one. We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. Unlike image…

Computer Vision and Pattern Recognition · Computer Science 2022-12-12 Zixin Zhu , Yixuan Wei , Jianfeng Wang , Zhe Gan , Zheng Zhang , Le Wang , Gang Hua , Lijuan Wang , Zicheng Liu , Han Hu

Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Kaifeng Gao , Jiaxin Shi , Hanwang Zhang , Chunping Wang , Jun Xiao , Long Chen

Exploring Iterative Refinement with Diffusion Models for Video Grounding

Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query. Existing methods typically select the best prediction from a set of predefined proposals or directly regress the target span…

Computer Vision and Pattern Recognition · Computer Science 2024-01-01 Xiao Liang , Tao Shi , Yaoyuan Liang , Te Tao , Shao-Lun Huang

Towards Diverse and Efficient Audio Captioning via Diffusion Models

We introduce Diffusion-based Audio Captioning (DAC), a non-autoregressive diffusion model tailored for diverse and efficient audio captioning. Although existing captioning models relying on language backbones have achieved remarkable…

Computation and Language · Computer Science 2025-06-03 Manjie Xu , Chenxing Li , Xinyi Tu , Yong Ren , Ruibo Fu , Wei Liang , Dong Yu

Fast Image Caption Generation with Position Alignment

Recent neural network models for image captioning usually employ an encoder-decoder architecture, where the decoder adopts a recursive sequence decoding way. However, such autoregressive decoding may result in sequential error accumulation…

Computer Vision and Pattern Recognition · Computer Science 2019-12-16 Zheng-cong Fei

VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first…

Computer Vision and Pattern Recognition · Computer Science 2026-01-15 Longbin Ji , Xiaoxiong Liu , Junyuan Shang , Shuohuan Wang , Yu Sun , Hua Wu , Haifeng Wang

DiffVC-OSD: One-Step Diffusion-based Perceptual Neural Video Compression Framework

In this work, we first propose DiffVC-OSD, a One-Step Diffusion-based Perceptual Neural Video Compression framework. Unlike conventional multi-step diffusion-based methods, DiffVC-OSD feeds the reconstructed latent representation directly…

Image and Video Processing · Electrical Eng. & Systems 2025-08-12 Wenzhuo Ma , Zhenzhong Chen

Progressive Autoregressive Video Diffusion Models

Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Desai Xie , Zhan Xu , Yicong Hong , Hao Tan , Difan Liu , Feng Liu , Arie Kaufman , Yang Zhou

Streaming Autoregressive Video Generation via Diagonal Distillation

Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but…

Computer Vision and Pattern Recognition · Computer Science 2026-03-12 Jinxiu Liu , Xuanming Liu , Kangfu Mei , Yandong Wen , Ming-Hsuan Yang , Weiyang Liu

Contextualized Diffusion Models for Text-Guided Image and Video Generation

Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. Nevertheless, prevailing text-guided visual diffusion models primarily focus on incorporating text-visual…

Computer Vision and Pattern Recognition · Computer Science 2024-06-05 Ling Yang , Zhilong Zhang , Zhaochen Yu , Jingwei Liu , Minkai Xu , Stefano Ermon , Bin Cui

Diff-3DCap: Shape Captioning with Diffusion Models

The task of 3D shape captioning occupies a significant place within the domain of computer graphics and has garnered considerable interest in recent years. Traditional approaches to this challenge frequently depend on the utilization of…

Graphics · Computer Science 2025-09-30 Zhenyu Shu , Jiawei Wen , Shiyang Li , Shiqing Xin , Ligang Liu

IPAD: Iterative, Parallel, and Diffusion-based Network for Scene Text Recognition

Nowadays, scene text recognition has attracted more and more attention due to its diverse applications. Most state-of-the-art methods adopt an encoder-decoder framework with the attention mechanism, autoregressively generating text from…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Xiaomeng Yang , Zhi Qiao , Yu Zhou

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence,…

Computer Vision and Pattern Recognition · Computer Science 2025-09-25 Tianwei Yin , Qiang Zhang , Richard Zhang , William T. Freeman , Fredo Durand , Eli Shechtman , Xun Huang

Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion

Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work,…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Amirhosein Javadi , Shirin Saeedi Bidokhti , Tara Javidi

Diffusion-aided Extreme Video Compression with Lightweight Semantics Guidance

Modern video codecs and learning-based approaches struggle for semantic reconstruction at extremely low bit-rates due to reliance on low-level spatiotemporal redundancies. Generative models, especially diffusion models, offer a new paradigm…

Image and Video Processing · Electrical Eng. & Systems 2026-02-06 Maojun Zhang , Haotian Wu , Richeng Jin , Deniz Gunduz , Krystian Mikolajczyk

CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis

Multi-view diffusion models have shown promise in 3D novel view synthesis, but most existing methods adopt a non-autoregressive formulation. This limits their applicability in world modeling, as they only support a fixed number of views and…

Computer Vision and Pattern Recognition · Computer Science 2025-09-09 Xin Kong , Daniel Watson , Yannick Strümpler , Michael Niemeyer , Federico Tombari

DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models

Recent advances in image and video creation, especially AI-based image synthesis, have led to the production of numerous visual scenes that exhibit a high level of abstractness and diversity. Consequently, Visual Storytelling (VST), a task…

Computation and Language · Computer Science 2023-12-13 Shengguang Wu , Mei Yuan , Qi Su