Related papers: DODO: Discrete OCR Diffusion Models

DECDM: Document Enhancement using Cycle-Consistent Diffusion Models

The performance of optical character recognition (OCR) heavily relies on document image quality, which is crucial for automatic document processing and document intelligence. However, most existing document enhancement methods require…

Computer Vision and Pattern Recognition · Computer Science 2023-11-17 Jiaxin Zhang , Joy Rimchala , Lalla Mouatadid , Kamalika Das , Sricharan Kumar

Efficient OCR for Building a Diverse Digital History

Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR)…

Computer Vision and Pattern Recognition · Computer Science 2024-07-29 Jacob Carlson , Tom Bryan , Melissa Dell

Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics…

Computation and Language · Computer Science 2026-04-13 Chengyue Wu , Shiyi Lan , Yonggan Fu , Sensen Gao , Jin Wang , Jincheng Yu , Jose M. Alvarez , Pavlo Molchanov , Ping Luo , Song Han , Ligeng Zhu , Enze Xie

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Hejun Dong , Junbo Niu , Bin Wang , Weijun Zeng , Wentao Zhang , Conghui He

TransDocs: Optical Character Recognition with word to word translation

While OCR has been used in various applications, its output is not always accurate, leading to misfit words. This research work focuses on improving the optical character recognition (OCR) with ML techniques with integration of OCR with…

Computer Vision and Pattern Recognition · Computer Science 2023-04-18 Abhishek Bamotra , Phani Krishna Uppala

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on…

Computer Vision and Pattern Recognition · Computer Science 2026-05-19 Zihan Tang , Leqi Shen , Hui Chen , Ao Wang , Ben Wan , Yan Feng , Ke Zhang , Sicheng Zhao , Tongxuan Liu , Guiguang Ding

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation…

Robotics · Computer Science 2026-05-14 Ruiheng Wang , Shuanghao Bai , Haoran Zhang , Badong Chen , Xiangyu Xu

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames…

Computer Vision and Pattern Recognition · Computer Science 2025-02-11 Sankalp Nagaonkar , Augustya Sharma , Ashish Choithani , Ashutosh Trivedi

Reversible Diffusion Decoding for Diffusion Language Models

Diffusion language models enable parallel token generation through block-wise decoding, but their irreversible commitments can lead to stagnation, where the reverse diffusion process fails to make further progress under a suboptimal…

Computation and Language · Computer Science 2026-02-03 Xinyun Wang , Min Zhang , Sen Cui , Zhikang Chen , Bo Jiang , Kun Kuang , Mingbao Lin

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

Recently, the advent of Large Visual-Language Models (LVLMs) has received increasing attention across various domains, particularly in the field of visual document understanding (VDU). Different from conventional vision-language tasks, VDU…

Computer Vision and Pattern Recognition · Computer Science 2024-03-01 Xin Li , Yunfei Wu , Xinghua Jiang , Zhihao Guo , Mingming Gong , Haoyu Cao , Yinsong Liu , Deqiang Jiang , Xing Sun

Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator

While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective, which minimizes the forward KL…

Computer Vision and Pattern Recognition · Computer Science 2025-06-24 Kaiwen Zheng , Yongxin Chen , Huayu Chen , Guande He , Ming-Yu Liu , Jun Zhu , Qinsheng Zhang

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token…

Machine Learning · Computer Science 2026-05-19 Lifeng Chen , Tianqi You , Hao Liu , Zhimin Bao , Jile Jiao , Xiao Han , Zhicai Ou , Tao Sun , Xiaofeng Mou , Xiaojie Jin , Yi Xu

Diffusion Models Need Visual Priors for Image Generation

Conventional class-guided diffusion models generally succeed in generating images with correct semantic content, but often struggle with texture details. This limitation stems from the usage of class priors, which only provide coarse and…

Computer Vision and Pattern Recognition · Computer Science 2024-10-14 Xiaoyu Yue , Zidong Wang , Zeyu Lu , Shuyang Sun , Meng Wei , Wanli Ouyang , Lei Bai , Luping Zhou

Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to…

Computer Vision and Pattern Recognition · Computer Science 2026-04-06 Cheng Cui , Ting Sun , Suyin Liang , Tingquan Gao , Zelun Zhang , Jiaxuan Liu , Xueqing Wang , Changda Zhou , Hongen Liu , Manhui Lin , Yue Zhang , Yubo Zhang , Jing Zhang , Jun Zhang , Xing Wei , Yi Liu , Dianhai Yu , Yanjun Ma

DocVLM: Make Your VLM an Efficient Reader

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Mor Shpigel Nacson , Aviad Aberdam , Roy Ganz , Elad Ben Avraham , Alona Golts , Yair Kittenplon , Shai Mazor , Ron Litman

From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors…

Computer Vision and Pattern Recognition · Computer Science 2026-04-16 Weile Gong , Yiping Zuo , Zijian Lu , Xin He , Weibei Fan , Lianyong Qi , Shi Jin

A Learned-SVD approach for Regularization in Diffuse Optical Tomography

Diffuse Optical Tomography (DOT) is an emerging technology in medical imaging which employs light in the NIR spectrum to estimate the distribution of optical coefficients in biological tissues for diagnostic and monitoring purposes. DOT…

Numerical Analysis · Mathematics 2022-05-27 Alessandro Benfenati , Giuseppe Bisazza , Paola Causin

WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering…

Computation and Language · Computer Science 2025-12-30 Aiwei Liu , Minghua He , Shaoxun Zeng , Sijun Zhang , Linhao Zhang , Chuhan Wu , Wei Jia , Yuan Liu , Xiao Zhou , Jie Zhou

Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions

The challenge in fine-grained visual categorization lies in how to explore the subtle differences between different subclasses and achieve accurate discrimination. Previous research has relied on large-scale annotated data and pre-trained…

Computer Vision and Pattern Recognition · Computer Science 2024-05-16 Tianxu Wu , Shuo Ye , Shuhuang Chen , Qinmu Peng , Xinge You

Open-Vocabulary Object Detectors: Robustness Challenges under Distribution Shifts

The challenge of Out-Of-Distribution (OOD) robustness remains a critical hurdle towards deploying deep vision models. Vision-Language Models (VLMs) have recently achieved groundbreaking results. VLM-based open-vocabulary object detection…

Computer Vision and Pattern Recognition · Computer Science 2024-09-09 Prakash Chandra Chhipa , Kanjar De , Meenakshi Subhash Chippa , Rajkumar Saini , Marcus Liwicki