Related papers: Enhancing Document Key Information Localization Th…

VRD-IU: Lessons from Visually Rich Document Intelligence and Understanding

Visually Rich Document Understanding (VRDU) has emerged as a critical field in document intelligence, enabling automated extraction of key information from complex documents across domains such as medical, financial, and educational…

Computer Vision and Pattern Recognition · Computer Science 2025-06-03 Yihao Ding , Soyeon Caren Han , Yan Li , Josiah Poon

Deep Learning based Visually Rich Document Content Understanding: A Survey

Visually Rich Documents (VRDs) play a vital role in domains such as academia, finance, healthcare, and marketing, as they convey information through a combination of text, layout, and visual elements. Traditional approaches to extracting…

Computation and Language · Computer Science 2025-06-23 Yihao Ding , Soyeon Caren Han , Jean Lee , Eduard Hovy

RDU: A Region-based Approach to Form-style Document Understanding

Key Information Extraction (KIE) is aimed at extracting structured information (e.g. key-value pairs) from form-style documents (e.g. invoices), which makes an important step towards intelligent document understanding. Previous approaches…

Artificial Intelligence · Computer Science 2022-06-15 Fengbin Zhu , Chao Wang , Wenqiang Lei , Ziyang Liu , Tat Seng Chua

ReLayout: Towards Real-World Document Understanding via Layout-enhanced Pre-training

Recent approaches for visually-rich document understanding (VrDU) uses manually annotated semantic groups, where a semantic group encompasses all semantically relevant but not obviously grouped words. As OCR tools are unable to…

Computer Vision and Pattern Recognition · Computer Science 2024-10-17 Zhouqiang Jiang , Bowen Wang , Junhao Chen , Yuta Nakashima

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

Visually Rich Document Understanding (VRDU) has become a pivotal area of research, driven by the need to automatically interpret documents that contain intricate visual, textual, and structural elements. Recently, Multimodal Large Language…

Computer Vision and Pattern Recognition · Computer Science 2026-04-22 Yihao Ding , Siwen Luo , Yue Dai , Yanbei Jiang , Zechuan Li , Qiang Sun , Geoffrey Martin , Wei Liu , Yifan Peng

Object Detection Based Handwriting Localization

We present an object detection based approach to localize handwritten regions from documents, which initially aims to enhance the anonymization during the data transmission. The concatenated fusion of original and preprocessed images…

Computer Vision and Pattern Recognition · Computer Science 2026-02-23 Yuli Wu , Yucheng Hu , Suting Miao

VRDU: A Benchmark for Visually-rich Document Understanding

Understanding visually-rich business documents to extract structured data and automate business workflows has been receiving attention both in academia and industry. Although recent multi-modal language models have achieved impressive…

Computation and Language · Computer Science 2023-09-19 Zilong Wang , Yichao Zhou , Wei Wei , Chen-Yu Lee , Sandeep Tata

MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has made significant progress for Visually Rich Document Understanding (VRDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number…

Computation and Language · Computer Science 2022-03-14 Junlong Li , Yiheng Xu , Lei Cui , Furu Wei

Modeling Visual Context is Key to Augmenting Object Detection Datasets

Performing data augmentation for learning deep neural networks is well known to be important for training visual recognition systems. By artificially increasing the number of training examples, it helps reducing overfitting and improves…

Computer Vision and Pattern Recognition · Computer Science 2018-07-20 Nikita Dvornik , Julien Mairal , Cordelia Schmid

Including Keyword Position in Image-based Models for Act Segmentation of Historical Registers

The segmentation of complex images into semantic regions has seen a growing interest these last years with the advent of Deep Learning. Until recently, most existing methods for Historical Document Analysis focused on the visual appearance…

Computer Vision and Pattern Recognition · Computer Science 2021-11-03 Mélodie Boillet , Martin Maarand , Thierry Paquet , Christopher Kermorvant

DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye Movement for Machine Reading

The use of visually-rich documents (VRDs) in various fields has created a demand for Document AI models that can read and comprehend documents like humans, which requires the overcoming of technical, linguistic, and cognitive barriers.…

Human-Computer Interaction · Computer Science 2023-10-24 Hao Wang , Qingxuan Wang , Yue Li , Changqing Wang , Chenhui Chu , Rui Wang

GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph Creation

Object detection in documents is a key step to automate the structural elements identification process in a digital or scanned document through understanding the hierarchical structure and relationships between different elements. Large and…

Computer Vision and Pattern Recognition · Computer Science 2024-02-21 Ayan Banerjee , Sanket Biswas , Josep Lladós , Umapada Pal

Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition

Handwritten text and scene text suffer from various shapes and distorted patterns. Thus training a robust recognition model requires a large amount of data to cover diversity as much as possible. In contrast to data collection and…

Computer Vision and Pattern Recognition · Computer Science 2020-03-17 Canjie Luo , Yuanzhi Zhu , Lianwen Jin , Yongpan Wang

Graphical Object Detection in Document Images

Graphical elements: particularly tables and figures contain a visual summary of the most valuable information contained in a document. Therefore, localization of such graphical objects in the document images is the initial step to…

Computer Vision and Pattern Recognition · Computer Science 2020-08-26 Ranajit Saha , Ajoy Mondal , C. V. Jawahar

DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models

Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by…

Computer Vision and Pattern Recognition · Computer Science 2024-10-07 Sungnyun Kim , Haofu Liao , Srikar Appalaraju , Peng Tang , Zhuowen Tu , Ravi Kumar Satzoda , R. Manmatha , Vijay Mahadevan , Stefano Soatto

Augmented Reality Meets Computer Vision : Efficient Data Generation for Urban Driving Scenes

The success of deep learning in computer vision is based on availability of large annotated datasets. To lower the need for hand labeled images, virtually rendered 3D worlds have recently gained popularity. Creating realistic 3D content is…

Computer Vision and Pattern Recognition · Computer Science 2017-08-07 Hassan Abu Alhaija , Siva Karthik Mustikovela , Lars Mescheder , Andreas Geiger , Carsten Rother

Document Expansion by Query Prediction

One technique to improve the retrieval effectiveness of a search engine is to expand documents with terms that are related or representative of the documents' content.From the perspective of a question answering system, this might comprise…

Information Retrieval · Computer Science 2019-09-26 Rodrigo Nogueira , Wei Yang , Jimmy Lin , Kyunghyun Cho

Document Optimization for Black-Box Retrieval via Reinforcement Learning

Document expansion is a classical technique for improving retrieval quality, and is attractive since it shifts computation offline, avoiding additional query-time processing. However, when applied to modern retrievers, it has been shown to…

Computation and Language · Computer Science 2026-04-08 Omri Uzan , Ron Polonsky , Douwe Kiela , Christopher Potts

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard…

Computer Vision and Pattern Recognition · Computer Science 2025-06-19 Chuwei Luo , Guozhi Tang , Qi Zheng , Cong Yao , Lianwen Jin , Chenliang Li , Yang Xue , Luo Si

MATrIX -- Modality-Aware Transformer for Information eXtraction

We present MATrIX - a Modality-Aware Transformer for Information eXtraction in the Visual Document Understanding (VDU) domain. VDU covers information extraction from visually rich documents such as forms, invoices, receipts, tables, graphs,…

Computer Vision and Pattern Recognition · Computer Science 2022-05-18 Thomas Delteil , Edouard Belval , Lei Chen , Luis Goncalves , Vijay Mahadevan