Related papers: DocFormerv2: Local Features for Document Understan…

DocFormer: End-to-End Transformer for Document Understanding

We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and…

Computer Vision and Pattern Recognition · Computer Science 2021-09-21 Srikar Appalaraju , Bhavan Jasani , Bhargava Urala Kota , Yusheng Xie , R. Manmatha

GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

Visual document understanding (VDU) has rapidly advanced with the development of powerful multi-modal language models. However, these models typically require extensive document pre-training data to learn intermediate representations and…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Souhail Bakkali , Sanket Biswas , Zuheng Ming , Mickaël Coustaty , Marçal Rusiñol , Oriol Ramos Terrades , Josep Lladós

OCR-free Document Understanding Transformer

Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods…

Machine Learning · Computer Science 2022-10-07 Geewook Kim , Teakgyu Hong , Moonbin Yim , Jeongyeon Nam , Jinyoung Park , Jinyeong Yim , Wonseok Hwang , Sangdoo Yun , Dongyoon Han , Seunghyun Park

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard…

Computer Vision and Pattern Recognition · Computer Science 2025-06-19 Chuwei Luo , Guozhi Tang , Qi Zheng , Cong Yao , Lianwen Jin , Chenliang Li , Yang Xue , Luo Si

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding

Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the…

Computation and Language · Computer Science 2022-12-20 Haoli Bai , Zhiguang Liu , Xiaojun Meng , Wentao Li , Shuang Liu , Nian Xie , Rongfu Zheng , Liangwei Wang , Lu Hou , Jiansheng Wei , Xin Jiang , Qun Liu

Test-Time Adaptation for Visual Document Understanding

For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be…

Computer Vision and Pattern Recognition · Computer Science 2023-08-25 Sayna Ebrahimi , Sercan O. Arik , Tomas Pfister

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose…

Computation and Language · Computer Science 2022-01-11 Yang Xu , Yiheng Xu , Tengchao Lv , Lei Cui , Furu Wei , Guoxin Wang , Yijuan Lu , Dinei Florencio , Cha Zhang , Wanxiang Che , Min Zhang , Lidong Zhou

DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-01-01 Brandon Huang , Hang Hua , Zhuoran Yu , Trevor Darrell , Rogerio Feris , Roei Herzig

MATrIX -- Modality-Aware Transformer for Information eXtraction

We present MATrIX - a Modality-Aware Transformer for Information eXtraction in the Visual Document Understanding (VDU) domain. VDU covers information extraction from visually rich documents such as forms, invoices, receipts, tables, graphs,…

Computer Vision and Pattern Recognition · Computer Science 2022-05-18 Thomas Delteil , Edouard Belval , Lei Chen , Luis Goncalves , Vijay Mahadevan

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language…

Computer Vision and Pattern Recognition · Computer Science 2023-03-02 Yuechen Yu , Yulin Li , Chengquan Zhang , Xiaoqiang Zhang , Zengyuan Guo , Xiameng Qin , Kun Yao , Junyu Han , Errui Ding , Jingdong Wang

Long-Range Transformer Architectures for Document Understanding

Since their release, Transformers have revolutionized many fields from Natural Language Understanding to Computer Vision. Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019.…

Computation and Language · Computer Science 2023-09-12 Thibault Douzon , Stefan Duffner , Christophe Garcia , Jérémy Espinas

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the…

Computer Vision and Pattern Recognition · Computer Science 2024-01-25 Ryota Tanaka , Taichi Iki , Kyosuke Nishida , Kuniko Saito , Jun Suzuki

DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models

Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by…

Computer Vision and Pattern Recognition · Computer Science 2024-10-07 Sungnyun Kim , Haofu Liao , Srikar Appalaraju , Peng Tang , Zhuowen Tu , Ravi Kumar Satzoda , R. Manmatha , Vijay Mahadevan , Stefano Soatto

Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning

Recent approaches in literature have exploited the multi-modal information in documents (text, layout, image) to serve specific downstream document tasks. However, they are limited by their - (i) inability to learn cross-modal…

Computation and Language · Computer Science 2022-01-06 Subhojeet Pramanik , Shashank Mujumdar , Hima Patel

Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting

Document parsing has garnered widespread attention as vision-language models (VLMs) advance OCR capabilities. However, the field remains fragmented across dozens of specialized models with varying strengths, forcing users to navigate…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Hao Feng , Wei Shi , Ke Zhang , Xiang Fei , Lei Liao , Dingkang Yang , Yongkun Du , Xuecheng Wu , Jingqun Tang , Yang Liu , Hong Chen , Can Huang

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Souhail Bakkali , Zuheng Ming , Mickael Coustaty , Marçal Rusiñol , Oriol Ramos Terrades

On Web-based Visual Corpus Construction for Visual Document Understanding

In recent years, research on visual document understanding (VDU) has grown significantly, with a particular emphasis on the development of self-supervised learning methods. However, one of the significant challenges faced in this field is…

Computer Vision and Pattern Recognition · Computer Science 2023-05-03 Donghyun Kim , Teakgyu Hong , Moonbin Yim , Yoonsik Kim , Geewook Kim

Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach

Visual Question Answering (VQA) has emerged as a highly engaging field in recent years, with increasing research focused on enhancing VQA accuracy through advanced models such as Transformers. Despite this growing interest, limited work has…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Zhilin Zhang , Fangyu Wu

Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich Documents

This paper introduces a deep learning model tailored for document information analysis, emphasizing document classification, entity relation extraction, and document visual question answering. The proposed model leverages transformer-based…

Computer Vision and Pattern Recognition · Computer Science 2023-10-26 Tofik Ali , Partha Pratim Roy

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

Recently, the advent of Large Visual-Language Models (LVLMs) has received increasing attention across various domains, particularly in the field of visual document understanding (VDU). Different from conventional vision-language tasks, VDU…

Computer Vision and Pattern Recognition · Computer Science 2024-03-01 Xin Li , Yunfei Wu , Xinghua Jiang , Zhihao Guo , Mingming Gong , Haoyu Cao , Yinsong Liu , Deqiang Jiang , Xing Sun