English
Related papers

Related papers: DocFormerv2: Local Features for Document Understan…

200 papers

We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and…

Computer Vision and Pattern Recognition · Computer Science 2021-09-21 Srikar Appalaraju , Bhavan Jasani , Bhargava Urala Kota , Yusheng Xie , R. Manmatha

Visual document understanding (VDU) has rapidly advanced with the development of powerful multi-modal language models. However, these models typically require extensive document pre-training data to learn intermediate representations and…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Souhail Bakkali , Sanket Biswas , Zuheng Ming , Mickaël Coustaty , Marçal Rusiñol , Oriol Ramos Terrades , Josep Lladós

Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods…

Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard…

Computer Vision and Pattern Recognition · Computer Science 2025-06-19 Chuwei Luo , Guozhi Tang , Qi Zheng , Cong Yao , Lianwen Jin , Chenliang Li , Yang Xue , Luo Si

Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the…

Computation and Language · Computer Science 2022-12-20 Haoli Bai , Zhiguang Liu , Xiaojun Meng , Wentao Li , Shuang Liu , Nian Xie , Rongfu Zheng , Liangwei Wang , Lu Hou , Jiansheng Wei , Xin Jiang , Qun Liu

For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be…

Computer Vision and Pattern Recognition · Computer Science 2023-08-25 Sayna Ebrahimi , Sercan O. Arik , Tomas Pfister

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose…

Computation and Language · Computer Science 2022-01-11 Yang Xu , Yiheng Xu , Tengchao Lv , Lei Cui , Furu Wei , Guoxin Wang , Yijuan Lu , Dinei Florencio , Cha Zhang , Wanxiang Che , Min Zhang , Lidong Zhou

While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-01-01 Brandon Huang , Hang Hua , Zhuoran Yu , Trevor Darrell , Rogerio Feris , Roei Herzig

We present MATrIX - a Modality-Aware Transformer for Information eXtraction in the Visual Document Understanding (VDU) domain. VDU covers information extraction from visually rich documents such as forms, invoices, receipts, tables, graphs,…

Computer Vision and Pattern Recognition · Computer Science 2022-05-18 Thomas Delteil , Edouard Belval , Lei Chen , Luis Goncalves , Vijay Mahadevan

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language…

Computer Vision and Pattern Recognition · Computer Science 2023-03-02 Yuechen Yu , Yulin Li , Chengquan Zhang , Xiaoqiang Zhang , Zengyuan Guo , Xiameng Qin , Kun Yao , Junyu Han , Errui Ding , Jingdong Wang

Since their release, Transformers have revolutionized many fields from Natural Language Understanding to Computer Vision. Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019.…

Computation and Language · Computer Science 2023-09-12 Thibault Douzon , Stefan Duffner , Christophe Garcia , Jérémy Espinas

We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the…

Computer Vision and Pattern Recognition · Computer Science 2024-01-25 Ryota Tanaka , Taichi Iki , Kyosuke Nishida , Kuniko Saito , Jun Suzuki

Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by…

Computer Vision and Pattern Recognition · Computer Science 2024-10-07 Sungnyun Kim , Haofu Liao , Srikar Appalaraju , Peng Tang , Zhuowen Tu , Ravi Kumar Satzoda , R. Manmatha , Vijay Mahadevan , Stefano Soatto

Recent approaches in literature have exploited the multi-modal information in documents (text, layout, image) to serve specific downstream document tasks. However, they are limited by their - (i) inability to learn cross-modal…

Computation and Language · Computer Science 2022-01-06 Subhojeet Pramanik , Shashank Mujumdar , Hima Patel

Document parsing has garnered widespread attention as vision-language models (VLMs) advance OCR capabilities. However, the field remains fragmented across dozens of specialized models with varying strengths, forcing users to navigate…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Hao Feng , Wei Shi , Ke Zhang , Xiang Fei , Lei Liao , Dingkang Yang , Yongkun Du , Xuecheng Wu , Jingqun Tang , Yang Liu , Hong Chen , Can Huang

Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Souhail Bakkali , Zuheng Ming , Mickael Coustaty , Marçal Rusiñol , Oriol Ramos Terrades

In recent years, research on visual document understanding (VDU) has grown significantly, with a particular emphasis on the development of self-supervised learning methods. However, one of the significant challenges faced in this field is…

Computer Vision and Pattern Recognition · Computer Science 2023-05-03 Donghyun Kim , Teakgyu Hong , Moonbin Yim , Yoonsik Kim , Geewook Kim

Visual Question Answering (VQA) has emerged as a highly engaging field in recent years, with increasing research focused on enhancing VQA accuracy through advanced models such as Transformers. Despite this growing interest, limited work has…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Zhilin Zhang , Fangyu Wu

This paper introduces a deep learning model tailored for document information analysis, emphasizing document classification, entity relation extraction, and document visual question answering. The proposed model leverages transformer-based…

Computer Vision and Pattern Recognition · Computer Science 2023-10-26 Tofik Ali , Partha Pratim Roy

Recently, the advent of Large Visual-Language Models (LVLMs) has received increasing attention across various domains, particularly in the field of visual document understanding (VDU). Different from conventional vision-language tasks, VDU…

Computer Vision and Pattern Recognition · Computer Science 2024-03-01 Xin Li , Yunfei Wu , Xinghua Jiang , Zhihao Guo , Mingming Gong , Haoyu Cao , Yinsong Liu , Deqiang Jiang , Xing Sun
‹ Prev 1 2 3 10 Next ›