Related papers: SelfDoc: Self-Supervised Document Representation L…

Unified Pretraining Framework for Document Understanding

Document intelligence automates the extraction of information from documents and supports many business applications. Recent self-supervised learning methods on large-scale unlabeled document datasets have opened up promising directions…

Computation and Language · Computer Science 2022-04-29 Jiuxiang Gu , Jason Kuen , Vlad I. Morariu , Handong Zhao , Nikolaos Barmpalios , Rajiv Jain , Ani Nenkova , Tong Sun

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Souhail Bakkali , Zuheng Ming , Mickael Coustaty , Marçal Rusiñol , Oriol Ramos Terrades

Multimodal Pre-training Based on Graph Attention Network for Document Understanding

Document intelligence as a relatively new research topic supports many business applications. Its main task is to automatically read, understand, and analyze documents. However, due to the diversity of formats (invoices, reports, forms,…

Computer Vision and Pattern Recognition · Computer Science 2022-10-25 Zhenrong Zhang , Jiefeng Ma , Jun Du , Licheng Wang , Jianshu Zhang

GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

Visual document understanding (VDU) has rapidly advanced with the development of powerful multi-modal language models. However, these models typically require extensive document pre-training data to learn intermediate representations and…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Souhail Bakkali , Sanket Biswas , Zuheng Ming , Mickaël Coustaty , Marçal Rusiñol , Oriol Ramos Terrades , Josep Lladós

Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning

Recent approaches in literature have exploited the multi-modal information in documents (text, layout, image) to serve specific downstream document tasks. However, they are limited by their - (i) inability to learn cross-modal…

Computation and Language · Computer Science 2022-01-06 Subhojeet Pramanik , Shashank Mujumdar , Hima Patel

XDoc: Unified Pre-training for Cross-Format Document Understanding

The surge of pre-training has witnessed the rapid development of document understanding recently. Pre-training and fine-tuning framework has been effectively used to tackle texts in various formats, including plain texts, document texts,…

Computation and Language · Computer Science 2022-10-07 Jingye Chen , Tengchao Lv , Lei Cui , Cha Zhang , Furu Wei

Deep Learning for Technical Document Classification

In large technology companies, the requirements for managing and organizing technical documents created by engineers and managers have increased dramatically in recent years, which has led to a higher demand for more scalable, accurate, and…

Machine Learning · Computer Science 2025-10-31 Shuo Jiang , Jie Hu , Christopher L. Magee , Jianxi Luo

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard…

Computer Vision and Pattern Recognition · Computer Science 2025-06-19 Chuwei Luo , Guozhi Tang , Qi Zheng , Cong Yao , Lianwen Jin , Chenliang Li , Yang Xue , Luo Si

$FastDoc$: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

In this paper, we propose $FastDoc$ (Fast Continual Pre-training Technique using Document Level Metadata and Taxonomy), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals…

Computation and Language · Computer Science 2024-11-04 Abhilash Nandy , Manav Nitin Kapadnis , Sohan Patnaik , Yash Parag Butala , Pawan Goyal , Niloy Ganguly

CogDoc: Towards Unified thinking in Documents

Current document reasoning paradigms are constrained by a fundamental trade-off between scalability (processing long-context documents) and fidelity (capturing fine-grained, multimodal details). To bridge this gap, we propose CogDoc, a…

Computer Vision and Pattern Recognition · Computer Science 2025-12-16 Qixin Xu , Haozhe Wang , Che Liu , Fangzhen Lin , Wenhu Chen

SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation

Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and recognition to graph-based representation, visual feature…

Computer Vision and Pattern Recognition · Computer Science 2023-08-22 Subhajit Maity , Sanket Biswas , Siladittya Manna , Ayan Banerjee , Josep Lladós , Saumik Bhattacharya , Umapada Pal

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are…

Computation and Language · Computer Science 2024-11-12 Yew Ken Chia , Liying Cheng , Hou Pong Chan , Chaoqun Liu , Maojia Song , Sharifah Mahani Aljunied , Soujanya Poria , Lidong Bing

MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding

Document images are a ubiquitous source of data where the text is organized in a complex hierarchical structure ranging from fine granularity (e.g., words), medium granularity (e.g., regions such as paragraphs or figures), to coarse…

Computer Vision and Pattern Recognition · Computer Science 2022-11-29 Zilong Wang , Jiuxiang Gu , Chris Tensmeyer , Nikolaos Barmpalios , Ani Nenkova , Tong Sun , Jingbo Shang , Vlad I. Morariu

UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

In the era of Large Language Models (LLMs), tremendous strides have been made in the field of multimodal understanding. However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and…

Artificial Intelligence · Computer Science 2023-09-06 Hao Feng , Zijian Wang , Jingqun Tang , Jinghui Lu , Wengang Zhou , Houqiang Li , Can Huang

Robust Document Representations using Latent Topics and Metadata

Task specific fine-tuning of a pre-trained neural language model using a custom softmax output layer is the de facto approach of late when dealing with document classification problems. This technique is not adequate when labeled examples…

Computation and Language · Computer Science 2020-10-27 Natraj Raman , Armineh Nourbakhsh , Sameena Shah , Manuela Veloso

Multi-Modal Representation Learning with Text-Driven Soft Masks

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Jaeyoo Park , Bohyung Han

Self-Supervised Representation Learning on Document Images

This work analyses the impact of self-supervised pre-training on document images in the context of document image classification. While previous approaches explore the effect of self-supervision on natural images, we show that patch-based…

Computer Vision and Pattern Recognition · Computer Science 2020-05-28 Adrian Cosma , Mihai Ghidoveanu , Michael Panaitescu-Liess , Marius Popescu

Multimodal Self-Supervised Learning for Medical Image Analysis

Self-supervised learning approaches leverage unlabeled samples to acquire generic knowledge about different concepts, hence allowing for annotation-efficient downstream task learning. In this paper, we propose a novel self-supervised method…

Computer Vision and Pattern Recognition · Computer Science 2020-10-27 Aiham Taleb , Christoph Lippert , Tassilo Klein , Moin Nabi

Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning

Self-supervised learning is an efficient pre-training method for medical image analysis. However, current research is mostly confined to specific-modality data pre-training, consuming considerable time and resources without achieving…

Computer Vision and Pattern Recognition · Computer Science 2023-12-01 Yiwen Ye , Yutong Xie , Jianpeng Zhang , Ziyang Chen , Qi Wu , Yong Xia

FlexDoc: Flexible Document Adaptation through Optimizing both Content and Layout

Designing adaptive documents that are visually appealing across various devices and for diverse viewers is a challenging task. This is due to the wide variety of devices and different viewer requirements and preferences. Alterations to a…

Human-Computer Interaction · Computer Science 2024-10-22 Yue Jiang , Christof Lutteroth , Rajiv Jain , Christopher Tensmeyer , Varun Manjunatha , Wolfgang Stuerzlinger , Vlad Morariu