Related papers: SelfDoc: Self-Supervised Document Representation L…
Document intelligence automates the extraction of information from documents and supports many business applications. Recent self-supervised learning methods on large-scale unlabeled document datasets have opened up promising directions…
Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem…
Document intelligence as a relatively new research topic supports many business applications. Its main task is to automatically read, understand, and analyze documents. However, due to the diversity of formats (invoices, reports, forms,…
Visual document understanding (VDU) has rapidly advanced with the development of powerful multi-modal language models. However, these models typically require extensive document pre-training data to learn intermediate representations and…
Recent approaches in literature have exploited the multi-modal information in documents (text, layout, image) to serve specific downstream document tasks. However, they are limited by their - (i) inability to learn cross-modal…
The surge of pre-training has witnessed the rapid development of document understanding recently. Pre-training and fine-tuning framework has been effectively used to tackle texts in various formats, including plain texts, document texts,…
In large technology companies, the requirements for managing and organizing technical documents created by engineers and managers have increased dramatically in recent years, which has led to a higher demand for more scalable, accurate, and…
Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard…
In this paper, we propose $FastDoc$ (Fast Continual Pre-training Technique using Document Level Metadata and Taxonomy), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals…
Current document reasoning paradigms are constrained by a fundamental trade-off between scalability (processing long-context documents) and fidelity (capturing fine-grained, multimodal details). To bridge this gap, we propose CogDoc, a…
Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and recognition to graph-based representation, visual feature…
The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are…
Document images are a ubiquitous source of data where the text is organized in a complex hierarchical structure ranging from fine granularity (e.g., words), medium granularity (e.g., regions such as paragraphs or figures), to coarse…
In the era of Large Language Models (LLMs), tremendous strides have been made in the field of multimodal understanding. However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and…
Task specific fine-tuning of a pre-trained neural language model using a custom softmax output layer is the de facto approach of late when dealing with document classification problems. This technique is not adequate when labeled examples…
We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text…
This work analyses the impact of self-supervised pre-training on document images in the context of document image classification. While previous approaches explore the effect of self-supervision on natural images, we show that patch-based…
Self-supervised learning approaches leverage unlabeled samples to acquire generic knowledge about different concepts, hence allowing for annotation-efficient downstream task learning. In this paper, we propose a novel self-supervised method…
Self-supervised learning is an efficient pre-training method for medical image analysis. However, current research is mostly confined to specific-modality data pre-training, consuming considerable time and resources without achieving…
Designing adaptive documents that are visually appealing across various devices and for diverse viewers is a challenging task. This is due to the wide variety of devices and different viewer requirements and preferences. Alterations to a…