Related papers: DoPTA: Improving Document Layout Analysis using Pa…

Multilevel Text Alignment with Cross-Document Attention

Text alignment finds application in tasks such as citation recommendation and plagiarism detection. Existing alignment methods operate at a single, predefined level and cannot learn to align texts at, for example, sentence and document…

Computation and Language · Computer Science 2020-10-06 Xuhui Zhou , Nikolaos Pappas , Noah A. Smith

Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models

Alignment between image and text has shown promising improvements on patch-level pre-trained document image models. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of…

Computer Vision and Pattern Recognition · Computer Science 2022-12-02 Lei Wang , Jiabang He , Xing Xu , Ning Liu , Hui Liu

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship. With the advent of the multimedia information age, image, and text data show explosive growth, and…

Machine Learning · Computer Science 2024-06-24 Jinyin Wang , Haijing Zhang , Yihao Zhong , Yingbin Liang , Rongwei Ji , Yiru Cang

DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding

Text-rich document understanding (TDU) requires comprehensive analysis of documents containing substantial textual content and complex layouts. While Multimodal Large Language Models (MLLMs) have achieved fast progress in this domain,…

Computer Vision and Pattern Recognition · Computer Science 2025-03-20 Wenhui Liao , Jiapeng Wang , Hongliang Li , Chengyu Wang , Jun Huang , Lianwen Jin

Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich Documents

This paper introduces a deep learning model tailored for document information analysis, emphasizing document classification, entity relation extraction, and document visual question answering. The proposed model leverages transformer-based…

Computer Vision and Pattern Recognition · Computer Science 2023-10-26 Tofik Ali , Partha Pratim Roy

DocLLM: A layout-aware generative language model for multimodal document understanding

Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a…

Computation and Language · Computer Science 2024-01-03 Dongsheng Wang , Natraj Raman , Mathieu Sibue , Zhiqiang Ma , Petr Babkin , Simerjot Kaur , Yulong Pei , Armineh Nourbakhsh , Xiaomo Liu

DUET: Detection Utilizing Enhancement for Text in Scanned or Captured Documents

We present a novel deep neural model for text detection in document images. For robust text detection in noisy scanned documents, the advantages of multi-task learning are adopted by adding an auxiliary task of text enhancement. Namely, our…

Computer Vision and Pattern Recognition · Computer Science 2021-06-11 Eun-Soo Jung , HyeongGwan Son , Kyusam Oh , Yongkeun Yun , Soonhwan Kwon , Min Soo Kim

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational…

Computer Vision and Pattern Recognition · Computer Science 2025-12-18 Yumeng Li , Guang Yang , Hao Liu , Bowen Wang , Colin Zhang

Learning Multimodal Affinities for Textual Editing in Images

Nowadays, as cameras are rapidly adopted in our daily routine, images of documents are becoming both abundant and prevalent. Unlike natural images that capture physical objects, document-images contain a significant amount of text with…

Computer Vision and Pattern Recognition · Computer Science 2021-03-19 Or Perel , Oron Anschel , Omri Ben-Eliezer , Shai Mazor , Hadar Averbuch-Elor

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image…

Computer Vision and Pattern Recognition · Computer Science 2020-06-23 Zhicheng Huang , Zhaoyang Zeng , Bei Liu , Dongmei Fu , Jianlong Fu

DocAtlas: Multilingual Document Understanding Across 80+ Languages

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs…

Computation and Language · Computer Science 2026-05-22 Ahmed Heakl , Youssef Mohamed , Abdullah Sohail , Rania Elbadry , Ahmed Nassar , Peter W. J. Staar , Fahad Shahbaz Khan , Imran Razzak , Salman Khan

OPAD: An Optimized Policy-based Active Learning Framework for Document Content Analysis

Documents are central to many business systems, and include forms, reports, contracts, invoices or purchase orders. The information in documents is typically in natural language, but can be organized in various layouts and formats. There…

Information Retrieval · Computer Science 2021-10-08 Sumit Shekhar , Bhanu Prakash Reddy Guda , Ashutosh Chaubey , Ishan Jindal , Avneet Jain

DocTTT: Test-Time Training for Handwritten Document Recognition Using Meta-Auxiliary Learning

Despite recent significant advancements in Handwritten Document Recognition (HDR), the efficient and accurate recognition of text against complex backgrounds, diverse handwriting styles, and varying document layouts remains a practical…

Computer Vision and Pattern Recognition · Computer Science 2025-01-23 Wenhao Gu , Li Gu , Ziqiang Wang , Ching Yee Suen , Yang Wang

DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation

Automating the annotation of scanned documents is challenging, requiring a balance between computational efficiency and accuracy. DocParseNet addresses this by combining deep learning and multi-modal learning to process both text and visual…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Ahmad Mohammadshirazi , Ali Nosrati Firoozsalari , Mengxi Zhou , Dheeraj Kulshrestha , Rajiv Ramnath

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains…

Computer Vision and Pattern Recognition · Computer Science 2025-05-27 Minheng Ni , Zhengyuan Yang , Linjie Li , Chung-Ching Lin , Kevin Lin , Wangmeng Zuo , Lijuan Wang

Deep Unrestricted Document Image Rectification

In recent years, tremendous efforts have been made on document image rectification, but existing advanced algorithms are limited to processing restricted document images, i.e., the input images must incorporate a complete document. Once the…

Computer Vision and Pattern Recognition · Computer Science 2023-12-19 Hao Feng , Shaokai Liu , Jiajun Deng , Wengang Zhou , Houqiang Li

COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment

Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches…

Multimedia · Computer Science 2024-02-27 Chaoya Jiang , Haiyang Xu , Wei Ye , Qinghao Ye , Chenliang Li , Ming Yan , Bin Bi , Shikun Zhang , Ji Zhang , Fei Huang

OLALA: Object-Level Active Learning for Efficient Document Layout Annotation

Document images often have intricate layout structures, with numerous content regions (e.g. texts, figures, tables) densely arranged on each page. This makes the manual annotation of layout datasets expensive and inefficient. These…

Machine Learning · Computer Science 2021-03-31 Zejiang Shen , Jian Zhao , Melissa Dell , Yaoliang Yu , Weining Li

DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection

The prosperity of deep learning contributes to the rapid progress in scene text detection. Among all the methods with convolutional networks, segmentation-based ones have drawn extensive attention due to their superiority in detecting text…

Computer Vision and Pattern Recognition · Computer Science 2022-08-23 Jingyu Lin , Jie Jiang , Yan Yan , Chunchao Guo , Hongfa Wang , Wei Liu , Hanzi Wang

CDA: a Cost Efficient Content-based Multilingual Web Document Aligner

We introduce a Content-based Document Alignment approach (CDA), an efficient method to align multilingual web documents based on content in creating parallel training data for machine translation (MT) systems operating at the industrial…

Computation and Language · Computer Science 2021-02-23 Thuy Vu , Alessandro Moschitti