Related papers: Visually Guided Generative Text-Layout Pre-trainin…

MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has made significant progress for Visually Rich Document Understanding (VRDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number…

Computation and Language · Computer Science 2022-03-14 Junlong Li , Yiheng Xu , Lei Cui , Furu Wei

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while…

Computation and Language · Computer Science 2020-06-17 Yiheng Xu , Minghao Li , Lei Cui , Shaohan Huang , Furu Wei , Ming Zhou

Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation

Vision-Language Pretraining (VLP) has demonstrated remarkable capabilities in learning visual representations from textual descriptions of images without annotations. Yet, effective VLP demands large-scale image-text pairs, a resource that…

Computer Vision and Pattern Recognition · Computer Science 2023-06-09 Yinda Chen , Che Liu , Wei Huang , Sibo Cheng , Rossella Arcucci , Zhiwei Xiong

ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of…

Computer Vision and Pattern Recognition · Computer Science 2023-09-01 Weihan Wang , Zhen Yang , Bin Xu , Juanzi Li , Yankui Sun

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding

Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the…

Computation and Language · Computer Science 2022-12-20 Haoli Bai , Zhiguang Liu , Xiaojun Meng , Wentao Li , Shuang Liu , Nian Xie , Rongfu Zheng , Liangwei Wang , Lu Hou , Jiansheng Wei , Xin Jiang , Qun Liu

Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data,…

Computer Vision and Pattern Recognition · Computer Science 2022-03-02 Mingyang Zhou , Licheng Yu , Amanpreet Singh , Mengjiao Wang , Zhou Yu , Ning Zhang

Vision Grid Transformer for Document Layout Analysis

Document pre-trained models and grid-based models have proven to be very effective on various tasks in Document AI. However, for the document layout analysis (DLA) task, existing document pre-trained models, even those pre-trained in a…

Computer Vision and Pattern Recognition · Computer Science 2023-08-30 Cheng Da , Chuwei Luo , Qi Zheng , Cong Yao

LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding

This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents. Visually Rich Document Understanding tasks, such as document image classification and information extraction, have gained…

Computation and Language · Computer Science 2024-03-22 Masato Fujitake

Let ViT Speak: Generative Language-Image Pre-training

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models…

Computer Vision and Pattern Recognition · Computer Science 2026-05-04 Yan Fang , Mengcheng Lan , Zilong Huang , Weixian Lei , Yunqing Zhao , Yujie Zhong , Yingchen Yu , Qi She , Yao Zhao , Yunchao Wei

Position-guided Text Prompt for Vision-Language Pre-training

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization…

Computer Vision and Pattern Recognition · Computer Science 2023-06-08 Alex Jinpeng Wang , Pan Zhou , Mike Zheng Shou , Shuicheng Yan

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard…

Computer Vision and Pattern Recognition · Computer Science 2025-06-19 Chuwei Luo , Guozhi Tang , Qi Zheng , Cong Yao , Lianwen Jin , Chenliang Li , Yang Xue , Luo Si

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Visually-rich Document Understanding (VrDU) has attracted much research attention over the past years. Pre-trained models on a large number of document images with transformer-based backbones have led to significant performance gains in…

Computer Vision and Pattern Recognition · Computer Science 2023-06-12 Yi Tu , Ya Guo , Huan Chen , Jinyang Tang

DocLLM: A layout-aware generative language model for multimodal document understanding

Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a…

Computation and Language · Computer Science 2024-01-03 Dongsheng Wang , Natraj Raman , Mathieu Sibue , Zhiqiang Ma , Petr Babkin , Simerjot Kaur , Yulong Pei , Armineh Nourbakhsh , Xiaomo Liu

TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Vision Transformers (ViTs) have been widely used in large-scale Vision and Language Pre-training (VLP) models. Though previous VLP works have proved the effectiveness of ViTs, they still suffer from computational efficiency brought by the…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Chaoya Jiang , Haiyang Xu , Chenliang Li , Miang Yan , Wei Ye , Shikun Zhang , Bin Bi , Songfang Huang

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating…

Computer Vision and Pattern Recognition · Computer Science 2024-12-19 Shengbang Tong , David Fan , Jiachen Zhu , Yunyang Xiong , Xinlei Chen , Koustuv Sinha , Michael Rabbat , Yann LeCun , Saining Xie , Zhuang Liu

ReLayout: Towards Real-World Document Understanding via Layout-enhanced Pre-training

Recent approaches for visually-rich document understanding (VrDU) uses manually annotated semantic groups, where a semantic group encompasses all semantically relevant but not obviously grouped words. As OCR tools are unable to…

Computer Vision and Pattern Recognition · Computer Science 2024-10-17 Zhouqiang Jiang , Bowen Wang , Junhao Chen , Yuta Nakashima

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text…

Computation and Language · Computer Science 2021-03-16 Chenliang Li , Ming Yan , Haiyang Xu , Fuli Luo , Wei Wang , Bin Bi , Songfang Huang

Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layouts. Although recent LVLMs have shown…

Computation and Language · Computer Science 2026-04-07 Haruka Kawasaki , Ryota Tanaka , Kyosuke Nishida

3D Vision and Language Pretraining with Large-Scale Synthetic Data

3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an important technique for embodied intelligence. However, current 3D-VLP datasets are hindered by limited…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Dejie Yang , Zhu Xu , Wentao Mo , Qingchao Chen , Siyuan Huang , Yang Liu

3D Scene Graph Guided Vision-Language Pre-training

3D vision-language (VL) reasoning has gained significant attention due to its potential to bridge the 3D physical world with natural language descriptions. Existing approaches typically follow task-specific, highly specialized paradigms.…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Hao Liu , Yanni Ma , Yan Liu , Haihong Xiao , Ying He