Related papers: Multimodal OCR: Parse Anything from Documents

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational…

Computer Vision and Pattern Recognition · Computer Science 2025-12-18 Yumeng Li , Guang Yang , Hao Liu , Bowen Wang , Colin Zhang

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Large Multimodal Models (LMMs) have demonstrated impressive performance in recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and…

Computer Vision and Pattern Recognition · Computer Science 2024-12-11 Zhibo Yang , Jun Tang , Zhaohai Li , Pengfei Wang , Jianqiang Wan , Humen Zhong , Xuejing Liu , Mingkun Yang , Peng Wang , Shuai Bai , LianWen Jin , Junyang Lin

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Jiarui Zhang , Yuliang Liu , Zijun Wu , Guosheng Pang , Zhili Ye , Yupei Zhong , Junteng Ma , Tao Wei , Haiyang Xu , Weikai Chen , Zeen Wang , Qiangjun Ji , Fanxi Zhou , Qi Zhang , Yuanrui Hu , Jiahao Liu , Zhang Li , Ziyang Zhang , Qiang Liu , Xiang Bai

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion,…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Qinwu Xu , Yifan Jiang , Haoyu Ren

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

We introduce MonkeyOCR, a document parsing model that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline and…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Zhang Li , Yuliang Liu , Qiang Liu , Zhiyin Ma , Ziyang Zhang , Shuo Zhang , Biao Yang , Zidun Guo , Jiarui Zhang , Xinyu Wang , Xiang Bai

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Yuliang Liu , Zhang Li , Mingxin Huang , Biao Yang , Wenwen Yu , Chunyuan Li , Xucheng Yin , Cheng-lin Liu , Lianwen Jin , Xiang Bai

GLM-OCR Technical Report

GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance…

Computation and Language · Computer Science 2026-03-17 Shuaiqi Duan , Yadong Xue , Weihan Wang , Zhe Su , Huan Liu , Sheng Yang , Guobing Gan , Guo Wang , Zihan Wang , Shengdong Yan , Dexin Jin , Yuxuan Zhang , Guohong Wen , Yanfeng Wang , Yutao Zhang , Xiaohan Zhang , Wenyi Hong , Yukuo Cen , Da Yin , Bin Chen , Wenmeng Yu , Xiaotao Gu , Jie Tang

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention with zero-initialization, we achieve cross-window…

Computer Vision and Pattern Recognition · Computer Science 2024-03-18 Yuliang Liu , Biao Yang , Qiang Liu , Zhang Li , Zhiyin Ma , Shuo Zhang , Xiang Bai

DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding

Form understanding depends on both textual contents and organizational structure. Although modern OCR performs well, it is still challenging to realize general form understanding because forms are commonly used and of various formats. The…

Computer Vision and Pattern Recognition · Computer Science 2020-10-23 Zilong Wang , Mingjie Zhan , Xuebo Liu , Ding Liang

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Cross-lingual cross-modal retrieval (CCR) aims to retrieve visually relevant content based on non-English queries, without relying on human-labeled cross-modal data pairs during training. One popular approach involves utilizing machine…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Yabing Wang , Le Wang , Qiang Zhou , Zhibin Wang , Hao Li , Gang Hua , Wei Tang

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications…

Computation and Language · Computer Science 2026-05-06 Zhipeng Xu , Junhao Ji , Zulong Chen , Zhenghao Liu , Qing Liu , Chunyi Peng , Zubao Qin , Ze Xu , Jianqiang Wan , Jun Tang , Zhibo Yang , Shuai Bai , Dayiheng Liu

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals…

Computer Vision and Pattern Recognition · Computer Science 2024-09-04 Haoran Wei , Chenglong Liu , Jinyue Chen , Jia Wang , Lingyu Kong , Yanming Xu , Zheng Ge , Liang Zhao , Jianjian Sun , Yuang Peng , Chunrui Han , Xiangyu Zhang

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Souhail Bakkali , Zuheng Ming , Mickael Coustaty , Marçal Rusiñol , Oriol Ramos Terrades

LOCR: Location-Guided Transformer for Optical Character Recognition

Academic documents are packed with texts, equations, tables, and figures, requiring comprehensive understanding for accurate Optical Character Recognition (OCR). While end-to-end OCR methods offer improved accuracy over layout-based…

Computer Vision and Pattern Recognition · Computer Science 2024-03-05 Yu Sun , Dongzhan Zhou , Chen Lin , Conghui He , Wanli Ouyang , Han-Sen Zhong

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key…

Computation and Language · Computer Science 2026-04-21 Sensen Gao , Shanshan Zhao , Xu Jiang , Lunhao Duan , Yong Xien Chng , Qing-Guo Chen , Weihua Luo , Kaifu Zhang , Jia-Wang Bian , Mingming Gong

GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

Visual document understanding (VDU) has rapidly advanced with the development of powerful multi-modal language models. However, these models typically require extensive document pre-training data to learn intermediate representations and…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Souhail Bakkali , Sanket Biswas , Zuheng Ming , Mickaël Coustaty , Marçal Rusiñol , Oriol Ramos Terrades , Josep Lladós

DocMMIR: A Framework for Document Multi-modal Information Retrieval

The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack…

Information Retrieval · Computer Science 2025-10-20 Zirui Li , Siwei Wu , Yizhi Li , Xingyu Wang , Yi Zhou , Chenghua Lin

MMDocIR: Benchmarking Multimodal Retrieval for Long Documents

Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents. Despite its increasing popularity, there is a notable lack of…

Information Retrieval · Computer Science 2025-11-10 Kuicai Dong , Yujing Chang , Xin Deik Goh , Dexun Li , Ruiming Tang , Yong Liu

Multimodal deep networks for text and image-based document classification

Classification of document images is a critical step for archival of old manuscripts, online subscription and administrative procedures. Computer vision and deep learning have been suggested as a first solution to classify documents based…

Computer Vision and Pattern Recognition · Computer Science 2019-07-16 Nicolas Audebert , Catherine Herold , Kuider Slimani , Cédric Vidal

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when…

Computation and Language · Computer Science 2025-07-03 Jake Poznanski , Aman Rangapur , Jon Borchardt , Jason Dunkelberger , Regan Huff , Daniel Lin , Aman Rangapur , Christopher Wilhelm , Kyle Lo , Luca Soldaini