Related papers: GLM-OCR Technical Report

Multimodal OCR: Parse Anything from Documents

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Handong Zheng , Yumeng Li , Kaile Zhang , Liang Xin , Guangwei Zhao , Hao Liu , Jiayu Chen , Jie Lou , Qi Fu , Rui Yang , Shuo Jiang , Weijian Luo , Weijie Su , Weijun Zhang , Xingyu Zhu , Yabin Li , Yiwei ma , Yu Chen , Yuqiu Ji , Zhaohui Yu , Guang Yang , Colin Zhang , Lei Zhang , Yuliang Liu , Xiang Bai

Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation

This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks,…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Yongxin Shi , Dezhi Peng , Wenhui Liao , Zening Lin , Xinhong Chen , Chongyu Liu , Yuyi Zhang , Lianwen Jin

PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

The advent of "OCR 2.0" and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Cheng Cui , Yubo Zhang , Ting Sun , Xueqing Wang , Hongen Liu , Manhui Lin , Yue Zhang , Tingquan Gao , Changda Zhou , Jiaxuan Liu , Zelun Zhang , Jing Zhang , Jun Zhang , Yi Liu

Multi-Stage Field Extraction of Financial Documents with OCR and Compact Vision-Language Models

Financial documents are essential sources of information for regulators, auditors, and financial institutions, particularly for assessing the wealth and compliance of Small and Medium-sized Businesses. However, SMB documents are often…

Information Retrieval · Computer Science 2025-10-28 Yichao Jin , Yushuo Wang , Qishuai Zhong , Kent Chiu Jin-Chun , Kenneth Zhu Ke , Donald MacDonald

Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task

Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that…

Computation and Language · Computer Science 2025-10-14 Zilong Wang , Xiaoyu Shen

OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only…

Computation and Language · Computer Science 2026-03-04 Jiyuan Shen , Peiyue Yuan , Atin Ghosh , Yifan Mai , Daniel Dahlmeier

Advanced ingestion process powered by LLM parsing for RAG system

Retrieval Augmented Generation (RAG) systems struggle with processing multimodal documents of varying structural complexity. This paper introduces a novel multi-strategy parsing approach using LLM-powered OCR to extract content from diverse…

Computation and Language · Computer Science 2024-12-23 Arnau Perez , Xavier Vizcaino

From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors…

Computer Vision and Pattern Recognition · Computer Science 2026-04-16 Weile Gong , Yiping Zuo , Zijian Lu , Xin He , Weibei Fan , Lianyong Qi , Shi Jin

PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration

In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Chen Duan , Zhentao Guo , Pei Fu , Zining Wang , Kai Zhou , Pengfei Yan

FireRed-OCR Technical Report

We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural…

Computer Vision and Pattern Recognition · Computer Science 2026-03-03 Hao Wu , Haoran Lou , Xinyue Li , Zuodong Zhong , Zhaojun Sun , Phellon Chen , Xuanhe Zhou , Kai Zuo , Yibo Chen , Xu Tang , Yao Hu , Boxiang Zhou , Jian Wu , Yongji Wu , Wenxin Yu , Yingmiao Liu , Yuhao Huang , Manjie Xu , Gang Liu , Yidong Ma , Zhichao Sun , Changhao Qiao

Ocean-OCR: Towards General OCR Application via a Vision-Language Model

Multimodal large language models (MLLMs) have shown impressive capabilities across various domains, excelling in processing and understanding information from multiple modalities. Despite the rapid progress made previously, insufficient OCR…

Computer Vision and Pattern Recognition · Computer Science 2025-01-28 Song Chen , Xinyu Guo , Yadong Li , Tao Zhang , Mingan Lin , Dongdong Kuang , Youwei Zhang , Lingfeng Ming , Fengyu Zhang , Yuran Wang , Jianhua Xu , Zenan Zhou , Weipeng Chen

Digitization of Document and Information Extraction using OCR

Retrieving accurate details from documents is a crucial task, especially when handling a combination of scanned images and native digital formats. This document presents a combined framework for text extraction that merges Optical Character…

Computer Vision and Pattern Recognition · Computer Science 2025-06-16 Rasha Sinha , Rekha B S

LMV-RPA: Large Model Voting-based Robotic Process Automation

Automating high-volume unstructured data processing is essential for operational efficiency. Optical Character Recognition (OCR) is critical but often struggles with accuracy and efficiency in complex layouts and ambiguous text. These…

Robotics · Computer Science 2025-04-29 Osama Abdellatif , Ahmed Ayman , Ali Hamdi

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications…

Computation and Language · Computer Science 2026-05-06 Zhipeng Xu , Junhao Ji , Zulong Chen , Zhenghao Liu , Qing Liu , Chunyi Peng , Zubao Qin , Ze Xu , Jianqiang Wan , Jun Tang , Zhibo Yang , Shuai Bai , Dayiheng Liu

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Large Multimodal Models (LMMs) have demonstrated impressive performance in recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and…

Computer Vision and Pattern Recognition · Computer Science 2024-12-11 Zhibo Yang , Jun Tang , Zhaohai Li , Pengfei Wang , Jianqiang Wan , Humen Zhong , Xuejing Liu , Mingkun Yang , Peng Wang , Shuai Bai , LianWen Jin , Junyang Lin

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when…

Computation and Language · Computer Science 2025-07-03 Jake Poznanski , Aman Rangapur , Jon Borchardt , Jason Dunkelberger , Regan Huff , Daniel Lin , Aman Rangapur , Christopher Wilhelm , Kyle Lo , Luca Soldaini

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic…

Computer Vision and Pattern Recognition · Computer Science 2025-11-26 Cheng Cui , Ting Sun , Suyin Liang , Tingquan Gao , Zelun Zhang , Jiaxuan Liu , Xueqing Wang , Changda Zhou , Hongen Liu , Manhui Lin , Yue Zhang , Yubo Zhang , Handong Zheng , Jing Zhang , Jun Zhang , Yi Liu , Dianhai Yu , Yanjun Ma

DocVLM: Make Your VLM an Efficient Reader

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Mor Shpigel Nacson , Aviad Aberdam , Roy Ganz , Elad Ben Avraham , Alona Golts , Yair Kittenplon , Shai Mazor , Ron Litman

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion,…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Qinwu Xu , Yifan Jiang , Haoyu Ren

PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System

Optical character recognition (OCR) technology has been widely used in various scenes, as shown in Figure 1. Designing a practical OCR system is still a meaningful but challenging task. In previous work, considering the efficiency and…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Chenxia Li , Weiwei Liu , Ruoyu Guo , Xiaoting Yin , Kaitao Jiang , Yongkun Du , Yuning Du , Lingfeng Zhu , Baohua Lai , Xiaoguang Hu , Dianhai Yu , Yanjun Ma