English
Related papers

Related papers: GLM-OCR Technical Report

200 papers

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped…

This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks,…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Yongxin Shi , Dezhi Peng , Wenhui Liao , Zening Lin , Xinhong Chen , Chongyu Liu , Yuyi Zhang , Lianwen Jin

The advent of "OCR 2.0" and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Cheng Cui , Yubo Zhang , Ting Sun , Xueqing Wang , Hongen Liu , Manhui Lin , Yue Zhang , Tingquan Gao , Changda Zhou , Jiaxuan Liu , Zelun Zhang , Jing Zhang , Jun Zhang , Yi Liu

Financial documents are essential sources of information for regulators, auditors, and financial institutions, particularly for assessing the wealth and compliance of Small and Medium-sized Businesses. However, SMB documents are often…

Information Retrieval · Computer Science 2025-10-28 Yichao Jin , Yushuo Wang , Qishuai Zhong , Kent Chiu Jin-Chun , Kenneth Zhu Ke , Donald MacDonald

Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that…

Computation and Language · Computer Science 2025-10-14 Zilong Wang , Xiaoyu Shen

Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only…

Computation and Language · Computer Science 2026-03-04 Jiyuan Shen , Peiyue Yuan , Atin Ghosh , Yifan Mai , Daniel Dahlmeier

Retrieval Augmented Generation (RAG) systems struggle with processing multimodal documents of varying structural complexity. This paper introduces a novel multi-strategy parsing approach using LLM-powered OCR to extract content from diverse…

Computation and Language · Computer Science 2024-12-23 Arnau Perez , Xavier Vizcaino

Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors…

Computer Vision and Pattern Recognition · Computer Science 2026-04-16 Weile Gong , Yiping Zuo , Zijian Lu , Xin He , Weibei Fan , Lianyong Qi , Shi Jin

In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Chen Duan , Zhentao Guo , Pei Fu , Zining Wang , Kai Zhou , Pengfei Yan

We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural…

Multimodal large language models (MLLMs) have shown impressive capabilities across various domains, excelling in processing and understanding information from multiple modalities. Despite the rapid progress made previously, insufficient OCR…

Computer Vision and Pattern Recognition · Computer Science 2025-01-28 Song Chen , Xinyu Guo , Yadong Li , Tao Zhang , Mingan Lin , Dongdong Kuang , Youwei Zhang , Lingfeng Ming , Fengyu Zhang , Yuran Wang , Jianhua Xu , Zenan Zhou , Weipeng Chen

Retrieving accurate details from documents is a crucial task, especially when handling a combination of scanned images and native digital formats. This document presents a combined framework for text extraction that merges Optical Character…

Computer Vision and Pattern Recognition · Computer Science 2025-06-16 Rasha Sinha , Rekha B S

Automating high-volume unstructured data processing is essential for operational efficiency. Optical Character Recognition (OCR) is critical but often struggles with accuracy and efficiency in complex layouts and ambiguous text. These…

Robotics · Computer Science 2025-04-29 Osama Abdellatif , Ahmed Ayman , Ali Hamdi

Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications…

Computation and Language · Computer Science 2026-05-06 Zhipeng Xu , Junhao Ji , Zulong Chen , Zhenghao Liu , Qing Liu , Chunyi Peng , Zubao Qin , Ze Xu , Jianqiang Wan , Jun Tang , Zhibo Yang , Shuai Bai , Dayiheng Liu

Large Multimodal Models (LMMs) have demonstrated impressive performance in recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and…

Computer Vision and Pattern Recognition · Computer Science 2024-12-11 Zhibo Yang , Jun Tang , Zhaohai Li , Pengfei Wang , Jianqiang Wan , Humen Zhong , Xuejing Liu , Mingkun Yang , Peng Wang , Shuai Bai , LianWen Jin , Junyang Lin

PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when…

In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic…

Computer Vision and Pattern Recognition · Computer Science 2025-11-26 Cheng Cui , Ting Sun , Suyin Liang , Tingquan Gao , Zelun Zhang , Jiaxuan Liu , Xueqing Wang , Changda Zhou , Hongen Liu , Manhui Lin , Yue Zhang , Yubo Zhang , Handong Zheng , Jing Zhang , Jun Zhang , Yi Liu , Dianhai Yu , Yanjun Ma

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Mor Shpigel Nacson , Aviad Aberdam , Roy Ganz , Elad Ben Avraham , Alona Golts , Yair Kittenplon , Shai Mazor , Ron Litman

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion,…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Qinwu Xu , Yifan Jiang , Haoyu Ren

Optical character recognition (OCR) technology has been widely used in various scenes, as shown in Figure 1. Designing a practical OCR system is still a meaningful but challenging task. In previous work, considering the efficiency and…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Chenxia Li , Weiwei Liu , Ruoyu Guo , Xiaoting Yin , Kaitao Jiang , Yongkun Du , Yuning Du , Lingfeng Zhu , Baohua Lai , Xiaoguang Hu , Dianhai Yu , Yanjun Ma
‹ Prev 1 2 3 10 Next ›