Related papers: FireRed-OCR Technical Report

Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families…

Computation and Language · Computer Science 2026-05-18 Jonathan Steinberg , Oren Gal

DocVLM: Make Your VLM an Efficient Reader

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Mor Shpigel Nacson , Aviad Aberdam , Roy Ganz , Elad Ben Avraham , Alona Golts , Yair Kittenplon , Shai Mazor , Ron Litman

Focus Anywhere for Fine-grained Multi-page Document Understanding

Modern LVLMs still struggle to achieve fine-grained document understanding, such as OCR/translation/caption for regions of interest to the user, tasks that require the context of the entire page, or even multiple pages. Accordingly, this…

Computer Vision and Pattern Recognition · Computer Science 2024-05-24 Chenglong Liu , Haoran Wei , Jinyue Chen , Lingyu Kong , Zheng Ge , Zining Zhu , Liang Zhao , Jianjian Sun , Chunrui Han , Xiangyu Zhang

PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

The advent of "OCR 2.0" and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Cheng Cui , Yubo Zhang , Ting Sun , Xueqing Wang , Hongen Liu , Manhui Lin , Yue Zhang , Tingquan Gao , Changda Zhou , Jiaxuan Liu , Zelun Zhang , Jing Zhang , Jun Zhang , Yi Liu

SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely…

Computer Vision and Pattern Recognition · Computer Science 2026-02-27 Yibo Peng , Peng Xia , Ding Zhong , Kaide Zeng , Siwei Han , Yiyang Zhou , Jiaqi Liu , Ruiyi Zhang , Huaxiu Yao

GLM-OCR Technical Report

GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance…

Computation and Language · Computer Science 2026-03-17 Shuaiqi Duan , Yadong Xue , Weihan Wang , Zhe Su , Huan Liu , Sheng Yang , Guobing Gan , Guo Wang , Zihan Wang , Shengdong Yan , Dexin Jin , Yuxuan Zhang , Guohong Wen , Yanfeng Wang , Yutao Zhang , Xiaohan Zhang , Wenyi Hong , Yukuo Cen , Da Yin , Bin Chen , Wenmeng Yu , Xiaotao Gu , Jie Tang

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when…

Computation and Language · Computer Science 2025-07-03 Jake Poznanski , Aman Rangapur , Jon Borchardt , Jason Dunkelberger , Regan Huff , Daniel Lin , Aman Rangapur , Christopher Wilhelm , Kyle Lo , Luca Soldaini

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Jiarui Zhang , Yuliang Liu , Zijun Wu , Guosheng Pang , Zhili Ye , Yupei Zhong , Junteng Ma , Tao Wei , Haiyang Xu , Weikai Chen , Zeen Wang , Qiangjun Ji , Fanxi Zhou , Qi Zhang , Yuanrui Hu , Jiahao Liu , Zhang Li , Ziyang Zhang , Qiang Liu , Xiang Bai

Multimodal OCR: Parse Anything from Documents

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Handong Zheng , Yumeng Li , Kaile Zhang , Liang Xin , Guangwei Zhao , Hao Liu , Jiayu Chen , Jie Lou , Qi Fu , Rui Yang , Shuo Jiang , Weijian Luo , Weijie Su , Weijun Zhang , Xingyu Zhu , Yabin Li , Yiwei ma , Yu Chen , Yuqiu Ji , Zhaohui Yu , Guang Yang , Colin Zhang , Lei Zhang , Yuliang Liu , Xiang Bai

Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task

Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that…

Computation and Language · Computer Science 2025-10-14 Zilong Wang , Xiaoyu Shen

Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR

Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to…

Computer Vision and Pattern Recognition · Computer Science 2025-09-01 Shashank Vempati , Nishit Anand , Gaurav Talebailkar , Arpan Garai , Chetan Arora

DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

Recent advances in vision-language models (VLMs) have enabled end-to-end document parsing and understanding, achieving strong performance on diverse optical character recognition (OCR) tasks. However, VLMs are prone to generate words that…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Qian Chen , Xianyin Zhang , Lifan Guo , Feng Chen , Chi Zhang

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2026-02-05 Yufeng Zhong , Lei Chen , Xuanle Zhao , Wenkang Han , Liming Zheng , Jing Huang , Deyang Jiang , Yilin Cao , Lin Ma , Zhixiong Zeng

From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors…

Computer Vision and Pattern Recognition · Computer Science 2026-04-16 Weile Gong , Yiping Zuo , Zijian Lu , Xin He , Weibei Fan , Lianyong Qi , Shi Jin

A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model

Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations…

Computer Vision and Pattern Recognition · Computer Science 2026-01-26 Muhammad Tayyab Khan , Zane Yong , Lequn Chen , Wenhe Feng , Nicholas Yew Jin Tan , Seung Ki Moon

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames…

Computer Vision and Pattern Recognition · Computer Science 2025-02-11 Sankalp Nagaonkar , Augustya Sharma , Ashish Choithani , Ashutosh Trivedi

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of…

Machine Learning · Computer Science 2026-05-27 Mingxin Huang , Yongxin Shi , Dezhi Peng , Songxuan Lai , Zecheng Xie , Lianwen Jin

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their…

Computer Vision and Pattern Recognition · Computer Science 2025-06-06 Ling Fu , Zhebin Kuang , Jiajun Song , Mingxin Huang , Biao Yang , Yuzhe Li , Linghao Zhu , Qidi Luo , Xinyu Wang , Hao Lu , Zhang Li , Guozhi Tang , Bin Shan , Chunhui Lin , Qi Liu , Binghong Wu , Hao Feng , Hao Liu , Can Huang , Jingqun Tang , Wei Chen , Lianwen Jin , Yuliang Liu , Xiang Bai

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Large Multimodal Models (LMMs) have demonstrated impressive performance in recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and…

Computer Vision and Pattern Recognition · Computer Science 2024-12-11 Zhibo Yang , Jun Tang , Zhaohai Li , Pengfei Wang , Jianqiang Wan , Humen Zhong , Xuejing Liu , Mingkun Yang , Peng Wang , Shuai Bai , LianWen Jin , Junyang Lin

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable…

Computation and Language · Computer Science 2026-04-28 Chengye Wang , Lin Fu , Zexi Kuang , Yilun Zhao