English
Related papers

Related papers: FireRed-OCR Technical Report

200 papers

Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families…

Computation and Language · Computer Science 2026-05-18 Jonathan Steinberg , Oren Gal

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Mor Shpigel Nacson , Aviad Aberdam , Roy Ganz , Elad Ben Avraham , Alona Golts , Yair Kittenplon , Shai Mazor , Ron Litman

Modern LVLMs still struggle to achieve fine-grained document understanding, such as OCR/translation/caption for regions of interest to the user, tasks that require the context of the entire page, or even multiple pages. Accordingly, this…

Computer Vision and Pattern Recognition · Computer Science 2024-05-24 Chenglong Liu , Haoran Wei , Jinyue Chen , Lingyu Kong , Zheng Ge , Zining Zhu , Liang Zhao , Jianjian Sun , Chunrui Han , Xiangyu Zhang

The advent of "OCR 2.0" and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Cheng Cui , Yubo Zhang , Ting Sun , Xueqing Wang , Hongen Liu , Manhui Lin , Yue Zhang , Tingquan Gao , Changda Zhou , Jiaxuan Liu , Zelun Zhang , Jing Zhang , Jun Zhang , Yi Liu

Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely…

Computer Vision and Pattern Recognition · Computer Science 2026-02-27 Yibo Peng , Peng Xia , Ding Zhong , Kaide Zeng , Siwei Han , Yiyang Zhou , Jiaqi Liu , Ruiyi Zhang , Huaxiu Yao

GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance…

PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when…

Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts…

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped…

Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that…

Computation and Language · Computer Science 2025-10-14 Zilong Wang , Xiaoyu Shen

Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to…

Computer Vision and Pattern Recognition · Computer Science 2025-09-01 Shashank Vempati , Nishit Anand , Gaurav Talebailkar , Arpan Garai , Chetan Arora

Recent advances in vision-language models (VLMs) have enabled end-to-end document parsing and understanding, achieving strong performance on diverse optical character recognition (OCR) tasks. However, VLMs are prone to generate words that…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Qian Chen , Xianyin Zhang , Lifan Guo , Feng Chen , Chi Zhang

The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2026-02-05 Yufeng Zhong , Lei Chen , Xuanle Zhao , Wenkang Han , Liming Zheng , Jing Huang , Deyang Jiang , Yilin Cao , Lin Ma , Zhixiong Zeng

Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors…

Computer Vision and Pattern Recognition · Computer Science 2026-04-16 Weile Gong , Yiping Zuo , Zijian Lu , Xin He , Weibei Fan , Lianyong Qi , Shi Jin

Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations…

Computer Vision and Pattern Recognition · Computer Science 2026-01-26 Muhammad Tayyab Khan , Zane Yong , Lequn Chen , Wenhe Feng , Nicholas Yew Jin Tan , Seung Ki Moon

This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames…

Computer Vision and Pattern Recognition · Computer Science 2025-02-11 Sankalp Nagaonkar , Augustya Sharma , Ashish Choithani , Ashutosh Trivedi

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of…

Machine Learning · Computer Science 2026-05-27 Mingxin Huang , Yongxin Shi , Dezhi Peng , Songxuan Lai , Zecheng Xie , Lianwen Jin

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their…

Large Multimodal Models (LMMs) have demonstrated impressive performance in recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and…

Computer Vision and Pattern Recognition · Computer Science 2024-12-11 Zhibo Yang , Jun Tang , Zhaohai Li , Pengfei Wang , Jianqiang Wan , Humen Zhong , Xuejing Liu , Mingkun Yang , Peng Wang , Shuai Bai , LianWen Jin , Junyang Lin

Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable…

Computation and Language · Computer Science 2026-04-28 Chengye Wang , Lin Fu , Zexi Kuang , Yilun Zhao
‹ Prev 1 2 3 10 Next ›