English
Related papers

Related papers: Multimodal OCR: Parse Anything from Documents

200 papers

Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational…

Computer Vision and Pattern Recognition · Computer Science 2025-12-18 Yumeng Li , Guang Yang , Hao Liu , Bowen Wang , Colin Zhang

Large Multimodal Models (LMMs) have demonstrated impressive performance in recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and…

Computer Vision and Pattern Recognition · Computer Science 2024-12-11 Zhibo Yang , Jun Tang , Zhaohai Li , Pengfei Wang , Jianqiang Wan , Humen Zhong , Xuejing Liu , Mingkun Yang , Peng Wang , Shuai Bai , LianWen Jin , Junyang Lin

Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts…

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion,…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Qinwu Xu , Yifan Jiang , Haoyu Ren

We introduce MonkeyOCR, a document parsing model that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline and…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Zhang Li , Yuliang Liu , Qiang Liu , Zhiyin Ma , Ziyang Zhang , Shuo Zhang , Biao Yang , Zidun Guo , Jiarui Zhang , Xinyu Wang , Xiang Bai

Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Yuliang Liu , Zhang Li , Mingxin Huang , Biao Yang , Wenwen Yu , Chunyuan Li , Xucheng Yin , Cheng-lin Liu , Lianwen Jin , Xiang Bai

GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance…

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention with zero-initialization, we achieve cross-window…

Computer Vision and Pattern Recognition · Computer Science 2024-03-18 Yuliang Liu , Biao Yang , Qiang Liu , Zhang Li , Zhiyin Ma , Shuo Zhang , Xiang Bai

Form understanding depends on both textual contents and organizational structure. Although modern OCR performs well, it is still challenging to realize general form understanding because forms are commonly used and of various formats. The…

Computer Vision and Pattern Recognition · Computer Science 2020-10-23 Zilong Wang , Mingjie Zhan , Xuebo Liu , Ding Liang

Cross-lingual cross-modal retrieval (CCR) aims to retrieve visually relevant content based on non-English queries, without relying on human-labeled cross-modal data pairs during training. One popular approach involves utilizing machine…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Yabing Wang , Le Wang , Qiang Zhou , Zhibin Wang , Hao Li , Gang Hua , Wei Tang

Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications…

Computation and Language · Computer Science 2026-05-06 Zhipeng Xu , Junhao Ji , Zulong Chen , Zhenghao Liu , Qing Liu , Chunyi Peng , Zubao Qin , Ze Xu , Jianqiang Wan , Jun Tang , Zhibo Yang , Shuai Bai , Dayiheng Liu

Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals…

Computer Vision and Pattern Recognition · Computer Science 2024-09-04 Haoran Wei , Chenglong Liu , Jinyue Chen , Jia Wang , Lingyu Kong , Yanming Xu , Zheng Ge , Liang Zhao , Jianjian Sun , Yuang Peng , Chunrui Han , Xiangyu Zhang

Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Souhail Bakkali , Zuheng Ming , Mickael Coustaty , Marçal Rusiñol , Oriol Ramos Terrades

Academic documents are packed with texts, equations, tables, and figures, requiring comprehensive understanding for accurate Optical Character Recognition (OCR). While end-to-end OCR methods offer improved accuracy over layout-based…

Computer Vision and Pattern Recognition · Computer Science 2024-03-05 Yu Sun , Dongzhan Zhou , Chen Lin , Conghui He , Wanli Ouyang , Han-Sen Zhong

Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key…

Computation and Language · Computer Science 2026-04-21 Sensen Gao , Shanshan Zhao , Xu Jiang , Lunhao Duan , Yong Xien Chng , Qing-Guo Chen , Weihua Luo , Kaifu Zhang , Jia-Wang Bian , Mingming Gong

Visual document understanding (VDU) has rapidly advanced with the development of powerful multi-modal language models. However, these models typically require extensive document pre-training data to learn intermediate representations and…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Souhail Bakkali , Sanket Biswas , Zuheng Ming , Mickaël Coustaty , Marçal Rusiñol , Oriol Ramos Terrades , Josep Lladós

The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack…

Information Retrieval · Computer Science 2025-10-20 Zirui Li , Siwei Wu , Yizhi Li , Xingyu Wang , Yi Zhou , Chenghua Lin

Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents. Despite its increasing popularity, there is a notable lack of…

Information Retrieval · Computer Science 2025-11-10 Kuicai Dong , Yujing Chang , Xin Deik Goh , Dexun Li , Ruiming Tang , Yong Liu

Classification of document images is a critical step for archival of old manuscripts, online subscription and administrative procedures. Computer vision and deep learning have been suggested as a first solution to classify documents based…

Computer Vision and Pattern Recognition · Computer Science 2019-07-16 Nicolas Audebert , Catherine Herold , Kuider Slimani , Cédric Vidal

PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when…

‹ Prev 1 2 3 10 Next ›