Related papers: Reengineering PDF-Based Documents Targeting Comple…

Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness

Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due…

Digital Libraries · Computer Science 2022-07-14 Christoph Auer , Michele Dolfi , André Carvalho , Cesar Berrospi Ramis , Peter W. J. Staar

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Document parsing (DP) transforms unstructured or semi-structured documents into structured, machine-readable representations, enabling downstream applications such as knowledge base construction and retrieval-augmented generation (RAG).…

Multimedia · Computer Science 2026-04-07 Qintong Zhang , Bin Wang , Victor Shea-Jay Huang , Junyuan Zhang , Zhengren Wang , Hao Liang , Conghui He , Wentao Zhang

Unfolding the Structure of a Document using Deep Learning

Understanding and extracting of information from large documents, such as business opportunities, academic articles, medical documents and technical reports, poses challenges not present in short documents. Such large documents may be…

Computation and Language · Computer Science 2019-10-10 Muhammad Mahbubur Rahman , Tim Finin

Extracting Formal Specifications from Documents Using LLMs for Automated Testing

Automated testing plays a crucial role in ensuring software security. It heavily relies on formal specifications to validate the correctness of the system behavior. However, the main approach to defining these formal specifications is…

Software Engineering · Computer Science 2025-04-03 Hui Li , Zhen Dong , Siao Wang , Hui Zhang , Liwei Shen , Xin Peng , Dongdong She

Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding

Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and…

Machine Learning · Computer Science 2025-07-15 Vishesh Tripathi , Tanmay Odapally , Indraneel Das , Uday Allu , Biddwan Ahmed

Revise: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy

Recent advances in Large Language Models (LLMs) have significantly improved the field of Document AI, demonstrating remarkable performance on document understanding tasks such as question answering. However, existing approaches primarily…

Artificial Intelligence · Computer Science 2026-04-10 Gyuho Shim , Seongtae Hong , Heuiseok Lim

Understanding and representing the semantics of large structured documents

Understanding large, structured documents like scholarly articles, requests for proposals or business reports is a complex and difficult task. It involves discovering a document's overall purpose and subject(s), understanding the function…

Computation and Language · Computer Science 2018-07-27 Muhammad Mahbubur Rahman , Tim Finin

Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To…

Computation and Language · Computer Science 2026-04-15 Omar El Bachyr , Yewei Song , Saad Ezzini , Jacques Klein , Tegawendé F. Bissyandé , Anas Zilali , Ulrick Ble , Anne Goujon

Digitization of Document and Information Extraction using OCR

Retrieving accurate details from documents is a crucial task, especially when handling a combination of scanned images and native digital formats. This document presents a combined framework for text extraction that merges Optical Character…

Computer Vision and Pattern Recognition · Computer Science 2025-06-16 Rasha Sinha , Rekha B S

Multi-objective Software Architecture Refactoring driven by Quality Attributes

Architecture optimization is the process of automatically generating design options, typically to enhance software's quantifiable quality attributes, such as performance and reliability. Multi-objective optimization approaches have been…

Software Engineering · Computer Science 2024-01-31 Daniele Di Pompeo , Michele Tucci

DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents

The exponential growth of scientific literature in PDF format necessitates advanced tools for efficient and accurate document understanding, summarization, and content optimization. Traditional methods fall short in handling complex layouts…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Kun Qian , Wenjie Li , Tianyu Sun , Wenhong Wang , Wenhan Luo

Robust PDF Document Conversion Using Recurrent Neural Networks

The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document…

Machine Learning · Computer Science 2021-02-19 Nikolaos Livathinos , Cesar Berrospi , Maksym Lysak , Viktor Kuropiatnyk , Ahmed Nassar , Andre Carvalho , Michele Dolfi , Christoph Auer , Kasper Dinkla , Peter Staar

Improving Prolog programs: Refactoring for Prolog

Refactoring is an established technique from the object-oriented (OO) programming community to restructure code: it aims at improving software readability, maintainability and extensibility. Although refactoring is not tied to the…

Software Engineering · Computer Science 2007-05-23 Alexander Serebrenik , Tom Schrijvers , Bart Demoen

User-Centered Course Reengineering: An Analytical Approach to Enhancing Reading Comprehension in Educational Content

Delivering high-quality content is crucial for effective reading comprehension and successful learning. Ensuring educational materials are interpreted as intended by their authors is a persistent challenge, especially with the added…

Computers and Society · Computer Science 2024-12-17 Madjid Sadallah

Visualizing Object-oriented Software for Understanding and Documentation

Understanding or comprehending source code is one of the core activities of software engineering. Understanding object-oriented source code is essential and required when a programmer maintains, migrates, reuses, documents or enhances…

Software Engineering · Computer Science 2016-01-29 Ra'Fat AL-msie'deen

Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task

Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that…

Computation and Language · Computer Science 2025-10-14 Zilong Wang , Xiaoyu Shen

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Document content extraction is a critical task in computer vision, underpinning the data needs of large language models (LLMs) and retrieval-augmented generation (RAG) systems. Despite recent progress, current document parsing methods have…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Linke Ouyang , Yuan Qu , Hongbin Zhou , Jiawei Zhu , Rui Zhang , Qunshu Lin , Bin Wang , Zhiyuan Zhao , Man Jiang , Xiaomeng Zhao , Jin Shi , Fan Wu , Pei Chu , Minghao Liu , Zhenxiang Li , Chao Xu , Bo Zhang , Botian Shi , Zhongying Tu , Conghui He

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this…

Information Retrieval · Computer Science 2026-05-27 José Guilherme Marques dos Santos , Ricardo Yang , Rui Humberto Pereira , Alexandre Sousa , Brígida Mónica Faria , Henrique Lopes Cardoso , José Duarte , José Luís Reis , Luís Paulo Reis , Pedro Pimenta , José Paulo Marques dos Santos

Object-oriented Neural Programming (OONP) for Document Understanding

We propose Object-oriented Neural Programming (OONP), a framework for semantically parsing documents in specific domains. Basically, OONP reads a document and parses it into a predesigned object-oriented data structure (referred to as…

Machine Learning · Computer Science 2018-07-26 Zhengdong Lu , Xianggen Liu , Haotian Cui , Yukun Yan , Daqi Zheng

Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion

This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and…

Computer Vision and Pattern Recognition · Computer Science 2026-02-13 Bruno Rigal , Victor Dupriez , Alexis Mignon , Ronan Le Hy , Nicolas Mery