English
Related papers

Related papers: Reengineering PDF-Based Documents Targeting Comple…

200 papers

Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due…

Digital Libraries · Computer Science 2022-07-14 Christoph Auer , Michele Dolfi , André Carvalho , Cesar Berrospi Ramis , Peter W. J. Staar

Document parsing (DP) transforms unstructured or semi-structured documents into structured, machine-readable representations, enabling downstream applications such as knowledge base construction and retrieval-augmented generation (RAG).…

Understanding and extracting of information from large documents, such as business opportunities, academic articles, medical documents and technical reports, poses challenges not present in short documents. Such large documents may be…

Computation and Language · Computer Science 2019-10-10 Muhammad Mahbubur Rahman , Tim Finin

Automated testing plays a crucial role in ensuring software security. It heavily relies on formal specifications to validate the correctness of the system behavior. However, the main approach to defining these formal specifications is…

Software Engineering · Computer Science 2025-04-03 Hui Li , Zhen Dong , Siao Wang , Hui Zhang , Liwei Shen , Xin Peng , Dongdong She

Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and…

Machine Learning · Computer Science 2025-07-15 Vishesh Tripathi , Tanmay Odapally , Indraneel Das , Uday Allu , Biddwan Ahmed

Recent advances in Large Language Models (LLMs) have significantly improved the field of Document AI, demonstrating remarkable performance on document understanding tasks such as question answering. However, existing approaches primarily…

Artificial Intelligence · Computer Science 2026-04-10 Gyuho Shim , Seongtae Hong , Heuiseok Lim

Understanding large, structured documents like scholarly articles, requests for proposals or business reports is a complex and difficult task. It involves discovering a document's overall purpose and subject(s), understanding the function…

Computation and Language · Computer Science 2018-07-27 Muhammad Mahbubur Rahman , Tim Finin

PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To…

Computation and Language · Computer Science 2026-04-15 Omar El Bachyr , Yewei Song , Saad Ezzini , Jacques Klein , Tegawendé F. Bissyandé , Anas Zilali , Ulrick Ble , Anne Goujon

Retrieving accurate details from documents is a crucial task, especially when handling a combination of scanned images and native digital formats. This document presents a combined framework for text extraction that merges Optical Character…

Computer Vision and Pattern Recognition · Computer Science 2025-06-16 Rasha Sinha , Rekha B S

Architecture optimization is the process of automatically generating design options, typically to enhance software's quantifiable quality attributes, such as performance and reliability. Multi-objective optimization approaches have been…

Software Engineering · Computer Science 2024-01-31 Daniele Di Pompeo , Michele Tucci

The exponential growth of scientific literature in PDF format necessitates advanced tools for efficient and accurate document understanding, summarization, and content optimization. Traditional methods fall short in handling complex layouts…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Kun Qian , Wenjie Li , Tianyu Sun , Wenhong Wang , Wenhan Luo

The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document…

Refactoring is an established technique from the object-oriented (OO) programming community to restructure code: it aims at improving software readability, maintainability and extensibility. Although refactoring is not tied to the…

Software Engineering · Computer Science 2007-05-23 Alexander Serebrenik , Tom Schrijvers , Bart Demoen

Delivering high-quality content is crucial for effective reading comprehension and successful learning. Ensuring educational materials are interpreted as intended by their authors is a persistent challenge, especially with the added…

Computers and Society · Computer Science 2024-12-17 Madjid Sadallah

Understanding or comprehending source code is one of the core activities of software engineering. Understanding object-oriented source code is essential and required when a programmer maintains, migrates, reuses, documents or enhances…

Software Engineering · Computer Science 2016-01-29 Ra'Fat AL-msie'deen

Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that…

Computation and Language · Computer Science 2025-10-14 Zilong Wang , Xiaoyu Shen

Document content extraction is a critical task in computer vision, underpinning the data needs of large language models (LLMs) and retrieval-augmented generation (RAG) systems. Despite recent progress, current document parsing methods have…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Linke Ouyang , Yuan Qu , Hongbin Zhou , Jiawei Zhu , Rui Zhang , Qunshu Lin , Bin Wang , Zhiyuan Zhao , Man Jiang , Xiaomeng Zhao , Jin Shi , Fan Wu , Pei Chu , Minghao Liu , Zhenxiang Li , Chao Xu , Bo Zhang , Botian Shi , Zhongying Tu , Conghui He

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this…

We propose Object-oriented Neural Programming (OONP), a framework for semantically parsing documents in specific domains. Basically, OONP reads a document and parses it into a predesigned object-oriented data structure (referred to as…

Machine Learning · Computer Science 2018-07-26 Zhengdong Lu , Xianggen Liu , Haotian Cui , Yukun Yan , Daqi Zheng

This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and…

Computer Vision and Pattern Recognition · Computer Science 2026-02-13 Bruno Rigal , Victor Dupriez , Alexis Mignon , Ronan Le Hy , Nicolas Mery
‹ Prev 1 2 3 10 Next ›