Related papers: TWIX: Automatically Reconstructing Structured Data…

Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task

Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that…

Computation and Language · Computer Science 2025-10-14 Zilong Wang , Xiaoyu Shen

Iterative Document-level Information Extraction via Imitation Learning

We present a novel iterative extraction model, IterX, for extracting complex relations, or templates (i.e., N-tuples representing a mapping from named slots to spans of text) within a document. Documents may feature zero or more instances…

Computation and Language · Computer Science 2023-05-02 Yunmo Chen , William Gantt , Weiwei Gu , Tongfei Chen , Aaron Steven White , Benjamin Van Durme

Unsupervised Data Extraction from Computer-generated Documents with Single Line Formatting

Processing large amounts of data is an essential problem of the big data era. Most of the data exchange is done via direct communication (using APIs) and well-structured file formats (JSON, XML, EDI, etc.), but a significant portion of the…

Information Retrieval · Computer Science 2020-07-17 Vladimir Bernstein , Andrei Afanassenkov

TableLab: An Interactive Table Extraction System with Adaptive Deep Learning

Table extraction from PDF and image documents is a ubiquitous task in the real-world. Perfect extraction quality is difficult to achieve with one single out-of-box model due to (1) the wide variety of table styles, (2) the lack of training…

Human-Computer Interaction · Computer Science 2021-02-18 Nancy Xin Ru Wang , Douglas Burdick , Yunyao Li

Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots

Automated data extraction from research texts has been steadily improving, with the emergence of large language models (LLMs) accelerating progress even further. Extracting data from plots in research papers, however, has been such a…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Maciej P. Polak , Dane Morgan

Exploring LLMs for Scientific Information Extraction Using The SciEx Framework

Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents,…

Artificial Intelligence · Computer Science 2026-01-26 Sha Li , Ayush Sadekar , Nathan Self , Yiqi Su , Lars Andersland , Mira Chaplin , Annabel Zhang , Hyoju Yang , James B Henderson , Krista Wigginton , Linsey Marr , T. M. Murali , Naren Ramakrishnan

Unfolding the Structure of a Document using Deep Learning

Understanding and extracting of information from large documents, such as business opportunities, academic articles, medical documents and technical reports, poses challenges not present in short documents. Such large documents may be…

Computation and Language · Computer Science 2019-10-10 Muhammad Mahbubur Rahman , Tim Finin

TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents

Recently, automatically extracting information from visually rich documents (e.g., tickets and resumes) has become a hot and vital research topic due to its widespread commercial value. Most existing methods divide this task into two…

Computer Vision and Pattern Recognition · Computer Science 2022-07-15 Zhanzhan Cheng , Peng Zhang , Can Li , Qiao Liang , Yunlu Xu , Pengfei Li , Shiliang Pu , Yi Niu , Fei Wu

Interpretable Medical Diagnostics with Structured Data Extraction by Large Language Models

Tabular data is often hidden in text, particularly in medical diagnostic reports. Traditional machine learning (ML) models designed to work with tabular data, cannot effectively process information in such form. On the other hand, large…

Machine Learning · Computer Science 2023-06-09 Aleksa Bisercic , Mladen Nikolic , Mihaela van der Schaar , Boris Delibasic , Pietro Lio , Andrija Petrovic

Data-Efficient Information Extraction from Form-Like Documents

Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge…

Machine Learning · Computer Science 2022-01-14 Beliz Gunel , Navneet Potti , Sandeep Tata , James B. Wendt , Marc Najork , Jing Xie

Locating Tables in Scanned Documents for Reconstructing and Republishing (ICIAfS14)

Pool of knowledge available to the mankind depends on the source of learning resources, which can vary from ancient printed documents to present electronic material. The rapid conversion of material available in traditional libraries to…

Computer Vision and Pattern Recognition · Computer Science 2014-12-25 Akmal Jahan Mac , Roshan G Ragel

Improving Unstructured Data Quality via Updatable Extracted Views

Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring…

Databases · Computer Science 2025-02-26 Besat Kassaie , Frank Wm. Tompa

Web Template Extraction Based on Hyperlink Analysis

Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful,…

Information Retrieval · Computer Science 2015-01-12 Julián Alarte , David Insa , Josep Silva , Salvador Tamarit

Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review

Extracting key information from documents represents a large portion of business workloads and therefore offers a high potential for efficiency improvements and process automation. With recent advances in Deep Learning, a plethora of Deep…

Information Retrieval · Computer Science 2025-07-21 Alexander Michael Rombach , Peter Fettke

Tablext: A Combined Neural Network And Heuristic Based Table Extractor

A significant portion of the data available today is found within tables. Therefore, it is necessary to use automated table extraction to obtain thorough results when data-mining. Today's popular state-of-the-art methods for table…

Information Retrieval · Computer Science 2021-04-26 Zach Colter , Morteza Fayazi , Zineb Benameur-El , Serafina Kamp , Shuyan Yu , Ronald Dreslinski

Team, Then Trim: An Assembly-Line LLM Framework for High-Quality Tabular Data Generation

While tabular data is fundamental to many real-world machine learning (ML) applications, acquiring high-quality tabular data is usually labor-intensive and expensive. Limited by the scarcity of observations, tabular datasets often exhibit…

Machine Learning · Computer Science 2026-02-05 Congjing Zhang , Ryan Feng Lin , Ruoxuan Bao , Shuai Huang

Information Extraction From Fiscal Documents Using LLMs

Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data…

Computation and Language · Computer Science 2025-11-25 Vikram Aggarwal , Jay Kulkarni , Aditi Mascarenhas , Aakriti Narang , Siddarth Raman , Ajay Shah , Susan Thomas

MIMDE: Exploring the Use of Synthetic vs Human Data for Evaluating Multi-Insight Multi-Document Extraction Tasks

Large language models (LLMs) have demonstrated remarkable capabilities in text analysis tasks, yet their evaluation on complex, real-world applications remains challenging. We define a set of tasks, Multi-Insight Multi-Document Extraction…

Computation and Language · Computer Science 2024-12-02 John Francis , Saba Esnaashari , Anton Poletaev , Sukankana Chakraborty , Youmna Hashem , Jonathan Bright

PP-StructureV2: A Stronger Document Analysis System

A large amount of document data exists in unstructured form such as raw images without any text information. Designing a practical document image analysis system is a meaningful but challenging task. In previous work, we proposed an…

Computer Vision and Pattern Recognition · Computer Science 2022-10-14 Chenxia Li , Ruoyu Guo , Jun Zhou , Mengtao An , Yuning Du , Lingfeng Zhu , Yi Liu , Xiaoguang Hu , Dianhai Yu

PAGED: A Benchmark for Procedural Graphs Extraction from Documents

Automatic extraction of procedural graphs from documents creates a low-cost way for users to easily understand a complex procedure by skimming visual graphs. Despite the progress in recent studies, it remains unanswered: whether the…

Computation and Language · Computer Science 2024-08-09 Weihong Du , Wenrui Liao , Hongru Liang , Wenqiang Lei