English
Related papers

Related papers: TWIX: Automatically Reconstructing Structured Data…

200 papers

Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that…

Computation and Language · Computer Science 2025-10-14 Zilong Wang , Xiaoyu Shen

We present a novel iterative extraction model, IterX, for extracting complex relations, or templates (i.e., N-tuples representing a mapping from named slots to spans of text) within a document. Documents may feature zero or more instances…

Computation and Language · Computer Science 2023-05-02 Yunmo Chen , William Gantt , Weiwei Gu , Tongfei Chen , Aaron Steven White , Benjamin Van Durme

Processing large amounts of data is an essential problem of the big data era. Most of the data exchange is done via direct communication (using APIs) and well-structured file formats (JSON, XML, EDI, etc.), but a significant portion of the…

Information Retrieval · Computer Science 2020-07-17 Vladimir Bernstein , Andrei Afanassenkov

Table extraction from PDF and image documents is a ubiquitous task in the real-world. Perfect extraction quality is difficult to achieve with one single out-of-box model due to (1) the wide variety of table styles, (2) the lack of training…

Human-Computer Interaction · Computer Science 2021-02-18 Nancy Xin Ru Wang , Douglas Burdick , Yunyao Li

Automated data extraction from research texts has been steadily improving, with the emergence of large language models (LLMs) accelerating progress even further. Extracting data from plots in research papers, however, has been such a…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Maciej P. Polak , Dane Morgan

Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents,…

Understanding and extracting of information from large documents, such as business opportunities, academic articles, medical documents and technical reports, poses challenges not present in short documents. Such large documents may be…

Computation and Language · Computer Science 2019-10-10 Muhammad Mahbubur Rahman , Tim Finin

Recently, automatically extracting information from visually rich documents (e.g., tickets and resumes) has become a hot and vital research topic due to its widespread commercial value. Most existing methods divide this task into two…

Computer Vision and Pattern Recognition · Computer Science 2022-07-15 Zhanzhan Cheng , Peng Zhang , Can Li , Qiao Liang , Yunlu Xu , Pengfei Li , Shiliang Pu , Yi Niu , Fei Wu

Tabular data is often hidden in text, particularly in medical diagnostic reports. Traditional machine learning (ML) models designed to work with tabular data, cannot effectively process information in such form. On the other hand, large…

Machine Learning · Computer Science 2023-06-09 Aleksa Bisercic , Mladen Nikolic , Mihaela van der Schaar , Boris Delibasic , Pietro Lio , Andrija Petrovic

Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge…

Machine Learning · Computer Science 2022-01-14 Beliz Gunel , Navneet Potti , Sandeep Tata , James B. Wendt , Marc Najork , Jing Xie

Pool of knowledge available to the mankind depends on the source of learning resources, which can vary from ancient printed documents to present electronic material. The rapid conversion of material available in traditional libraries to…

Computer Vision and Pattern Recognition · Computer Science 2014-12-25 Akmal Jahan Mac , Roshan G Ragel

Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring…

Databases · Computer Science 2025-02-26 Besat Kassaie , Frank Wm. Tompa

Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful,…

Information Retrieval · Computer Science 2015-01-12 Julián Alarte , David Insa , Josep Silva , Salvador Tamarit

Extracting key information from documents represents a large portion of business workloads and therefore offers a high potential for efficiency improvements and process automation. With recent advances in Deep Learning, a plethora of Deep…

Information Retrieval · Computer Science 2025-07-21 Alexander Michael Rombach , Peter Fettke

A significant portion of the data available today is found within tables. Therefore, it is necessary to use automated table extraction to obtain thorough results when data-mining. Today's popular state-of-the-art methods for table…

Information Retrieval · Computer Science 2021-04-26 Zach Colter , Morteza Fayazi , Zineb Benameur-El , Serafina Kamp , Shuyan Yu , Ronald Dreslinski

While tabular data is fundamental to many real-world machine learning (ML) applications, acquiring high-quality tabular data is usually labor-intensive and expensive. Limited by the scarcity of observations, tabular datasets often exhibit…

Machine Learning · Computer Science 2026-02-05 Congjing Zhang , Ryan Feng Lin , Ruoxuan Bao , Shuai Huang

Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data…

Computation and Language · Computer Science 2025-11-25 Vikram Aggarwal , Jay Kulkarni , Aditi Mascarenhas , Aakriti Narang , Siddarth Raman , Ajay Shah , Susan Thomas

Large language models (LLMs) have demonstrated remarkable capabilities in text analysis tasks, yet their evaluation on complex, real-world applications remains challenging. We define a set of tasks, Multi-Insight Multi-Document Extraction…

Computation and Language · Computer Science 2024-12-02 John Francis , Saba Esnaashari , Anton Poletaev , Sukankana Chakraborty , Youmna Hashem , Jonathan Bright

A large amount of document data exists in unstructured form such as raw images without any text information. Designing a practical document image analysis system is a meaningful but challenging task. In previous work, we proposed an…

Computer Vision and Pattern Recognition · Computer Science 2022-10-14 Chenxia Li , Ruoyu Guo , Jun Zhou , Mengtao An , Yuning Du , Lingfeng Zhu , Yi Liu , Xiaoguang Hu , Dianhai Yu

Automatic extraction of procedural graphs from documents creates a low-cost way for users to easily understand a complex procedure by skimming visual graphs. Despite the progress in recent studies, it remains unanswered: whether the…

Computation and Language · Computer Science 2024-08-09 Weihong Du , Wenrui Liao , Hongru Liang , Wenqiang Lei
‹ Prev 1 2 3 10 Next ›