Related papers: Data-Efficient Information Extraction from Form-Li…

Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents

This paper introduces a new information extraction model for business documents. Different from prior studies which only base on span extraction or sequence labeling, the model takes into account advantage of both span extraction and…

Computation and Language · Computer Science 2022-05-27 Nguyen Hong Son , Hieu M. Vu , Tuan-Anh D. Nguyen , Minh-Tien Nguyen

Improving Information Extraction on Business Documents with Specific Pre-Training Tasks

Transformer-based Language Models are widely used in Natural Language Processing related tasks. Thanks to their pre-training, they have been successfully adapted to Information Extraction in business documents. However, most pre-training…

Computation and Language · Computer Science 2023-09-12 Thibault Douzon , Stefan Duffner , Christophe Garcia , Jérémy Espinas

Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task

Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that…

Computation and Language · Computer Science 2025-10-14 Zilong Wang , Xiaoyu Shen

Schema-Driven Information Extraction from Heterogeneous Tables

In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into…

Computation and Language · Computer Science 2024-11-22 Fan Bai , Junmo Kang , Gabriel Stanovsky , Dayne Freitag , Mark Dredze , Alan Ritter

Rapid Adaptation of BERT for Information Extraction on Domain-Specific Business Documents

Techniques for automatically extracting important content elements from business documents such as contracts, statements, and filings have the potential to make business operations more efficient. This problem can be formulated as a…

Computation and Language · Computer Science 2020-02-06 Ruixue Zhang , Wei Yang , Luyun Lin , Zhengkai Tu , Yuqing Xie , Zihang Fu , Yuhao Xie , Luchen Tan , Kun Xiong , Jimmy Lin

Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review

Extracting key information from documents represents a large portion of business workloads and therefore offers a high potential for efficiency improvements and process automation. With recent advances in Deep Learning, a plethora of Deep…

Information Retrieval · Computer Science 2025-07-21 Alexander Michael Rombach , Peter Fettke

Corpus Conversion Service: A machine learning platform to ingest documents at scale [Poster abstract]

Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make their content…

Digital Libraries · Computer Science 2018-05-25 Peter W J Staar , Michele Dolfi , Christoph Auer , Costas Bekas

Transfer Learning for Information Extraction with Limited Data

This paper presents a practical approach to fine-grained information extraction. Through plenty of experiences of authors in practically applying information extraction to business process automation, there can be found a couple of…

Information Retrieval · Computer Science 2020-06-09 Minh-Tien Nguyen , Viet-Anh Phan , Le Thai Linh , Nguyen Hong Son , Le Tien Dung , Miku Hirano , Hajime Hotta

A Span Extraction Approach for Information Extraction on Visually-Rich Documents

Information extraction (IE) for visually-rich documents (VRDs) has achieved SOTA performance recently thanks to the adaptation of Transformer-based language models, which shows the great potential of pre-training methods. In this paper, we…

Artificial Intelligence · Computer Science 2021-07-07 Tuan-Anh D. Nguyen , Hieu M. Vu , Nguyen Hong Son , Minh-Tien Nguyen

Learning from similarity and information extraction from structured documents

The automation of document processing is gaining recent attention due to the great potential to reduce manual work through improved methods and hardware. Neural networks have been successfully applied before - even though they have been…

Computation and Language · Computer Science 2021-06-15 Martin Holeček

Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows

Bioinformatics workflows are essential for complex biological data analyses and are often described in scientific articles with source code in public repositories. Extracting detailed workflow information from articles can improve…

Computation and Language · Computer Science 2025-03-11 Clémence Sebe , Sarah Cohen-Boulakia , Olivier Ferret , Aurélie Névéol

Key Information Extraction From Documents: Evaluation And Generator

Extracting information from documents usually relies on natural language processing methods working on one-dimensional sequences of text. In some cases, for example, for the extraction of key information from semi-structured documents, such…

Computation and Language · Computer Science 2021-06-29 Oliver Bensch , Mirela Popa , Constantin Spille

Information Extraction from Visually Rich Documents with Font Style Embeddings

Information extraction (IE) from documents is an intensive area of research with a large set of industrial applications. Current state-of-the-art methods focus on scanned documents with approaches combining computer vision, natural language…

Computation and Language · Computer Science 2022-08-16 Ismail Oussaid , William Vanhuffel , Pirashanth Ratnamogan , Mhamed Hajaiej , Alexis Mathey , Thomas Gilles

Unsupervised Data Extraction from Computer-generated Documents with Single Line Formatting

Processing large amounts of data is an essential problem of the big data era. Most of the data exchange is done via direct communication (using APIs) and well-structured file formats (JSON, XML, EDI, etc.), but a significant portion of the…

Information Retrieval · Computer Science 2020-07-17 Vladimir Bernstein , Andrei Afanassenkov

Enhanced Language Representation with Label Knowledge for Span Extraction

Span extraction, aiming to extract text spans (such as words or phrases) from plain texts, is a fundamental process in Information Extraction. Recent works introduce the label knowledge to enhance the text representation by formalizing the…

Computation and Language · Computer Science 2021-11-02 Pan Yang , Xin Cong , Zhenyun Sun , Xingwu Liu

Assessing the quality of information extraction

Advances in large language models have notably enhanced the efficiency of information extraction from unstructured and semi-structured data sources. As these technologies become integral to various applications, establishing an objective…

Computation and Language · Computer Science 2024-05-24 Filip Seitl , Tomáš Kovářík , Soheyla Mirshahi , Jan Kryštůfek , Rastislav Dujava , Matúš Ondreička , Herbert Ullrich , Petr Gronat

Improving Document Image Understanding with Reinforcement Finetuning

Successful Artificial Intelligence systems often require numerous labeled data to extract information from document images. In this paper, we investigate the problem of improving the performance of Artificial Intelligence systems in…

Information Retrieval · Computer Science 2022-09-27 Bao-Sinh Nguyen , Dung Tien Le , Hieu M. Vu , Tuan Anh D. Nguyen , Minh-Tien Nguyen , Hung Le

Putting Question-Answering Systems into Practice: Transfer Learning for Efficient Domain Customization

Traditional information retrieval (such as that offered by web search engines) impedes users with information overload from extensive result pages and the need to manually locate the desired information therein. Conversely,…

Computation and Language · Computer Science 2019-03-11 Bernhard Kratzwald , Stefan Feuerriegel

Extracting Procedural Knowledge from Technical Documents

Procedures are an important knowledge component of documents that can be leveraged by cognitive assistants for automation, question-answering or driving a conversation. It is a challenging problem to parse big dense documents like product…

Artificial Intelligence · Computer Science 2020-10-21 Shivali Agarwal , Shubham Atreja , Vikas Agarwal

Label-Efficient Self-Training for Attribute Extraction from Semi-Structured Web Documents

Extracting structured information from HTML documents is a long-studied problem with a broad range of applications, including knowledge base construction, faceted search, and personalized recommendation. Prior works rely on a few…

Information Retrieval · Computer Science 2022-08-30 Ritesh Sarkhel , Binxuan Huang , Colin Lockard , Prashant Shiralkar