Related papers: Split-Correctness in Information Extraction

Grammars for Document Spanners

We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied…

Databases · Computer Science 2023-01-25 Liat Peterfreund

Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity

Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with…

Databases · Computer Science 2018-01-01 Francisco Maturana , Cristian Riveros , Domagoj Vrgoč

Span-Oriented Information Extraction -- A Unifying Perspective on Information Extraction

Information Extraction refers to a collection of tasks within Natural Language Processing (NLP) that identifies sub-sequences within text and their labels. These tasks have been used for many years to link extract relevant information and…

Computation and Language · Computer Science 2024-03-26 Yifan Ding , Michael Yankoski , Tim Weninger

Extracting Procedural Knowledge from Technical Documents

Procedures are an important knowledge component of documents that can be leveraged by cognitive assistants for automation, question-answering or driving a conversation. It is a challenging problem to parse big dense documents like product…

Artificial Intelligence · Computer Science 2020-10-21 Shivali Agarwal , Shubham Atreja , Vikas Agarwal

Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images

Text segmentation, the task of dividing a document into sections, is often a prerequisite for performing additional natural language processing tasks. Existing text segmentation methods have typically been developed and tested using clean,…

Computer Vision and Pattern Recognition · Computer Science 2023-12-21 Carol Anderson , Phil Crone

An Exploratory Study of Ad Hoc Parsers in Python

Background: Ad hoc parsers are pieces of code that use common string functions like split, trim, or slice to effectively perform parsing. Whether it is handling command-line arguments, reading configuration files, parsing custom file…

Software Engineering · Computer Science 2023-04-20 Michael Schröder , Marc Goritschnig , Jürgen Cito

Structural Text Segmentation of Legal Documents

The growing complexity of legal cases has lead to an increasing interest in legal information retrieval systems that can effectively satisfy user-specific information needs. However, such downstream systems typically require documents to be…

Computation and Language · Computer Science 2021-05-18 Dennis Aumiller , Satya Almasian , Sebastian Lackner , Michael Gertz

Detecting Opportunities for Differential Maintenance of Extracted Views

Semi-structured and unstructured data management is challenging, but many of the problems encountered are analogous to problems already addressed in the relational context. In the area of information extraction, for example, the shift from…

Databases · Computer Science 2020-07-07 Besat Kassaie , Frank Wm. Tompa

Text Line Segmentation of Historical Documents: a Survey

There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such…

Computer Vision and Pattern Recognition · Computer Science 2007-05-23 Laurence Likforman-Sulem , Abderrazak Zahour , Bruno Taconet

Constant delay algorithms for regular document spanners

Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants…

Databases · Computer Science 2018-03-15 Fernando Florenzano , Cristian Riveros , Martin Ugarte , Stijn Vansummeren , Domagoj Vrgoc

Word and character segmentation directly in run-length compressed handwritten document images

From the literature, it is demonstrated that performing text-line segmentation directly in the run-length compressed handwritten document images significantly reduces the computational time and memory space. In this paper, we investigate…

Computer Vision and Pattern Recognition · Computer Science 2019-09-12 Amarnath R , P. Nagabhushan , Mohammed Javed

Automatic Page Segmentation Without Decompressing the Run-Length Compressed Text Documents

Page segmentation is considered to be the crucial stage for the automatic analysis of documents with complex layouts. This has traditionally been carried out in uncompressed documents, although most of the documents in real life exist in a…

Computer Vision and Pattern Recognition · Computer Science 2020-07-03 Mohammed Javed , P. Nagabhushan

Toward Unifying Text Segmentation and Long Document Summarization

Text segmentation is important for signaling a document's structure. Without segmenting a long document into topically coherent sections, it is difficult for readers to comprehend the text, let alone find important information. The problem…

Computation and Language · Computer Science 2022-11-01 Sangwoo Cho , Kaiqiang Song , Xiaoyang Wang , Fei Liu , Dong Yu

Text Segmentation Using Exponential Models

This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To…

cmp-lg · Computer Science 2008-02-03 Doug Beeferman , Adam Berger , John Lafferty

Unfolding the Structure of a Document using Deep Learning

Understanding and extracting of information from large documents, such as business opportunities, academic articles, medical documents and technical reports, poses challenges not present in short documents. Such large documents may be…

Computation and Language · Computer Science 2019-10-10 Muhammad Mahbubur Rahman , Tim Finin

Design of Automatically Adaptable Web Wrappers

Nowadays, the huge amount of information distributed through the Web motivates studying techniques to be adopted in order to extract relevant data in an efficient and reliable way. Both academia and enterprises developed several approaches…

Artificial Intelligence · Computer Science 2013-06-06 Emilio Ferrara , Robert Baumgartner

TRIE: End-to-End Text Reading and Information Extraction for Document Understanding

Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks,…

Computer Vision and Pattern Recognition · Computer Science 2021-10-26 Peng Zhang , Yunlu Xu , Zhanzhan Cheng , Shiliang Pu , Jing Lu , Liang Qiao , Yi Niu , Fei Wu

Navigating multilingual news collections using automatically extracted information

We are presenting a text analysis tool set that allows analysts in various fields to sieve through large collections of multilingual news items quickly and to find information that is of relevance to them. For a given document collection,…

Computation and Language · Computer Science 2007-05-23 Ralf Steinberger , Bruno Pouliquen , Camelia Ignat

A framework for extraction and transformation of documents

We present a theoretical framework for the extraction and transformation of text documents. We propose to use a two-phase process where the first phase extracts span-tuples from a document, and the second phase maps the content of the…

Databases · Computer Science 2024-05-22 Cristian Riveros , Markus L. Schmid , Nicole Schweikardt

Information Extraction Using the Structured Language Model

The paper presents a data-driven approach to information extraction (viewed as template filling) using the structured language model (SLM) as a statistical parser. The task of template filling is cast as constrained parsing using the SLM.…

Computation and Language · Computer Science 2007-05-23 Ciprian Chelba , Milind Mahajan