Related papers: Constant-Delay Enumeration for Nondeterministic Do…

Constant-Delay Enumeration for Nondeterministic Document Spanners

We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential…

Databases · Computer Science 2020-12-08 Antoine Amarilli , Pierre Bourhis , Stefan Mengel , Matthias Niewerth

Constant delay algorithms for regular document spanners

Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants…

Databases · Computer Science 2018-03-15 Fernando Florenzano , Cristian Riveros , Martin Ugarte , Stijn Vansummeren , Domagoj Vrgoc

Spanner Evaluation over SLP-Compressed Documents

We consider the problem of evaluating regular spanners over compressed documents, i.e., we wish to solve evaluation tasks directly on the compressed data, without decompression. As compressed forms of the documents we use straight-line…

Data Structures and Algorithms · Computer Science 2021-01-27 Markus L. Schmid , Nicole Schweikardt

Constant-delay enumeration for SLP-compressed documents

We study the problem of enumerating results from a query over a compressed document. The model we use for compression are straight-line programs (SLPs), which are defined by a context-free grammar that produces a single string. For our…

Data Structures and Algorithms · Computer Science 2025-02-26 Martín Muñoz , Cristian Riveros

Streaming enumeration on nested documents

Some of the most relevant document schemas used online, such as XML and JSON, have a nested format. In the last decade, the task of extracting data from nested documents over streams has become especially relevant. We focus on the streaming…

Databases · Computer Science 2022-01-11 Martín Muñoz , Cristian Riveros

Revisiting Weighted Information Extraction: A Simpler and Faster Algorithm for Ranked Enumeration

Information extraction from textual data, where the query is represented by a finite transducer and the task is to enumerate all results without repetition, and its extension to the weighted case, where each output element has a weight and…

Data Structures and Algorithms · Computer Science 2024-10-08 Pawel Gawrychowski , Florin Manea , Markus L. Schmid

Grammars for Document Spanners

We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied…

Databases · Computer Science 2023-01-25 Liat Peterfreund

A Span Extraction Approach for Information Extraction on Visually-Rich Documents

Information extraction (IE) for visually-rich documents (VRDs) has achieved SOTA performance recently thanks to the adaptation of Transformer-based language models, which shows the great potential of pre-training methods. In this paper, we…

Artificial Intelligence · Computer Science 2021-07-07 Tuan-Anh D. Nguyen , Hieu M. Vu , Nguyen Hong Son , Minh-Tien Nguyen

Enumeration on Trees with Tractable Combined Complexity and Efficient Updates

We give an algorithm to enumerate the results on trees of monadic second-order (MSO) queries represented by nondeterministic tree automata. After linear time preprocessing (in the input tree), we can enumerate answers with linear delay (in…

Databases · Computer Science 2019-08-28 Antoine Amarilli , Pierre Bourhis , Stefan Mengel , Matthias Niewerth

Weight Annotation in Information Extraction

The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For…

Databases · Computer Science 2023-06-22 Johannes Doleschal , Benny Kimelfeld , Wim Martens , Liat Peterfreund

Complexity Bounds for Relational Algebra over Document Spanners

We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular…

Databases · Computer Science 2019-02-07 Liat Peterfreund , Dominik D. Freydenberger , Benny Kimelfeld , Markus Kröll

Extended Formulations via Decision Diagrams

We propose a general algorithm of constructing an extended formulation for any given set of linear constraints with integer coefficients. Our algorithm consists of two phases: first construct a decision diagram $(V,E)$ that somehow…

Data Structures and Algorithms · Computer Science 2023-09-07 Yuta Kurokawa , Ryotaro Mitsuboshi , Haruki Hamasaki , Kohei Hatano , Eiji Takimoto , Holakou Rahmanian

Document Expansion by Query Prediction

One technique to improve the retrieval effectiveness of a search engine is to expand documents with terms that are related or representative of the documents' content.From the perspective of a question answering system, this might comprise…

Information Retrieval · Computer Science 2019-09-26 Rodrigo Nogueira , Wei Yang , Jimmy Lin , Kyunghyun Cho

Progressively Sampled Equality-Constrained Optimization

An algorithm is proposed, analyzed, and tested for solving continuous nonlinear-equality-constrained optimization problems where the objective and constraint functions are defined by expectations or averages over large, finite numbers of…

Optimization and Control · Mathematics 2026-05-14 Frank E. Curtis , Lingjun Guo , Daniel P. Robinson

Modular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents (Extended Version)

We address the extraction of mathematical statements and their proofs from scholarly PDF articles as a multimodal classification problem, utilizing text, font features, and bitmap image renderings of PDFs as distinct modalities. We propose…

Artificial Intelligence · Computer Science 2024-10-14 Shrey Mishra , Antoine Gauquier , Pierre Senellart

Accurate, Data-Efficient, Unconstrained Text Recognition with Convolutional Neural Networks

Unconstrained text recognition is an important computer vision task, featuring a wide variety of different sub-tasks, each with its own set of challenges. One of the biggest promises of deep neural networks has been the convergence and…

Computer Vision and Pattern Recognition · Computer Science 2019-01-01 Mohamed Yousef , Khaled F. Hussain , Usama S. Mohammed

ABS: Enforcing Constraint Satisfaction On Generated Sequences Via Automata-Guided Beam Search

Sequence generation and prediction form a cornerstone of modern machine learning, with applications spanning natural language processing, program synthesis, and time-series forecasting. These tasks are typically modeled in an autoregressive…

Machine Learning · Computer Science 2025-11-05 Vincenzo Collura , Karim Tit , Laura Bussi , Eleonora Giunchiglia , Maxime Cordy

A framework for extraction and transformation of documents

We present a theoretical framework for the extraction and transformation of text documents. We propose to use a two-phase process where the first phase extracts span-tuples from a document, and the second phase maps the content of the…

Databases · Computer Science 2024-05-22 Cristian Riveros , Markus L. Schmid , Nicole Schweikardt

Constraint-based Sequential Pattern Mining with Decision Diagrams

Constrained sequential pattern mining aims at identifying frequent patterns on a sequential database of items while observing constraints defined over the item attributes. We introduce novel techniques for constraint-based sequential…

Machine Learning · Computer Science 2019-01-01 Amin Hosseininasab , Willem-Jan van Hoeve , Andre A. Cire

Compact enumeration for scheduling one machine

A Variable Parameter (VP) analysis, that we introduce here, aims to give a precise algorithm time complexity expression in which an exponent appears solely in terms of a variable parameter. A variable parameter is the number of objects with…

Data Structures and Algorithms · Computer Science 2025-07-08 Nodari Vakhania