Related papers: Unsupervised Data Extraction from Computer-generat…

Data-Efficient Information Extraction from Form-Like Documents

Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge…

Machine Learning · Computer Science 2022-01-14 Beliz Gunel , Navneet Potti , Sandeep Tata , James B. Wendt , Marc Najork , Jing Xie

Extracting Procedural Knowledge from Technical Documents

Procedures are an important knowledge component of documents that can be leveraged by cognitive assistants for automation, question-answering or driving a conversation. It is a challenging problem to parse big dense documents like product…

Artificial Intelligence · Computer Science 2020-10-21 Shivali Agarwal , Shubham Atreja , Vikas Agarwal

Unsupervised Keyphrase Extraction via Interpretable Neural Networks

Keyphrase extraction aims at automatically extracting a list of "important" phrases representing the key concepts in a document. Prior approaches for unsupervised keyphrase extraction resorted to heuristic notions of phrase importance via…

Computation and Language · Computer Science 2023-02-20 Rishabh Joshi , Vidhisha Balachandran , Emily Saldanha , Maria Glenski , Svitlana Volkova , Yulia Tsvetkov

Improving Unstructured Data Quality via Updatable Extracted Views

Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring…

Databases · Computer Science 2025-02-26 Besat Kassaie , Frank Wm. Tompa

Information Extraction - A User Guide

This technical memo describes Information Extraction from the point-of-view of a potential user of the technology. No knowledge of language processing is assumed. Information Extraction is a process which takes unseen texts as input and…

cmp-lg · Computer Science 2008-02-03 Hamish Cunningham

Simple Unsupervised Keyphrase Extraction using Sentence Embeddings

Keyphrase extraction is the task of automatically selecting a small set of phrases that best describe a given free text document. Supervised keyphrase extraction requires large amounts of labeled training data and generalizes very poorly…

Computation and Language · Computer Science 2018-09-07 Kamil Bennani-Smires , Claudiu Musat , Andreea Hossmann , Michael Baeriswyl , Martin Jaggi

Optimising Human-Machine Collaboration for Efficient High-Precision Information Extraction from Text Documents

While humans can extract information from unstructured text with high precision and recall, this is often too time-consuming to be practical. Automated approaches, on the other hand, produce nearly-immediate results, but may not be reliable…

Computation and Language · Computer Science 2023-02-21 Bradley Butcher , Miri Zilka , Darren Cook , Jiri Hron , Adrian Weller

Key Information Extraction From Documents: Evaluation And Generator

Extracting information from documents usually relies on natural language processing methods working on one-dimensional sequences of text. In some cases, for example, for the extraction of key information from semi-structured documents, such…

Computation and Language · Computer Science 2021-06-29 Oliver Bensch , Mirela Popa , Constantin Spille

Information Extraction from Unstructured data using Augmented-AI and Computer Vision

Information extraction (IE) from unstructured documents remains a critical challenge in data processing pipelines. Traditional optical character recognition (OCR) methods and conventional parsing engines demonstrate limited effectiveness…

Computer Vision and Pattern Recognition · Computer Science 2025-07-28 Aditya Parikh

A Semi-automatic Data Extraction System for Heterogeneous Data Sources: A Case Study from Cotton Industry

With the recent developments in digitisation, there are increasing number of documents available online. There are several information extraction tools that are available to extract information from digitised documents. However, identifying…

Information Retrieval · Computer Science 2021-11-08 Richi Nayak , Thirunavukarasu Balasubramaniam , Sangeetha Kutty , Sachindra Banduthilaka , Erin Peterson

Unsupervised and Distributional Detection of Machine-Generated Text

The power of natural language generation models has provoked a flurry of interest in automatic methods to detect if a piece of text is human or machine-authored. The problem so far has been framed in a standard supervised way and consists…

Computation and Language · Computer Science 2021-11-05 Matthias Gallé , Jos Rozen , Germán Kruszewski , Hady Elsahar

Assessing the quality of information extraction

Advances in large language models have notably enhanced the efficiency of information extraction from unstructured and semi-structured data sources. As these technologies become integral to various applications, establishing an objective…

Computation and Language · Computer Science 2024-05-24 Filip Seitl , Tomáš Kovářík , Soheyla Mirshahi , Jan Kryštůfek , Rastislav Dujava , Matúš Ondreička , Herbert Ullrich , Petr Gronat

TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents

Recently, automatically extracting information from visually rich documents (e.g., tickets and resumes) has become a hot and vital research topic due to its widespread commercial value. Most existing methods divide this task into two…

Computer Vision and Pattern Recognition · Computer Science 2022-07-15 Zhanzhan Cheng , Peng Zhang , Can Li , Qiao Liang , Yunlu Xu , Pengfei Li , Shiliang Pu , Yi Niu , Fei Wu

TWIX: Automatically Reconstructing Structured Data from Templatized Documents

Many documents, that we call templatized documents, are programmatically generated by populating fields in a visual template. Effective data extraction from these documents is crucial to supporting downstream analytical tasks. Current data…

Databases · Computer Science 2025-01-14 Yiming Lin , Mawil Hasan , Rohan Kosalge , Alvin Cheung , Aditya G. Parameswaran

A New Sentence Extraction Strategy for Unsupervised Extractive Summarization Methods

In recent years, text summarization methods have attracted much attention again thanks to the researches on neural network models. Most of the current text summarization methods based on neural network models are supervised methods which…

Computation and Language · Computer Science 2024-01-25 Dehao Tao , Yingzhu Xiong , Zhongliang Yang , Yongfeng Huang

An efficient domain-independent approach for supervised keyphrase extraction and ranking

We present a supervised learning approach for automatic extraction of keyphrases from single documents. Our solution uses simple to compute statistical and positional features of candidate phrases and does not rely on any external knowledge…

Information Retrieval · Computer Science 2024-04-12 Sriraghavendra Ramaswamy

TRIE: End-to-End Text Reading and Information Extraction for Document Understanding

Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks,…

Computer Vision and Pattern Recognition · Computer Science 2021-10-26 Peng Zhang , Yunlu Xu , Zhanzhan Cheng , Shiliang Pu , Jing Lu , Liang Qiao , Yi Niu , Fei Wu

Unsupervised, Efficient and Semantic Expertise Retrieval

We introduce an unsupervised discriminative model for the task of retrieving experts in online document collections. We exclusively employ textual evidence and avoid explicit feature engineering by learning distributed word representations…

Information Retrieval · Computer Science 2017-09-19 Christophe Van Gysel , Maarten de Rijke , Marcel Worring

Retrieval-efficiency trade-off of Unsupervised Keyword Extraction

Efficiently identifying keyphrases that represent a given document is a challenging task. In the last years, plethora of keyword detection approaches were proposed. These approaches can be based on statistical (frequency-based) properties…

Information Retrieval · Computer Science 2023-12-25 Blaž Škrlj , Boshko Koloski , Senja Pollak

A Review of Keyphrase Extraction

Keyphrase extraction is a textual information processing task concerned with the automatic extraction of representative and characteristic phrases from a document that express all the key aspects of its content. Keyphrases constitute a…

Computation and Language · Computer Science 2019-07-31 Eirini Papagiannopoulou , Grigorios Tsoumakas