Related papers: SCRIBES: Web-Scale Script-Based Semi-Structured Da…

CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web

The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high…

Artificial Intelligence · Computer Science 2018-04-13 Colin Lockard , Xin Luna Dong , Arash Einolghozati , Prashant Shiralkar

Label-Efficient Self-Training for Attribute Extraction from Semi-Structured Web Documents

Extracting structured information from HTML documents is a long-studied problem with a broad range of applications, including knowledge base construction, faceted search, and personalized recommendation. Prior works rely on a few…

Information Retrieval · Computer Science 2022-08-30 Ritesh Sarkhel , Binxuan Huang , Colin Lockard , Prashant Shiralkar

WebFormer: The Web-page Transformer for Structure Information Extraction

Structure information extraction refers to the task of extracting structured text fields from web pages, such as extracting a product offer from a shopping page including product title, description, brand and price. It is an important…

Computation and Language · Computer Science 2022-02-02 Qifan Wang , Yi Fang , Anirudh Ravula , Fuli Feng , Xiaojun Quan , Dongfang Liu

Reverse method for labeling the information from semi-structured web pages

We propose a new technique to infer the structure and extract the tokens of data from the semi-structured web sources which are generated using a consistent template or layout with some implicit regularities. The attributes are extracted…

Information Retrieval · Computer Science 2009-08-06 Z. Akbar , L. T. Handoko

Extraction of Core Contents from Web Pages

The information available on web pages mostly contains semi-structured text documents which are represented either in XML, or HTML, or XHTML format that lacks formatted document structure. The document does not discriminate between the text…

Information Retrieval · Computer Science 2014-03-11 Sandeep Sirsat

Information Extraction Using the Structured Language Model

The paper presents a data-driven approach to information extraction (viewed as template filling) using the structured language model (SLM) as a statistical parser. The task of template filling is cast as constrained parsing using the SLM.…

Computation and Language · Computer Science 2007-05-23 Ciprian Chelba , Milind Mahajan

SPIRE: Structure-Preserving Interpretable Retrieval of Evidence

Retrieval-augmented generation over semi-structured sources such as HTML is constrained by a mismatch between document structure and the flat, sequence-based interfaces of today's embedding and generative models. Retrieval pipelines often…

Information Retrieval · Computer Science 2026-04-24 Mike Rainey , Umut Acar , Muhammed Sezer

HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM

In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing…

Information Retrieval · Computer Science 2024-10-01 Kazuki Kawamura , Akihiro Yamamoto

A Graph Representation of Semi-structured Data for Web Question Answering

The abundant semi-structured data on the Web, such as HTML-based tables and lists, provide commercial search engines a rich information source for question answering (QA). Different from plain text passages in Web documents, Web tables and…

Computation and Language · Computer Science 2020-10-15 Xingyao Zhang , Linjun Shou , Jian Pei , Ming Gong , Lijie Wen , Daxin Jiang

Combining Language and Graph Models for Semi-structured Information Extraction on the Web

Relation extraction is an efficient way of mining the extraordinary wealth of human knowledge on the Web. Existing methods rely on domain-specific training data or produce noisy outputs. We focus here on extracting targeted relations from…

Information Retrieval · Computer Science 2024-02-23 Zhi Hong , Kyle Chard , Ian Foster

Data-Efficient Information Extraction from Form-Like Documents

Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge…

Machine Learning · Computer Science 2022-01-14 Beliz Gunel , Navneet Potti , Sandeep Tata , James B. Wendt , Marc Najork , Jing Xie

SLIDE: Sliding Localized Information for Document Extraction

Constructing accurate knowledge graphs from long texts and low-resource languages is challenging, as large language models (LLMs) experience degraded performance with longer input chunks. This problem is amplified in low-resource settings…

Computation and Language · Computer Science 2025-03-25 Divyansh Singh , Manuel Nunez Martinez , Bonnie J. Dorr , Sonja Schmer Galunder

OCR++: A Robust Framework For Information Extraction from Scholarly Articles

This paper proposes OCR++, an open-source framework designed for a variety of information extraction tasks from scholarly articles including metadata (title, author names, affiliation and e-mail), structure (section headings and body text,…

Digital Libraries · Computer Science 2016-09-26 Mayank Singh , Barnopriyo Barua , Priyank Palod , Manvi Garg , Sidhartha Satapathy , Samuel Bushi , Kumar Ayush , Krishna Sai Rohith , Tulasi Gamidi , Pawan Goyal , Animesh Mukherjee

SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

Training reliable tool-augmented agents remains a significant challenge, largely due to the difficulty of credit assignment in multi-step reasoning. While process-level reward models offer a promising direction, existing LLM-based judges…

Artificial Intelligence · Computer Science 2026-04-28 Yuxuan Jiang , Francis Ferraro

Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they…

Information Retrieval · Computer Science 2026-02-13 Antoine Gauquier , Ioana Manolescu , Pierre Senellart

Simplified DOM Trees for Transferable Attribute Extraction from the Web

There has been a steady need to precisely extract structured knowledge from the web (i.e. HTML documents). Given a web page, extracting a structured object along with various attributes of interest (e.g. price, publisher, author, and genre…

Machine Learning · Computer Science 2021-01-08 Yichao Zhou , Ying Sheng , Nguyen Vo , Nick Edmonds , Sandeep Tata

Scientific Information Extraction with Semi-supervised Neural Tagging

This paper addresses the problem of extracting keyphrases from scientific articles and categorizing them as corresponding to a task, process, or material. We cast the problem as sequence tagging and introduce semi-supervised methods to a…

Computation and Language · Computer Science 2017-08-22 Yi Luan , Mari Ostendorf , Hannaneh Hajishirzi

GraphRank Pro+: Advancing Talent Analytics Through Knowledge Graphs and Sentiment-Enhanced Skill Profiling

The extraction of information from semi-structured text, such as resumes, has long been a challenge due to the diverse formatting styles and subjective content organization. Conventional solutions rely on specialized logic tailored for…

Artificial Intelligence · Computer Science 2025-02-26 Sirisha Velampalli , Chandrashekar Muniyappa

INFOTABS: Inference on Tables as Semi-structured Data

In this paper, we observe that semi-structured tabulated text is ubiquitous; understanding them requires not only comprehending the meaning of text fragments, but also implicit relationships between them. We argue that such data can prove…

Computation and Language · Computer Science 2020-05-14 Vivek Gupta , Maitrey Mehta , Pegah Nokhiz , Vivek Srikumar

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like…

Computation and Language · Computer Science 2020-10-22 Bill Yuchen Lin , Ying Sheng , Nguyen Vo , Sandeep Tata