Related papers: Web Template Extraction Based on Hyperlink Analysi…

Automatic Detection of Webpages that Share the Same Web Template

Template extraction is the process of isolating the template of a given webpage. It is widely used in several disciplines, including webpages development, content extraction, block detection, and webpages indexing. One of the main goals of…

Information Retrieval · Computer Science 2014-09-10 Julián Alarte , David Insa , Josep Silva , Salvador Tamarit

A Benchmark Suite for Template Detection and Content Extraction

Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their…

Information Retrieval · Computer Science 2022-07-19 Julián Alarte , Josep Silva

A Fast Template-based Approach to Automatically Identify Primary Text Content of a Web Page

Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant Web pages. One reason is because search engines also look at non-informative…

Information Retrieval · Computer Science 2019-11-27 Dat Quoc Nguyen , Dai Quoc Nguyen , Son Bao Pham , The Duy Bui

Using the DOM Tree for Content Extraction

The main information of a webpage is usually mixed between menus, advertisements, panels, and other not necessarily related information; and it is often difficult to automatically isolate this information. This is precisely the objective of…

Information Retrieval · Computer Science 2012-10-24 Sergio López , Josep Silva , David Insa

Navigation Objects Extraction for Better Content Structure Understanding

Existing works for extracting navigation objects from webpages focus on navigation menus, so as to reveal the information architecture of the site. However, web 2.0 sites such as social networks, e-commerce portals etc. are making the…

Artificial Intelligence · Computer Science 2017-08-29 Kui Zhao , Bangpeng Li , Zilun Peng , Jiajun Bu , Can Wang

Reverse method for labeling the information from semi-structured web pages

We propose a new technique to infer the structure and extract the tokens of data from the semi-structured web sources which are generated using a consistent template or layout with some implicit regularities. The attributes are extracted…

Information Retrieval · Computer Science 2009-08-06 Z. Akbar , L. T. Handoko

A language independent web data extraction using vision based page segmentation algorithm

Web usage mining is a process of extracting useful information from server logs i.e. users history. Web usage mining is a process of finding out what users are looking for on the internet. Some users might be looking at only textual data,…

Information Retrieval · Computer Science 2013-10-25 P YesuRaju , P KiranSree

Pattern discovery for semi-structured web pages using bar-tree representation

Many websites with an underlying database containing structured data provide the richest and most dense source of information relevant for topical data integration. The real data integration requires sustainable and reliable pattern…

Information Retrieval · Computer Science 2015-03-19 Z. Akbar , L. T. Handoko

Cross-Domain Web Information Extraction at Pinterest

The internet offers a massive repository of unstructured information, but it's a significant challenge to convert this into a structured format. At Pinterest, the ability to accurately extract structured product data from e-commerce…

Computation and Language · Computer Science 2025-08-05 Michael Farag , Patrick Halina , Andrey Zaytsev , Alekhya Munagala , Imtihan Ahmed , Junhao Wang

HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM

In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing…

Information Retrieval · Computer Science 2024-10-01 Kazuki Kawamura , Akihiro Yamamoto

The Impact of Main Content Extraction on Near-Duplicate Detection

Commercial web search engines employ near-duplicate detection to ensure that users see each relevant result only once, albeit the underlying web crawls typically include (near-)duplicates of many web pages. We revisit the risks and…

Information Retrieval · Computer Science 2021-11-23 Maik Fröbe , Matthias Hagen , Janek Bevendorff , Michael Völske , Benno Stein , Christopher Schröder , Robby Wagner , Lukas Gienapp , Martin Potthast

Web Table Extraction, Retrieval and Augmentation: A Survey

Tables are a powerful and popular tool for organizing and manipulating data. A vast number of tables can be found on the Web, which represents a valuable knowledge resource. The objective of this survey is to synthesize and present two…

Information Retrieval · Computer Science 2020-02-06 Shuo Zhang , Krisztian Balog

Boilerplate Removal using a Neural Sequence Labeling Model

The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing.…

Machine Learning · Computer Science 2020-04-30 Jurek Leonhardt , Avishek Anand , Megha Khosla

Effective Blog Pages Extractor for Better UGC Accessing

Blog is becoming an increasingly popular media for information publishing. Besides the main content, most of blog pages nowadays also contain noisy information such as advertisements etc. Removing these unrelated elements can improves user…

Information Retrieval · Computer Science 2017-08-29 Kui Zhao , Yi Wang , Xia Hu , Can Wang

Webpage Segmentation for Extracting Images and Their Surrounding Contextual Information

Web images come in hand with valuable contextual information. Although this information has long been mined for various uses such as image annotation, clustering of images, inference of image semantic content, etc., insufficient attention…

Multimedia · Computer Science 2020-05-21 F. Fauzi , H. J. Long , M. Belkhatir

Web2Text: Deep Structured Boilerplate Removal

Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this…

Information Retrieval · Computer Science 2018-03-28 Thijs Vogels , Octavian-Eugen Ganea , Carsten Eickhoff

A Model for Personalized Keyword Extraction from Web Pages using Segmentation

The World Wide Web caters to the needs of billions of users in heterogeneous groups. Each user accessing the World Wide Web might have his / her own specific interest and would expect the web to respond to the specific requirements. The…

Information Retrieval · Computer Science 2017-11-22 K. S. Kuppusamy , G. Aghila

An Index-based Approach for Efficient and Effective Web Content Extraction

As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance,…

Information Retrieval · Computer Science 2025-12-09 Yihan Chen , Benfeng Xu , Xiaorui Wang , Zhendong Mao

CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web

The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high…

Artificial Intelligence · Computer Science 2018-04-13 Colin Lockard , Xin Luna Dong , Arash Einolghozati , Prashant Shiralkar

"What's important here?": Opportunities and Challenges of Using LLMs in Retrieving Information from Web Interfaces

Large language models (LLMs) that have been trained on a corpus that includes large amount of code exhibit a remarkable ability to understand HTML code. As web interfaces are primarily constructed using HTML, we design an in-depth study to…

Computation and Language · Computer Science 2023-12-12 Faria Huq , Jeffrey P. Bigham , Nikolas Martelaro