English
Related papers

Related papers: Web Template Extraction Based on Hyperlink Analysi…

200 papers

Template extraction is the process of isolating the template of a given webpage. It is widely used in several disciplines, including webpages development, content extraction, block detection, and webpages indexing. One of the main goals of…

Information Retrieval · Computer Science 2014-09-10 Julián Alarte , David Insa , Josep Silva , Salvador Tamarit

Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their…

Information Retrieval · Computer Science 2022-07-19 Julián Alarte , Josep Silva

Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant Web pages. One reason is because search engines also look at non-informative…

Information Retrieval · Computer Science 2019-11-27 Dat Quoc Nguyen , Dai Quoc Nguyen , Son Bao Pham , The Duy Bui

The main information of a webpage is usually mixed between menus, advertisements, panels, and other not necessarily related information; and it is often difficult to automatically isolate this information. This is precisely the objective of…

Information Retrieval · Computer Science 2012-10-24 Sergio López , Josep Silva , David Insa

Existing works for extracting navigation objects from webpages focus on navigation menus, so as to reveal the information architecture of the site. However, web 2.0 sites such as social networks, e-commerce portals etc. are making the…

Artificial Intelligence · Computer Science 2017-08-29 Kui Zhao , Bangpeng Li , Zilun Peng , Jiajun Bu , Can Wang

We propose a new technique to infer the structure and extract the tokens of data from the semi-structured web sources which are generated using a consistent template or layout with some implicit regularities. The attributes are extracted…

Information Retrieval · Computer Science 2009-08-06 Z. Akbar , L. T. Handoko

Web usage mining is a process of extracting useful information from server logs i.e. users history. Web usage mining is a process of finding out what users are looking for on the internet. Some users might be looking at only textual data,…

Information Retrieval · Computer Science 2013-10-25 P YesuRaju , P KiranSree

Many websites with an underlying database containing structured data provide the richest and most dense source of information relevant for topical data integration. The real data integration requires sustainable and reliable pattern…

Information Retrieval · Computer Science 2015-03-19 Z. Akbar , L. T. Handoko

The internet offers a massive repository of unstructured information, but it's a significant challenge to convert this into a structured format. At Pinterest, the ability to accurately extract structured product data from e-commerce…

Computation and Language · Computer Science 2025-08-05 Michael Farag , Patrick Halina , Andrey Zaytsev , Alekhya Munagala , Imtihan Ahmed , Junhao Wang

In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing…

Information Retrieval · Computer Science 2024-10-01 Kazuki Kawamura , Akihiro Yamamoto

Commercial web search engines employ near-duplicate detection to ensure that users see each relevant result only once, albeit the underlying web crawls typically include (near-)duplicates of many web pages. We revisit the risks and…

Tables are a powerful and popular tool for organizing and manipulating data. A vast number of tables can be found on the Web, which represents a valuable knowledge resource. The objective of this survey is to synthesize and present two…

Information Retrieval · Computer Science 2020-02-06 Shuo Zhang , Krisztian Balog

The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing.…

Machine Learning · Computer Science 2020-04-30 Jurek Leonhardt , Avishek Anand , Megha Khosla

Blog is becoming an increasingly popular media for information publishing. Besides the main content, most of blog pages nowadays also contain noisy information such as advertisements etc. Removing these unrelated elements can improves user…

Information Retrieval · Computer Science 2017-08-29 Kui Zhao , Yi Wang , Xia Hu , Can Wang

Web images come in hand with valuable contextual information. Although this information has long been mined for various uses such as image annotation, clustering of images, inference of image semantic content, etc., insufficient attention…

Multimedia · Computer Science 2020-05-21 F. Fauzi , H. J. Long , M. Belkhatir

Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this…

Information Retrieval · Computer Science 2018-03-28 Thijs Vogels , Octavian-Eugen Ganea , Carsten Eickhoff

The World Wide Web caters to the needs of billions of users in heterogeneous groups. Each user accessing the World Wide Web might have his / her own specific interest and would expect the web to respond to the specific requirements. The…

Information Retrieval · Computer Science 2017-11-22 K. S. Kuppusamy , G. Aghila

As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance,…

Information Retrieval · Computer Science 2025-12-09 Yihan Chen , Benfeng Xu , Xiaorui Wang , Zhendong Mao

The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high…

Artificial Intelligence · Computer Science 2018-04-13 Colin Lockard , Xin Luna Dong , Arash Einolghozati , Prashant Shiralkar

Large language models (LLMs) that have been trained on a corpus that includes large amount of code exhibit a remarkable ability to understand HTML code. As web interfaces are primarily constructed using HTML, we design an in-depth study to…

Computation and Language · Computer Science 2023-12-12 Faria Huq , Jeffrey P. Bigham , Nikolas Martelaro
‹ Prev 1 2 3 10 Next ›