English
Related papers

Related papers: Inscriptis -- A Python-based HTML to text conversi…

200 papers

Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging.…

Computation and Language · Computer Science 2025-10-03 Shicheng Liu , Kai Sun , Lisheng Fu , Xilun Chen , Xinyuan Zhang , Zhaojiang Lin , Rulin Shao , Yue Liu , Anuj Kumar , Wen-tau Yih , Xin Luna Dong

This technical memo describes Information Extraction from the point-of-view of a potential user of the technology. No knowledge of language processing is assumed. Information Extraction is a process which takes unseen texts as input and…

cmp-lg · Computer Science 2008-02-03 Hamish Cunningham

HTML documents are an important medium for disseminating information on the Web for human consumption. An HTML document presents information in multiple text formats including unstructured text, structured key-value pairs, and tables.…

Computation and Language · Computer Science 2022-01-27 Xiang Deng , Prashant Shiralkar , Colin Lockard , Binxuan Huang , Huan Sun

Structure information extraction refers to the task of extracting structured text fields from web pages, such as extracting a product offer from a shopping page including product title, description, brand and price. It is an important…

Computation and Language · Computer Science 2022-02-02 Qifan Wang , Yi Fang , Anirudh Ravula , Fuli Feng , Xiaojun Quan , Dongfang Liu

Data exploration is an important step of every data science and machine learning project, including those involving textual data. We provide a novel language tool, in the form of a publicly available Python library for extracting patterns…

Computation and Language · Computer Science 2022-06-20 Piyawat Lertvittayakumjorn , Leshem Choshen , Eyal Shnarch , Francesca Toni

In this paper, we present SATIS, a framework to derive Web Service specifications from end-user's requirements in order to opera-tionalise business processes in the context of a specific application domain. The aim of SATIS is to provide to…

Software Engineering · Computer Science 2015-04-24 Isabelle Mirbel , Pierre Crescenzo

This paper presents the InScript corpus (Narrative Texts Instantiating Script structure). InScript is a corpus of 1,000 stories centered around 10 different scenarios. Verbs and noun phrases are annotated with event and participant types,…

Computation and Language · Computer Science 2017-03-16 Ashutosh Modi , Tatjana Anikina , Simon Ostermann , Manfred Pinkal

This paper proposes OCR++, an open-source framework designed for a variety of information extraction tasks from scholarly articles including metadata (title, author names, affiliation and e-mail), structure (section headings and body text,…

Phenotyping consists in applying algorithms to identify individuals associated with a specific, potentially complex, trait or condition, typically out of a collection of Electronic Health Records (EHRs). Because a lot of the clinical…

Text-to-SQL converts natural language questions into executable SQL queries, enabling non-technical users to access relational databases for analytics and intelligent data services. In real-world scenarios, performance is often constrained…

Computation and Language · Computer Science 2026-05-25 Tianhao Qiu , Xiaojun Chen

Digital humanities are rooted in text analysis. However, most visualization paradigms use only categoric, ordered or quantitative data. Literal text must be considered a base data type to encode into visualizations. Literal text offers…

Human-Computer Interaction · Computer Science 2020-09-08 Richard Brath

The problem of offline to online script conversion is a challenging and an ill-posed problem. The interest in offline to online conversion exists because there are a plethora of robust algorithms in online script literature which can not be…

Computer Vision and Pattern Recognition · Computer Science 2015-04-08 Sunil Kopparapu , Devanuj , Akhilesh Srivastava , P. V. S. Rao

The evolution of web applications relies on iterative code modifications, a process that is traditionally manual and time-consuming. While Large Language Models (LLMs) can generate UI code, their ability to edit existing code from new…

Software Engineering · Computer Science 2025-10-31 Truong Hai Dang , Jingyu Xiao , Yintong Huo

This paper highlights the challenges, current trends, and open issues related to the representation, querying and analytics of content extracted from texts. The internet contains vast text-based information on various subjects, including…

Databases · Computer Science 2023-10-11 Genoveva Vargas-Solar , Mirian Halfeld Ferrari Alves , Anne-Lyse Minard Forst

Data-to-text systems are powerful in generating reports from data automatically and thus they simplify the presentation of complex data. Rather than presenting data using visualisation techniques, data-to-text systems use natural (human)…

Computation and Language · Computer Science 2016-10-27 Dimitra Gkatzia

The rapid advancement of information technology has introduced a noticeable shift from traditional offline practices to more efficient and interconnected online environments. This transition, while offering convenience, has also increased…

Cryptography and Security · Computer Science 2026-05-20 Mohammed Mahir Rahman , Shahzad Memon , Tauseef Ahmed , Ameer Al-Nemrat

This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors…

Computation and Language · Computer Science 2024-06-25 Max Dallabetta , Conrad Dobberstein , Adrian Breiding , Alan Akbik

Computational notebooks have become increasingly popular for exploratory data analysis due to their ability to support data exploration and explanation within a single document. Effective documentation for explaining chart findings during…

Human-Computer Interaction · Computer Science 2023-07-18 Yanna Lin , Haotian Li , Leni Yang , Aoyu Wu , Huamin Qu

Text Simplification (TS) is the task of converting a text into a form that is easier to read while maintaining the meaning of the original text. A sub-task of TS is Cognitive Simplification (CS), converting text to a form that is readily…

Computation and Language · Computer Science 2022-11-17 Eytan Chamovitz , Omri Abend

The exponential growth of academic publications has created an urgent need for automated tools capable of extracting structured knowledge from unstructured scientific texts. While large language models (LLMs) have demonstrated remarkable…

Computation and Language · Computer Science 2026-05-11 Hanqing Zhao
‹ Prev 1 2 3 10 Next ›