Related papers: Inscriptis -- A Python-based HTML to text conversi…

SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging.…

Computation and Language · Computer Science 2025-10-03 Shicheng Liu , Kai Sun , Lisheng Fu , Xilun Chen , Xinyuan Zhang , Zhaojiang Lin , Rulin Shao , Yue Liu , Anuj Kumar , Wen-tau Yih , Xin Luna Dong

Information Extraction - A User Guide

This technical memo describes Information Extraction from the point-of-view of a potential user of the technology. No knowledge of language processing is assumed. Information Extraction is a process which takes unseen texts as input and…

cmp-lg · Computer Science 2008-02-03 Hamish Cunningham

DOM-LM: Learning Generalizable Representations for HTML Documents

HTML documents are an important medium for disseminating information on the Web for human consumption. An HTML document presents information in multiple text formats including unstructured text, structured key-value pairs, and tables.…

Computation and Language · Computer Science 2022-01-27 Xiang Deng , Prashant Shiralkar , Colin Lockard , Binxuan Huang , Huan Sun

WebFormer: The Web-page Transformer for Structure Information Extraction

Structure information extraction refers to the task of extracting structured text fields from web pages, such as extracting a product offer from a shopping page including product title, description, brand and price. It is an important…

Computation and Language · Computer Science 2022-02-02 Qifan Wang , Yi Fang , Anirudh Ravula , Fuli Feng , Xiaojun Quan , Dongfang Liu

GrASP: A Library for Extracting and Exploring Human-Interpretable Textual Patterns

Data exploration is an important step of every data science and machine learning project, including those involving textual data. We provide a novel language tool, in the form of a publicly available Python library for extracting patterns…

Computation and Language · Computer Science 2022-06-20 Piyawat Lertvittayakumjorn , Leshem Choshen , Eyal Shnarch , Francesca Toni

From End-User's Requirements to Web Services Retrieval: A Semantic and Intention-Driven Approach

In this paper, we present SATIS, a framework to derive Web Service specifications from end-user's requirements in order to opera-tionalise business processes in the context of a specific application domain. The aim of SATIS is to provide to…

Software Engineering · Computer Science 2015-04-24 Isabelle Mirbel , Pierre Crescenzo

InScript: Narrative texts annotated with script information

This paper presents the InScript corpus (Narrative Texts Instantiating Script structure). InScript is a corpus of 1,000 stories centered around 10 different scenarios. Verbs and noun phrases are annotated with event and participant types,…

Computation and Language · Computer Science 2017-03-16 Ashutosh Modi , Tatjana Anikina , Simon Ostermann , Manfred Pinkal

OCR++: A Robust Framework For Information Extraction from Scholarly Articles

This paper proposes OCR++, an open-source framework designed for a variety of information extraction tasks from scholarly articles including metadata (title, author names, affiliation and e-mail), structure (section headings and body text,…

Digital Libraries · Computer Science 2016-09-26 Mayank Singh , Barnopriyo Barua , Priyank Palod , Manvi Garg , Sidhartha Satapathy , Samuel Bushi , Kumar Ayush , Krishna Sai Rohith , Tulasi Gamidi , Pawan Goyal , Animesh Mukherjee

Facilitating phenotyping from clinical texts: the medkit library

Phenotyping consists in applying algorithms to identify individuals associated with a specific, potentially complex, trait or condition, typically out of a collection of Electronic Health Records (EHRs). Because a lot of the clinical…

Computation and Language · Computer Science 2024-09-05 Antoine Neuraz , Ghislain Vaillant , Camila Arias , Olivier Birot , Kim-Tam Huynh , Thibaut Fabacher , Alice Rogier , Nicolas Garcelon , Ivan Lerner , Bastien Rance , Adrien Coulet

Knowledge Distillation for Low-Resource Open-source Text-to-SQL Model

Text-to-SQL converts natural language questions into executable SQL queries, enabling non-technical users to access relational databases for analytics and intelligent data services. In real-world scenarios, performance is often constrained…

Computation and Language · Computer Science 2026-05-25 Tianhao Qiu , Xiaojun Chen

Literal Encoding: Text is a first-class data encoding

Digital humanities are rooted in text analysis. However, most visualization paradigms use only categoric, ordered or quantitative data. Literal text must be considered a base data type to encode into visualizations. Literal text offers…

Human-Computer Interaction · Computer Science 2020-09-08 Richard Brath

Knowledge driven Offline to Online Script Conversion

The problem of offline to online script conversion is a challenging and an ill-posed problem. The interest in offline to online conversion exists because there are a plethora of robust algorithms in online script literature which can not be…

Computer Vision and Pattern Recognition · Computer Science 2015-04-08 Sunil Kopparapu , Devanuj , Akhilesh Srivastava , P. V. S. Rao

Envisioning Future Interactive Web Development: Editing Webpage with Natural Language

The evolution of web applications relies on iterative code modifications, a process that is traditionally manual and time-consuming. While Large Language Models (LLMs) can generate UI code, their ability to edit existing code from new…

Software Engineering · Computer Science 2025-10-31 Truong Hai Dang , Jingyu Xiao , Yintong Huo

From Text to Knowledge with Graphs: modelling, querying and exploiting textual content

This paper highlights the challenges, current trends, and open issues related to the representation, querying and analytics of content extracted from texts. The internet contains vast text-based information on various subjects, including…

Databases · Computer Science 2023-10-11 Genoveva Vargas-Solar , Mirian Halfeld Ferrari Alves , Anne-Lyse Minard Forst

Content Selection in Data-to-Text Systems: A Survey

Data-to-text systems are powerful in generating reports from data automatically and thus they simplify the presentation of complex data. Rather than presenting data using visualisation techniques, data-to-text systems use natural (human)…

Computation and Language · Computer Science 2016-10-27 Dimitra Gkatzia

reconCTI: A Proactive Approach to Cyber-Threat Intelligence

The rapid advancement of information technology has introduced a noticeable shift from traditional offline practices to more efficient and interconnected online environments. This transition, while offering convenience, has also increased…

Cryptography and Security · Computer Science 2026-05-20 Mohammed Mahir Rahman , Shahzad Memon , Tauseef Ahmed , Ameer Al-Nemrat

Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions

This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors…

Computation and Language · Computer Science 2024-06-25 Max Dallabetta , Conrad Dobberstein , Adrian Breiding , Alan Akbik

InkSight: Leveraging Sketch Interaction for Documenting Chart Findings in Computational Notebooks

Computational notebooks have become increasingly popular for exploratory data analysis due to their ability to support data exploration and explanation within a single document. Effective documentation for explaining chart findings during…

Human-Computer Interaction · Computer Science 2023-07-18 Yanna Lin , Haotian Li , Leni Yang , Aoyu Wu , Huamin Qu

Cognitive Simplification Operations Improve Text Simplification

Text Simplification (TS) is the task of converting a text into a form that is easier to read while maintaining the meaning of the original text. A sub-task of TS is Cognitive Simplification (CS), converting text to a form that is readily…

Computation and Language · Computer Science 2022-11-17 Eytan Chamovitz , Omri Abend

TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature

The exponential growth of academic publications has created an urgent need for automated tools capable of extracting structured knowledge from unstructured scientific texts. While large language models (LLMs) have demonstrated remarkable…

Computation and Language · Computer Science 2026-05-11 Hanqing Zhao