Related papers: Removing Manually-Generated Boilerplate from Elect…

A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full…

Computation and Language · Computer Science 2018-12-20 Martin Gerlach , Francesc Font-Clos

Cleaning English Abstracts of Scientific Publications

Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section…

Computation and Language · Computer Science 2026-01-01 Michael E. Rose , Nils A. Herrmann , Sebastian Erhardt

A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications

Automatic text generation based on neural language models has achieved performance levels that make the generated text almost indistinguishable from those written by humans. Despite the value that text generation can have in various…

Computation and Language · Computer Science 2022-05-02 Vijini Liyanage , Davide Buscaldi , Adeline Nazarenko

Data Wrangling Task Automation Using Code-Generating Language Models

Ensuring data quality in large tabular datasets is a critical challenge, typically addressed through data wrangling tasks. Traditional statistical methods, though efficient, cannot often understand the semantic context and deep learning…

Machine Learning · Computer Science 2025-02-25 Ashlesha Akella , Krishnasuri Narayanam

German Text Simplification: Finetuning Large Language Models with Semi-Synthetic Data

This study pioneers the use of synthetically generated data for training generative models in document-level text simplification of German texts. We demonstrate the effectiveness of our approach with real-world online texts. Addressing the…

Computation and Language · Computer Science 2024-02-19 Lars Klöser , Mika Beele , Jan-Niklas Schagen , Bodo Kraft

Quasi Error-free Text Classification and Authorship Recognition in a large Corpus of English Literature based on a Novel Feature Set

The Gutenberg Literary English Corpus (GLEC) provides a rich source of textual data for research in digital humanities, computational linguistics or neurocognitive poetics. However, so far only a small subcorpus, the Gutenberg English…

Computation and Language · Computer Science 2020-10-22 Arthur M. Jacobs , Annette Kinder

Automatic Detection of Inauthentic Templated Responses in English Language Assessments

In high-stakes English Language Assessments, low-skill test takers may employ memorized materials called ``templates'' on essay questions to ``game'' or fool the automated scoring system. In this study, we introduce the automated detection…

Computation and Language · Computer Science 2025-09-11 Yashad Samant , Lee Becker , Scott Hellman , Bradley Behan , Sarah Hughes , Joshua Southerland

From Topic to Transition Structure: Unsupervised Concept Discovery at Corpus Scale via Predictive Associative Memory

Embedding models group text by semantic content, what text is about. We show that temporal co-occurrence within texts discovers a different kind of structure: recurrent transition-structure concepts or what text does. We train a…

Artificial Intelligence · Computer Science 2026-03-20 Jason Dury

Web2Text: Deep Structured Boilerplate Removal

Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this…

Information Retrieval · Computer Science 2018-03-28 Thijs Vogels , Octavian-Eugen Ganea , Carsten Eickhoff

Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition

Wikipedia articles (content pages) are commonly used corpora in Natural Language Processing (NLP) research, especially in low-resource languages other than English. Yet, a few research studies have studied the three Arabic Wikipedia…

Computation and Language · Computer Science 2024-04-02 Saied Alshahrani , Hesham Haroon , Ali Elfilali , Mariama Njie , Jeanna Matthews

Boilerplate Removal using a Neural Sequence Labeling Model

The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing.…

Machine Learning · Computer Science 2020-04-30 Jurek Leonhardt , Avishek Anand , Megha Khosla

Improving Document Clustering by Eliminating Unnatural Language

Technical documents contain a fair amount of unnatural language, such as tables, formulas, pseudo-codes, etc. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of…

Information Retrieval · Computer Science 2017-03-20 Myungha Jang , Jinho D. Choi , James Allan

CorpusStudio: Surfacing Emergent Patterns in a Corpus of Prior Work while Writing

Many communities, including the scientific community, develop implicit writing norms. Understanding them is crucial for effective communication with that community. Writers gradually develop an implicit understanding of norms by reading…

Human-Computer Interaction · Computer Science 2025-03-18 Hai Dang , Chelse Swoopes , Daniel Buschek , Elena L. Glassman

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled…

Computation and Language · Computer Science 2026-04-16 Joel Niklaus , Atsuki Yamaguchi , Michal Štefánik , Guilherme Penedo , Hynek Kydlíček , Elie Bakouch , Lewis Tunstall , Edward Emanuel Beeching , Thibaud Frere , Colin Raffel , Leandro von Werra , Thomas Wolf

The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining

Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions…

Computation and Language · Computer Science 2026-01-26 Jiandong Shao , Raphael Tang , Crystina Zhang , Karin Sevegnani , Pontus Stenetorp , Jianfei Yang , Yao Lu

Data-to-Text Generation with Style Imitation

Recent neural approaches to data-to-text generation have mostly focused on improving content fidelity while lacking explicit control over writing styles (e.g., word choices, sentence structures). More traditional systems use templates to…

Computation and Language · Computer Science 2020-10-12 Shuai Lin , Wentao Wang , Zichao Yang , Xiaodan Liang , Frank F. Xu , Eric Xing , Zhiting Hu

How to Write Summaries with Patterns? Learning towards Abstractive Summarization through Prototype Editing

Under special circumstances, summaries should conform to a particular style with patterns, such as court judgments and abstracts in academic papers. To this end, the prototype document-summary pairs can be utilized to generate better…

Computation and Language · Computer Science 2019-09-20 Shen Gao , Xiuying Chen , Piji Li , Zhangming Chan , Dongyan Zhao , Rui Yan

Unveiling the semantic structure of text documents using paragraph-aware Topic Models

Classic Topic Models are built under the Bag Of Words assumption, in which word position is ignored for simplicity. Besides, symmetric priors are typically used in most applications. In order to easily learn topics with different properties…

Computation and Language · Computer Science 2018-06-27 Simón Roca-Sotelo , Jerónimo Arenas-García

Neural Models for Documents with Metadata

Most real-world document collections involve various types of metadata, such as author, source, and date, and yet the most commonly-used approaches to modeling text corpora ignore this information. While specialized models have been…

Machine Learning · Statistics 2018-10-25 Dallas Card , Chenhao Tan , Noah A. Smith

The Gutenberg Dialogue Dataset

Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue datasets offer a trade-off between quality (e.g., DailyDialog) and size (e.g., Opensubtitles). We narrow this gap by…

Computation and Language · Computer Science 2021-01-25 Richard Csaky , Gabor Recski