English
Related papers

Related papers: Removing Manually-Generated Boilerplate from Elect…

200 papers

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full…

Computation and Language · Computer Science 2018-12-20 Martin Gerlach , Francesc Font-Clos

Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section…

Computation and Language · Computer Science 2026-01-01 Michael E. Rose , Nils A. Herrmann , Sebastian Erhardt

Automatic text generation based on neural language models has achieved performance levels that make the generated text almost indistinguishable from those written by humans. Despite the value that text generation can have in various…

Computation and Language · Computer Science 2022-05-02 Vijini Liyanage , Davide Buscaldi , Adeline Nazarenko

Ensuring data quality in large tabular datasets is a critical challenge, typically addressed through data wrangling tasks. Traditional statistical methods, though efficient, cannot often understand the semantic context and deep learning…

Machine Learning · Computer Science 2025-02-25 Ashlesha Akella , Krishnasuri Narayanam

This study pioneers the use of synthetically generated data for training generative models in document-level text simplification of German texts. We demonstrate the effectiveness of our approach with real-world online texts. Addressing the…

Computation and Language · Computer Science 2024-02-19 Lars Klöser , Mika Beele , Jan-Niklas Schagen , Bodo Kraft

The Gutenberg Literary English Corpus (GLEC) provides a rich source of textual data for research in digital humanities, computational linguistics or neurocognitive poetics. However, so far only a small subcorpus, the Gutenberg English…

Computation and Language · Computer Science 2020-10-22 Arthur M. Jacobs , Annette Kinder

In high-stakes English Language Assessments, low-skill test takers may employ memorized materials called ``templates'' on essay questions to ``game'' or fool the automated scoring system. In this study, we introduce the automated detection…

Computation and Language · Computer Science 2025-09-11 Yashad Samant , Lee Becker , Scott Hellman , Bradley Behan , Sarah Hughes , Joshua Southerland

Embedding models group text by semantic content, what text is about. We show that temporal co-occurrence within texts discovers a different kind of structure: recurrent transition-structure concepts or what text does. We train a…

Artificial Intelligence · Computer Science 2026-03-20 Jason Dury

Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this…

Information Retrieval · Computer Science 2018-03-28 Thijs Vogels , Octavian-Eugen Ganea , Carsten Eickhoff

Wikipedia articles (content pages) are commonly used corpora in Natural Language Processing (NLP) research, especially in low-resource languages other than English. Yet, a few research studies have studied the three Arabic Wikipedia…

Computation and Language · Computer Science 2024-04-02 Saied Alshahrani , Hesham Haroon , Ali Elfilali , Mariama Njie , Jeanna Matthews

The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing.…

Machine Learning · Computer Science 2020-04-30 Jurek Leonhardt , Avishek Anand , Megha Khosla

Technical documents contain a fair amount of unnatural language, such as tables, formulas, pseudo-codes, etc. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of…

Information Retrieval · Computer Science 2017-03-20 Myungha Jang , Jinho D. Choi , James Allan

Many communities, including the scientific community, develop implicit writing norms. Understanding them is crucial for effective communication with that community. Writers gradually develop an implicit understanding of norms by reading…

Human-Computer Interaction · Computer Science 2025-03-18 Hai Dang , Chelse Swoopes , Daniel Buschek , Elena L. Glassman

Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled…

Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions…

Computation and Language · Computer Science 2026-01-26 Jiandong Shao , Raphael Tang , Crystina Zhang , Karin Sevegnani , Pontus Stenetorp , Jianfei Yang , Yao Lu

Recent neural approaches to data-to-text generation have mostly focused on improving content fidelity while lacking explicit control over writing styles (e.g., word choices, sentence structures). More traditional systems use templates to…

Computation and Language · Computer Science 2020-10-12 Shuai Lin , Wentao Wang , Zichao Yang , Xiaodan Liang , Frank F. Xu , Eric Xing , Zhiting Hu

Under special circumstances, summaries should conform to a particular style with patterns, such as court judgments and abstracts in academic papers. To this end, the prototype document-summary pairs can be utilized to generate better…

Computation and Language · Computer Science 2019-09-20 Shen Gao , Xiuying Chen , Piji Li , Zhangming Chan , Dongyan Zhao , Rui Yan

Classic Topic Models are built under the Bag Of Words assumption, in which word position is ignored for simplicity. Besides, symmetric priors are typically used in most applications. In order to easily learn topics with different properties…

Computation and Language · Computer Science 2018-06-27 Simón Roca-Sotelo , Jerónimo Arenas-García

Most real-world document collections involve various types of metadata, such as author, source, and date, and yet the most commonly-used approaches to modeling text corpora ignore this information. While specialized models have been…

Machine Learning · Statistics 2018-10-25 Dallas Card , Chenhao Tan , Noah A. Smith

Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue datasets offer a trade-off between quality (e.g., DailyDialog) and size (e.g., Opensubtitles). We narrow this gap by…

Computation and Language · Computer Science 2021-01-25 Richard Csaky , Gabor Recski
‹ Prev 1 2 3 10 Next ›