Related papers: LMD3: Language Model Data Density Dependence

Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model's…

Computation and Language · Computer Science 2023-11-16 Gregory Yauney , Emily Reif , David Mimno

Analyzing Persuasive Strategies in Meme Texts: A Fusion of Language Models with Paraphrase Enrichment

This paper describes our approach to hierarchical multi-label detection of persuasion techniques in meme texts. Our model, developed as a part of the recent SemEval task, is based on fine-tuning individual language models (BERT,…

Computation and Language · Computer Science 2024-07-04 Kota Shamanth Ramanath Nayak , Leila Kosseim

Paraphrase Types Elicit Prompt Engineering Capabilities

Much of the success of modern language models depends on finding a suitable prompt to instruct the model. Until now, it has been largely unknown how variations in the linguistic expression of prompts affect these models. This study…

Computation and Language · Computer Science 2026-02-17 Jan Philip Wahle , Terry Ruas , Yang Xu , Bela Gipp

Third-Party Language Model Performance Prediction from Instruction

Language model-based instruction-following systems have lately shown increasing performance on many benchmark tasks, demonstrating the capability of adapting to a broad variety of instructions. However, such systems are often not designed…

Computation and Language · Computer Science 2024-03-20 Rahul Nadkarni , Yizhong Wang , Noah A. Smith

A Little Pretraining Goes a Long Way: A Case Study on Dependency Parsing Task for Low-resource Morphologically Rich Languages

Neural dependency parsing has achieved remarkable performance for many domains and languages. The bottleneck of massive labeled data limits the effectiveness of these approaches for low resource languages. In this work, we focus on…

Computation and Language · Computer Science 2021-04-13 Jivnesh Sandhan , Amrith Krishna , Ashim Gupta , Laxmidhar Behera , Pawan Goyal

Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

Recently published work on rephrasing natural text data for pre-training LLMs has shown promising results when combining the original dataset with the synthetically rephrased data. We build upon previous work by replicating existing results…

Computation and Language · Computer Science 2024-10-29 Michael Pieler , Marco Bellagente , Hannah Teufel , Duy Phung , Nathan Cooper , Jonathan Tow , Paulo Rocha , Reshinth Adithyan , Zaid Alyafeai , Nikhil Pinnaparaju , Maksym Zhuravinskyi , Carlos Riquelme

Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study

Large Language Models (LLMs) are highly vulnerable to input perturbations, as even a small prompt change may result in a substantially different output. Existing methods to enhance LLM robustness are primarily focused on perturbed data…

Computation and Language · Computer Science 2025-04-04 Aryan Agrawal , Lisa Alazraki , Shahin Honarvar , Marek Rei

Demystifying Prompts in Language Models via Perplexity Estimation

Language models can be prompted to perform a wide variety of zero- and few-shot learning problems. However, performance varies significantly with the choice of prompt, and we do not yet understand why this happens or how to pick the best…

Computation and Language · Computer Science 2024-09-16 Hila Gonen , Srini Iyer , Terra Blevins , Noah A. Smith , Luke Zettlemoyer

Diversity-oriented Data Augmentation with Large Language Models

Data augmentation is an essential technique in natural language processing (NLP) for enriching training datasets by generating diverse samples. This process is crucial for improving the robustness and generalization capabilities of NLP…

Computation and Language · Computer Science 2025-10-16 Zaitian Wang , Jinghan Zhang , Xinhao Zhang , Kunpeng Liu , Pengfei Wang , Yuanchun Zhou

How Focused Are LLMs? A Quantitative Study via Repetitive Deterministic Prediction Tasks

We investigate the performance of large language models on repetitive deterministic prediction tasks and study how the sequence accuracy rate scales with output length. Each such task involves repeating the same operation n times. Examples…

Artificial Intelligence · Computer Science 2025-11-25 Wanda Hou , Leon Zhou , Hong-Ye Hu , Yubei Chen , Yi-Zhuang You , Xiao-Liang Qi

Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities

The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could…

Computation and Language · Computer Science 2025-02-03 Yaping Chai , Haoran Xie , Joe S. Qin

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data…

Computation and Language · Computer Science 2024-02-21 Demin Song , Honglin Guo , Yunhua Zhou , Shuhao Xing , Yudong Wang , Zifan Song , Wenwei Zhang , Qipeng Guo , Hang Yan , Xipeng Qiu , Dahua Lin

Investigating Training and Generalization in Faithful Self-Explanations of Large Language Models

Large language models have the potential to generate explanations for their own predictions in a variety of styles based on user instructions. Recent research has examined whether these self-explanations faithfully reflect the models'…

Computation and Language · Computer Science 2025-12-09 Tomoki Doi , Masaru Isonuma , Hitomi Yanaka

Training Bilingual LMs with Data Constraints in the Targeted Language

Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high…

Computation and Language · Computer Science 2025-02-07 Skyler Seto , Maartje ter Hoeve , Richard He Bai , Natalie Schluter , David Grangier

Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models

While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model…

Computation and Language · Computer Science 2024-02-07 Lukáš Mikula , Michal Štefánik , Marek Petrovič , Petr Sojka

Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models

Long-context modeling capabilities are important for large language models (LLMs) in various applications. However, directly training LLMs with long context windows is insufficient to enhance this capability since some training samples do…

Computation and Language · Computer Science 2024-05-29 Longze Chen , Ziqiang Liu , Wanwei He , Yunshui Li , Run Luo , Min Yang

Compact Example-Based Explanations for Language Models

Training data influence estimation methods quantify the contribution of training documents to a model's output, making them a promising source of information for example-based explanations. As humans cannot interpret thousands of documents,…

Computation and Language · Computer Science 2026-04-10 Loris Schoenegger , Benjamin Roth

Self-Influence Guided Data Reweighting for Language Model Pre-training

Language Models (LMs) pre-trained with self-supervision on large text corpora have become the default starting point for developing models for various NLP tasks. Once the pre-training corpus has been assembled, all data samples in the…

Computation and Language · Computer Science 2023-11-03 Megh Thakkar , Tolga Bolukbasi , Sriram Ganapathy , Shikhar Vashishth , Sarath Chandar , Partha Talukdar

Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs

Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance…

Machine Learning · Computer Science 2025-07-16 Zhengyu Chen , Siqi Wang , Teng Xiao , Yudong Wang , Shiqi Chen , Xunliang Cai , Junxian He , Jingang Wang

Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language

We are exposed to much information trying to influence us, such as teaser messages, debates, politically framed news, and propaganda - all of which use persuasive language. With the recent interest in Large Language Models (LLMs), we study…

Computation and Language · Computer Science 2025-02-24 Amalie Brogaard Pauli , Isabelle Augenstein , Ira Assent