Related papers: Automatic Document Selection for Efficient Encoder…

Data Selection with Cluster-Based Language Difference Models and Cynical Selection

We present and apply two methods for addressing the problem of selecting relevant training data out of a general pool for use in tasks such as machine translation. Building on existing work on class-based language difference models, we…

Computation and Language · Computer Science 2019-04-11 Lucía Santamaría , Amittai Axelrod

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Contextual embedding-based language models trained on large data sets, such as BERT and RoBERTa, provide strong performance across a wide range of tasks and are ubiquitous in modern NLP. It has been observed that fine-tuning these models on…

Computation and Language · Computer Science 2021-09-16 Vin Sachidananda , Jason S. Kessler , Yi-an Lai

Large Language Model-guided Document Selection

Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts…

Computation and Language · Computer Science 2024-06-10 Xiang Kong , Tom Gunter , Ruoming Pang

Classifying Long Clinical Documents with Pre-trained Transformers

Automatic phenotyping is a task of identifying cohorts of patients that match a predefined set of criteria. Phenotyping typically involves classifying long clinical documents that contain thousands of tokens. At the same time, recent…

Computation and Language · Computer Science 2021-05-17 Xin Su , Timothy Miller , Xiyu Ding , Majid Afshar , Dmitriy Dligach

Need a Small Specialized Language Model? Plan Early!

Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a…

Machine Learning · Computer Science 2024-11-01 David Grangier , Angelos Katharopoulos , Pierre Ablin , Awni Hannun

Downstream Datasets Make Surprisingly Good Pretraining Corpora

For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent…

Computation and Language · Computer Science 2023-05-29 Kundan Krishna , Saurabh Garg , Jeffrey P. Bigham , Zachary C. Lipton

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient…

Computation and Language · Computer Science 2025-08-05 Kashun Shum , Yuzhen Huang , Hongjian Zou , Qi Ding , Yixuan Liao , Xiaoxin Chen , Qian Liu , Junxian He

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains

Large pre-trained models have achieved great success in many natural language processing tasks. However, when they are applied in specific domains, these models suffer from domain shift and bring challenges in fine-tuning and online serving…

Computation and Language · Computer Science 2021-06-30 Yunzhi Yao , Shaohan Huang , Wenhui Wang , Li Dong , Furu Wei

In-context Pretraining: Language Modeling Beyond Document Boundaries

Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining…

Computation and Language · Computer Science 2024-06-25 Weijia Shi , Sewon Min , Maria Lomeli , Chunting Zhou , Margaret Li , Gergely Szilvasy , Rich James , Xi Victoria Lin , Noah A. Smith , Luke Zettlemoyer , Scott Yih , Mike Lewis

Self-training Improves Pre-training for Natural Language Understanding

Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a…

Computation and Language · Computer Science 2020-10-06 Jingfei Du , Edouard Grave , Beliz Gunel , Vishrav Chaudhary , Onur Celebi , Michael Auli , Ves Stoyanov , Alexis Conneau

Automatic Pruning of Fine-tuning Datasets for Transformer-based Language Models

Transformer-based language models have shown state-of-the-art performance on a variety of natural language understanding tasks. To achieve this performance, these models are first pre-trained on general corpus and then fine-tuned on…

Computation and Language · Computer Science 2024-07-15 Mohammadreza Tayaranian , Seyyed Hasan Mozafari , Brett H. Meyer , James J. Clark , Warren J. Gross

MC-BERT: Efficient Language Pre-Training via a Meta Controller

Pre-trained contextual representations (e.g., BERT) have become the foundation to achieve state-of-the-art results on many NLP tasks. However, large-scale pre-training is computationally expensive. ELECTRA, an early attempt to accelerate…

Computation and Language · Computer Science 2020-06-17 Zhenhui Xu , Linyuan Gong , Guolin Ke , Di He , Shuxin Zheng , Liwei Wang , Jiang Bian , Tie-Yan Liu

A Compact Pretraining Approach for Neural Language Models

Domain adaptation for large neural language models (NLMs) is coupled with massive amounts of unstructured data in the pretraining phase. In this study, however, we show that pretrained NLMs learn in-domain information more effectively and…

Computation and Language · Computer Science 2022-08-30 Shahriar Golchin , Mihai Surdeanu , Nazgol Tavabi , Ata Kiapour

Synthetic continued pretraining

Pretraining on large-scale, unstructured internet text enables language models to acquire a significant amount of world knowledge. However, this knowledge acquisition is data-inefficient--to learn a given fact, models must be trained on…

Machine Learning · Computer Science 2024-10-04 Zitong Yang , Neil Band , Shuangping Li , Emmanuel Candès , Tatsunori Hashimoto

Unsupervised Data Selection via Discrete Speech Representation for ASR

Self-supervised learning of speech representations has achieved impressive results in improving automatic speech recognition (ASR). In this paper, we show that data selection is important for self-supervised learning. We propose a simple…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-06 Zhiyun Lu , Yongqiang Wang , Yu Zhang , Wei Han , Zhehuai Chen , Parisa Haghani

Large Language Models are Demonstration Pre-Selectors for Themselves

In-context learning (ICL) with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training data. However, existing ICL methods, which rely on similarity or diversity scores…

Computation and Language · Computer Science 2025-06-09 Jiarui Jin , Yuwei Wu , Haoxuan Li , Xiaoting He , Weinan Zhang , Yiming Yang , Yong Yu , Jun Wang , Mengyue Yang

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity

Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving…

Computer Vision and Pattern Recognition · Computer Science 2024-03-21 Siddharth Joshi , Arnav Jain , Ali Payani , Baharan Mirzasoleiman

DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Speculative decoding accelerates LLM inference by letting a small drafter propose multiple tokens which a large target model verifies once per speculation step. As vocabularies scale past 10e5 tokens,verification cost in the target model is…

Computation and Language · Computer Science 2026-02-04 Jinbin Zhang , Nasib Ullah , Erik Schultheis , Rohit Babbar

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Pretraining sentence encoders with language modeling and related unsupervised tasks has recently been shown to be very effective for language understanding tasks. By supplementing language model-style pretraining with further training on…

Computation and Language · Computer Science 2019-03-01 Jason Phang , Thibault Févry , Samuel R. Bowman

Language Models Improve When Pretraining Data Matches Target Tasks

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine…

Computation and Language · Computer Science 2025-07-17 David Mizrahi , Anders Boesen Lindbo Larsen , Jesse Allardice , Suzie Petryk , Yuri Gorokhov , Jeffrey Li , Alex Fang , Josh Gardner , Tom Gunter , Afshin Dehghan