English
Related papers

Related papers: Automatic Document Selection for Efficient Encoder…

200 papers

We present and apply two methods for addressing the problem of selecting relevant training data out of a general pool for use in tasks such as machine translation. Building on existing work on class-based language difference models, we…

Computation and Language · Computer Science 2019-04-11 Lucía Santamaría , Amittai Axelrod

Contextual embedding-based language models trained on large data sets, such as BERT and RoBERTa, provide strong performance across a wide range of tasks and are ubiquitous in modern NLP. It has been observed that fine-tuning these models on…

Computation and Language · Computer Science 2021-09-16 Vin Sachidananda , Jason S. Kessler , Yi-an Lai

Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts…

Computation and Language · Computer Science 2024-06-10 Xiang Kong , Tom Gunter , Ruoming Pang

Automatic phenotyping is a task of identifying cohorts of patients that match a predefined set of criteria. Phenotyping typically involves classifying long clinical documents that contain thousands of tokens. At the same time, recent…

Computation and Language · Computer Science 2021-05-17 Xin Su , Timothy Miller , Xiyu Ding , Majid Afshar , Dmitriy Dligach

Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a…

Machine Learning · Computer Science 2024-11-01 David Grangier , Angelos Katharopoulos , Pierre Ablin , Awni Hannun

For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent…

Computation and Language · Computer Science 2023-05-29 Kundan Krishna , Saurabh Garg , Jeffrey P. Bigham , Zachary C. Lipton

Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient…

Computation and Language · Computer Science 2025-08-05 Kashun Shum , Yuzhen Huang , Hongjian Zou , Qi Ding , Yixuan Liao , Xiaoxin Chen , Qian Liu , Junxian He

Large pre-trained models have achieved great success in many natural language processing tasks. However, when they are applied in specific domains, these models suffer from domain shift and bring challenges in fine-tuning and online serving…

Computation and Language · Computer Science 2021-06-30 Yunzhi Yao , Shaohan Huang , Wenhui Wang , Li Dong , Furu Wei

Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining…

Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a…

Computation and Language · Computer Science 2020-10-06 Jingfei Du , Edouard Grave , Beliz Gunel , Vishrav Chaudhary , Onur Celebi , Michael Auli , Ves Stoyanov , Alexis Conneau

Transformer-based language models have shown state-of-the-art performance on a variety of natural language understanding tasks. To achieve this performance, these models are first pre-trained on general corpus and then fine-tuned on…

Computation and Language · Computer Science 2024-07-15 Mohammadreza Tayaranian , Seyyed Hasan Mozafari , Brett H. Meyer , James J. Clark , Warren J. Gross

Pre-trained contextual representations (e.g., BERT) have become the foundation to achieve state-of-the-art results on many NLP tasks. However, large-scale pre-training is computationally expensive. ELECTRA, an early attempt to accelerate…

Computation and Language · Computer Science 2020-06-17 Zhenhui Xu , Linyuan Gong , Guolin Ke , Di He , Shuxin Zheng , Liwei Wang , Jiang Bian , Tie-Yan Liu

Domain adaptation for large neural language models (NLMs) is coupled with massive amounts of unstructured data in the pretraining phase. In this study, however, we show that pretrained NLMs learn in-domain information more effectively and…

Computation and Language · Computer Science 2022-08-30 Shahriar Golchin , Mihai Surdeanu , Nazgol Tavabi , Ata Kiapour

Pretraining on large-scale, unstructured internet text enables language models to acquire a significant amount of world knowledge. However, this knowledge acquisition is data-inefficient--to learn a given fact, models must be trained on…

Machine Learning · Computer Science 2024-10-04 Zitong Yang , Neil Band , Shuangping Li , Emmanuel Candès , Tatsunori Hashimoto

Self-supervised learning of speech representations has achieved impressive results in improving automatic speech recognition (ASR). In this paper, we show that data selection is important for self-supervised learning. We propose a simple…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-06 Zhiyun Lu , Yongqiang Wang , Yu Zhang , Wei Han , Zhehuai Chen , Parisa Haghani

In-context learning (ICL) with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training data. However, existing ICL methods, which rely on similarity or diversity scores…

Computation and Language · Computer Science 2025-06-09 Jiarui Jin , Yuwei Wu , Haoxuan Li , Xiaoting He , Weinan Zhang , Yiming Yang , Yong Yu , Jun Wang , Mengyue Yang

Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving…

Computer Vision and Pattern Recognition · Computer Science 2024-03-21 Siddharth Joshi , Arnav Jain , Ali Payani , Baharan Mirzasoleiman

Speculative decoding accelerates LLM inference by letting a small drafter propose multiple tokens which a large target model verifies once per speculation step. As vocabularies scale past 10e5 tokens,verification cost in the target model is…

Computation and Language · Computer Science 2026-02-04 Jinbin Zhang , Nasib Ullah , Erik Schultheis , Rohit Babbar

Pretraining sentence encoders with language modeling and related unsupervised tasks has recently been shown to be very effective for language understanding tasks. By supplementing language model-style pretraining with further training on…

Computation and Language · Computer Science 2019-03-01 Jason Phang , Thibault Févry , Samuel R. Bowman

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine…

‹ Prev 1 2 3 10 Next ›