Related papers: Entropy-Based Data Selection for Language Models

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets, allowing the trained model to outperform baselines trained on the full dataset. However, the expanding body of research…

Computation and Language · Computer Science 2025-02-25 Ziche Liu , Rui Ke , Yajiao Liu , Feng Jiang , Haizhou Li

RL-Guided Data Selection for Language Model Finetuning

Data selection for finetuning Large Language Models (LLMs) can be framed as a budget-constrained optimization problem: maximizing a model's downstream performance under a strict training data budget. Solving this problem is generally…

Machine Learning · Computer Science 2025-10-01 Animesh Jha , Harshit Gupta , Ananjan Nandi

A Survey on Data Selection for Language Models

A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as…

Computation and Language · Computer Science 2024-08-05 Alon Albalak , Yanai Elazar , Sang Michael Xie , Shayne Longpre , Nathan Lambert , Xinyi Wang , Niklas Muennighoff , Bairu Hou , Liangming Pan , Haewon Jeong , Colin Raffel , Shiyu Chang , Tatsunori Hashimoto , William Yang Wang

Entropy Law: The Story Behind Data Compression and LLM Performance

Data is the cornerstone of large language models (LLMs), but not all data is useful for model learning. Carefully selected data can better elicit the capabilities of LLMs with much less computational overhead. Most methods concentrate on…

Machine Learning · Computer Science 2024-07-12 Mingjia Yin , Chuhan Wu , Yufei Wang , Hao Wang , Wei Guo , Yasheng Wang , Yong Liu , Ruiming Tang , Defu Lian , Enhong Chen

Data Selection via Optimal Control for Language Models

This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs' capabilities for downstream usage. We formulate data selection as a generalized Optimal Control problem, which can be solved…

Computation and Language · Computer Science 2025-03-20 Yuxian Gu , Li Dong , Hongning Wang , Yaru Hao , Qingxiu Dong , Furu Wei , Minlie Huang

EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling

Recently, Large Language Models (LLMs) have demonstrated outstanding performance across a wide range of downstream language tasks. Temperature sampling is a commonly used decoding strategy for LLMs' generation process. However, a fixed…

Computation and Language · Computer Science 2024-04-04 Shimao Zhang , Yu Bao , Shujian Huang

Entropy-UID: A Method for Optimizing Information Density

Balanced and efficient information flow is essential for optimizing language generation models. In this work, we propose Entropy-UID, a new token selection method that balances entropy and Uniform Information Density (UID) principles for…

Computation and Language · Computer Science 2025-02-21 Xinpeng Shou

LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection

Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for…

Computation and Language · Computer Science 2025-09-09 Jian Wu , Hang Yu , Bingchang Liu , Wenjie Yang , Peng Di , Jianguo Li , Yue Zhang

Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning

Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias…

Machine Learning · Computer Science 2026-02-03 Heming Zou , Yixiu Mao , Yun Qu , Qi Wang , Xiangyang Ji

Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide

Fine-tuning large language models (LLMs) with limited data poses a practical challenge in low-resource languages, specialized domains, and constrained deployment settings. While pre-trained LLMs provide strong foundations, effective…

Computation and Language · Computer Science 2025-10-29 Marton Szep , Daniel Rueckert , Rüdiger von Eisenhart-Rothe , Florian Hinterwimmer

Robust Guidance for Unsupervised Data Selection: Capturing Perplexing Named Entities for Domain-Specific Machine Translation

Low-resourced data presents a significant challenge for neural machine translation. In most cases, the low-resourced environment is caused by high costs due to the need for domain experts or the lack of language experts. Therefore,…

Computation and Language · Computer Science 2024-05-22 Seunghyun Ji , Hagai Raja Sinulingga , Darongsae Kwon

MoDS: Model-oriented Data Selection for Instruction Tuning

Instruction tuning has become the de facto method to equip large language models (LLMs) with the ability of following user instructions. Usually, hundreds of thousands or millions of instruction-following pairs are employed to fine-tune the…

Computation and Language · Computer Science 2023-11-28 Qianlong Du , Chengqing Zong , Jiajun Zhang

Unsupervised Data Validation Methods for Efficient Model Training

This paper investigates the challenges and potential solutions for improving machine learning systems for low-resource languages. State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT), and…

Computation and Language · Computer Science 2024-10-11 Yurii Paniv

Compute-Constrained Data Selection

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the…

Machine Learning · Computer Science 2025-04-09 Junjie Oscar Yin , Alexander M. Rush

Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm

Selecting high-quality and diverse training samples from extensive datasets plays a crucial role in reducing training overhead and enhancing the performance of Large Language Models (LLMs). However, existing studies fall short in assessing…

Computation and Language · Computer Science 2025-10-14 Zhuo Li , Yuhao Du , Xiaoqi Jiao , Yiwen Guo , Yuege Feng , Xiang Wan , Anningzhe Gao , Jinpeng Hu

Measuring Sample Importance in Data Pruning for Language Models based on Information Entropy

Compute-efficient training of language models has become an important issue. We consider data pruning for data-efficient training of LLMs. In this work, we consider a data pruning method based on information entropy. We propose that the…

Artificial Intelligence · Computer Science 2024-12-13 Minsang Kim , Seungjun Baek

Optimising Language Models for Downstream Tasks: A Post-Training Perspective

Language models (LMs) have demonstrated remarkable capabilities in NLP, yet adapting them efficiently and robustly to specific tasks remains challenging. As their scale and complexity grow, fine-tuning LMs on labelled data often…

Computation and Language · Computer Science 2025-06-27 Zhengyan Shi

Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models

The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated models like OpenAI's ChatGPT, represents a significant advancement in artificial intelligence. These models, however, bring forth substantial challenges in…

Machine Learning · Computer Science 2024-12-31 Guangji Bai , Zheng Chai , Chen Ling , Shiyu Wang , Jiaying Lu , Nan Zhang , Tingwei Shi , Ziyang Yu , Mengdan Zhu , Yifei Zhang , Xinyuan Song , Carl Yang , Yue Cheng , Liang Zhao

Large Language Models: An Applied Econometric Framework

Large language models (LLMs) enable researchers to analyze text at unprecedented scale and minimal cost. Researchers can now revisit old questions and tackle novel ones with rich data. We provide an econometric framework for realizing this…

Econometrics · Economics 2025-12-08 Jens Ludwig , Sendhil Mullainathan , Ashesh Rambachan

EvoP: Robust LLM Inference via Evolutionary Pruning

Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, but their massive size and computational demands hinder their deployment in resource-constrained environments. Existing model pruning…

Computation and Language · Computer Science 2025-08-14 Shangyu Wu , Hongchao Du , Ying Xiong , Shuai Chen , Tei-Wei Kuo , Nan Guan , Chun Jason Xue