English
Related papers

Related papers: Entropy-Based Data Selection for Language Models

200 papers

Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets, allowing the trained model to outperform baselines trained on the full dataset. However, the expanding body of research…

Computation and Language · Computer Science 2025-02-25 Ziche Liu , Rui Ke , Yajiao Liu , Feng Jiang , Haizhou Li

Data selection for finetuning Large Language Models (LLMs) can be framed as a budget-constrained optimization problem: maximizing a model's downstream performance under a strict training data budget. Solving this problem is generally…

Machine Learning · Computer Science 2025-10-01 Animesh Jha , Harshit Gupta , Ananjan Nandi

A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as…

Data is the cornerstone of large language models (LLMs), but not all data is useful for model learning. Carefully selected data can better elicit the capabilities of LLMs with much less computational overhead. Most methods concentrate on…

Machine Learning · Computer Science 2024-07-12 Mingjia Yin , Chuhan Wu , Yufei Wang , Hao Wang , Wei Guo , Yasheng Wang , Yong Liu , Ruiming Tang , Defu Lian , Enhong Chen

This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs' capabilities for downstream usage. We formulate data selection as a generalized Optimal Control problem, which can be solved…

Computation and Language · Computer Science 2025-03-20 Yuxian Gu , Li Dong , Hongning Wang , Yaru Hao , Qingxiu Dong , Furu Wei , Minlie Huang

Recently, Large Language Models (LLMs) have demonstrated outstanding performance across a wide range of downstream language tasks. Temperature sampling is a commonly used decoding strategy for LLMs' generation process. However, a fixed…

Computation and Language · Computer Science 2024-04-04 Shimao Zhang , Yu Bao , Shujian Huang

Balanced and efficient information flow is essential for optimizing language generation models. In this work, we propose Entropy-UID, a new token selection method that balances entropy and Uniform Information Density (UID) principles for…

Computation and Language · Computer Science 2025-02-21 Xinpeng Shou

Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for…

Computation and Language · Computer Science 2025-09-09 Jian Wu , Hang Yu , Bingchang Liu , Wenjie Yang , Peng Di , Jianguo Li , Yue Zhang

Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias…

Machine Learning · Computer Science 2026-02-03 Heming Zou , Yixiu Mao , Yun Qu , Qi Wang , Xiangyang Ji

Fine-tuning large language models (LLMs) with limited data poses a practical challenge in low-resource languages, specialized domains, and constrained deployment settings. While pre-trained LLMs provide strong foundations, effective…

Computation and Language · Computer Science 2025-10-29 Marton Szep , Daniel Rueckert , Rüdiger von Eisenhart-Rothe , Florian Hinterwimmer

Low-resourced data presents a significant challenge for neural machine translation. In most cases, the low-resourced environment is caused by high costs due to the need for domain experts or the lack of language experts. Therefore,…

Computation and Language · Computer Science 2024-05-22 Seunghyun Ji , Hagai Raja Sinulingga , Darongsae Kwon

Instruction tuning has become the de facto method to equip large language models (LLMs) with the ability of following user instructions. Usually, hundreds of thousands or millions of instruction-following pairs are employed to fine-tune the…

Computation and Language · Computer Science 2023-11-28 Qianlong Du , Chengqing Zong , Jiajun Zhang

This paper investigates the challenges and potential solutions for improving machine learning systems for low-resource languages. State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT), and…

Computation and Language · Computer Science 2024-10-11 Yurii Paniv

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the…

Machine Learning · Computer Science 2025-04-09 Junjie Oscar Yin , Alexander M. Rush

Selecting high-quality and diverse training samples from extensive datasets plays a crucial role in reducing training overhead and enhancing the performance of Large Language Models (LLMs). However, existing studies fall short in assessing…

Computation and Language · Computer Science 2025-10-14 Zhuo Li , Yuhao Du , Xiaoqi Jiao , Yiwen Guo , Yuege Feng , Xiang Wan , Anningzhe Gao , Jinpeng Hu

Compute-efficient training of language models has become an important issue. We consider data pruning for data-efficient training of LLMs. In this work, we consider a data pruning method based on information entropy. We propose that the…

Artificial Intelligence · Computer Science 2024-12-13 Minsang Kim , Seungjun Baek

Language models (LMs) have demonstrated remarkable capabilities in NLP, yet adapting them efficiently and robustly to specific tasks remains challenging. As their scale and complexity grow, fine-tuning LMs on labelled data often…

Computation and Language · Computer Science 2025-06-27 Zhengyan Shi

The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated models like OpenAI's ChatGPT, represents a significant advancement in artificial intelligence. These models, however, bring forth substantial challenges in…

Large language models (LLMs) enable researchers to analyze text at unprecedented scale and minimal cost. Researchers can now revisit old questions and tackle novel ones with rich data. We provide an econometric framework for realizing this…

Econometrics · Economics 2025-12-08 Jens Ludwig , Sendhil Mullainathan , Ashesh Rambachan

Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, but their massive size and computational demands hinder their deployment in resource-constrained environments. Existing model pruning…

Computation and Language · Computer Science 2025-08-14 Shangyu Wu , Hongchao Du , Ying Xiong , Shuai Chen , Tei-Wei Kuo , Nan Guan , Chun Jason Xue
‹ Prev 1 2 3 10 Next ›