Related papers: Compute-Constrained Data Selection

A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)

Instruction fine-tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on…

Machine Learning · Computer Science 2026-02-17 Nihal V. Nayak , Paula Rodriguez-Diaz , Neha Hulkund , Sara Beery , David Alvarez-Melis

RL-Guided Data Selection for Language Model Finetuning

Data selection for finetuning Large Language Models (LLMs) can be framed as a budget-constrained optimization problem: maximizing a model's downstream performance under a strict training data budget. Solving this problem is generally…

Machine Learning · Computer Science 2025-10-01 Animesh Jha , Harshit Gupta , Ananjan Nandi

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired…

Machine Learning · Computer Science 2024-05-07 Feiyang Kang , Hoang Anh Just , Yifan Sun , Himanshu Jahagirdar , Yuanzhi Zhang , Rongxing Du , Anit Kumar Sahu , Ruoxi Jia

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But…

Computation and Language · Computer Science 2024-06-11 Ming Li , Yong Zhang , Shwai He , Zhitao Li , Hongyu Zhao , Jianzong Wang , Ning Cheng , Tianyi Zhou

DsDm: Model-Aware Dataset Selection with Datamodels

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…

Machine Learning · Computer Science 2024-01-24 Logan Engstrom , Axel Feldmann , Aleksander Madry

Data-efficient LLM Fine-tuning for Code Generation

Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically…

Computation and Language · Computer Science 2025-04-18 Weijie Lv , Xuan Xia , Sheng-Jun Huang

Data Selection for LLM Alignment Using Fine-Grained Preferences

Large language models (LLMs) alignment aims to ensure that the behavior of LLMs meets human preferences. While collecting data from multiple fine-grained, aspect-specific preferences becomes more and more feasible, existing alignment…

Machine Learning · Computer Science 2026-03-03 Jia Zhang , Yao Liu , Chen-Xi Zhang , Yi Liu , Yi-Xuan Jin , Lan-Zhe Guo , Yu-Feng Li

A Survey on Data Selection for LLM Instruction Tuning

Instruction tuning is a vital step of training large language models (LLMs), so how to enhance the effect of instruction tuning has received increased attention. Existing works indicate that the quality of the dataset is more crucial than…

Computation and Language · Computer Science 2025-08-27 Bolin Zhang , Jiahao Wang , Qianlong Du , Jiajun Zhang , Zhiying Tu , Dianhui Chu

Data-efficient Fine-tuning for LLM-based Recommendation

Leveraging Large Language Models (LLMs) for recommendation has recently garnered considerable attention, where fine-tuning plays a key role in LLMs' adaptation. However, the cost of fine-tuning LLMs on rapidly expanding recommendation data…

Information Retrieval · Computer Science 2024-06-05 Xinyu Lin , Wenjie Wang , Yongqi Li , Shuo Yang , Fuli Feng , Yinwei Wei , Tat-Seng Chua

LASER: Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy

Recent work shows that post-training datasets for LLMs can be substantially downsampled without noticeably deteriorating performance. However, data selection often incurs high computational costs or is limited to narrow domains. In this…

Computation and Language · Computer Science 2025-09-25 Paramita Mirza , Lucas Weber , Fabian Küch

Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning

Instruction tuning for large language models (LLMs) has gained attention from researchers due to its ability to unlock the potential of LLMs in following instructions. While instruction tuning offers advantages for facilitating the…

Artificial Intelligence · Computer Science 2023-05-17 Hao Chen , Yiming Zhang , Qi Zhang , Hantao Yang , Xiaomeng Hu , Xuetao Ma , Yifan Yanggong , Junbo Zhao

A Survey on Data Selection for Language Models

A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as…

Computation and Language · Computer Science 2024-08-05 Alon Albalak , Yanai Elazar , Sang Michael Xie , Shayne Longpre , Nathan Lambert , Xinyi Wang , Niklas Muennighoff , Bairu Hou , Liangming Pan , Haewon Jeong , Colin Raffel , Shiyu Chang , Tatsunori Hashimoto , William Yang Wang

Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning

Recent work targeting large language models (LLMs) for code generation demonstrated that increasing the amount of training data through synthetic code generation often leads to exceptional performance. In this paper we explore data pruning…

Software Engineering · Computer Science 2024-07-09 Yun-Da Tsai , Mingjie Liu , Haoxing Ren

How to Train Data-Efficient LLMs

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data…

Machine Learning · Computer Science 2024-02-16 Noveen Sachdeva , Benjamin Coleman , Wang-Cheng Kang , Jianmo Ni , Lichan Hong , Ed H. Chi , James Caverlee , Julian McAuley , Derek Zhiyuan Cheng

Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection

Data selection in instruction tuning emerges as a pivotal process for acquiring high-quality data and training instruction-following large language models (LLMs), but it is still a new and unexplored research area for vision-language models…

Computation and Language · Computer Science 2024-02-21 Ruibo Chen , Yihan Wu , Lichang Chen , Guodong Liu , Qi He , Tianyi Xiong , Chenxi Liu , Junfeng Guo , Heng Huang

Exploring Data Redundancy in Real-world Image Classification through Data Selection

Deep learning models often require large amounts of data for training, leading to increased costs. It is particularly challenging in medical imaging, i.e., gathering distributed data for centralized training, and meanwhile, obtaining…

Computer Vision and Pattern Recognition · Computer Science 2023-06-27 Zhenyu Tang , Shaoting Zhang , Xiaosong Wang

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets, allowing the trained model to outperform baselines trained on the full dataset. However, the expanding body of research…

Computation and Language · Computer Science 2025-02-25 Ziche Liu , Rui Ke , Yajiao Liu , Feng Jiang , Haizhou Li

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web…

Computation and Language · Computer Science 2023-09-12 Max Marion , Ahmet Üstün , Luiza Pozzobon , Alex Wang , Marzieh Fadaee , Sara Hooker

Accelerating data-driven algorithm selection for combinatorial partitioning problems

Data-driven algorithm selection is a powerful approach for choosing effective heuristics for computational problems. It operates by evaluating a set of candidate algorithms on a collection of representative training instances and selecting…

Machine Learning · Computer Science 2025-12-04 Vaggos Chatziafratis , Ishani Karmarkar , Yingxi Li , Ellen Vitercik

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

Supervised fine-tuning (SFT) is crucial for aligning Large Language Models (LLMs) with human instructions. The primary goal during SFT is to select a small yet representative subset of training data from the larger pool, such that…

Computation and Language · Computer Science 2024-12-10 Tingyu Xia , Bowen Yu , Kai Dang , An Yang , Yuan Wu , Yuan Tian , Yi Chang , Junyang Lin