Related papers: TSDS: Data Selection for Task-Specific Model Finet…

Task-Specific Data Selection for Instruction Tuning via Monosemantic Neuronal Activations

Instruction tuning improves the ability of large language models (LLMs) to follow diverse human instructions, but achieving strong performance on specific target tasks remains challenging. A critical bottleneck is selecting the most…

Machine Learning · Computer Science 2025-05-19 Da Ma , Gonghu Shang , Zhi Chen , Libo Qin , Yijie Luo , Lei Pan , Shuai Fan , Lu Chen , Kai Yu

MoDS: Model-oriented Data Selection for Instruction Tuning

Instruction tuning has become the de facto method to equip large language models (LLMs) with the ability of following user instructions. Usually, hundreds of thousands or millions of instruction-following pairs are employed to fine-tune the…

Computation and Language · Computer Science 2023-11-28 Qianlong Du , Chengqing Zong , Jiajun Zhang

Instruction Matters: A Simple yet Effective Task Selection for Optimized Instruction Tuning of Specific Tasks

Instruction tuning has been proven effective in enhancing zero-shot generalization across various tasks and in improving the performance of specific tasks. For task-specific improvements, strategically selecting and training on related…

Computation and Language · Computer Science 2024-10-18 Changho Lee , Janghoon Han , Seonghyeon Ye , Stanley Jungkyu Choi , Honglak Lee , Kyunghoon Bae

Selecting task with optimal transport self-supervised learning for few-shot classification

Few-Shot classification aims at solving problems that only a few samples are available in the training process. Due to the lack of samples, researchers generally employ a set of training tasks from other domains to assist the target task,…

Computer Vision and Pattern Recognition · Computer Science 2022-04-04 Renjie Xu , Xinghao Yang , Baodi Liu , Kai Zhang , Weifeng Liu

Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach

We study the problem of fine-tuning a language model (LM) for a target task by optimally using the information from $n$ auxiliary tasks. This problem has broad applications in NLP, such as targeted instruction tuning and data selection in…

Computation and Language · Computer Science 2025-06-03 Dongyue Li , Ziniu Zhang , Lu Wang , Hongyang R. Zhang

Efficient Training of Deep Networks using Guided Spectral Data Selection: A Step Toward Learning What You Need

Effective data curation is essential for optimizing neural network training. In this paper, we present the Guided Spectrally Tuned Data Selection (GSTDS) algorithm, which dynamically adjusts the subset of data points used for training using…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Mohammadreza Sharifi , Ahad Harati

RL-Guided Data Selection for Language Model Finetuning

Data selection for finetuning Large Language Models (LLMs) can be framed as a budget-constrained optimization problem: maximizing a model's downstream performance under a strict training data budget. Solving this problem is generally…

Machine Learning · Computer Science 2025-10-01 Animesh Jha , Harshit Gupta , Ananjan Nandi

ProDS: Preference-oriented Data Selection for Instruction Tuning

Instruction data selection aims to identify a high-quality subset from the training set that matches or exceeds the performance of the full dataset on target tasks. Existing methods focus on the instruction-to-response mapping, but neglect…

Machine Learning · Computer Science 2025-05-20 Wenya Guo , Zhengkun Zhang , Xumeng Liu , Ying Zhang , Ziyu Lu , Haoze Zhu , Xubo Liu , Ruxue Yan

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired…

Machine Learning · Computer Science 2024-05-07 Feiyang Kang , Hoang Anh Just , Yifan Sun , Himanshu Jahagirdar , Yuanzhi Zhang , Rongxing Du , Anit Kumar Sahu , Ruoxi Jia

Data-Efficient Finetuning Using Cross-Task Nearest Neighbors

Obtaining labeled data to train a model for a task of interest is often expensive. Prior work shows training models on multitask data augmented with task descriptions (prompts) effectively transfers knowledge to new tasks. Towards…

Computation and Language · Computer Science 2023-05-26 Hamish Ivison , Noah A. Smith , Hannaneh Hajishirzi , Pradeep Dasigi

DsDm: Model-Aware Dataset Selection with Datamodels

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…

Machine Learning · Computer Science 2024-01-24 Logan Engstrom , Axel Feldmann , Aleksander Madry

SwiftTS: A Swift Selection Framework for Time Series Pre-trained Models via Multi-task Meta-Learning

Pre-trained models exhibit strong generalization to various downstream tasks. However, given the numerous models available in the model hub, identifying the most suitable one by individually fine-tuning is time-consuming. In this paper, we…

Machine Learning · Computer Science 2026-03-10 Tengxue Zhang , Biao Ouyang , Yang Shu , Xinyang Chen , Chenjuan Guo , Bin Yang

TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training

Large-scale multimodal pre-trained models like CLIP rely heavily on high-quality training data, yet raw web-crawled datasets are often noisy, misaligned, and redundant, leading to inefficient training and suboptimal generalization. Existing…

Machine Learning · Computer Science 2026-02-06 Guanjie Cheng , Boyi Li , Lingyu Sun , Mengying Zhu , Yangyang Wu , Xinkui Zhao , Shuiguang Deng

Language Models Improve When Pretraining Data Matches Target Tasks

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine…

Computation and Language · Computer Science 2025-07-17 David Mizrahi , Anders Boesen Lindbo Larsen , Jesse Allardice , Suzie Petryk , Yuri Gorokhov , Jeffrey Li , Alex Fang , Josh Gardner , Tom Gunter , Afshin Dehghan

Task-Specific Skill Localization in Fine-tuned Language Models

Pre-trained language models can be fine-tuned to solve diverse NLP tasks, including in few-shot settings. Thus fine-tuning allows the model to quickly pick up task-specific ``skills,'' but there has been limited study of where these…

Computation and Language · Computer Science 2023-07-04 Abhishek Panigrahi , Nikunj Saunshi , Haoyu Zhao , Sanjeev Arora

Transductive Data-Selection Algorithms for Fine-Tuning Neural Machine Translation

Machine Translation models are trained to translate a variety of documents from one language into another. However, models specifically trained for a particular characteristics of the documents tend to perform better. Fine-tuning is a…

Computation and Language · Computer Science 2019-10-09 Alberto Poncelas , Gideon Maillette de Buy Wenniger , Andy Way

TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning

Large pre-trained models achieve remarkable performance in vision tasks but are impractical for fine-tuning due to high computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods mitigate this issue by updating only a…

Computer Vision and Pattern Recognition · Computer Science 2025-07-31 Siqi Luo , Haoran Yang , Yi Xin , Mingyang Yi , Guangyang Wu , Guangtao Zhai , Xiaohong Liu

Optimizing Data Usage via Differentiable Rewards

To acquire a new skill, humans learn better and faster if a tutor, based on their current knowledge level, informs them of how much attention they should pay to particular content or practice problems. Similarly, a machine learning model…

Machine Learning · Computer Science 2021-06-18 Xinyi Wang , Hieu Pham , Paul Michel , Antonios Anastasopoulos , Jaime Carbonell , Graham Neubig

ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning

Instruction tuning has underscored the significant potential of large language models (LLMs) in producing more human controllable and effective outputs in various domains. In this work, we focus on the data selection problem for…

Machine Learning · Computer Science 2025-09-01 Yang Wu , Huayi Zhang , Yizheng Jiao , Lin Ma , Xiaozhong Liu , Jinhong Yu , Dongyu Zhang , Dezhi Yu , Wei Xu

On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation

Domain adaptation of neural networks commonly relies on three training phases: pretraining, selected data training and then fine tuning. Data selection improves target domain generalization by training further on pretraining data identified…

Computation and Language · Computer Science 2021-09-17 Dan Iter , David Grangier