Related papers: Diversify and Conquer: Diversity-Centric Data Sele…

Boosting LLM via Learning from Data Iteratively and Selectively

Datasets nowadays are generally constructed from multiple sources and using different synthetic techniques, making data de-noising and de-duplication crucial before being used for post-training. In this work, we propose to perform…

Computation and Language · Computer Science 2024-12-24 Qi Jia , Siyu Ren , Ziheng Qin , Fuzhao Xue , Jinjie Ni , Yang You

Increasing Data Diversity with Iterative Sampling to Improve Performance

As a part of the Data-Centric AI Competition, we propose a data-centric approach to improve the diversity of the training samples by iterative sampling. The method itself relies strongly on the fidelity of augmented samples and the…

Machine Learning · Computer Science 2021-11-09 Devrim Cavusoglu , Ogulcan Eryuksel , Sinan Altinuc

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

As large language models (LLMs) continue to advance, instruction tuning has become critical for improving their ability to generate accurate and contextually appropriate responses. Although numerous instruction-tuning datasets have been…

Computation and Language · Computer Science 2024-10-18 Jielin Song , Siyu Liu , Bin Zhu , Yanghui Rao

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets, allowing the trained model to outperform baselines trained on the full dataset. However, the expanding body of research…

Computation and Language · Computer Science 2025-02-25 Ziche Liu , Rui Ke , Yajiao Liu , Feng Jiang , Haizhou Li

Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

Data selection is of great significance in pre-training large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data…

Artificial Intelligence · Computer Science 2024-10-08 Chi Zhang , Huaping Zhong , Kuan Zhang , Chengliang Chai , Rui Wang , Xinlin Zhuang , Tianyi Bai , Jiantao Qiu , Lei Cao , Ju Fan , Ye Yuan , Guoren Wang , Conghui He

LASER: Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy

Recent work shows that post-training datasets for LLMs can be substantially downsampled without noticeably deteriorating performance. However, data selection often incurs high computational costs or is limited to narrow domains. In this…

Computation and Language · Computer Science 2025-09-25 Paramita Mirza , Lucas Weber , Fabian Küch

A Survey on Data Selection for LLM Instruction Tuning

Instruction tuning is a vital step of training large language models (LLMs), so how to enhance the effect of instruction tuning has received increased attention. Existing works indicate that the quality of the dataset is more crucial than…

Computation and Language · Computer Science 2025-08-27 Bolin Zhang , Jiahao Wang , Qianlong Du , Jiajun Zhang , Zhiying Tu , Dianhui Chu

Diversity Measurement and Subset Selection for Instruction Tuning Datasets

We aim to select data subsets for the fine-tuning of large language models to more effectively follow instructions. Prior work has emphasized the importance of diversity in dataset curation but relied on heuristics such as the number of…

Machine Learning · Computer Science 2024-02-07 Peiqi Wang , Yikang Shen , Zhen Guo , Matthew Stallone , Yoon Kim , Polina Golland , Rameswar Panda

Improving Multilingual Instruction Finetuning via Linguistically Natural and Diverse Datasets

Advancements in Large Language Models (LLMs) have significantly enhanced instruction-following capabilities. However, most Instruction Fine-Tuning (IFT) datasets are predominantly in English, limiting model performance in other languages.…

Computation and Language · Computer Science 2024-07-03 Sathish Reddy Indurthi , Wenxuan Zhou , Shamil Chollampatt , Ravi Agrawal , Kaiqiang Song , Lingxiao Zhao , Chenguang Zhu

G-DIG: Towards Gradient-based Diverse and High-quality Instruction Data Selection for Machine Translation

Large Language Models (LLMs) have demonstrated remarkable abilities in general scenarios. Instruction finetuning empowers them to align with humans in various tasks. Nevertheless, the Diversity and Quality of the instruction data remain two…

Computation and Language · Computer Science 2024-07-09 Xingyuan Pan , Luyang Huang , Liyan Kang , Zhicheng Liu , Yu Lu , Shanbo Cheng

Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection

Visual instruction datasets from various distributors are released at different times and often contain a significant number of semantically redundant text-image pairs, depending on their task compositions (i.e., skills) or reference…

Machine Learning · Computer Science 2025-03-25 Adyasha Maharana , Jaehong Yoon , Tianlong Chen , Mohit Bansal

LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning

Instruction tuning has emerged as a critical paradigm for improving the capabilities and alignment of large language models (LLMs). However, existing iterative model-aware data selection methods incur significant computational overhead, as…

Machine Learning · Computer Science 2025-05-13 Xiaotian Lin , Yanlin Qi , Yizhang Zhu , Themis Palpanas , Chengliang Chai , Nan Tang , Yuyu Luo

DsDm: Model-Aware Dataset Selection with Datamodels

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…

Machine Learning · Computer Science 2024-01-24 Logan Engstrom , Axel Feldmann , Aleksander Madry

D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning

Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by…

Machine Learning · Computer Science 2025-05-20 Jia Zhang , Chen-Xi Zhang , Yao Liu , Yi-Xuan Jin , Xiao-Wen Yang , Bo Zheng , Yi Liu , Lan-Zhe Guo

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While…

Computation and Language · Computer Science 2023-08-24 Kushal Tirumala , Daniel Simig , Armen Aghajanyan , Ari S. Morcos

Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning

Enhancing the instruction-following ability of Large Language Models (LLMs) primarily demands substantial instruction-tuning datasets. However, the sheer volume of these imposes a considerable computational burden and annotation cost. To…

Computation and Language · Computer Science 2023-11-15 Shengguang Wu , Keming Lu , Benfeng Xu , Junyang Lin , Qi Su , Chang Zhou

Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection…

Computation and Language · Computer Science 2025-11-07 Mohammad Amin Ghanizadeh , Mohammad Javad Dousti

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

The rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. While previous literature has…

Computation and Language · Computer Science 2024-10-24 Hao Chen , Abdul Waheed , Xiang Li , Yidong Wang , Jindong Wang , Bhiksha Raj , Marah I. Abdin

Exploring Instruction Data Quality for Explainable Image Quality Assessment

In recent years, with the rapid development of powerful multimodal large language models (MLLMs), explainable image quality assessment (IQA) has gradually become popular, aiming at providing quality-related descriptions and answers of…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Yunhao Li , Sijing Wu , Huiyu Duan , Yucheng Zhu , Qi Jia , Guangtao Zhai

Rethinking Representativeness and Diversity in Dynamic Data Selection

Dynamic data selection accelerates training by sampling a changing subset of the dataset while preserving accuracy. We rethink two core notions underlying sample evaluation: representativeness and diversity. Instead of local geometric…

Artificial Intelligence · Computer Science 2026-03-06 Yuzhe Zhou , Zhenglin Hua , Haiyun Guo , Yuheng Jia