English
Related papers

Related papers: Diversify and Conquer: Diversity-Centric Data Sele…

200 papers

Datasets nowadays are generally constructed from multiple sources and using different synthetic techniques, making data de-noising and de-duplication crucial before being used for post-training. In this work, we propose to perform…

Computation and Language · Computer Science 2024-12-24 Qi Jia , Siyu Ren , Ziheng Qin , Fuzhao Xue , Jinjie Ni , Yang You

As a part of the Data-Centric AI Competition, we propose a data-centric approach to improve the diversity of the training samples by iterative sampling. The method itself relies strongly on the fidelity of augmented samples and the…

Machine Learning · Computer Science 2021-11-09 Devrim Cavusoglu , Ogulcan Eryuksel , Sinan Altinuc

As large language models (LLMs) continue to advance, instruction tuning has become critical for improving their ability to generate accurate and contextually appropriate responses. Although numerous instruction-tuning datasets have been…

Computation and Language · Computer Science 2024-10-18 Jielin Song , Siyu Liu , Bin Zhu , Yanghui Rao

Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets, allowing the trained model to outperform baselines trained on the full dataset. However, the expanding body of research…

Computation and Language · Computer Science 2025-02-25 Ziche Liu , Rui Ke , Yajiao Liu , Feng Jiang , Haizhou Li

Data selection is of great significance in pre-training large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data…

Artificial Intelligence · Computer Science 2024-10-08 Chi Zhang , Huaping Zhong , Kuan Zhang , Chengliang Chai , Rui Wang , Xinlin Zhuang , Tianyi Bai , Jiantao Qiu , Lei Cao , Ju Fan , Ye Yuan , Guoren Wang , Conghui He

Recent work shows that post-training datasets for LLMs can be substantially downsampled without noticeably deteriorating performance. However, data selection often incurs high computational costs or is limited to narrow domains. In this…

Computation and Language · Computer Science 2025-09-25 Paramita Mirza , Lucas Weber , Fabian Küch

Instruction tuning is a vital step of training large language models (LLMs), so how to enhance the effect of instruction tuning has received increased attention. Existing works indicate that the quality of the dataset is more crucial than…

Computation and Language · Computer Science 2025-08-27 Bolin Zhang , Jiahao Wang , Qianlong Du , Jiajun Zhang , Zhiying Tu , Dianhui Chu

We aim to select data subsets for the fine-tuning of large language models to more effectively follow instructions. Prior work has emphasized the importance of diversity in dataset curation but relied on heuristics such as the number of…

Machine Learning · Computer Science 2024-02-07 Peiqi Wang , Yikang Shen , Zhen Guo , Matthew Stallone , Yoon Kim , Polina Golland , Rameswar Panda

Advancements in Large Language Models (LLMs) have significantly enhanced instruction-following capabilities. However, most Instruction Fine-Tuning (IFT) datasets are predominantly in English, limiting model performance in other languages.…

Computation and Language · Computer Science 2024-07-03 Sathish Reddy Indurthi , Wenxuan Zhou , Shamil Chollampatt , Ravi Agrawal , Kaiqiang Song , Lingxiao Zhao , Chenguang Zhu

Large Language Models (LLMs) have demonstrated remarkable abilities in general scenarios. Instruction finetuning empowers them to align with humans in various tasks. Nevertheless, the Diversity and Quality of the instruction data remain two…

Computation and Language · Computer Science 2024-07-09 Xingyuan Pan , Luyang Huang , Liyan Kang , Zhicheng Liu , Yu Lu , Shanbo Cheng

Visual instruction datasets from various distributors are released at different times and often contain a significant number of semantically redundant text-image pairs, depending on their task compositions (i.e., skills) or reference…

Machine Learning · Computer Science 2025-03-25 Adyasha Maharana , Jaehong Yoon , Tianlong Chen , Mohit Bansal

Instruction tuning has emerged as a critical paradigm for improving the capabilities and alignment of large language models (LLMs). However, existing iterative model-aware data selection methods incur significant computational overhead, as…

Machine Learning · Computer Science 2025-05-13 Xiaotian Lin , Yanlin Qi , Yizhang Zhu , Themis Palpanas , Chengliang Chai , Nan Tang , Yuyu Luo

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…

Machine Learning · Computer Science 2024-01-24 Logan Engstrom , Axel Feldmann , Aleksander Madry

Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by…

Machine Learning · Computer Science 2025-05-20 Jia Zhang , Chen-Xi Zhang , Yao Liu , Yi-Xuan Jin , Xiao-Wen Yang , Bo Zheng , Yi Liu , Lan-Zhe Guo

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While…

Computation and Language · Computer Science 2023-08-24 Kushal Tirumala , Daniel Simig , Armen Aghajanyan , Ari S. Morcos

Enhancing the instruction-following ability of Large Language Models (LLMs) primarily demands substantial instruction-tuning datasets. However, the sheer volume of these imposes a considerable computational burden and annotation cost. To…

Computation and Language · Computer Science 2023-11-15 Shengguang Wu , Keming Lu , Benfeng Xu , Junyang Lin , Qi Su , Chang Zhou

Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection…

Computation and Language · Computer Science 2025-11-07 Mohammad Amin Ghanizadeh , Mohammad Javad Dousti

The rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. While previous literature has…

Computation and Language · Computer Science 2024-10-24 Hao Chen , Abdul Waheed , Xiang Li , Yidong Wang , Jindong Wang , Bhiksha Raj , Marah I. Abdin

In recent years, with the rapid development of powerful multimodal large language models (MLLMs), explainable image quality assessment (IQA) has gradually become popular, aiming at providing quality-related descriptions and answers of…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Yunhao Li , Sijing Wu , Huiyu Duan , Yucheng Zhu , Qi Jia , Guangtao Zhai

Dynamic data selection accelerates training by sampling a changing subset of the dataset while preserving accuracy. We rethink two core notions underlying sample evaluation: representativeness and diversity. Instead of local geometric…

Artificial Intelligence · Computer Science 2026-03-06 Yuzhe Zhou , Zhenglin Hua , Haiyun Guo , Yuheng Jia
‹ Prev 1 2 3 10 Next ›