Related papers: Group-Level Data Selection for Efficient Pretraini…

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger…

Computation and Language · Computer Science 2024-11-19 Zichun Yu , Spandan Das , Chenyan Xiong

MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models

High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to…

Computation and Language · Computer Science 2025-07-08 Jiazheng Li , Lu Yu , Qing Cui , Zhiqiang Zhang , Jun Zhou , Yanfang Ye , Chuxu Zhang

Efficient Online Data Mixing For Language Model Pre-Training

The data used to pretrain large language models has a decisive impact on a model's downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for…

Computation and Language · Computer Science 2023-12-12 Alon Albalak , Liangming Pan , Colin Raffel , William Yang Wang

Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration

Efficient data selection is crucial to accelerate the pretraining of language model (LMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches…

Computation and Language · Computer Science 2025-06-10 Tianyi Bai , Ling Yang , Zhen Hao Wong , Fupeng Sun , Jiahui Peng , Xinlin Zhuang , Chi Zhang , Lijun Wu , Jiantao Qiu , Wentao Zhang , Binhang Yuan , Conghui He

Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by…

Computation and Language · Computer Science 2026-01-01 Ziqing Fan , Yuqiao Xian , Yan Sun , Li Shen

Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

Data selection is of great significance in pre-training large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data…

Artificial Intelligence · Computer Science 2024-10-08 Chi Zhang , Huaping Zhong , Kuan Zhang , Chengliang Chai , Rui Wang , Xinlin Zhuang , Tianyi Bai , Jiantao Qiu , Lei Cao , Ju Fan , Ye Yuan , Guoren Wang , Conghui He

Topic Over Source: The Key to Effective Data Mixing for Language Models Pre-training

The performance of large language models (LLMs) is significantly affected by the quality and composition of their pre-training data, which is inherently diverse, spanning various languages, sources, and topics. Effectively integrating these…

Computation and Language · Computer Science 2025-08-11 Jiahui Peng , Xinlin Zhuang , Jiantao Qiu , Ren Ma , Jing Yu , He Zhu , Conghui He

RoundTable: Investigating Group Decision-Making Mechanism in Multi-Agent Collaboration

Effective group decision-making is critical in Multi-Agent Systems (MAS). Yet, how different mechanisms for reaching consensus impact collaboration quality and efficiency remains understudied. We conduct a systematic study on group…

Multiagent Systems · Computer Science 2025-06-05 Young-Min Cho , Raphael Shu , Nilaksh Das , Tamer Alkhouli , Yi-An Lai , Jason Cai , Monica Sunkara , Yi Zhang , Dan Roth

Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories

Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Nilay Naharas , Dang Nguyen , Nesihan Bulut , Mohammadhossein Bateni , Vahab Mirrokni , Baharan Mirzasoleiman

Language Models Improve When Pretraining Data Matches Target Tasks

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine…

Computation and Language · Computer Science 2025-07-17 David Mizrahi , Anders Boesen Lindbo Larsen , Jesse Allardice , Suzie Petryk , Yuri Gorokhov , Jeffrey Li , Alex Fang , Josh Gardner , Tom Gunter , Afshin Dehghan

BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining

Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models,…

Machine Learning · Computer Science 2026-02-04 Jie Hao , Rui Yu , Wei Zhang , Huixia Wang , Jie Xu , Mingrui Liu

Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection…

Computation and Language · Computer Science 2025-11-07 Mohammad Amin Ghanizadeh , Mohammad Javad Dousti

ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs

Gradient-based data influence approximation has been leveraged to select useful data samples in the supervised fine-tuning of large language models. However, the computation of gradients throughout the fine-tuning process requires too many…

Computation and Language · Computer Science 2025-06-13 Zige Wang , Qi Zhu , Fei Mi , Minghui Xu , Ruochun Jin , Wenjing Yang

Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection

Large language models (LLMs) are very proficient text generators. We leverage this capability of LLMs to generate task-specific data via zero-shot prompting and promote cross-lingual transfer for low-resource target languages. Given…

Computation and Language · Computer Science 2024-07-16 Barah Fazili , Ashish Sunil Agrawal , Preethi Jyothi

LESS: Selecting Influential Data for Targeted Instruction Tuning

Instruction tuning has unlocked powerful capabilities in large language models (LLMs), effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills…

Computation and Language · Computer Science 2024-06-14 Mengzhou Xia , Sadhika Malladi , Suchin Gururangan , Sanjeev Arora , Danqi Chen

CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent Systems

Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy…

Multiagent Systems · Computer Science 2026-05-29 Ziyang Ma , Dingyi Zhang , Sichu Liang , Jiajia Chu , Pengfei Xia , Hui Zang , Deyu Zhou

Improving LLM Group Fairness on Tabular Data via In-Context Learning

Large language models (LLMs) have been shown to be effective on tabular prediction tasks in the low-data regime, leveraging their internal knowledge and ability to learn from instructions and examples. However, LLMs can fail to generate…

Machine Learning · Computer Science 2024-12-09 Valeriia Cherepanova , Chia-Jung Lee , Nil-Jana Akpinar , Riccardo Fogliato , Martin Andres Bertran , Michael Kearns , James Zou

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes…

Computation and Language · Computer Science 2024-09-18 Simon Yu , Liangyu Chen , Sara Ahmadian , Marzieh Fadaee

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training…

Computation and Language · Computer Science 2026-02-10 Shaobo Wang , Xuan Ouyang , Tianyi Xu , Yuzheng Hu , Jialin Liu , Guo Chen , Tianyu Zhang , Junhao Zheng , Kexin Yang , Xingzhang Ren , Dayiheng Liu , Linfeng Zhang

Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models

The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new…

Computation and Language · Computer Science 2023-08-01 Mayee F. Chen , Nicholas Roberts , Kush Bhatia , Jue Wang , Ce Zhang , Frederic Sala , Christopher Ré