English
Related papers

Related papers: Group-Level Data Selection for Efficient Pretraini…

200 papers

Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger…

Computation and Language · Computer Science 2024-11-19 Zichun Yu , Spandan Das , Chenyan Xiong

High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to…

Computation and Language · Computer Science 2025-07-08 Jiazheng Li , Lu Yu , Qing Cui , Zhiqiang Zhang , Jun Zhou , Yanfang Ye , Chuxu Zhang

The data used to pretrain large language models has a decisive impact on a model's downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for…

Computation and Language · Computer Science 2023-12-12 Alon Albalak , Liangming Pan , Colin Raffel , William Yang Wang

Efficient data selection is crucial to accelerate the pretraining of language model (LMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches…

Computation and Language · Computer Science 2025-06-10 Tianyi Bai , Ling Yang , Zhen Hao Wong , Fupeng Sun , Jiahui Peng , Xinlin Zhuang , Chi Zhang , Lijun Wu , Jiantao Qiu , Wentao Zhang , Binhang Yuan , Conghui He

A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by…

Computation and Language · Computer Science 2026-01-01 Ziqing Fan , Yuqiao Xian , Yan Sun , Li Shen

Data selection is of great significance in pre-training large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data…

Artificial Intelligence · Computer Science 2024-10-08 Chi Zhang , Huaping Zhong , Kuan Zhang , Chengliang Chai , Rui Wang , Xinlin Zhuang , Tianyi Bai , Jiantao Qiu , Lei Cao , Ju Fan , Ye Yuan , Guoren Wang , Conghui He

The performance of large language models (LLMs) is significantly affected by the quality and composition of their pre-training data, which is inherently diverse, spanning various languages, sources, and topics. Effectively integrating these…

Computation and Language · Computer Science 2025-08-11 Jiahui Peng , Xinlin Zhuang , Jiantao Qiu , Ren Ma , Jing Yu , He Zhu , Conghui He

Effective group decision-making is critical in Multi-Agent Systems (MAS). Yet, how different mechanisms for reaching consensus impact collaboration quality and efficiency remains understudied. We conduct a systematic study on group…

Multiagent Systems · Computer Science 2025-06-05 Young-Min Cho , Raphael Shu , Nilaksh Das , Tamer Alkhouli , Yi-An Lai , Jason Cai , Monica Sunkara , Yi Zhang , Dan Roth

Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Nilay Naharas , Dang Nguyen , Nesihan Bulut , Mohammadhossein Bateni , Vahab Mirrokni , Baharan Mirzasoleiman

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine…

Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models,…

Machine Learning · Computer Science 2026-02-04 Jie Hao , Rui Yu , Wei Zhang , Huixia Wang , Jie Xu , Mingrui Liu

Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection…

Computation and Language · Computer Science 2025-11-07 Mohammad Amin Ghanizadeh , Mohammad Javad Dousti

Gradient-based data influence approximation has been leveraged to select useful data samples in the supervised fine-tuning of large language models. However, the computation of gradients throughout the fine-tuning process requires too many…

Computation and Language · Computer Science 2025-06-13 Zige Wang , Qi Zhu , Fei Mi , Minghui Xu , Ruochun Jin , Wenjing Yang

Large language models (LLMs) are very proficient text generators. We leverage this capability of LLMs to generate task-specific data via zero-shot prompting and promote cross-lingual transfer for low-resource target languages. Given…

Computation and Language · Computer Science 2024-07-16 Barah Fazili , Ashish Sunil Agrawal , Preethi Jyothi

Instruction tuning has unlocked powerful capabilities in large language models (LLMs), effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills…

Computation and Language · Computer Science 2024-06-14 Mengzhou Xia , Sadhika Malladi , Suchin Gururangan , Sanjeev Arora , Danqi Chen

Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy…

Multiagent Systems · Computer Science 2026-05-29 Ziyang Ma , Dingyi Zhang , Sichu Liang , Jiajia Chu , Pengfei Xia , Hui Zang , Deyu Zhou

Large language models (LLMs) have been shown to be effective on tabular prediction tasks in the low-data regime, leveraging their internal knowledge and ability to learn from instructions and examples. However, LLMs can fail to generate…

Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes…

Computation and Language · Computer Science 2024-09-18 Simon Yu , Liangyu Chen , Sara Ahmadian , Marzieh Fadaee

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training…

Computation and Language · Computer Science 2026-02-10 Shaobo Wang , Xuan Ouyang , Tianyi Xu , Yuzheng Hu , Jialin Liu , Guo Chen , Tianyu Zhang , Junhao Zheng , Kexin Yang , Xingzhang Ren , Dayiheng Liu , Linfeng Zhang

The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new…

Computation and Language · Computer Science 2023-08-01 Mayee F. Chen , Nicholas Roberts , Kush Bhatia , Jue Wang , Ce Zhang , Frederic Sala , Christopher Ré
‹ Prev 1 2 3 10 Next ›