Related papers: An Empirical Study on Influence-Based Pretraining …

To Code, or Not To Code? Exploring Impact of Code in Pre-training

Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a…

Computation and Language · Computer Science 2024-08-21 Viraat Aryabumi , Yixuan Su , Raymond Ma , Adrien Morisot , Ivan Zhang , Acyr Locatelli , Marzieh Fadaee , Ahmet Üstün , Sara Hooker

Seed-Coder: Let the Code Model Curate Data for Itself

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code…

Computation and Language · Computer Science 2025-06-06 ByteDance Seed , Yuyu Zhang , Jing Su , Yifan Sun , Chenguang Xi , Xia Xiao , Shen Zheng , Anxiang Zhang , Kaibo Liu , Daoguang Zan , Tao Sun , Jinhua Zhu , Shulin Xin , Dong Huang , Yetao Bai , Lixin Dong , Chao Li , Jianchong Chen , Hanzhi Zhou , Yifan Huang , Guanghan Ning , Xierui Song , Jiaze Chen , Siyao Liu , Kai Shen , Liang Xiang , Yonghui Wu

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data…

Computation and Language · Computer Science 2024-02-21 Demin Song , Honglin Guo , Yunhua Zhou , Shuhao Xing , Yudong Wang , Zifan Song , Wenwei Zhang , Qipeng Guo , Hang Yan , Xipeng Qiu , Dahua Lin

Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning

Through pretraining on a corpus with various sources, Large Language Models (LLMs) have gained impressive performance. However, the impact of each component of the pretraining corpus remains opaque. As a result, the organization of the…

Computation and Language · Computer Science 2024-08-29 Yang Zhao , Li Du , Xiao Ding , Kai Xiong , Zhouhao Sun , Jun Shi , Ting Liu , Bing Qin

Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging…

Computation and Language · Computer Science 2025-07-04 Weijie Lyu , Sheng-Jun Huang , Xuan Xia

Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions

Large language models (LLMs) have demonstrated remarkable reasoning capabilities in math and coding, often bolstered by post-training on the chain-of-thoughts (CoTs) generated by stronger models. However, existing strategies for curating…

Machine Learning · Computer Science 2025-05-27 Siqi Kou , Qingyuan Tian , Hanwen Xu , Zihao Zeng , Zhijie Deng

Influence-driven Curriculum Learning for Pre-training on Limited Data

Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we…

Computation and Language · Computer Science 2025-09-29 Loris Schoenegger , Lukas Thoma , Terra Blevins , Benjamin Roth

Influence Functions for Efficient Data Selection in Reasoning

Fine-tuning large language models (LLMs) on chain-of-thought (CoT) data shows that a small amount of high-quality data can outperform massive datasets. Yet, what constitutes "quality" remains ill-defined. Existing reasoning methods rely on…

Machine Learning · Computer Science 2025-12-02 Prateek Humane , Paolo Cudrano , Daniel Z. Kaplan , Matteo Matteucci , Supriyo Chakraborty , Irina Rish

Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

Pre-trained code models rely heavily on high-quality pre-training data, particularly human-written reference comments that bridge code and natural language. However, these comments often become outdated as software evolves, degrading model…

Software Engineering · Computer Science 2025-04-29 Kang Yang , Xinjun Mao , Shangwen Wang , Yanlin Wang , Tanghaoran Zhang , Bo Lin , Yihao Qin , Zhang Zhang , Yao Lu , Kamal Al-Sabahi

LLM-Aided Customizable Profiling of Code Data Based On Programming Language Concepts

Data profiling is critical in machine learning for generating descriptive statistics, supporting both deeper understanding and downstream tasks like data valuation and curation. This work addresses profiling specifically in the context of…

Software Engineering · Computer Science 2025-03-21 Pankaj Thorat , Adnan Qidwai , Adrija Dhar , Aishwariya Chakraborty , Anand Eswaran , Hima Patel , Praveen Jayachandran

In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models

Despite advancements, fine-tuning Large Language Models (LLMs) remains costly due to the extensive parameter count and substantial data requirements for model generalization. Accessibility to computing resources remains a barrier for the…

Machine Learning · Computer Science 2024-10-04 Ayrton San Joaquin , Bin Wang , Zhengyuan Liu , Nicholas Asher , Brian Lim , Philippe Muller , Nancy F. Chen

At Which Training Stage Does Code Data Help LLMs Reasoning?

Large Language Models (LLMs) have exhibited remarkable reasoning capabilities and become the foundation of language technologies. Inspired by the great success of code data in training LLMs, we naturally wonder at which training stage…

Computation and Language · Computer Science 2023-10-03 Yingwei Ma , Yue Liu , Yue Yu , Yuanliang Zhang , Yu Jiang , Changjian Wang , Shanshan Li

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we…

Computation and Language · Computer Science 2023-11-14 Shayne Longpre , Gregory Yauney , Emily Reif , Katherine Lee , Adam Roberts , Barret Zoph , Denny Zhou , Jason Wei , Kevin Robinson , David Mimno , Daphne Ippolito

Improving the Ability of Pre-trained Language Model by Imparting Large Language Model's Experience

Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks (e.g., code completion and code generation). By leveraging huge existing code corpora (e.g., GitHub),…

Software Engineering · Computer Science 2025-01-16 Xin Yin , Chao Ni , Xiaodan Xu , Xinrui Li , Xiaohu Yang

The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining

Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to…

Machine Learning · Computer Science 2025-10-03 Thiziri Nait Saada , Louis Bethune , Michal Klein , David Grangier , Marco Cuturi , Pierre Ablin

Topic Over Source: The Key to Effective Data Mixing for Language Models Pre-training

The performance of large language models (LLMs) is significantly affected by the quality and composition of their pre-training data, which is inherently diverse, spanning various languages, sources, and topics. Effectively integrating these…

Computation and Language · Computer Science 2025-08-11 Jiahui Peng , Xinlin Zhuang , Jiantao Qiu , Ren Ma , Jing Yu , He Zhu , Conghui He

Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset

The Large Language Models (LLMs) have demonstrated great potential in code-related tasks. However, most research focuses on improving the output quality of LLMs (e.g., correctness), and less attention has been paid to the LLM input (e.g.,…

Software Engineering · Computer Science 2025-08-19 Zhipeng Xue , Xiaoting Zhang , Zhipeng Gao , Xing Hu , Shan Gao , Xin Xia , Shanping Li

What Are They Filtering Out? An Experimental Benchmark of Filtering Strategies for Harm Reduction in Pretraining Datasets

Data filtering strategies are a crucial component to develop safe Large Language Models (LLM), since they support the removal of harmful contents from pretraining datasets. There is a lack of research on the actual impact of these…

Computation and Language · Computer Science 2026-03-24 Marco Antonio Stranisci , Christian Hardmeier

Preference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning

As LLMs continue to scale, improving training efficiency increasingly depends on using data more effectively. Data selection addresses this problem by allocating a limited training budget to samples that best promote a target behavior.…

Machine Learning · Computer Science 2026-05-21 Qihao Lin , Guanxu Chen , Dongrui Liu , Jing Shao

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But…

Computation and Language · Computer Science 2024-06-11 Ming Li , Yong Zhang , Shwai He , Zhitao Li , Hongyu Zhao , Jianzong Wang , Ning Cheng , Tianyi Zhou