English
Related papers

Related papers: An Empirical Study on Influence-Based Pretraining …

200 papers

Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a…

Computation and Language · Computer Science 2024-08-21 Viraat Aryabumi , Yixuan Su , Raymond Ma , Adrien Morisot , Ivan Zhang , Acyr Locatelli , Marzieh Fadaee , Ahmet Üstün , Sara Hooker

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code…

The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data…

Computation and Language · Computer Science 2024-02-21 Demin Song , Honglin Guo , Yunhua Zhou , Shuhao Xing , Yudong Wang , Zifan Song , Wenwei Zhang , Qipeng Guo , Hang Yan , Xipeng Qiu , Dahua Lin

Through pretraining on a corpus with various sources, Large Language Models (LLMs) have gained impressive performance. However, the impact of each component of the pretraining corpus remains opaque. As a result, the organization of the…

Computation and Language · Computer Science 2024-08-29 Yang Zhao , Li Du , Xiao Ding , Kai Xiong , Zhouhao Sun , Jun Shi , Ting Liu , Bing Qin

Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging…

Computation and Language · Computer Science 2025-07-04 Weijie Lyu , Sheng-Jun Huang , Xuan Xia

Large language models (LLMs) have demonstrated remarkable reasoning capabilities in math and coding, often bolstered by post-training on the chain-of-thoughts (CoTs) generated by stronger models. However, existing strategies for curating…

Machine Learning · Computer Science 2025-05-27 Siqi Kou , Qingyuan Tian , Hanwen Xu , Zihao Zeng , Zhijie Deng

Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we…

Computation and Language · Computer Science 2025-09-29 Loris Schoenegger , Lukas Thoma , Terra Blevins , Benjamin Roth

Fine-tuning large language models (LLMs) on chain-of-thought (CoT) data shows that a small amount of high-quality data can outperform massive datasets. Yet, what constitutes "quality" remains ill-defined. Existing reasoning methods rely on…

Machine Learning · Computer Science 2025-12-02 Prateek Humane , Paolo Cudrano , Daniel Z. Kaplan , Matteo Matteucci , Supriyo Chakraborty , Irina Rish

Pre-trained code models rely heavily on high-quality pre-training data, particularly human-written reference comments that bridge code and natural language. However, these comments often become outdated as software evolves, degrading model…

Software Engineering · Computer Science 2025-04-29 Kang Yang , Xinjun Mao , Shangwen Wang , Yanlin Wang , Tanghaoran Zhang , Bo Lin , Yihao Qin , Zhang Zhang , Yao Lu , Kamal Al-Sabahi

Data profiling is critical in machine learning for generating descriptive statistics, supporting both deeper understanding and downstream tasks like data valuation and curation. This work addresses profiling specifically in the context of…

Software Engineering · Computer Science 2025-03-21 Pankaj Thorat , Adnan Qidwai , Adrija Dhar , Aishwariya Chakraborty , Anand Eswaran , Hima Patel , Praveen Jayachandran

Despite advancements, fine-tuning Large Language Models (LLMs) remains costly due to the extensive parameter count and substantial data requirements for model generalization. Accessibility to computing resources remains a barrier for the…

Machine Learning · Computer Science 2024-10-04 Ayrton San Joaquin , Bin Wang , Zhengyuan Liu , Nicholas Asher , Brian Lim , Philippe Muller , Nancy F. Chen

Large Language Models (LLMs) have exhibited remarkable reasoning capabilities and become the foundation of language technologies. Inspired by the great success of code data in training LLMs, we naturally wonder at which training stage…

Computation and Language · Computer Science 2023-10-03 Yingwei Ma , Yue Liu , Yue Yu , Yuanliang Zhang , Yu Jiang , Changjian Wang , Shanshan Li

Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we…

Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks (e.g., code completion and code generation). By leveraging huge existing code corpora (e.g., GitHub),…

Software Engineering · Computer Science 2025-01-16 Xin Yin , Chao Ni , Xiaodan Xu , Xinrui Li , Xiaohu Yang

Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to…

Machine Learning · Computer Science 2025-10-03 Thiziri Nait Saada , Louis Bethune , Michal Klein , David Grangier , Marco Cuturi , Pierre Ablin

The performance of large language models (LLMs) is significantly affected by the quality and composition of their pre-training data, which is inherently diverse, spanning various languages, sources, and topics. Effectively integrating these…

Computation and Language · Computer Science 2025-08-11 Jiahui Peng , Xinlin Zhuang , Jiantao Qiu , Ren Ma , Jing Yu , He Zhu , Conghui He

The Large Language Models (LLMs) have demonstrated great potential in code-related tasks. However, most research focuses on improving the output quality of LLMs (e.g., correctness), and less attention has been paid to the LLM input (e.g.,…

Software Engineering · Computer Science 2025-08-19 Zhipeng Xue , Xiaoting Zhang , Zhipeng Gao , Xing Hu , Shan Gao , Xin Xia , Shanping Li

Data filtering strategies are a crucial component to develop safe Large Language Models (LLM), since they support the removal of harmful contents from pretraining datasets. There is a lack of research on the actual impact of these…

Computation and Language · Computer Science 2026-03-24 Marco Antonio Stranisci , Christian Hardmeier

As LLMs continue to scale, improving training efficiency increasingly depends on using data more effectively. Data selection addresses this problem by allocating a limited training budget to samples that best promote a target behavior.…

Machine Learning · Computer Science 2026-05-21 Qihao Lin , Guanxu Chen , Dongrui Liu , Jing Shao

Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But…

Computation and Language · Computer Science 2024-06-11 Ming Li , Yong Zhang , Shwai He , Zhitao Li , Hongyu Zhao , Jianzong Wang , Ning Cheng , Tianyi Zhou
‹ Prev 1 2 3 10 Next ›