English

Thinking Augmented Pre-training

Computation and Language 2025-10-20 v4 Machine Learning

Abstract

This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to 100100B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of 33. For a 33B parameter model, it improves the post-training performance by over 10%10\% on several challenging reasoning benchmarks.

Keywords

Cite

@article{arxiv.2509.20186,
  title  = {Thinking Augmented Pre-training},
  author = {Liang Wang and Nan Yang and Shaohan Huang and Li Dong and Furu Wei},
  journal= {arXiv preprint arXiv:2509.20186},
  year   = {2025}
}

Comments

19 pages; v4 fixes an issue for HumanEval scores