Related papers: Efficient Pre-Training with Token Superposition

Thinking Augmented Pre-training

This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an…

Computation and Language · Computer Science 2025-10-20 Liang Wang , Nan Yang , Shaohan Huang , Li Dong , Furu Wei

Text Simplification by Tagging

Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are…

Computation and Language · Computer Science 2022-05-11 Kostiantyn Omelianchuk , Vipul Raheja , Oleksandr Skurzhanskyi

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of…

Machine Learning · Computer Science 2024-08-22 Pihe Hu , Shaolong Li , Longbo Huang

Reinforcement Mid-Training

The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training…

Computation and Language · Computer Science 2025-09-30 Yijun Tian , Shaoyu Chen , Zhichao Xu , Yawei Wang , Jinhe Bi , Peng Han , Wei Wang

Revisiting Token Dropping Strategy in Efficient BERT Pretraining

Token dropping is a recently-proposed strategy to speed up the pretraining of masked language models, such as BERT, by skipping the computation of a subset of the input tokens at several middle layers. It can effectively reduce the training…

Computation and Language · Computer Science 2023-05-25 Qihuang Zhong , Liang Ding , Juhua Liu , Xuebo Liu , Min Zhang , Bo Du , Dacheng Tao

Token Dropping for Efficient BERT Pretraining

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT,…

Computation and Language · Computer Science 2022-03-25 Le Hou , Richard Yuanzhe Pang , Tianyi Zhou , Yuexin Wu , Xinying Song , Xiaodan Song , Denny Zhou

FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness

Scaling training compute, measured in FLOPs, has long been shown to improve the accuracy of large language models, yet training remains resource-intensive. Prior work shows that increasing test-time compute (TTC)-for example through…

Computation and Language · Computer Science 2026-01-06 Hossam Amer , Maryam Dialameh , Hossein Rajabzadeh , Walid Ahmed , Weiwei Zhang , Yang Liu

Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining

Pretraining large language models effectively requires strategic data selection, blending and ordering. However, key details about data mixtures especially their scalability to longer token horizons and larger model sizes remain…

Computation and Language · Computer Science 2024-12-23 Steven Feng , Shrimai Prabhumoye , Kezhi Kong , Dan Su , Mostofa Patwary , Mohammad Shoeybi , Bryan Catanzaro

Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement

Recent research enhances language model reasoning by scaling test-time compute via longer chain-of-thought traces. This often improves accuracy but also introduces redundancy and high computational cost, especially for small language models…

Machine Learning · Computer Science 2025-05-26 Xuechen Zhang , Zijian Huang , Chenshun Ni , Ziyang Xiong , Jiasi Chen , Samet Oymak

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better…

Computation and Language · Computer Science 2026-02-12 Dawid J. Kopiczko , Sagar Vaze , Tijmen Blankevoort , Yuki M. Asano

Learning to Skip for Language Modeling

Overparameterized large-scale language models have impressive generalization performance of in-context few-shot learning. However, most language models allocate the same amount of parameters or computation to each token, disregarding the…

Computation and Language · Computer Science 2023-11-28 Dewen Zeng , Nan Du , Tao Wang , Yuanzhong Xu , Tao Lei , Zhifeng Chen , Claire Cui

Turbo Training with Token Dropout

The objective of this paper is an efficient training method for video tasks. We make three contributions: (1) We propose Turbo training, a simple and versatile training paradigm for Transformers on multiple video tasks. (2) We illustrate…

Computer Vision and Pattern Recognition · Computer Science 2022-10-11 Tengda Han , Weidi Xie , Andrew Zisserman

Token-Efficient Leverage Learning in Large Language Models

Large Language Models (LLMs) have excelled in various tasks but perform better in high-resource scenarios, which presents challenges in low-resource scenarios. Data scarcity and the inherent difficulty of adapting LLMs to specific tasks…

Computation and Language · Computer Science 2024-04-02 Yuanhao Zeng , Min Wang , Yihang Wang , Yingxia Shao

Pretraining with Token-Level Adaptive Latent Chain-of-Thought

Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs. This work explores an alternative axis: increasing per-token computation…

Computation and Language · Computer Science 2026-03-11 Boyi Zeng , Yiqin Hao , He Li , Shixiang Song , Feichen Song , Zitong Wang , Siyuan Huang , Yi Xu , ZiWei He , Xinbing Wang , Zhouhan Lin

STEP: Staged Parameter-Efficient Pre-training for Large Language Models

Pre-training large language models (LLMs) faces significant memory challenges due to the large size of model parameters. We introduce STaged parameter-Efficient Pre-training (STEP), which integrates parameter-efficient tuning techniques…

Computation and Language · Computer Science 2025-04-08 Kazuki Yano , Takumi Ito , Jun Suzuki

FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has…

Machine Learning · Computer Science 2025-09-24 Yuxuan Cai , Xiaozhuan Liang , Xinghua Wang , Jin Ma , Haijin Liang , Jinwen Luo , Xinyu Zuo , Lisheng Duan , Yuyang Yin , Xi Chen

Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference

Long-context inference enhances the reasoning capability of Large Language Models (LLMs), but incurs significant computational overhead. Token-oriented methods, such as pruning and skipping, have shown great promise in reducing inference…

Computation and Language · Computer Science 2026-02-03 Zimeng Wu , Donghao Wang , Chaozhe Jin , Jiaxin Chen , Yunhong Wang

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

Pre-training large language models is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training…

Machine Learning · Computer Science 2024-01-25 Ke Ye , Heinrich Jiang , Afshin Rostamizadeh , Ayan Chakrabarti , Giulia DeSalvo , Jean-François Kagy , Lazaros Karydas , Gui Citovsky , Sanjiv Kumar

MST: Masked Self-Supervised Transformer for Visual Representation

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only…

Computer Vision and Pattern Recognition · Computer Science 2021-10-26 Zhaowen Li , Zhiyang Chen , Fan Yang , Wei Li , Yousong Zhu , Chaoyang Zhao , Rui Deng , Liwei Wu , Rui Zhao , Ming Tang , Jinqiao Wang

Cross-Lingual Supervision improves Large Language Models Pre-training

The recent rapid progress in pre-training Large Language Models has relied on using self-supervised language modeling objectives like next token prediction or span corruption. On the other hand, Machine Translation Systems are mostly…

Computation and Language · Computer Science 2023-05-22 Andrea Schioppa , Xavier Garcia , Orhan Firat