English
Related papers

Related papers: Efficient Pre-Training with Token Superposition

200 papers

This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an…

Computation and Language · Computer Science 2025-10-20 Liang Wang , Nan Yang , Shaohan Huang , Li Dong , Furu Wei

Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are…

Computation and Language · Computer Science 2022-05-11 Kostiantyn Omelianchuk , Vipul Raheja , Oleksandr Skurzhanskyi

Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of…

Machine Learning · Computer Science 2024-08-22 Pihe Hu , Shaolong Li , Longbo Huang

The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training…

Computation and Language · Computer Science 2025-09-30 Yijun Tian , Shaoyu Chen , Zhichao Xu , Yawei Wang , Jinhe Bi , Peng Han , Wei Wang

Token dropping is a recently-proposed strategy to speed up the pretraining of masked language models, such as BERT, by skipping the computation of a subset of the input tokens at several middle layers. It can effectively reduce the training…

Computation and Language · Computer Science 2023-05-25 Qihuang Zhong , Liang Ding , Juhua Liu , Xuebo Liu , Min Zhang , Bo Du , Dacheng Tao

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT,…

Computation and Language · Computer Science 2022-03-25 Le Hou , Richard Yuanzhe Pang , Tianyi Zhou , Yuexin Wu , Xinying Song , Xiaodan Song , Denny Zhou

Scaling training compute, measured in FLOPs, has long been shown to improve the accuracy of large language models, yet training remains resource-intensive. Prior work shows that increasing test-time compute (TTC)-for example through…

Computation and Language · Computer Science 2026-01-06 Hossam Amer , Maryam Dialameh , Hossein Rajabzadeh , Walid Ahmed , Weiwei Zhang , Yang Liu

Pretraining large language models effectively requires strategic data selection, blending and ordering. However, key details about data mixtures especially their scalability to longer token horizons and larger model sizes remain…

Computation and Language · Computer Science 2024-12-23 Steven Feng , Shrimai Prabhumoye , Kezhi Kong , Dan Su , Mostofa Patwary , Mohammad Shoeybi , Bryan Catanzaro

Recent research enhances language model reasoning by scaling test-time compute via longer chain-of-thought traces. This often improves accuracy but also introduces redundancy and high computational cost, especially for small language models…

Machine Learning · Computer Science 2025-05-26 Xuechen Zhang , Zijian Huang , Chenshun Ni , Ziyang Xiong , Jiasi Chen , Samet Oymak

Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better…

Computation and Language · Computer Science 2026-02-12 Dawid J. Kopiczko , Sagar Vaze , Tijmen Blankevoort , Yuki M. Asano

Overparameterized large-scale language models have impressive generalization performance of in-context few-shot learning. However, most language models allocate the same amount of parameters or computation to each token, disregarding the…

Computation and Language · Computer Science 2023-11-28 Dewen Zeng , Nan Du , Tao Wang , Yuanzhong Xu , Tao Lei , Zhifeng Chen , Claire Cui

The objective of this paper is an efficient training method for video tasks. We make three contributions: (1) We propose Turbo training, a simple and versatile training paradigm for Transformers on multiple video tasks. (2) We illustrate…

Computer Vision and Pattern Recognition · Computer Science 2022-10-11 Tengda Han , Weidi Xie , Andrew Zisserman

Large Language Models (LLMs) have excelled in various tasks but perform better in high-resource scenarios, which presents challenges in low-resource scenarios. Data scarcity and the inherent difficulty of adapting LLMs to specific tasks…

Computation and Language · Computer Science 2024-04-02 Yuanhao Zeng , Min Wang , Yihang Wang , Yingxia Shao

Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs. This work explores an alternative axis: increasing per-token computation…

Computation and Language · Computer Science 2026-03-11 Boyi Zeng , Yiqin Hao , He Li , Shixiang Song , Feichen Song , Zitong Wang , Siyuan Huang , Yi Xu , ZiWei He , Xinbing Wang , Zhouhan Lin

Pre-training large language models (LLMs) faces significant memory challenges due to the large size of model parameters. We introduce STaged parameter-Efficient Pre-training (STEP), which integrates parameter-efficient tuning techniques…

Computation and Language · Computer Science 2025-04-08 Kazuki Yano , Takumi Ito , Jun Suzuki

As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has…

Machine Learning · Computer Science 2025-09-24 Yuxuan Cai , Xiaozhuan Liang , Xinghua Wang , Jin Ma , Haijin Liang , Jinwen Luo , Xinyu Zuo , Lisheng Duan , Yuyang Yin , Xi Chen

Long-context inference enhances the reasoning capability of Large Language Models (LLMs), but incurs significant computational overhead. Token-oriented methods, such as pruning and skipping, have shown great promise in reducing inference…

Computation and Language · Computer Science 2026-02-03 Zimeng Wu , Donghao Wang , Chaozhe Jin , Jiaxin Chen , Yunhong Wang

Pre-training large language models is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training…

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only…

Computer Vision and Pattern Recognition · Computer Science 2021-10-26 Zhaowen Li , Zhiyang Chen , Fan Yang , Wei Li , Yousong Zhu , Chaoyang Zhao , Rui Deng , Liwei Wu , Rui Zhao , Ming Tang , Jinqiao Wang

The recent rapid progress in pre-training Large Language Models has relied on using self-supervised language modeling objectives like next token prediction or span corruption. On the other hand, Machine Translation Systems are mostly…

Computation and Language · Computer Science 2023-05-22 Andrea Schioppa , Xavier Garcia , Orhan Firat
‹ Prev 1 2 3 10 Next ›