Related papers: Efficient Sequence Packing without Cross-contamina…

Refining Packing and Shuffling Strategies for Enhanced Performance in Generative Language Models

Packing and shuffling tokens is a common practice in training auto-regressive language models (LMs) to prevent overfitting and improve efficiency. Typically documents are concatenated to chunks of maximum sequence length (MSL) and then…

Computation and Language · Computer Science 2024-08-20 Yanbing Chen , Ruilin Wang , Zihao Yang , Lavender Yao Jiang , Eric Karl Oermann

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a…

Computation and Language · Computer Science 2025-01-08 Hadi Pouransari , Chun-Liang Li , Jen-Hao Rick Chang , Pavan Kumar Anasosalu Vasu , Cem Koc , Vaishaal Shankar , Oncel Tuzel

BLoad: Enhancing Neural Network Training with Efficient Sequential Data Handling

The increasing complexity of modern deep neural network models and the expanding sizes of datasets necessitate the development of optimized and scalable training methods. In this white paper, we addressed the challenge of efficiently…

Machine Learning · Computer Science 2024-04-29 Raphael Ruschel , A. S. M. Iftekhar , B. S. Manjunath , Suya You

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference…

Computation and Language · Computer Science 2025-06-03 Guoxuan Chen , Han Shi , Jiawei Li , Yihang Gao , Xiaozhe Ren , Yimeng Chen , Xin Jiang , Zhenguo Li , Weiyang Liu , Chao Huang

Beyond Fixed Length: Bucket Pre-training is All You Need

Large Language Models (LLMs) have demonstrated exceptional performance across various tasks, with pre-training stage serving as the cornerstone of their capabilities. However, the conventional fixed-length data composition strategy for…

Computation and Language · Computer Science 2025-06-30 Qing Yang , Qiyao Peng , Hongtao Liu , Kai Liu , Bing Qin , Ting Liu

Fewer Truncations Improve Language Modeling

In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it…

Computation and Language · Computer Science 2024-05-03 Hantian Ding , Zijian Wang , Giovanni Paolini , Varun Kumar , Anoop Deoras , Dan Roth , Stefano Soatto

Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

During inference for transformer-based large language models (LLM), prefilling is the computation of the key-value (KV) cache for input tokens in the prompt prior to autoregressive generation. For longer input prompt lengths, prefilling…

Machine Learning · Computer Science 2024-04-16 Siyan Zhao , Daniel Israel , Guy Van den Broeck , Aditya Grover

FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism

Extending the context length (i.e., the maximum supported sequence length) of LLMs is of paramount significance. To facilitate long context training of LLMs, sequence parallelism has emerged as an essential technique, which scatters each…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-12 Yujie Wang , Shiju Wang , Shenhan Zhu , Fangcheng Fu , Xinyi Liu , Xuefeng Xiao , Huixia Li , Jiashi Li , Faming Wu , Bin Cui

Beyond Next Token Prediction: Patch-Level Training for Large Language Models

The prohibitive training costs of Large Language Models (LLMs) have emerged as a significant bottleneck in the development of next-generation LLMs. In this paper, we show that it is possible to significantly reduce the training costs of…

Computation and Language · Computer Science 2025-05-16 Chenze Shao , Fandong Meng , Jie Zhou

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

Efficient parallelization of Large Language Models (LLMs) with long sequences is essential but challenging due to their significant computational and memory demands, particularly stemming from communication bottlenecks in attention…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-31 Zongwu Wang , Fangxin Liu , Mingshuai Li , Li Jiang

SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training

The efficient distributed training of Large Language Models (LLMs) is severely hampered by the extreme variance in context lengths. This data heterogeneity, amplified by conventional packing strategies and asymmetric forward-backward costs,…

Artificial Intelligence · Computer Science 2025-10-01 Yuliang Liu , Guohao Wu , Shenglong Zhang , Wei Zhang , Qianchao Zhu , Zhouyang Li , Chenyu Wang

Silent Tokens, Loud Effects: Padding in LLMs

Padding tokens are widely used in large language models (LLMs) to equalize sequence lengths during batched inference. While they should be fully masked, implementation errors can cause them to influence computation, and the extent of this…

Computation and Language · Computer Science 2025-10-07 Rom Himelstein , Amit LeVi , Yonatan Belinkov , Avi Mendelson

Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning

Curriculum learning-organizing training data from easy to hard-has improved efficiency across machine learning domains, yet remains underexplored for language model pretraining. We present the first systematic investigation of curriculum…

Computation and Language · Computer Science 2026-01-29 Yang Zhang , Amr Mohamed , Hadi Abdine , Guokan Shang , Michalis Vazirgiannis

Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining

Pretraining large language models effectively requires strategic data selection, blending and ordering. However, key details about data mixtures especially their scalability to longer token horizons and larger model sizes remain…

Computation and Language · Computer Science 2024-12-23 Steven Feng , Shrimai Prabhumoye , Kezhi Kong , Dan Su , Mostofa Patwary , Mohammad Shoeybi , Bryan Catanzaro

Data-efficient LLM Fine-tuning for Code Generation

Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically…

Computation and Language · Computer Science 2025-04-18 Weijie Lv , Xuan Xia , Sheng-Jun Huang

Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning

Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP), achieving impressive performance in text generation. Their token-level representations capture rich, human-aligned semantics. However, pooling…

Computation and Language · Computer Science 2025-09-25 Benedikt Roth , Stephan Rappensperger , Tianming Qiu , Hamza Imamović , Julian Wörmann , Hao Shen

SPaCe: Unlocking Sample-Efficient Large Language Models Training With Self-Pace Curriculum Learning

Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL). However, such methods require extensive data and compute, making them impractical under many realistic training budgets.…

Machine Learning · Computer Science 2026-04-17 Dai Do , Manh Nguyen , Svetha Venkatesh , Hung Le

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding…

Machine Learning · Computer Science 2024-09-25 Johannes Hagemann , Samuel Weinbach , Konstantin Dobler , Maximilian Schall , Gerard de Melo

GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length

The evolving sophistication and intricacies of Large Language Models (LLMs) yield unprecedented advancements, yet they simultaneously demand considerable computational resources and incur significant costs. To alleviate these challenges,…

Computation and Language · Computer Science 2023-10-03 Hongye Jin , Xiaotian Han , Jingfeng Yang , Zhimeng Jiang , Chia-Yuan Chang , Xia Hu

SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention…

Computation and Language · Computer Science 2026-04-10 Jie Sun , Yu Liu , Lu Han , Qiwen Deng , Xiang Shu , Yang Xiao , Xingyu Lu , Jun Zhou , Pengfei Liu , Lintao Ma , Jiancan Wu , Xiang Wang