English

Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training

Machine Learning 2024-11-12 v4

Abstract

We introduce Mini-Sequence Transformer (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. Integrated with activation recomputation, it enables significant memory savings in both forward and backward passes. In experiments with the Llama3-8B model, with MsT, we measure no degradation in throughput or convergence even with 12x longer sequences than standard implementations. MsT is fully general, implementation-agnostic, and requires minimal code changes to integrate with existing LLM training frameworks. Integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.

Keywords

Cite

@article{arxiv.2407.15892,
  title  = {Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training},
  author = {Cheng Luo and Jiawei Zhao and Zhuoming Chen and Beidi Chen and Anima Anandkumar},
  journal= {arXiv preprint arXiv:2407.15892},
  year   = {2024}
}
R2 v1 2026-06-28T17:49:56.239Z