Related papers: Mesa: A Memory-saving Training Framework for Trans…

Backprop with Approximate Activations for Memory-efficient Network Training

Training convolutional neural network models is memory intensive since back-propagation requires storing activations of all intermediate layers. This presents a practical concern when seeking to deploy very deep architectures in production,…

Machine Learning · Computer Science 2019-10-30 Ayan Chakrabarti , Benjamin Moseley

Memory-Efficient Fine-Tuning of Transformers via Token Selection

Fine-tuning provides an effective means to specialize pre-trained models for various downstream tasks. However, fine-tuning often incurs high memory overhead, especially for large transformer-based models, such as LLMs. While existing…

Computation and Language · Computer Science 2025-02-03 Antoine Simoulin , Namyong Park , Xiaoyi Liu , Grey Yang

Transformers learn in-context by gradient descent

At present, the mechanisms of in-context learning in Transformers are not well understood and remain mostly an intuition. In this paper, we suggest that training Transformers on auto-regressive objectives is closely related to…

Machine Learning · Computer Science 2023-06-01 Johannes von Oswald , Eyvind Niklasson , Ettore Randazzo , João Sacramento , Alexander Mordvintsev , Andrey Zhmoginov , Max Vladymyrov

Memory-Efficient Backpropagation through Large Linear Layers

In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers. Since the…

Machine Learning · Computer Science 2022-02-04 Daniel Bershatsky , Aleksandr Mikhalev , Alexandr Katrutsa , Julia Gusak , Daniil Merkulov , Ivan Oseledets

Uncovering mesa-optimization algorithms in Transformers

Some autoregressive models exhibit in-context learning capabilities: being able to learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so. The origins of this…

Machine Learning · Computer Science 2024-10-16 Johannes von Oswald , Maximilian Schlegel , Alexander Meulemans , Seijin Kobayashi , Eyvind Niklasson , Nicolas Zucchet , Nino Scherrer , Nolan Miller , Mark Sandler , Blaise Agüera y Arcas , Max Vladymyrov , Razvan Pascanu , João Sacramento

On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability

Autoregressively trained transformers have brought a profound revolution to the world, especially with their in-context learning (ICL) ability to address downstream tasks. Recently, several studies suggest that transformers learn a…

Machine Learning · Computer Science 2024-10-29 Chenyu Zheng , Wei Huang , Rongzhen Wang , Guoqiang Wu , Jun Zhu , Chongxuan Li

Optimizing Layer-Fused Scheduling of Transformer Networks on Multi-accelerator Platforms

The impact of transformer networks is booming, yet, they come with significant computational complexity. It is therefore essential to understand how to optimally map and execute these networks on modern neural processor hardware. So far,…

Hardware Architecture · Computer Science 2024-06-17 Steven Colleman , Arne Symons , Victor J. B. Jung , Marian Verhelst

DAF: An Efficient End-to-End Dynamic Activation Framework for on-Device DNN Training

Recent advancements in on-device training for deep neural networks have underscored the critical need for efficient activation compression to overcome the memory constraints of mobile and edge devices. As activations dominate memory usage…

Networking and Internet Architecture · Computer Science 2025-07-11 Renyuan Liu , Yuyang Leng , Kaiyan Liu , Shaohan Hu , Chun-Fu , Chen , Peijun Zhao , Heechul Yun , Shuochao Yao

Linear Self-Attention Approximation via Trainable Feedforward Kernel

In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches -- models attaining sub-quadratic attention complexity can utilize a notion of sparsity or a low-rank approximation of inputs to reduce…

Machine Learning · Computer Science 2022-11-09 Uladzislau Yorsh , Alexander Kovalenko

A Survey on Efficient Training of Transformers

Recent advances in Transformers have come with a huge requirement on computing resources, highlighting the importance of developing efficient training techniques to make Transformer training faster, at lower cost, and to higher accuracy by…

Machine Learning · Computer Science 2023-05-05 Bohan Zhuang , Jing Liu , Zizheng Pan , Haoyu He , Yuetian Weng , Chunhua Shen

Reducing Activation Recomputation in Large Transformer Models

Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation.…

Machine Learning · Computer Science 2022-05-12 Vijay Korthikanti , Jared Casper , Sangkug Lym , Lawrence McAfee , Michael Andersch , Mohammad Shoeybi , Bryan Catanzaro

Zen-Attention: A Compiler Framework for Dynamic Attention Folding on AMD NPUs

Transformer-based deep learning models are increasingly deployed on energy, and DRAM bandwidth constrained devices such as laptops and gaming consoles, which presents significant challenges in meeting the latency requirements of the models.…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-26 Aadesh Deshmukh , Venkata Yaswanth Raparti , Samuel Hsu

Stepping Forward on the Last Mile

Continuously adapting pre-trained models to local data on resource constrained edge devices is the $\emph{last mile}$ for model deployment. However, as models increase in size and depth, backpropagation requires a large amount of memory,…

Machine Learning · Computer Science 2024-11-07 Chen Feng , Shaojie Zhuo , Xiaopeng Zhang , Ramchalam Kinattinkara Ramakrishnan , Zhaocong Yuan , Andrew Zou Li

Allocation of Parameters in Transformers

Transformers have achieved remarkable successes across a wide range of applications, yet the theoretical foundation of their model efficiency remains underexplored. In this work, we investigate how the model parameters -- mainly attention…

Machine Learning · Computer Science 2025-10-07 Ruoxi Yu , Haotian Jiang , Jingpu Cheng , Penghao Yu , Qianxiao Li , Zhong Li

AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems

Transformer models gain popularity because of their superior inference accuracy and inference throughput. However, the transformer is computation-intensive, causing a long inference time. The existing works on transformer inference…

Performance · Computer Science 2023-04-19 Yuan Feng , Hyeran Jeon , Filip Blagojevic , Cyril Guyot , Qing Li , Dong Li

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile and TVs. Existing PTQ schemes, however, consume…

Machine Learning · Computer Science 2024-11-06 Junhan Kim , Chungman Lee , Eulrang Cho , Kyungphil Park , Ho-young Kim , Joonyoung Kim , Yongkweon Jeon

MISA: Memory-Efficient LLMs Optimization with Module-wise Importance Sampling

The substantial memory demands of pre-training and fine-tuning large language models (LLMs) require memory-efficient optimization algorithms. One promising approach is layer-wise optimization, which treats each transformer block as a single…

Machine Learning · Computer Science 2026-01-15 Yuxi Liu , Renjia Deng , Yutong He , Xue Wang , Tao Yao , Kun Yuan

An Efficient Data Reuse with Tile-Based Adaptive Stationary for Transformer Accelerators

Transformer-based models have become the \textit{de facto} backbone across many fields, such as computer vision and natural language processing. However, as these models scale in size, external memory access (EMA) for weight and activations…

Machine Learning · Computer Science 2025-03-26 Tseng-Jen Li , Tian-Sheuan Chang

Exploring Quantization for Efficient Pre-Training of Transformer Language Models

The increasing scale of Transformer models has led to an increase in their pre-training computational requirements. While quantization has proven to be effective after pre-training and during fine-tuning, applying quantization in…

Machine Learning · Computer Science 2024-10-14 Kamran Chitsaz , Quentin Fournier , Gonçalo Mordido , Sarath Chandar

Inverted Activations: Reducing Memory Footprint in Neural Network Training

The scaling of neural networks with increasing data and model sizes necessitates the development of more efficient deep learning algorithms. A significant challenge in neural network training is the memory footprint associated with…

Machine Learning · Computer Science 2024-10-08 Georgii Novikov , Ivan Oseledets