Related papers: Layer-Parallel Training for Transformers

A multilevel approach to accelerate the training of Transformers

In this article, we investigate the potential of multilevel approaches to accelerate the training of transformer architectures. Using an ordinary differential equation (ODE) interpretation of these architectures, we propose an appropriate…

Machine Learning · Computer Science 2025-04-29 Guillaume Lauga , Maël Chaumette , Edgar Desainte-Maréville , Étienne Lasalle , Arthur Lebeurrier

A Practical Layer-Parallel Training Algorithm for Residual Networks

Gradient-based algorithms for training ResNets typically require a forward pass of the input data, followed by back-propagating the objective gradient to update parameters, which are time-consuming for deep ResNets. To break the…

Machine Learning · Computer Science 2021-02-19 Qi Sun , Hexin Dong , Zewei Chen , Weizhen Dian , Jiacheng Sun , Yitong Sun , Zhenguo Li , Bin Dong

A Neural ODE Interpretation of Transformer Layers

Transformer layers, which use an alternating pattern of multi-head attention and multi-layer perceptron (MLP) layers, provide an effective tool for a variety of machine learning problems. As the transformer layers use residual connections…

Machine Learning · Computer Science 2022-12-13 Yaofeng Desmond Zhong , Tongtao Zhang , Amit Chakraborty , Biswadip Dey

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Recently, Transformer-based language models have demonstrated remarkable performance across many NLP domains. However, the unsupervised pre-training step of these models suffers from unbearable overall computational expenses. Current…

Machine Learning · Computer Science 2020-10-27 Minjia Zhang , Yuxiong He

Layer-Parallel Training of Deep Residual Neural Networks

Residual neural networks (ResNets) are a promising class of deep neural networks that have shown excellent performance for a number of learning tasks, e.g., image classification and recognition. Mathematically, ResNet architectures can be…

Optimization and Control · Mathematics 2019-07-26 S. Günther , L. Ruthotto , J. B. Schroder , E. C. Cyr , N. R. Gauger

Layer-Wise Partitioning and Merging for Efficient and Scalable Deep Learning

Deep Neural Network (DNN) models are usually trained sequentially from one layer to another, which causes forward, backward and update locking's problems, leading to poor performance in terms of training time. The existing parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-25 Samson B. Akintoye , Liangxiu Han , Huw Lloyd , Xin Zhang , Darren Dancey , Haoming Chen , Daoqiang Zhang

Transformers learn to implement preconditioned gradient descent for in-context learning

Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of…

Machine Learning · Computer Science 2023-11-13 Kwangjun Ahn , Xiang Cheng , Hadi Daneshmand , Suvrit Sra

A Multi-Level Framework for Accelerating Training Transformer Models

The fast growing capabilities of large-scale deep learning models, such as Bert, GPT and ViT, are revolutionizing the landscape of NLP, CV and many other domains. Training such models, however, poses an unprecedented demand for computing…

Machine Learning · Computer Science 2024-04-15 Longwei Zou , Han Zhang , Yangdong Deng

Ouroboros: On Accelerating Training of Transformer-Based Language Models

Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language…

Computation and Language · Computer Science 2019-09-17 Qian Yang , Zhouyuan Huo , Wenlin Wang , Heng Huang , Lawrence Carin

Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?

The remarkable capability of Transformers to do reasoning and few-shot learning, without any fine-tuning, is widely conjectured to stem from their ability to implicitly simulate a multi-step algorithms -- such as gradient descent -- with…

Machine Learning · Computer Science 2024-10-14 Khashayar Gatmiry , Nikunj Saunshi , Sashank J. Reddi , Stefanie Jegelka , Sanjiv Kumar

Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks. Current approaches parallelize training onto multiple devices by applying a single parallelization strategy (e.g.,…

Machine Learning · Computer Science 2018-06-12 Zhihao Jia , Sina Lin , Charles R. Qi , Alex Aiken

MML: Maximal Multiverse Learning for Robust Fine-Tuning of Language Models

Recent state-of-the-art language models utilize a two-phase training procedure comprised of (i) unsupervised pre-training on unlabeled text, and (ii) fine-tuning for a specific supervised task. More recently, many studies have been focused…

Computation and Language · Computer Science 2019-11-15 Itzik Malkiel , Lior Wolf

Deep Progressive Training: scaling up depth capacity of zero/one-layer models

Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, an effective strategy is the progressive training, which scales up…

Machine Learning · Computer Science 2025-11-10 Zhiqi Bu

Gradient Layer: Enhancing the Convergence of Adversarial Training for Generative Models

We propose a new technique that boosts the convergence of training generative adversarial networks. Generally, the rate of training deep models reduces severely after multiple iterations. A key reason for this phenomenon is that a deep…

Machine Learning · Statistics 2018-06-15 Atsushi Nitanda , Taiji Suzuki

Linear Transformers are Versatile In-Context Learners

Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in…

Machine Learning · Computer Science 2024-10-31 Max Vladymyrov , Johannes von Oswald , Mark Sandler , Rong Ge

Block-wise Training of Residual Networks via the Minimizing Movement Scheme

End-to-end backpropagation has a few shortcomings: it requires loading the entire model during training, which can be impossible in constrained settings, and suffers from three locking problems (forward locking, update locking and backward…

Machine Learning · Computer Science 2023-06-07 Skander Karkar , Ibrahim Ayed , Emmanuel de Bézenac , Patrick Gallinari

Parallel Training of Deep Networks with Local Updates

Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times…

Machine Learning · Computer Science 2021-06-16 Michael Laskin , Luke Metz , Seth Nabarro , Mark Saroufim , Badreddine Noune , Carlo Luschi , Jascha Sohl-Dickstein , Pieter Abbeel

A Bi-layered Parallel Training Architecture for Large-scale Convolutional Neural Networks

Benefitting from large-scale training datasets and the complex training network, Convolutional Neural Networks (CNNs) are widely applied in various fields with high accuracy. However, the training process of CNNs is very time-consuming,…

Machine Learning · Computer Science 2019-11-26 Jianguo Chen , Kenli Li , Kashif Bilal , Xu Zhou , Keqin Li , Philip S. Yu

Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism

Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging due to communication bottlenecks. While existing compression techniques are effective in…

Machine Learning · Computer Science 2025-06-03 Sameera Ramasinghe , Thalaiyasingam Ajanthan , Gil Avraham , Yan Zuo , Alexander Long

Orthogonalising gradients to speed up neural network optimisation

The optimisation of neural networks can be sped up by orthogonalising the gradients before the optimisation step, ensuring the diversification of the learned representations. We orthogonalise the gradients of the layer's components/filters…

Machine Learning · Computer Science 2022-02-16 Mark Tuddenham , Adam Prügel-Bennett , Jonathan Hare