Related papers: Completed Hyperparameter Transfer across Modules, …

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters…

Machine Learning · Computer Science 2026-05-21 Dayal Singh Kalra , Maissam Barkeshli

Understanding the Mechanisms of Fast Hyperparameter Transfer

The growing scale of deep learning models has rendered standard hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware hyperparameters, which can enable direct transfer of optimal HPs from…

Machine Learning · Computer Science 2025-12-30 Nikhil Ghosh , Denny Wu , Alberto Bietti

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $\mu$P parameterized networks, where the…

Machine Learning · Statistics 2023-12-11 Blake Bordelon , Lorenzo Noci , Mufan Bill Li , Boris Hanin , Cengiz Pehlevan

$\mu$pscaling small models: Principled warm starts and hyperparameter transfer

Modern large-scale neural networks are often trained and released in multiple sizes to accommodate diverse inference budgets. To improve efficiency, recent work has explored model upscaling: initializing larger models from trained smaller…

Machine Learning · Computer Science 2026-02-12 Yuxin Ma , Nan Chen , Mateo Díaz , Soufiane Hayou , Dmitriy Kunisky , Soledad Villar

An Empirical Study of $\mu$P Learning Rate Transfer

Deep learning models have become a cornerstone of modern AI research, yet their initializations and learning rates may at times be set in an opaque or ad-hoc fashion due to the high cost of hyperparameter sweeps. The $\mu$-Parameterization…

Machine Learning · Computer Science 2025-02-17 Lucas Lingle

Don't be lazy: CompleteP enables compute-efficient deep transformers

We study compute efficiency of LLM training when using different parameterizations, i.e., rules for adjusting model and optimizer hyperparameters (HPs) as model size changes. Some parameterizations fail to transfer optimal base HPs (such as…

Machine Learning · Computer Science 2026-01-21 Nolan Dey , Bin Claire Zhang , Lorenzo Noci , Mufan Li , Blake Bordelon , Shane Bergsma , Cengiz Pehlevan , Boris Hanin , Joel Hestness

Extending $\mu$P: Spectral Conditions for Feature Learning Across Optimizers

Several variations of adaptive first-order and second-order optimization methods have been proposed to accelerate and scale the training of large language models. The performance of these optimization routines is highly sensitive to the…

Machine Learning · Computer Science 2026-02-25 Akshita Gupta , Marieme Ngom , Sam Foreman , Venkatram Vishwanath

$\mu$-Parametrization for Mixture of Experts

Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such…

Machine Learning · Computer Science 2025-10-10 Jan Małaśnicki , Kamil Ciebiera , Mateusz Boruń , Maciej Pióro , Jan Ludziejewski , Maciej Stefaniak , Michał Krutul , Sebastian Jaszczur , Marek Cygan , Kamil Adamczewski , Jakub Krajewski

Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks

Deeper modern architectures are costly to train, making hyperparameter transfer preferable to expensive repeated tuning. Maximal Update Parametrization ($\mu$P) helps explain why many hyperparameters transfer across width. Yet depth scaling…

Machine Learning · Computer Science 2026-02-10 Shenxi Wu , Haosong Zhang , Xingjian Ma , Shirui Bian , Yichi Zhang , Xi Chen , Wei Lin

GQA-{\mu}P: The maximal parameterization update for grouped query attention

Hyperparameter transfer across model architectures dramatically reduces the amount of compute necessary for tuning large language models (LLMs). The maximal update parameterization ({\mu}P) ensures transfer through principled mathematical…

Machine Learning · Computer Science 2026-05-18 Kyle R. Chickering , Huijuan Wang , Mengxi Wu , Alexander Moreno , Muhao Chen , Xuezhe Ma , Daria Soboleva , Joel Hestness , Zhengzhong Liu , Eric Xing

Towards a Unified View of Parameter-Efficient Transfer Learning

Fine-tuning large pre-trained language models on downstream tasks has become the de-facto learning paradigm in NLP. However, conventional approaches fine-tune all the parameters of the pre-trained model, which becomes prohibitive as the…

Computation and Language · Computer Science 2022-02-03 Junxian He , Chunting Zhou , Xuezhe Ma , Taylor Berg-Kirkpatrick , Graham Neubig

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often…

Machine Learning · Computer Science 2026-03-18 Egor Shulgin , Dimitri von Rütte , Tianyue H. Zhang , Niccolò Ajroldi , Bernhard Schölkopf , Antonio Orvieto

Weight Decay may matter more than muP for Learning Rate Transfer in Practice

Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (muP)…

Machine Learning · Computer Science 2026-02-16 Atli Kosson , Jeremy Welborn , Yang Liu , Martin Jaggi , Xi Chen

A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

Although deep learning has produced dazzling successes for applications of image, speech, and video processing in the past few years, most trainings are with suboptimal hyper-parameters, requiring unnecessarily long training times. Setting…

Machine Learning · Computer Science 2018-04-25 Leslie N. Smith

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

Several recently introduced deep learning optimizers utilizing matrix-level preconditioning have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts…

Machine Learning · Computer Science 2026-01-21 Shikai Qiu , Zixi Chen , Hoang Phan , Qi Lei , Andrew Gordon Wilson

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs…

Machine Learning · Computer Science 2022-03-29 Greg Yang , Edward J. Hu , Igor Babuschkin , Szymon Sidor , Xiaodong Liu , David Farhi , Nick Ryder , Jakub Pachocki , Weizhu Chen , Jianfeng Gao

Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks

State-of-the-art parameter-efficient fine-tuning methods rely on introducing adapter modules between the layers of a pretrained language model. However, such modules are trained separately for each task and thus do not enable sharing…

Computation and Language · Computer Science 2021-06-09 Rabeeh Karimi Mahabadi , Sebastian Ruder , Mostafa Dehghani , James Henderson

On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice

Machine learning algorithms have been used widely in various applications and areas. To fit a machine learning model into different problems, its hyper-parameters must be tuned. Selecting the best hyper-parameter configuration for machine…

Machine Learning · Computer Science 2022-10-06 Li Yang , Abdallah Shami

Parameter-Efficient Transfer Learning for NLP

Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task. As an alternative, we…

Machine Learning · Computer Science 2019-06-14 Neil Houlsby , Andrei Giurgiu , Stanislaw Jastrzebski , Bruna Morrone , Quentin de Laroussilhe , Andrea Gesmundo , Mona Attariyan , Sylvain Gelly

Parameter Transfer Unit for Deep Neural Networks

Parameters in deep neural networks which are trained on large-scale databases can generalize across multiple domains, which is referred as "transferability". Unfortunately, the transferability is usually defined as discrete states and it…

Machine Learning · Computer Science 2018-04-25 Yinghua Zhang , Yu Zhang , Qiang Yang