English
Related papers

Related papers: Completed Hyperparameter Transfer across Modules, …

200 papers

Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters…

Machine Learning · Computer Science 2026-05-21 Dayal Singh Kalra , Maissam Barkeshli

The growing scale of deep learning models has rendered standard hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware hyperparameters, which can enable direct transfer of optimal HPs from…

Machine Learning · Computer Science 2025-12-30 Nikhil Ghosh , Denny Wu , Alberto Bietti

The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $\mu$P parameterized networks, where the…

Machine Learning · Statistics 2023-12-11 Blake Bordelon , Lorenzo Noci , Mufan Bill Li , Boris Hanin , Cengiz Pehlevan

Modern large-scale neural networks are often trained and released in multiple sizes to accommodate diverse inference budgets. To improve efficiency, recent work has explored model upscaling: initializing larger models from trained smaller…

Machine Learning · Computer Science 2026-02-12 Yuxin Ma , Nan Chen , Mateo Díaz , Soufiane Hayou , Dmitriy Kunisky , Soledad Villar

Deep learning models have become a cornerstone of modern AI research, yet their initializations and learning rates may at times be set in an opaque or ad-hoc fashion due to the high cost of hyperparameter sweeps. The $\mu$-Parameterization…

Machine Learning · Computer Science 2025-02-17 Lucas Lingle

We study compute efficiency of LLM training when using different parameterizations, i.e., rules for adjusting model and optimizer hyperparameters (HPs) as model size changes. Some parameterizations fail to transfer optimal base HPs (such as…

Machine Learning · Computer Science 2026-01-21 Nolan Dey , Bin Claire Zhang , Lorenzo Noci , Mufan Li , Blake Bordelon , Shane Bergsma , Cengiz Pehlevan , Boris Hanin , Joel Hestness

Several variations of adaptive first-order and second-order optimization methods have been proposed to accelerate and scale the training of large language models. The performance of these optimization routines is highly sensitive to the…

Machine Learning · Computer Science 2026-02-25 Akshita Gupta , Marieme Ngom , Sam Foreman , Venkatram Vishwanath

Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such…

Deeper modern architectures are costly to train, making hyperparameter transfer preferable to expensive repeated tuning. Maximal Update Parametrization ($\mu$P) helps explain why many hyperparameters transfer across width. Yet depth scaling…

Machine Learning · Computer Science 2026-02-10 Shenxi Wu , Haosong Zhang , Xingjian Ma , Shirui Bian , Yichi Zhang , Xi Chen , Wei Lin

Hyperparameter transfer across model architectures dramatically reduces the amount of compute necessary for tuning large language models (LLMs). The maximal update parameterization ({\mu}P) ensures transfer through principled mathematical…

Fine-tuning large pre-trained language models on downstream tasks has become the de-facto learning paradigm in NLP. However, conventional approaches fine-tune all the parameters of the pre-trained model, which becomes prohibitive as the…

Computation and Language · Computer Science 2022-02-03 Junxian He , Chunting Zhou , Xuezhe Ma , Taylor Berg-Kirkpatrick , Graham Neubig

Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often…

Machine Learning · Computer Science 2026-03-18 Egor Shulgin , Dimitri von Rütte , Tianyue H. Zhang , Niccolò Ajroldi , Bernhard Schölkopf , Antonio Orvieto

Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (muP)…

Machine Learning · Computer Science 2026-02-16 Atli Kosson , Jeremy Welborn , Yang Liu , Martin Jaggi , Xi Chen

Although deep learning has produced dazzling successes for applications of image, speech, and video processing in the past few years, most trainings are with suboptimal hyper-parameters, requiring unnecessarily long training times. Setting…

Machine Learning · Computer Science 2018-04-25 Leslie N. Smith

Several recently introduced deep learning optimizers utilizing matrix-level preconditioning have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts…

Machine Learning · Computer Science 2026-01-21 Shikai Qiu , Zixi Chen , Hoang Phan , Qi Lei , Andrew Gordon Wilson

Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs…

State-of-the-art parameter-efficient fine-tuning methods rely on introducing adapter modules between the layers of a pretrained language model. However, such modules are trained separately for each task and thus do not enable sharing…

Computation and Language · Computer Science 2021-06-09 Rabeeh Karimi Mahabadi , Sebastian Ruder , Mostafa Dehghani , James Henderson

Machine learning algorithms have been used widely in various applications and areas. To fit a machine learning model into different problems, its hyper-parameters must be tuned. Selecting the best hyper-parameter configuration for machine…

Machine Learning · Computer Science 2022-10-06 Li Yang , Abdallah Shami

Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task. As an alternative, we…

Parameters in deep neural networks which are trained on large-scale databases can generalize across multiple domains, which is referred as "transferability". Unfortunately, the transferability is usually defined as discrete states and it…

Machine Learning · Computer Science 2018-04-25 Yinghua Zhang , Yu Zhang , Qiang Yang
‹ Prev 1 2 3 10 Next ›