Related papers: Learning under Quantization for High-Dimensional L…

Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding

LLM training is resource-intensive. Quantized training improves computational and memory efficiency but introduces quantization noise, which can hinder convergence and degrade model accuracy. Stochastic Rounding (SR) has emerged as a…

Machine Learning · Computer Science 2025-11-04 Taowen Liu , Marta Andronic , Deniz Gündüz , George A. Constantinides

DQ-SGD: Dynamic Quantization in SGD for Communication-Efficient Distributed Learning

Gradient quantization is an emerging technique in reducing communication costs in distributed learning. Existing gradient quantization algorithms often rely on engineering heuristics or empirical observations, lacking a systematic approach…

Machine Learning · Computer Science 2021-08-02 Guangfeng Yan , Shao-Lun Huang , Tian Lan , Linqi Song

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to excellent scalability properties of this algorithm, and to its efficiency in the context of training deep neural networks.…

Machine Learning · Computer Science 2017-12-07 Dan Alistarh , Demjan Grubic , Jerry Li , Ryota Tomioka , Milan Vojnovic

Special Properties of Gradient Descent with Large Learning Rates

When training neural networks, it has been widely observed that a large step size is essential in stochastic gradient descent (SGD) for obtaining superior models. However, the effect of large step sizes on the success of SGD is not well…

Machine Learning · Computer Science 2023-02-17 Amirkeivan Mohtashami , Martin Jaggi , Sebastian Stich

Scaling Laws for Precision in High-Dimensional Linear Regression

Low-precision training is critical for optimizing the trade-off between model quality and training costs, necessitating the joint allocation of model size, dataset size, and numerical precision. While empirical scaling laws suggest that…

Machine Learning · Statistics 2026-02-27 Dechen Zhang , Xuan Tang , Yingyu Liang , Difan Zou

DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling

Differentially-Private SGD (DP-SGD) and its adaptive variant DP-Adam are powerful techniques to protect user privacy when using sensitive data to train neural networks. During training, converting model weights and activations into…

Machine Learning · Computer Science 2026-04-17 Yubo Gao , Renbo Tu , Gennady Pekhimenko , Nandita Vijaykumar

MQGrad: Reinforcement Learning of Gradient Quantization in Parameter Server

One of the most significant bottleneck in training large scale machine learning models on parameter server (PS) is the communication overhead, because it needs to frequently exchange the model gradients between the workers and servers…

Machine Learning · Computer Science 2018-04-25 Guoxin Cui , Jun Xu , Wei Zeng , Yanyan Lan , Jiafeng Guo , Xueqi Cheng

Effect of Weight Quantization on Learning Models by Typical Case Analysis

This paper examines the quantization methods used in large-scale data analysis models and their hyperparameter choices. The recent surge in data analysis scale has significantly increased computational resource requirements. To address…

Machine Learning · Statistics 2024-01-31 Shuhei Kashiwamura , Ayaka Sakata , Masaaki Imaizumi

RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models

With the rapid increase in the size of neural networks, model compression has become an important area of research. Quantization is an effective technique at decreasing the model size, memory access, and compute load of large models.…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-26 David Qiu , David Rim , Shaojin Ding , Oleg Rybakov , Yanzhang He

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is…

Machine Learning · Computer Science 2017-02-13 Nitish Shirish Keskar , Dheevatsa Mudigere , Jorge Nocedal , Mikhail Smelyanskiy , Ping Tak Peter Tang

Neural Networks with Quantization Constraints

Enabling low precision implementations of deep learning models, without considerable performance degradation, is necessary in resource and latency constrained settings. Moreover, exploiting the differences in sensitivity to quantization…

Machine Learning · Computer Science 2022-10-28 Ignacio Hounie , Juan Elenter , Alejandro Ribeiro

Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients

Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on…

Machine Learning · Computer Science 2013-03-28 Tom Schaul , Yann LeCun

Exploiting Explainable Metrics for Augmented SGD

Explaining the generalization characteristics of deep learning is an emerging topic in advanced machine learning. There are several unanswered questions about how learning under stochastic optimization really works and why certain…

Machine Learning · Computer Science 2022-04-01 Mahdi S. Hosseini , Mathieu Tuli , Konstantinos N. Plataniotis

A Statistical Framework for Low-bitwidth Training of Deep Neural Networks

Fully quantized training (FQT), which uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model, is a promising approach to accelerate the training of deep neural networks. One major…

Machine Learning · Computer Science 2020-10-28 Jianfei Chen , Yu Gai , Zhewei Yao , Michael W. Mahoney , Joseph E. Gonzalez

In-Hindsight Quantization Range Estimation for Quantized Training

Quantization techniques applied to the inference of deep neural networks have enabled fast and efficient execution on resource-constraint devices. The success of quantization during inference has motivated the academic community to explore…

Machine Learning · Computer Science 2021-05-11 Marios Fournarakis , Markus Nagel

Understanding Forgetting in Continual Learning with Linear Regression

Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. Despite the tremendous progress made in the past, the theoretical understanding, especially factors contributing to catastrophic…

Machine Learning · Computer Science 2024-05-29 Meng Ding , Kaiyi Ji , Di Wang , Jinhui Xu

On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD

One crucial factor behind the success of deep learning lies in the implicit bias induced by noise inherent in gradient-based training algorithms. Motivated by empirical observations that training with noisy labels improves model…

Machine Learning · Computer Science 2026-03-12 Tongcheng Zhang , Zhanpeng Zhou , Mingze Wang , Andi Han , Wei Huang , Taiji Suzuki , Junchi Yan

Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

Large-scale distributed optimization is of great importance in various applications. For data-parallel based distributed learning, the inter-node gradient communication often becomes the performance bottleneck. In this paper, we propose the…

Computer Vision and Pattern Recognition · Computer Science 2018-06-22 Jiaxiang Wu , Weidong Huang , Junzhou Huang , Tong Zhang

The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization

Stochastic gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training. Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization…

Machine Learning · Computer Science 2024-09-17 Haihan Zhang , Yuanshi Liu , Qianwen Chen , Cong Fang

Implicit Regularization of Stochastic Gradient Descent in Natural Language Processing: Observations and Implications

Deep neural networks with remarkably strong generalization performances are usually over-parameterized. Despite explicit regularization strategies are used for practitioners to avoid over-fitting, the impacts are often small. Some…

Computation and Language · Computer Science 2018-11-05 Deren Lei , Zichen Sun , Yijun Xiao , William Yang Wang