Related papers: Distributed Low Precision Training Without Mixed P…

Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters

In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution…

Machine Learning · Computer Science 2019-12-03 Alexey Svyatkovskiy , Julian Kates-Harbeck , William Tang

PositNN: Training Deep Neural Networks with Mixed Low-Precision Posit

Low-precision formats have proven to be an efficient way to reduce not only the memory footprint but also the hardware resources and power consumption of deep learning computations. Under this premise, the posit numerical format appears to…

Machine Learning · Computer Science 2021-05-17 Gonçalo Raposo , Pedro Tomás , Nuno Roma

A Hitchhiker's Guide On Distributed Training of Deep Neural Networks

Deep learning has led to tremendous advancements in the field of Artificial Intelligence. One caveat however is the substantial amount of compute needed to train these deep learning models. Training a benchmark dataset like ImageNet on a…

Machine Learning · Computer Science 2018-10-30 Karanbir Chahal , Manraj Singh Grover , Kuntal Dey

Mixed Precision Training

Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models…

Artificial Intelligence · Computer Science 2018-02-19 Paulius Micikevicius , Sharan Narang , Jonah Alben , Gregory Diamos , Erich Elsen , David Garcia , Boris Ginsburg , Michael Houston , Oleksii Kuchaiev , Ganesh Venkatesh , Hao Wu

Training with Mixed-Precision Floating-Point Assignments

When training deep neural networks, keeping all tensors in high precision (e.g., 32-bit or even 16-bit floats) is often wasteful. However, keeping all tensors in low precision (e.g., 8-bit floats) can lead to unacceptable accuracy loss.…

Machine Learning · Computer Science 2023-06-26 Wonyeol Lee , Rahul Sharma , Alex Aiken

Revisiting 16-bit Neural Network Training: A Practical Approach for Resource-Limited Learning

With the increasing complexity of machine learning models, managing computational resources like memory and processing power has become a critical concern. Mixed precision techniques, which leverage different numerical precisions during…

Machine Learning · Computer Science 2026-04-20 Juyoung Yun , Sol Choi , Francois Rameau , Byungkon Kang , Zhoulai Fu

FP8-LM: Training FP8 Large Language Models

In this paper, we explore FP8 low-bit data formats for efficient training of large language models (LLMs). Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats…

Machine Learning · Computer Science 2023-12-20 Houwen Peng , Kan Wu , Yixuan Wei , Guoshuai Zhao , Yuxiang Yang , Ze Liu , Yifan Xiong , Ziyue Yang , Bolin Ni , Jingcheng Hu , Ruihang Li , Miaosen Zhang , Chen Li , Jia Ning , Ruizhe Wang , Zheng Zhang , Shuguang Liu , Joe Chau , Han Hu , Peng Cheng

Multi-Precision Policy Enforced Training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs

Large-scale convolutional neural networks (CNNs) suffer from very long training times, spanning from hours to weeks, limiting the productivity and experimentation of deep learning practitioners. As networks grow in size and complexity,…

Computer Vision and Pattern Recognition · Computer Science 2020-06-18 Aditya Rajagopal , Diederik Adriaan Vink , Stylianos I. Venieris , Christos-Savvas Bouganis

Deep Learning Models on CPUs: A Methodology for Efficient Training

GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when…

Machine Learning · Computer Science 2023-06-21 Quchen Fu , Ramesh Chukka , Keith Achorn , Thomas Atta-fosu , Deepak R. Canchi , Zhongwei Teng , Jules White , Douglas C. Schmidt

Deep Learning with Limited Numerical Precision

Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of…

Machine Learning · Computer Science 2015-02-11 Suyog Gupta , Ankur Agrawal , Kailash Gopalakrishnan , Pritish Narayanan

Mixed Precision Training With 8-bit Floating Point

Reduced precision computation for deep neural networks is one of the key areas addressing the widening compute gap driven by an exponential growth in model size. In recent years, deep learning training has largely migrated to 16-bit…

Machine Learning · Computer Science 2019-05-30 Naveen Mellempudi , Sudarshan Srinivasan , Dipankar Das , Bharat Kaul

Training Deep Neural Networks Using Posit Number System

With the increasing size of Deep Neural Network (DNN) models, the high memory space requirements and computational complexity have become an obstacle for efficient DNN implementations. To ease this problem, using reduced-precision…

Machine Learning · Computer Science 2019-09-10 Jinming Lu , Siyuan Lu , Zhisheng Wang , Chao Fang , Jun Lin , Zhongfeng Wang , Li Du

Data optimization for large batch distributed training of deep neural networks

Distributed training in deep learning (DL) is common practice as data and models grow. The current practice for distributed training of deep neural networks faces the challenges of communication bottlenecks when operating at scale, and…

Machine Learning · Computer Science 2020-12-21 Shubhankar Gahlot , Junqi Yin , Mallikarjun Shankar

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

Synchronized stochastic gradient descent (SGD) optimizers with data parallelism are widely used in training large-scale deep neural networks. Although using larger mini-batch sizes can improve the system scalability by reducing the…

Machine Learning · Computer Science 2018-07-31 Xianyan Jia , Shutao Song , Wei He , Yangzihao Wang , Haidong Rong , Feihu Zhou , Liqiang Xie , Zhenyu Guo , Yuanzhou Yang , Liwei Yu , Tiegang Chen , Guangxiao Hu , Shaohuai Shi , Xiaowen Chu

Low Precision Neural Networks using Subband Decomposition

Large-scale deep neural networks (DNN) have been successfully used in a number of tasks from image recognition to natural language processing. They are trained using large training sets on large models, making them computationally and…

Machine Learning · Computer Science 2017-03-28 Sek Chai , Aswin Raghavan , David Zhang , Mohamed Amer , Tim Shields

Revisiting BFloat16 Training

State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit hardware compute units alone are not enough to maximize model accuracy. As a result, deep learning…

Machine Learning · Computer Science 2021-03-09 Pedram Zamirai , Jian Zhang , Christopher R. Aberger , Christopher De Sa

Exploration of Low Numeric Precision Deep Learning Inference Using Intel FPGAs

CNNs have been shown to maintain reasonable classification accuracy when quantized to lower precisions. Quantizing to sub 8-bit activations and weights can result in accuracy falling below an acceptable threshold. Techniques exist for…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-02 Philip Colangelo , Nasibeh Nasiri , Asit Mishra , Eriko Nurvitadhi , Martin Margala , Kevin Nealis

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-21 Shaohuai Shi , Xianhao Zhou , Shutao Song , Xingyao Wang , Zilin Zhu , Xue Huang , Xinan Jiang , Feihu Zhou , Zhenyu Guo , Liqiang Xie , Rui Lan , Xianbin Ouyang , Yan Zhang , Jieqian Wei , Jing Gong , Weiliang Lin , Ping Gao , Peng Meng , Xiaomin Xu , Chenyang Guo , Bo Yang , Zhibo Chen , Yongjian Wu , Xiaowen Chu

Decentralized Diffusion Models

Large-scale AI model training divides work across thousands of GPUs, then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up…

Computer Vision and Pattern Recognition · Computer Science 2025-01-13 David McAllister , Matthew Tancik , Jiaming Song , Angjoo Kanazawa

Deep Neural Network Training without Multiplications

Is multiplication really necessary for deep neural networks? Here we propose just adding two IEEE754 floating-point numbers with an integer-add instruction in place of a floating-point multiplication instruction. We show that ResNet can be…

Machine Learning · Computer Science 2020-12-08 Tsuguo Mogami