Related papers: EFloat: Entropy-coded Floating Point Format for Co…

AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference

Conventional hardware-friendly quantization methods, such as fixed-point or integer, tend to perform poorly at very low word sizes as their shrinking dynamic ranges cannot adequately capture the wide data distributions commonly seen in…

Machine Learning · Computer Science 2020-02-12 Thierry Tambe , En-Yu Yang , Zishen Wan , Yuntian Deng , Vijay Janapa Reddi , Alexander Rush , David Brooks , Gu-Yeon Wei

A Study of BFLOAT16 for Deep Learning Training

This paper presents the first comprehensive empirical study demonstrating the efficacy of the Brain Floating Point (BFLOAT16) half-precision format for Deep Learning training across image classification, speech recognition, language…

Machine Learning · Computer Science 2019-06-14 Dhiraj Kalamkar , Dheevatsa Mudigere , Naveen Mellempudi , Dipankar Das , Kunal Banerjee , Sasikanth Avancha , Dharma Teja Vooturi , Nataraj Jammalamadaka , Jianyu Huang , Hector Yuen , Jiyan Yang , Jongsoo Park , Alexander Heinecke , Evangelos Georganas , Sudarshan Srinivasan , Abhisek Kundu , Misha Smelyanskiy , Bharat Kaul , Pradeep Dubey

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)

Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs), have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Mohsen Hariri , Shaochen Zhong , Vipin Chaudhary , Yang Sui , Xia Hu , Anshumali Shrivastava

Data Compression with Relative Entropy Coding

Over the last few years, machine learning unlocked previously infeasible features for compression, such as providing guarantees for users' privacy or tailoring compression to specific data statistics (e.g., satellite images or audio…

Information Theory · Computer Science 2026-03-25 Gergely Flamich

A Transprecision Floating-Point Platform for Ultra-Low Power Computing

In modern low-power embedded platforms, floating-point (FP) operations emerge as a major contributor to the energy consumption of compute-intensive applications with large dynamic range. Experimental evidence shows that 50% of the energy…

Hardware Architecture · Computer Science 2017-11-29 Giuseppe Tagliavini , Stefan Mach , Davide Rossi , Andrea Marongiu , Luca Benini

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision…

Machine Learning · Computer Science 2025-10-06 Zeyu Yang , Tianyi Zhang , Jianwen Xie , Chuan Li , Zhaozhuo Xu , Anshumali Shrivastava

EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

Large Language Models (LLMs) achieve strong performance across tasks, but face storage and compute challenges on edge devices. We propose EntroLLM, a compression framework combining mixed quantization and entropy coding to reduce storage…

Machine Learning · Computer Science 2026-05-05 Arnab Sanyal , Gourav Datta , Prithwish Mukherjee , Sandeep P. Chinchali , Michael Orshansky

Revisiting BFloat16 Training

State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit hardware compute units alone are not enough to maximize model accuracy. As a result, deep learning…

Machine Learning · Computer Science 2021-03-09 Pedram Zamirai , Jian Zhang , Christopher R. Aberger , Christopher De Sa

Ascend HiFloat8 Format for Deep Learning

This preliminary white paper proposes a novel 8-bit floating-point data format HiFloat8 (abbreviated as HiF8) for deep learning. HiF8 features tapered precision. For normal value encoding, it provides 7 exponent values with 3-bit mantissa,…

Machine Learning · Computer Science 2024-09-27 Yuanyong Luo , Zhongxing Zhang , Richard Wu , Hu Liu , Ying Jin , Kai Zheng , Minmin Wang , Zhanying He , Guipeng Hu , Luyao Chen , Tianchi Hu , Junsong Wang , Minqi Chen , Mikhaylov Dmitry , Korviakov Vladimir , Bobrin Maxim , Yuhao Hu , Guanfu Chen , Zeyi Huang

MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

Low-precision formats have recently driven major breakthroughs in neural network (NN) training and inference by reducing the memory footprint of the NN models and improving the energy efficiency of the underlying hardware architectures.…

Hardware Architecture · Computer Science 2024-10-28 Luca Bertaccini , Gianna Paulin , Tim Fischer , Stefan Mach , Luca Benini

Float8@2bits: Entropy Coding Enables Data-Free Model Compression

Post-training compression is currently divided into two contrasting regimes. On the one hand, fast, data-free, and model-agnostic methods (e.g., NF4 or HQQ) offer maximum accessibility but suffer from functional collapse at extreme…

Machine Learning · Computer Science 2026-02-02 Patrick Putzky , Martin Genzel , Mattes Mollenhauer , Sebastian Schulze , Thomas Wollmann , Stefan Dietzel

Embedding Compression via Spherical Coordinates

We present an $\epsilon$-bounded compression method for unit-norm embeddings that achieves 1.5$\times$ compression, 25% better than the best prior lossless method. The method exploits that spherical coordinates of high-dimensional unit…

Machine Learning · Computer Science 2026-03-27 Han Xiao

All-You-Can-Fit 8-Bit Flexible Floating-Point Format for Accurate and Memory-Efficient Inference of Deep Neural Networks

Modern deep neural network (DNN) models generally require a huge amount of weight and activation values to achieve good inference outcomes. Those data inevitably demand a massive off-chip memory capacity/bandwidth, and the situation gets…

Machine Learning · Computer Science 2021-04-27 Cheng-Wei Huang , Tim-Wei Chen , Juinn-Dar Huang

An Efficient Gradient-Aware Error-Bounded Lossy Compressor for Federated Learning

Federated learning (FL) enables collaborative model training without exposing clients' private data, but its deployment is often constrained by the communication cost of transmitting gradients between clients and the central server,…

Machine Learning · Computer Science 2025-11-11 Zhijing Ye , Sheng Di , Jiamin Wang , Zhiqing Zhong , Zhaorui Zhang , Xiaodong Yu

Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats

As deep learning models grow and deployment becomes more widespread, reducing the storage and transmission costs of neural network weights has become increasingly important. While prior work such as ZipNN has shown that lossless compression…

Machine Learning · Computer Science 2025-08-28 Anat Heilper , Doron Singer

Compressed Real Numbers for AI: a case-study using a RISC-V CPU

As recently demonstrated, Deep Neural Networks (DNN), usually trained using single precision IEEE 754 floating point numbers (binary32), can also work using lower precision. Therefore, 16-bit and 8-bit compressed format have attracted…

Machine Learning · Computer Science 2023-09-15 Federico Rossi , Marco Cococcioni , Roger Ferrer Ibàñez , Jesùs Labarta , Filippo Mantovani , Marc Casas , Emanuele Ruffaldi , Sergio Saponara

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Shivam Aggarwal , Hans Jakob Damsgaard , Alessandro Pappalardo , Giuseppe Franco , Thomas B. Preußer , Michaela Blott , Tulika Mitra

Complex Block Floating-Point Format with Box Encoding For Wordlength Reduction in Communication Systems

We propose a new complex block floating-point format to reduce implementation complexity. The new format achieves wordlength reduction by sharing an exponent across the block of samples, and uses box encoding for the shared exponent to…

Information Theory · Computer Science 2017-10-26 Yeong Foong Choo , Brian L. Evans , Alan Gatherer

Microscaling Floating Point Formats for Large Language Models

The increasing computational and memory demands of large language models (LLMs) necessitate innovative approaches to optimize resource usage without compromising performance. This paper leverages microscaling floating-point formats, a novel…

Neural and Evolutionary Computing · Computer Science 2025-10-03 Marco Cococcioni , Dario Pagani , Federico Rossi

The Weight of a Bit: EMFI Sensitivity Analysis of Embedded Deep Learning Models

Fault injection attacks on embedded neural network models have been shown as a potent threat. Numerous works studied resilience of models from various points of view. As of now, there is no comprehensive study that would evaluate the…

Cryptography and Security · Computer Science 2026-04-14 Jakub Breier , Štefan Kučerák , Xiaolu Hou