Related papers: On Approximate 8-bit Floating-Point Operations Usi…

Radix Conversion for IEEE754-2008 Mixed Radix Floating-Point Arithmetic

Conversion between binary and decimal floating-point representations is ubiquitous. Floating-point radix conversion means converting both the exponent and the mantissa. We develop an atomic operation for FP radix conversion with simple…

Mathematical Software · Computer Science 2014-07-21 O. Kupriianova , Ch. Lauter , J. -M. Muller

Computing Integer Powers in Floating-Point Arithmetic

We introduce two algorithms for accurately evaluating powers to a positive integer in floating-point arithmetic, assuming a fused multiply-add (fma) instruction is available. We show that our log-time algorithm always produce…

Numerical Analysis · Computer Science 2007-06-13 Peter Kornerup , Vincent Lefèvre , Jean-Michel Muller

Easy Accurate Reading and Writing of Floating-Point Numbers

Presented here are algorithms for converting between (decimal) scientific-notation and (binary) IEEE-754 double-precision floating-point numbers. By employing a rounding integer quotient operation these algorithms are much simpler than…

Numerical Analysis · Computer Science 2018-08-08 Aubrey Jaffer

Representing numeric data in 32 bits while preserving 64-bit precision

Data files often consist of numbers having only a few significant decimal digits, whose information content would allow storage in only 32 bits. However, we may require that arithmetic operations involving these numbers be done with 64-bit…

Computation · Statistics 2015-04-14 Radford M. Neal

Floating-floating point: a highly accurate number representation with flexible Counting ranges

Efficient number representation is essential for federated learning, natural language processing, and network measurement solutions. Due to timing, area, and power constraints, such applications use narrow bit-width (e.g., 8-bit) number…

Networking and Internet Architecture · Computer Science 2024-10-08 Itamar Cohen , Gil Einziger

Precision-Aware Iterative Algorithms Based on Group-Shared Exponents of Floating-Point Numbers

Iterative solvers are frequently used in scientific applications and engineering computations. However, the memory-bound Sparse Matrix-Vector (SpMV) kernel computation hinders the efficiency of iterative algorithms. As modern hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-08 Jianhua Gao , Jiayuan Shen , Yuxiang Zhang , Weixing Ji , Hua Huang

Efficient Floating-Point Arithmetic on Fault-Tolerant Quantum Computers

We propose a novel floating-point encoding scheme that builds on prior work involving fixed-point encodings. We encode floating-point numbers using Two's Complement fixed-point mantissas and Two's Complement integral exponents. We used our…

Quantum Physics · Physics 2025-10-24 José E. Cruz Serrallés , Oluwadara Ogunkoya , Do{g}a Murat Kürkçüo{g}lu , Nicholas Bornman , Norm M. Tubman , Anna Grassellino , Silvia Zorzetti , Riccardo Lattanzi

Hardware-Efficient CNNs: Interleaved Approximate FP32 Multipliers for Kernel Computation

Single-precision floating point (FP32) data format, defined by the IEEE 754 standard, is widely employed in scientific computing, signal processing, and deep learning training, where precision is critical. However, FP32 multiplication is…

Hardware Architecture · Computer Science 2025-10-09 Bindu G Gowda , Yogesh Goyal , Yash Gupta , Madhav Rao

On the precision attainable with various floating-point number systems

For scientific computations on a digital computer the set of real number is usually approximated by a finite set F of "floating-point" numbers. We compare the numerical accuracy possible with difference choices of F having approximately the…

Numerical Analysis · Computer Science 2010-04-21 Richard P. Brent

FP8 Quantization: The Power of the Exponent

When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale…

Machine Learning · Computer Science 2024-02-26 Andrey Kuzmin , Mart Van Baalen , Yuwei Ren , Markus Nagel , Jorn Peters , Tijmen Blankevoort

Procrastination Is All You Need: Exponent Indexed Accumulators for Floating Point, Posits and Logarithmic Numbers

This paper discusses a simple and effective method for the summation of long sequences of floating point numbers. The method comprises two phases: an accumulation phase where the mantissas of the floating point numbers are added to…

Computer Vision and Pattern Recognition · Computer Science 2024-06-11 Vincenzo Liguori

Customizing Number Representation and Precision

There is a growing interest in the use of reduced-precision arithmetic, exacerbated by the recent interest in artificial intelligence, especially with deep learning. Most architectures already provide reduced-precision capabilities (e.g.,…

Hardware Architecture · Computer Science 2022-12-09 Olivier Sentieys , Daniel Menard

Addition is All You Need for Energy-efficient Language Models

Large neural networks spend most computation on floating point tensor multiplications. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity…

Computation and Language · Computer Science 2024-10-03 Hongyin Luo , Wei Sun

Accelerating Scientific Computations with Mixed Precision Algorithms

On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and…

Mathematical Software · Computer Science 2015-05-13 Marc Baboulin , Alfredo Buttari , Jack Dongarra , Jakub Kurzak , Julie Langou , Julien Langou , Piotr Luszczek , Stanimire Tomov

Numerical analysis of Givens rotation

Generating 2-by-2 unitary matrices in floating-precision arithmetic is a delicate task. One way to reduce the accumulation error is to use less floating-point operations to compute each of the entries in the 2-by-2 unitary matrix. This…

Numerical Analysis · Mathematics 2022-11-09 Weslley da Silva Pereira , Ali Lotfi , Julien Langou

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Shivam Aggarwal , Hans Jakob Damsgaard , Alessandro Pappalardo , Giuseppe Franco , Thomas B. Preußer , Michaela Blott , Tulika Mitra

Correct Approximation of IEEE 754 Floating-Point Arithmetic for Program Verification

Verification of programs using floating-point arithmetic is challenging on several accounts. One of the difficulties of reasoning about such programs is due to the peculiarities of floating-point arithmetic: rounding errors, infinities,…

Programming Languages · Computer Science 2022-06-23 Roberto Bagnara , Abramo Bagnara , Fabio Biselli , Michele Chiari , Roberta Gori

Implementation of float-float operators on graphics hardware

The Graphic Processing Unit (GPU) has evolved into a powerful and flexible processor. The latest graphic processors provide fully programmable vertex and pixel processing units that support vector operations up to single floating-point…

Hardware Architecture · Computer Science 2007-05-23 Guillaume Da Graçca , David Defour

Run-time reconfigurable multi-precision floating point multiplier design for high speed, low-power applications

Floating point multiplication is one of the crucial operations in many application domains such as image processing, signal processing etc. But every application requires different working features. Some need high precision, some need low…

Hardware Architecture · Computer Science 2020-12-08 S. Arish , R. K. Sharma

Performance and Numerical Aspects of Decompositional Factorizations with FP64 Floating-Point Emulation in INT8

Mixing precisions for performance has been an ongoing trend as the modern hardware accelerators started including new, and mostly lower-precision, data formats. The advantage of using them is a great potential of performance gain and energy…

Numerical Analysis · Mathematics 2025-09-30 Piotr Luszczek , Vijay Gadepally , LaToya Anderson , William Arcand , David Bestor , William Bergeron , Alex Bonn , Daniel J. Burrill , Chansup Byun , Michael Houle , Matthew Hubbell , Hayden Jananthan , Michael Jones , Peter Michaleas , Guillermo Morales , Julia Mullen , Andrew Prout , Albert Reuther , Antonio Rosa , Charles Yee , Jeremy Kepner