English
Related papers

Related papers: On Approximate 8-bit Floating-Point Operations Usi…

200 papers

Conversion between binary and decimal floating-point representations is ubiquitous. Floating-point radix conversion means converting both the exponent and the mantissa. We develop an atomic operation for FP radix conversion with simple…

Mathematical Software · Computer Science 2014-07-21 O. Kupriianova , Ch. Lauter , J. -M. Muller

We introduce two algorithms for accurately evaluating powers to a positive integer in floating-point arithmetic, assuming a fused multiply-add (fma) instruction is available. We show that our log-time algorithm always produce…

Numerical Analysis · Computer Science 2007-06-13 Peter Kornerup , Vincent Lefèvre , Jean-Michel Muller

Presented here are algorithms for converting between (decimal) scientific-notation and (binary) IEEE-754 double-precision floating-point numbers. By employing a rounding integer quotient operation these algorithms are much simpler than…

Numerical Analysis · Computer Science 2018-08-08 Aubrey Jaffer

Data files often consist of numbers having only a few significant decimal digits, whose information content would allow storage in only 32 bits. However, we may require that arithmetic operations involving these numbers be done with 64-bit…

Computation · Statistics 2015-04-14 Radford M. Neal

Efficient number representation is essential for federated learning, natural language processing, and network measurement solutions. Due to timing, area, and power constraints, such applications use narrow bit-width (e.g., 8-bit) number…

Networking and Internet Architecture · Computer Science 2024-10-08 Itamar Cohen , Gil Einziger

Iterative solvers are frequently used in scientific applications and engineering computations. However, the memory-bound Sparse Matrix-Vector (SpMV) kernel computation hinders the efficiency of iterative algorithms. As modern hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-08 Jianhua Gao , Jiayuan Shen , Yuxiang Zhang , Weixing Ji , Hua Huang

We propose a novel floating-point encoding scheme that builds on prior work involving fixed-point encodings. We encode floating-point numbers using Two's Complement fixed-point mantissas and Two's Complement integral exponents. We used our…

Single-precision floating point (FP32) data format, defined by the IEEE 754 standard, is widely employed in scientific computing, signal processing, and deep learning training, where precision is critical. However, FP32 multiplication is…

Hardware Architecture · Computer Science 2025-10-09 Bindu G Gowda , Yogesh Goyal , Yash Gupta , Madhav Rao

For scientific computations on a digital computer the set of real number is usually approximated by a finite set F of "floating-point" numbers. We compare the numerical accuracy possible with difference choices of F having approximately the…

Numerical Analysis · Computer Science 2010-04-21 Richard P. Brent

When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale…

Machine Learning · Computer Science 2024-02-26 Andrey Kuzmin , Mart Van Baalen , Yuwei Ren , Markus Nagel , Jorn Peters , Tijmen Blankevoort

This paper discusses a simple and effective method for the summation of long sequences of floating point numbers. The method comprises two phases: an accumulation phase where the mantissas of the floating point numbers are added to…

Computer Vision and Pattern Recognition · Computer Science 2024-06-11 Vincenzo Liguori

There is a growing interest in the use of reduced-precision arithmetic, exacerbated by the recent interest in artificial intelligence, especially with deep learning. Most architectures already provide reduced-precision capabilities (e.g.,…

Hardware Architecture · Computer Science 2022-12-09 Olivier Sentieys , Daniel Menard

Large neural networks spend most computation on floating point tensor multiplications. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity…

Computation and Language · Computer Science 2024-10-03 Hongyin Luo , Wei Sun

On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and…

Mathematical Software · Computer Science 2015-05-13 Marc Baboulin , Alfredo Buttari , Jack Dongarra , Jakub Kurzak , Julie Langou , Julien Langou , Piotr Luszczek , Stanimire Tomov

Generating 2-by-2 unitary matrices in floating-precision arithmetic is a delicate task. One way to reduce the accumulation error is to use less floating-point operations to compute each of the entries in the 2-by-2 unitary matrix. This…

Numerical Analysis · Mathematics 2022-11-09 Weslley da Silva Pereira , Ali Lotfi , Julien Langou

Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Shivam Aggarwal , Hans Jakob Damsgaard , Alessandro Pappalardo , Giuseppe Franco , Thomas B. Preußer , Michaela Blott , Tulika Mitra

Verification of programs using floating-point arithmetic is challenging on several accounts. One of the difficulties of reasoning about such programs is due to the peculiarities of floating-point arithmetic: rounding errors, infinities,…

Programming Languages · Computer Science 2022-06-23 Roberto Bagnara , Abramo Bagnara , Fabio Biselli , Michele Chiari , Roberta Gori

The Graphic Processing Unit (GPU) has evolved into a powerful and flexible processor. The latest graphic processors provide fully programmable vertex and pixel processing units that support vector operations up to single floating-point…

Hardware Architecture · Computer Science 2007-05-23 Guillaume Da Graçca , David Defour

Floating point multiplication is one of the crucial operations in many application domains such as image processing, signal processing etc. But every application requires different working features. Some need high precision, some need low…

Hardware Architecture · Computer Science 2020-12-08 S. Arish , R. K. Sharma

Mixing precisions for performance has been an ongoing trend as the modern hardware accelerators started including new, and mostly lower-precision, data formats. The advantage of using them is a great potential of performance gain and energy…

‹ Prev 1 2 3 10 Next ›