Related papers: Efficient Multi-Cycle Folded Integer Multipliers
Today every circuit has to face the power consumption issue for both portable device aiming at large battery life and high end circuits avoiding cooling packages and reliability issues that are too complex. It is generally accepted that…
The data transfer between a processor and memory has become a design bottleneck in data-intensive applications. Processing-In-Memory (PIM) is a practical approach to overcome the memory wall bottleneck. The 4:2 compressor is suitable for…
Multiple Constant Multiplication (MCM) over integers is a frequent operation arising in embedded systems that require highly optimized hardware. An efficient way is to replace costly generic multiplication by bit-shifts and additions, i.e.…
Matrix multiplication is the dominant computation during Machine Learning (ML) inference. To efficiently perform such multiplication operations, Compute-in-memory (CiM) paradigms have emerged as a highly energy efficient solution. However,…
Memristive Processing In-Memory (PIM) is one of the promising techniques for overcoming the Von-Neumann bottleneck. Reduction of data transfer between processor and memory and data processing by memristors in data-intensive applications…
Elliptic curve cryptography (ECC) has emerged as the dominant public-key protocol, with NIST standardizing parameters for binary field GF(2^m) ECC systems. This work presents a hardware implementation of a Hybrid Multiplication technique…
In recent years, various computing-in-memory (CIM) processors have been presented, showing superior performance over traditional architectures. To unleash the potential of various CIM architectures, such as device precision, crossbar size,…
Given the stringent requirements of energy efficiency for Internet-of-Things edge devices, approximate multipliers, as a basic component of many processors and accelerators, have been constantly proposed and studied for decades, especially…
Computing-in-Memory (CIM) accelerators are a promising solution for accelerating Machine Learning (ML) workloads, as they perform Matrix-Vector Multiplications (MVMs) on crossbar arrays directly in memory. Although the bit widths of the…
Matrix multiplications between asymmetric bit-width operands, especially between 8- and 4-bit operands are likely to become a fundamental kernel of many important workloads including neural networks and machine learning. While existing SIMD…
The ever-increasing quest for data-level parallelism and variable precision in ubiquitous multimedia and Deep Neural Network (DNN) applications has motivated the use of Single Instruction, Multiple Data (SIMD) architectures. To alleviate…
Multiplication is an indispensable operation in most of digital signal processing systems. Recently, many systems need to execute different types of algorithms on a multiplier. Therefore, it needs complicated computation and large area…
This paper describes several new improvements of modular arithmetic and how to exploit them in order to gain more efficient implementations of commonly used algorithms, especially in cryptographic applications. We further present a new…
The ever-increasing size and computational complexity of today's machine-learning algorithms pose an increasing strain on the underlying hardware. In this light, novel and dedicated architectural solutions are required to optimize energy…
Single instruction, multiple data (SIMD) is a popular design style of in-memory computing (IMC) architectures, which enables memory arrays to perform logic operations to achieve low energy consumption and high parallelism. To implement a…
In this work faster unsigned multiplication has been achieved by using a combination of High Performance Multiplication [HPM] column reduction technique and implementing a N-bit multiplier using 4 N/2-bit multipliers (recursive…
Approximate multipliers are widely being advocated for energy-efficient computing in applications that exhibit an inherent tolerance to inaccuracy. However, the inclusion of accuracy as a key design parameter, besides the performance, area…
This work presents a method to maximize power-efficiency of fixed point multiplier units by decomposing them into sub-components. First, an encoder block converts the operands from a two's complement to a sign magnitude representation,…
Vector multiplication is a fundamental operation for AI acceleration, responsible for over 85% of computational load in convolution tasks. While essential, these operations are primary drivers of area, power, and delay in modern datapath…
While reduction in feature size makes computation cheaper in terms of latency, area, and power consumption, performance of emerging data-intensive applications is determined by data movement. These trends have introduced the concept of…