Related papers: A Hardware-oriented Algorithm for Complex-valued C…
In calculating integral or discrete transforms, use has been made of fast algorithms for multiplying vectors by matrices whose elements are specified as values of special (Chebyshev, Legendre, Laguerre, etc.) functions. The currently…
Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput…
In this paper, we offer and discuss three efficient structural solutions for the hardware-oriented implementation of discrete quaternion Fourier transform basic operations with reduced implementation complexities. The first solution: a…
This paper presents a structural design of the hardware-efficient module for implementation of convolution neural network (CNN) basic operation with reduced implementation complexity. For this purpose we utilize some modification of the…
In this work, a rationalized algorithm for calculating the quotient of two quaternions is presented which reduces the number of underlying real multiplications. Hardware for fast multiplication is much more expensive than hardware for fast…
Vector-Matrix Multiplication (VMM) is the fundamental and frequently required computation in inference of Neural Networks (NN). Due to the large data movement required during inference, VMM can benefit greatly from in-memory computing.…
Multiple Constant Multiplication (MCM) over integers is a frequent operation arising in embedded systems that require highly optimized hardware. An efficient way is to replace costly generic multiplication by bit-shifts and additions, i.e.…
This document describes an algorithm to scale a complex vector by the reciprocal of a complex value. The algorithm computes the reciprocal of the complex value and then scales the vector by the reciprocal. Some scaling may be necessary due…
Approximate computing is a promising approach to reduce the power, delay, and area in hardware design for many error-resilient applications such as machine learning (ML) and digital signal processing (DSP) systems, in which multipliers…
Kernel matrices are crucial in many learning tasks such as support vector machines or kernel ridge regression. The kernel matrix is typically dense and large-scale. Depending on the dimension of the feature space even the computation of all…
In this paper, we present several resource-efficient algorithmic solutions regarding the fully parallel hardware implementation of the basic filtering operation performed in the convolutional layers of convolution neural networks. In fact,…
Matrix multiplication consumes a large fraction of the time taken in many machine-learning algorithms. Thus, accelerator chips that perform matrix multiplication faster than conventional processors or even GPU's are of increasing interest.…
Studies on time and memory costs of products in geometric algebra have been limited to cases where multivectors with multiple grades have only non-zero elements. This allows to design efficient algorithms for a generic purpose; however, it…
In this work a rationalized algorithm for calculating the quotient of two complex numbers is presented which reduces the number of underlying real multiplications. The performing of a complex number division using the naive method takes 4…
In recent years, a new kind of accelerated hardware has gained popularity in the Artificial Intelligence (AI) and Machine Learning (ML) communities which enables extremely high-performance tensor contractions in reduced precision for deep…
In this paper we propose a fast optimization algorithm for approximately minimizing convex quadratic functions over the intersection of affine and separable constraints (i.e., the Cartesian product of possibly nonconvex real sets). This…
Recently, the demand of low-power deep-learning hardware for industrial applications has been increasing. Most existing artificial intelligence (AI) chips have evolved to rely on new chip technologies rather than on radically new hardware…
Kernel matrix-vector product is ubiquitous in many science and engineering applications. However, a naive method requires $O(N^2)$ operations, which becomes prohibitive for large-scale problems. We introduce a parallel method that provably…
Quantum-dot cellular automata (QCA) shows promise as a post silicon CMOS, low power computational technology. Nevertheless, to generalize QCA for next-generation digital devices, the ability to implement conventional programmable circuits…
Ootomo, Ozaki, and Yokota [Int. J. High Perform. Comput. Appl., 38 (2024), p. 297-313] have proposed a strategy to recast a floating-point matrix multiplication in terms of integer matrix products. The factors A and B are split into integer…