Related papers: RPU: The Ring Processing Unit
By exploiting the modular RISC-V ISA this paper presents the customization of instruction set with posit\textsuperscript{\texttrademark} arithmetic instructions to provide improved numerical accuracy, well-defined behavior and increased…
Ring Learning With Error (RLWE) algorithm is used in Post Quantum Cryptography (PQC) and Homomorphic Encryption (HE) algorithm. The existing classical crypto algorithms may be broken in quantum computers. The adversaries can store all…
Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This…
In this work, we propose an open-source, first-of-its-kind, arithmetic hardware library with a focus on accelerating the arithmetic operations involved in Ring Learning with Error (RLWE)-based somewhat homomorphic encryption (SHE). We…
Machine learning applications are computationally demanding and power intensive. Hardware acceleration of these software tools is a natural step being explored using various technologies. A recurrent processing unit (RPU) is fast and…
The Ring-Learning With Errors (RLWE) problem forms the backbone of highly efficient Fully Homomorphic Encryption (FHE) schemes. A significant component of the RLWE public key and ciphertext of the form $(b,a)$ is the uniformly random…
In our previous work we have shown that resistive cross point devices, so called Resistive Processing Unit (RPU) devices, can provide significant power and speed benefits when training deep fully connected networks as well as convolutional…
Edge AI deployment faces critical challenges balancing computational performance, energy efficiency, and resource constraints. This paper presents FPGA-accelerated RISC-V instruction set architecture (ISA) extensions for efficient neural…
Processor design and verification require a synergistic approach that combines instruction-level functional simulations with precise hardware emulations. The trade-off between speed and accuracy in the instruction set simulation poses a…
Vector processor architectures offer an efficient solution for accelerating data-parallel workloads (e.g., ML, AI), reducing instruction count, and enhancing processing efficiency. This is evidenced by the increasing adoption of vector…
RC4 can be made more secured if an additional RC4-like Post-KSA Random Shuffing (PKRS) process is introduced between KSA and PRGA. It can also be made significantly faster if RC4 bytes are processed in a FPGA embedded system using multiple…
This paper presents an optimized methodology to design and deploy Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) on a state-of-the-art MicroController Unit (MCU), with 1+8 general-purpose RISC-V cores. To…
Processing-in-memory (PIM) has shown extraordinary potential in accelerating neural networks. To evaluate the performance of PIM accelerators, we present an ISA-based simulation framework including a dedicated ISA targeting neural networks…
The Random Phase Approximation (RPA) for correlation energy in the grid-based projector augmented wave (gpaw) code is accelerated by porting to the Graphics Processing Unit (GPU) architecture. The acceleration is achieved by grouping…
The exponential growth of Internet of Things (IoT) applications has intensified the demand for efficient, high-throughput, and energy-efficient data processing at the edge. Conventional CPU-centric encryption methods suffer from performance…
RISC-V is a RISC based open and loyalty free instruction set architecture which has been developed since 2010, and can be used for cost-effective soft processors on FPGAs. The basic 32-bit integer instruction set in RISC-V is defined as…
The recent emergence of novel computational devices, such as adiabatic quantum computers, CMOS annealers, and optical parametric oscillators, present new opportunities for hybrid-optimization algorithms that are hardware accelerated by…
This paper introduces a computer architecture, where part of the instruction set architecture (ISA) is implemented on small highly-integrated field-programmable gate arrays (FPGAs). Small FPGAs inside a general-purpose processor (CPU) can…
Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We…
This paper presents a novel, non-standard set of vector instruction types for exploring custom SIMD instructions in a softcore. The new types allow simultaneous access to a relatively high number of operands, reducing the instruction count…