Related papers: Multi-Dimensional Vector ISA Extension for Mobile …

End-to-end 100-TOPS/W Inference With Analog In-Memory Computing: Are We There Yet?

In-Memory Acceleration (IMA) promises major efficiency improvements in deep neural network (DNN) inference, but challenges remain in the integration of IMA within a digital system. We propose a heterogeneous architecture coupling 8 RISC-V…

Hardware Architecture · Computer Science 2021-09-06 Gianmarco Ottavi , Geethan Karunaratne , Francesco Conti , Irem Boybat , Luca Benini , Davide Rossi

A Flexible Instruction Set Architecture for Efficient GEMMs

GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set…

Hardware Architecture · Computer Science 2025-07-08 Alexandre de Limas Santana , Adrià Armejach , Francesc Martinez , Erich Focht , Marc Casas

A Heterogeneous In-Memory Computing Cluster For Flexible End-to-End Inference of Real-World Deep Neural Networks

Deployment of modern TinyML tasks on small battery-constrained IoT devices requires high computational energy efficiency. Analog In-Memory Computing (IMC) using non-volatile memory (NVM) promises major efficiency improvements in deep neural…

Hardware Architecture · Computer Science 2022-01-05 Angelo Garofalo , Gianmarco Ottavi , Francesco Conti , Geethan Karunaratne , Irem Boybat , Luca Benini , Davide Rossi

Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi-Pumped Soft SIMD Operations

Recent advancements in quantization and mixed-precision approaches offers substantial opportunities to improve the speed and energy efficiency of Neural Networks (NN). Research has shown that individual parameters with varying low…

Hardware Architecture · Computer Science 2024-08-14 Giorgos Armeniakos , Alexis Maras , Sotirios Xydis , Dimitrios Soudris

Extending the RISC-V ISA for exploring advanced reconfigurable SIMD instructions

This paper presents a novel, non-standard set of vector instruction types for exploring custom SIMD instructions in a softcore. The new types allow simultaneous access to a relatively high number of operands, reducing the instruction count…

Hardware Architecture · Computer Science 2021-06-15 Philippos Papaphilippou , Paul H. J. Kelly , Wayne Luk

Short reasons for long vectors in HPC CPUs: a study based on RISC-V

For years, SIMD/vector units have enhanced the capabilities of modern CPUs in High-Performance Computing (HPC) and mobile technology. Typical commercially-available SIMD units process up to 8 double-precision elements with one instruction.…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-14 Pablo Vizcaino , Georgios Ieronymakis , Nikolaos Dimou , Vassilis Papaefstathiou , Jesus Labarta , Filippo Mantovani

MaRVIn: A Cross-Layer Mixed-Precision RISC-V Framework for DNN Inference, from ISA Extension to Hardware Acceleration

The evolution of quantization and mixed-precision techniques has unlocked new possibilities for enhancing the speed and energy efficiency of NNs. Several recent studies indicate that adapting precision levels across different parameters can…

Machine Learning · Computer Science 2025-09-19 Giorgos Armeniakos , Alexis Maras , Sotirios Xydis , Dimitrios Soudris

SVE-enabling Lattice QCD Codes

Optimization of applications for supercomputers of the highest performance class requires parallelization at multiple levels using different techniques. In this contribution we focus on parallelization of particle physics simulations…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-23 Nils Meyer , Peter Georg , Dirk Pleiter , Stefan Solbrig , Tilo Wettig

Vector In Memory Architecture for simple and high efficiency computing

Data movement is one of the main challenges of contemporary system architectures. Near-Data Processing (NDP) mitigates this issue by moving computation closer to the memory, avoiding excessive data movement. Our proposal, Vector-In-Memory…

Hardware Architecture · Computer Science 2022-03-29 Marco Antonio Zanata Alves , Sairo Santos , Aline S. Cordeiro , Francis B. Moreira , Paulo C. Santos , Luigi Carro

The ARM Scalable Vector Extension

This article describes the ARM Scalable Vector Extension (SVE). Several goals guided the design of the architecture. First was the need to extend the vector processing capability associated with the ARM AArch64 execution state to better…

Hardware Architecture · Computer Science 2018-03-19 Nigel Stephens , Stuart Biles , Matthias Boettcher , Jacob Eapen , Mbou Eyole , Giacomo Gabrielli , Matt Horsnell , Grigorios Magklis , Alejandro Martinez , Nathanael Premillieu , Alastair Reid , Alejandro Rico , Paul Walker

A Variable Vector Length SIMD Architecture for HW/SW Co-designed Processors

Hardware/Software (HW/SW) co-designed processors provide a promising solution to the power and complexity problems of the modern microprocessors by keeping their hardware simple. Moreover, they employ several runtime optimizations to…

Hardware Architecture · Computer Science 2021-03-01 Rakesh Kumar , Alejandro Martinez , Antonio Gonzalez

Approximate ADCs for In-Memory Computing

In memory computing (IMC) architectures for deep learning (DL) accelerators leverage energy-efficient and highly parallel matrix vector multiplication (MVM) operations, implemented directly in memory arrays. Such IMC designs have been…

Emerging Technologies · Computer Science 2024-08-14 Arkapravo Ghosh , Hemkar Reddy Sadana , Mukut Debnath , Panthadip Maji , Shubham Negi , Sumeet Gupta , Mrigank Sharad , Kaushik Roy

Hello SME! Generating Fast Matrix Multiplication Kernels Using the Scalable Matrix Extension

Modern central processing units (CPUs) feature single-instruction, multiple-data pipelines to accelerate compute-intensive floating-point and fixed-point workloads. Traditionally, these pipelines and corresponding instruction set…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-30 Stefan Remke , Alexander Breuer

MoVE: Mixture of Value Embeddings -- A New Axis for Scaling Parametric Memory in Autoregressive Models

Autoregressive sequence modeling stands as the cornerstone of modern Generative AI, powering results across diverse modalities ranging from text generation to image generation. However, a fundamental limitation of this paradigm is the rigid…

Machine Learning · Computer Science 2026-02-02 Yangyan Li

FPGA-Accelerated RISC-V ISA Extensions for Efficient Neural Network Inference on Edge Devices

Edge AI deployment faces critical challenges balancing computational performance, energy efficiency, and resource constraints. This paper presents FPGA-accelerated RISC-V instruction set architecture (ISA) extensions for efficient neural…

Hardware Architecture · Computer Science 2025-11-11 Arya Parameshwara , Santosh Hanamappa Mokashi

VMXDOTP: A RISC-V Vector ISA Extension for Efficient Microscaling (MX) Format Acceleration

Compared to the first generation of deep neural networks, dominated by regular, compute-intensive kernels such as matrix multiplications (MatMuls) and convolutions, modern decoder-based transformers interleave attention, normalization, and…

Hardware Architecture · Computer Science 2026-03-06 Max Wipfli , Gamze İslamoğlu , Navaneeth Kunhi Purayil , Angelo Garofalo , Luca Benini

Klessydra-T: Designing Vector Coprocessors for Multi-Threaded Edge-Computing Cores

Computation intensive kernels, such as convolutions, matrix multiplication and Fourier transform, are fundamental to edge-computing AI, signal processing and cryptographic applications. Interleaved-Multi-Threading (IMT) processor cores are…

Hardware Architecture · Computer Science 2021-02-09 Abdallah Cheikh , Stefano Sordillo , Antonio Mastrandrea , Francesco Menichelli , Giuseppe Scotti , Mauro Olivieri

Performance of SSE and AVX Instruction Sets

SSE (streaming SIMD extensions) and AVX (advanced vector extensions) are SIMD (single instruction multiple data streams) instruction sets supported by recent CPUs manufactured in Intel and AMD. This SIMD programming allows parallel…

High Energy Physics - Lattice · Physics 2013-11-05 Hwancheol Jeong , Sunghoon Kim , Weonjong Lee , Seok-Ho Myung

LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices

Large language models (LLMs) have emerged as a powerful foundation for intelligent reasoning and decision-making, demonstrating substantial impact across a wide range of domains and applications. However, their massive parameter scales and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-29 Mingyu Sun , Xiao Zhang , Shen Qu , Yan Li , Mengbai Xiao , Yuan Yuan , Dongxiao Yu

In-Pipeline Integration of Digital In-Memory-Computing into RISC-V Vector Architecture to Accelerate Deep Learning

Expanding Deep Learning applications toward edge computing demands architectures capable of delivering high computational performance and efficiency while adhering to tight power and memory constraints. Digital In-Memory Computing (DIMC)…

Hardware Architecture · Computer Science 2026-02-03 Tommaso Spagnolo , Cristina Silvano , Riccardo Massa , Filippo Grillotti , Thomas Boesch , Giuseppe Desoli