Related papers: Efficient Architecture for RISC-V Vector Memory Ac…

Multi-Strided Access Patterns to Boost Hardware Prefetching

Important memory-bound kernels, such as linear algebra, convolutions, and stencils, rely on SIMD instructions as well as optimizations targeting improved vectorized data traversal and data re-use to attain satisfactory performance. On on…

Performance · Computer Science 2024-12-23 Miguel O. Blom , Kristian F. D. Rietveld , Rob V. van Nieuwpoort

Vector Search for the Future: From Memory-Resident, Static Heterogeneous Storage, to Cloud-Native Architectures

Vector search (VS) has become a fundamental component in multimodal data management, enabling core functionalities such as image, video, and code retrieval. As vector data scales rapidly, VS faces growing challenges in balancing search,…

Databases · Computer Science 2026-01-06 Yitong Song , Xuanhe Zhou , Christian S. Jensen , Jianliang Xu

Efficient Data Access Paths for Mixed Vector-Relational Search

The rapid growth of machine learning capabilities and the adoption of data processing methods using vector embeddings sparked a great interest in creating systems for vector data management. While the predominant approach of vector data…

Databases · Computer Science 2024-03-26 Viktor Sanca , Anastasia Ailamaki

Evaluating IOMMU-Based Shared Virtual Addressing for RISC-V Embedded Heterogeneous SoCs

Embedded heterogeneous systems-on-chip (SoCs) rely on domain-specific hardware accelerators to improve performance and energy efficiency. In particular, programmable multi-core accelerators feature a cluster of processing elements and…

Hardware Architecture · Computer Science 2025-02-25 Cyril Koenig , Enrico Zelioli , Luca Benini

Fine-Grained Vectorized Merge Sorting on RISC-V: From Register to Cache

Merge sort as a divide-sort-merge paradigm has been widely applied in computer science fields. As modern reduced instruction set computing architectures like the fifth generation (RISC-V) regard multiple registers as a vector register group…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-02 Jin Zhang , Jincheng Zhou , Xiang Zhang , Di Ma , Chunye Gong

FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching

The inference of ML models composed of diverse structures, types, and sizes boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of…

Hardware Architecture · Computer Science 2026-04-07 Jianming Tong , Anirudh Itagi , Prasanth Chatarasi , Tushar Krishna

ARCANE: Adaptive RISC-V Cache Architecture for Near-memory Extensions

Modern data-driven applications expose limitations of von Neumann architectures - extensive data movement, low throughput, and poor energy efficiency. Accelerators improve performance but lack flexibility and require data transfers.…

Hardware Architecture · Computer Science 2025-04-09 Vincenzo Petrolo , Flavia Guella , Michele Caon , Pasquale Davide Schiavone , Guido Masera , Maurizio Martina

VWR2A: A Very-Wide-Register Reconfigurable-Array Architecture for Low-Power Embedded Devices

Edge-computing requires high-performance energy-efficient embedded systems. Fixed-function or custom accelerators, such as FFT or FIR filter engines, are very efficient at implementing a particular functionality for a given set of…

Hardware Architecture · Computer Science 2022-06-03 Benoît Walter Denkinger , Miguel Peón-Quirós , Mario Konijnenburg , David Atienza , Francky Catthoor

3D-TrIM: A Memory-Efficient Spatial Computing Architecture for Convolution Workloads

The Von Neumann bottleneck, which relates to the energy cost of moving data from memory to on-chip core and vice versa, is a serious challenge in state-of-the-art AI architectures, like Convolutional Neural Networks' (CNNs) accelerators.…

Hardware Architecture · Computer Science 2025-02-27 Cristian Sestito , Ahmed J. Abdelmaksoud , Shady Agwa , Themis Prodromakis

Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment

High-dimensional vector similarity search (HVSS) is gaining prominence as a powerful tool for various data science and AI applications. As vector data scales up, in-memory indexes pose a significant challenge due to the substantial increase…

Databases · Computer Science 2024-03-05 Mengzhao Wang , Weizhi Xu , Xiaomeng Yi , Songlin Wu , Zhangyang Peng , Xiangyu Ke , Yunjun Gao , Xiaoliang Xu , Rentong Guo , Charles Xie

TCDM Burst Access: Breaking the Bandwidth Barrier in Shared-L1 RVV Clusters Beyond 1000 FPUs

As computing demand and memory footprint of deep learning applications accelerate, clusters of cores sharing local (L1) multi-banked memory are widely used as key building blocks in large-scale architectures. When the cluster's core count…

Hardware Architecture · Computer Science 2025-01-27 Diyou Shen , Yichao Zhang , Marco Bertuletti , Luca Benini

Crescent: Taming Memory Irregularities for Accelerating Deep Point Cloud Analytics

3D perception in point clouds is transforming the perception ability of future intelligent machines. Point cloud algorithms, however, are plagued by irregular memory accesses, leading to massive inefficiencies in the memory sub-system,…

Hardware Architecture · Computer Science 2022-04-25 Yu Feng , Gunnar Hammonds , Yiming Gan , Yuhao Zhu

Efficient Implementation of RISC-V Vector Permutation Instructions

RISC-V CPUs leverage the RVV (RISC-V Vector) extension to accelerate data-parallel workloads. In addition to arithmetic operations, RVV includes powerful permutation instructions that enable flexible element rearrangement within vector…

Hardware Architecture · Computer Science 2025-06-02 Vasileios Titopoulos , George Alexakis , Chrysostomos Nicopoulos , Giorgos Dimitrakopoulos

Accelerating Force-Directed Graph Drawing with RT Cores

Graph drawing with spring embedders employs a V x V computation phase over the graph's vertex set to compute repulsive forces. Here, the efficacy of forces diminishes with distance: a vertex can effectively only influence other vertices in…

Data Structures and Algorithms · Computer Science 2020-08-27 Stefan Zellmann , Martin Weier , Ingo Wald

A Scalable RISC-V Vector Processor Enabling Efficient Multi-Precision DNN Inference

RISC-V processors encounter substantial challenges in deploying multi-precision deep neural networks (DNNs) due to their restricted precision support, constrained throughput, and suboptimal dataflow design. To tackle these challenges, a…

Hardware Architecture · Computer Science 2024-07-16 Chuanning Wang , Chao Fang , Xiao Wu , Zhongfeng Wang , Jun Lin

Revet: A Language and Compiler for Dataflow Threads

Spatial dataflow architectures such as reconfigurable dataflow accelerators (RDA) can provide much higher performance and efficiency than CPUs and GPUs. In particular, vectorized reconfigurable dataflow accelerators (vRDA) in recent…

Hardware Architecture · Computer Science 2024-02-01 Alexander Rucker , Shiv Sundram , Coleman Smith , Matthew Vilim , Raghu Prabhakar , Fredrik Kjolstad , Kunle Olukotun

Persistent Data Layout and Infrastructure for Efficient Selective Retrieval of Event Data in ATLAS

The ATLAS detector at CERN has completed its first full year of recording collisions at 7 TeV, resulting in billions of events and petabytes of data. At these scales, physicists must have the capability to read only the data of interest to…

Data Analysis, Statistics and Probability · Physics 2015-03-13 Peter van Gemmeren , David Malon

TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Architecture and Hardware Implementation

Modern hardware architectures for Convolutional Neural Networks (CNNs), other than targeting high performance, aim at dissipating limited energy. Reducing the data movement cost between the computing cores and the memory is a way to…

Hardware Architecture · Computer Science 2025-01-15 Cristian Sestito , Shady Agwa , Themis Prodromakis

RISC-V Based TinyML Accelerator for Depthwise Separable Convolutions in Edge AI

The increasing demand for on-device intelligence in Edge AI and TinyML applications requires the efficient execution of modern Convolutional Neural Networks (CNNs). While lightweight architectures like MobileNetV2 employ Depthwise Separable…

Hardware Architecture · Computer Science 2025-11-27 Muhammed Yildirim , Ozcan Ozturk

Microarchitectural Co-Optimization for Sustained Throughput of RISC-V Multi-Lane Chaining Vector Processors

Modern RISC vector processors rely on the synergy of multi-lane parallelism and chaining to achieve high sustained throughput, yet their achieved performance often falls substantially short of the theoretical performance bound due to…

Hardware Architecture · Computer Science 2026-04-27 Weiying Wang , Zhiwei Zhang