Related papers: Soft Tiles: Capturing Physical Implementation Flex…

A Soft Processor Overlay with Tightly-coupled FPGA Accelerator

FPGA overlays are commonly implemented as coarse-grained reconfigurable architectures with a goal to improve designers' productivity through balancing flexibility and ease of configuration of the underlying fabric. To truly facilitate full…

Hardware Architecture · Computer Science 2016-06-22 Ho-Cheung Ng , Cheng Liu , Hayden Kwok-Hay So

Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters

Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid…

Hardware Architecture · Computer Science 2024-04-25 Sergio Mazzola , Samuel Riedel , Luca Benini

A transprecision floating-point cluster for efficient near-sensor data analytics

Recent applications in the domain of near-sensor computing require the adoption of floating-point arithmetic to reconcile high precision results with a wide dynamic range. In this paper, we propose a multi-core computing cluster that…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-12 Fabio Montagna , Stefan Mach , Simone Benatti , Angelo Garofalo , Gianmarco Ottavi , Luca Benini , Davide Rossi , Giuseppe Tagliavini

REPTILES: Repeated Tiles of Sargantana, a RISC-V multicore based on OpenPiton

Chip industry continues advancing and expanding modern computing systems, resulting in more complex multi-core processors. Conversely, academic projects face scalability challenges due to limited resources, highlighting the need for…

Hardware Architecture · Computer Science 2026-05-12 Noelia Oliete-Escuín , Arnau Bigas , Narcís Rodas , Albert Aguilera , Sajjad Ahmad , Jonathan Balkind , Xavier Carril , Max Doblas , Ivan Díaz , Roger Figueras , Alireza Foroodnia , Cesar Fuguet , Ignacio Genovese , Raúl Gilabert , Abbas Haghi , Alexander Kropotov , Neiel Leyva , Oscar Lostes-Cazorla , Lorién López-Villellas , Davy Million , Alireza Monemi , Sérik Pérez , Juan Antonio Rodríguez , Víctor Soria-Pardos , Behzad Salami , Francesc Moll , Oscar Palomar , Miquel Moretó , Lluc Alvarez

MemPool Flavors: Between Versatility and Specialization in a RISC-V Manycore Cluster

As computational paradigms evolve, applications such as attention-based models, wireless telecommunications, and computer vision impose increasingly challenging requirements on computer architectures: significant memory footprints and…

Hardware Architecture · Computer Science 2025-04-08 Sergio Mazzola , Yichao Zhang , Marco Bertuletti , Diyou Shen , Luca Benini

Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application

Graphics Processing Units (GPUs) are becoming popular accelerators in modern High-Performance Computing (HPC) clusters. Installing GPUs on each node of the cluster is not efficient resulting in high costs and power consumption as well as…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-15 Javier Prades , Blesson Varghese , Carlos Reano , Federico Silla

TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-up Cluster Design with High Bandwidth Main Memory Link

Shared L1-memory clusters of streamlined instruction processors (processing elements - PEs) are commonly used as building blocks in modern, massively parallel computing architectures (e.g. GP-GPUs). Scaling out these architectures by…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-03 Yichao Zhang , Marco Bertuletti , Chi Zhang , Samuel Riedel , Diyou Shen , Bowen Wang , Alessandro Vanelli-Coralli , Luca Benini

Flexible Vector Integration in Embedded RISC-V SoCs for End to End CNN Inference Acceleration

The emergence of heterogeneity and domain-specific architectures targeting deep learning inference show great potential for enabling the deployment of modern CNNs on resource-constrained embedded platforms. A significant development is the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-25 Dmitri Lyalikov

Cache-aware Parallel Programming for Manycore Processors

With rapidly evolving technology, multicore and manycore processors have emerged as promising architectures to benefit from increasing transistor numbers. The transition towards these parallel architectures makes today an exciting time to…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-01 Ashkan Tousimojarad , Wim Vanderbauwhede

Optimizing Scalable Multi-Cluster Architectures for Next-Generation Wireless Sensing and Communication

Next-generation wireless technologies (for immersive-massive communication, joint communication and sensing) demand highly parallel architectures for massive data processing. A common architectural template scales up by grouping tens to…

Hardware Architecture · Computer Science 2025-07-08 Samuel Riedel , Yichao Zhang , Marco Bertuletti , Luca Benini

Fused-Tiled Layers: Minimizing Data Movement on RISC-V SoCs with Software-Managed Caches

The success of DNNs and their high computational requirements pushed for large codesign efforts aiming at DNN acceleration. Since DNNs can be represented as static computational graphs, static memory allocation and tiling are two crucial…

Hardware Architecture · Computer Science 2025-04-08 Victor J. B. Jung , Alessio Burrello , Francesco Conti , Luca Benini

On the scaling of computational particle physics codes on cluster computers

Many appplications in computational science are sufficiently compute-intensive that they depend on the power of parallel computing for viability. For all but the "embarrassingly parallel" problems, the performance depends upon the level of…

High Energy Physics - Lattice · Physics 2009-09-29 Z. Sroczynski , N. Eicker , Th. Lippert , B. Orth , K. Schilling

Switchboard: An Open-Source Framework for Modular Simulation of Large Hardware Systems

Scaling up hardware systems has become an important tactic for improving performance as Moore's law fades. Unfortunately, simulations of large hardware systems are often a design bottleneck due to slow throughput and long build times. In…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-31 Steven Herbst , Noah Moroze , Edgar Iglesias , Andreas Olofsson

Contiguous Storage of Grid Data for Heterogeneous Computing

Structured Cartesian grids are a fundamental component in numerical simulations. Although these grids facilitate straightforward discretization schemes, their na\"{i}ve use in sparse domains leads to excessive memory overhead and…

Computational Engineering, Finance, and Science · Computer Science 2025-12-15 Fan Gu , Xiangyu Hu

Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation through Large Virtual Memory and Global Data Structures

We demonstrate that general-purpose memory allocation involving many threads on many cores can be done with high performance, multicore scalability, and low memory consumption. For this purpose, we have designed and implemented scalloc, a…

Programming Languages · Computer Science 2015-08-26 Martin Aigner , Christoph M. Kirsch , Michael Lippautz , Ana Sokolova

Improving Locality in Sparse and Dense Matrix Multiplications

Consecutive matrix multiplications are commonly used in graph neural networks and sparse linear solvers. These operations frequently access the same matrices for both reading and writing. While reusing these matrices improves data locality,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-02 Mohammad Mahdi Salehi Dezfuli , Kazem Cheshmi

A Scalable and Modular Software Architecture for Finite Elements on Hierarchical Hybrid Grids

In this article, a new generic higher-order finite-element framework for massively parallel simulations is presented. The modular software architecture is carefully designed to exploit the resources of modern and future supercomputers.…

Mathematical Software · Computer Science 2018-05-28 Nils Kohl , Dominik Thönnes , Daniel Drzisga , Dominik Bartuschat , Ulrich Rüde

Implementation of relativistic coupled cluster theory for massively parallel GPU-accelerated computing architectures

In this paper, we report a reimplementation of the core algorithms of relativistic coupled cluster theory aimed at modern heterogeneous high-performance computational infrastructures. The code is designed for efficient parallel execution on…

Chemical Physics · Physics 2023-09-18 Johann V. Pototschnig , Anastasios Papadopoulos , Dmitry I. Lyakh , Michal Repisky , Loïc Halbert , André Severo Pereira Gomes , Hans Jørgen Aa. Jensen , Lucas Visscher

MultiVic: A Time-Predictable RISC-V Multi-Core Processor Optimized for Neural Network Inference

Real-time systems, particularly those used in domains like automated driving, are increasingly adopting neural networks. From this trend arises the need for high-performance hardware exhibiting predictable timing behavior. While…

Hardware Architecture · Computer Science 2026-02-26 Maximilian Kirschner , Konstantin Dudzik , Ben Krusekamp , Jürgen Becker

Ripple : Simplified Large-Scale Computation on Heterogeneous Architectures with Polymorphic Data Layout

GPUs are now used for a wide range of problems within HPC. However, making efficient use of the computational power available with multiple GPUs is challenging. The main challenges in achieving good performance are memory layout, affecting…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-20 Robert Clucas , Philip Blakely , Nikolaos Nikiforakis