Related papers: Optimized Spatial Architecture Mapping Flow for Tr…

MatrixFlow: System-Accelerator co-design for high-performance transformer applications

Transformers are central to advances in artificial intelligence (AI), excelling in fields ranging from computer vision to natural language processing. Despite their success, their large parameter count and computational demands challenge…

Hardware Architecture · Computer Science 2025-03-10 Qunyou Liu , Marina Zapater , David Atienza

Understanding the Design-Space of Sparse/Dense Multiphase GNN dataflows on Spatial Accelerators

Graph Neural Networks (GNNs) have garnered a lot of recent interest because of their success in learning representations from graph-structured data across several critical applications in cloud and HPC. Owing to their unique compute and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-08 Raveesh Garg , Eric Qin , Francisco Muñoz-Martínez , Robert Guirado , Akshay Jain , Sergi Abadal , José L. Abellán , Manuel E. Acacio , Eduard Alarcón , Sivasankaran Rajamanickam , Tushar Krishna

Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design

A low-latency and energy-efficient tensor algebra accelerator design must optimize how data movement and operations are scheduled (i.e., mapped) in the accelerator architecture. A key mapping optimization is fusion, meaning holding data…

Hardware Architecture · Computer Science 2026-05-05 Tanner Andrulis , Michael Gilbert , Vivienne Sze , Joel S. Emer

A Scalable FPGA-based Architecture for Depth Estimation in SLAM

The current state of the art of Simultaneous Localisation and Mapping, or SLAM, on low power embedded systems is about sparse localisation and mapping with low resolution results in the name of efficiency. Meanwhile, research in this field…

Robotics · Computer Science 2019-02-14 Konstantinos Boikos , Christos-Savvas Bouganis

Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators

As the landscape of deep neural networks evolves, heterogeneous dataflow accelerators, in the form of multi-core architectures or chiplet-based designs, promise more flexibility and higher inference performance through scalability. So far,…

Hardware Architecture · Computer Science 2025-10-08 Arne Symons , Linyan Mei , Steven Colleman , Pouya Houshmand , Sebastian Karl , Marian Verhelst

SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs

Efficiently supporting long context length is crucial for Transformer models. The quadratic complexity of the self-attention computation plagues traditional Transformers. Sliding window-based static sparse attention mitigates the problem by…

Hardware Architecture · Computer Science 2024-05-28 Zhenyu Bai , Pranav Dangi , Huize Li , Tulika Mitra

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. The majority of existing approaches rely on temporal architectures that…

Machine Learning · Computer Science 2024-04-09 Hongzheng Chen , Jiahao Zhang , Yixiao Du , Shaojie Xiang , Zichao Yue , Niansong Zhang , Yaohui Cai , Zhiru Zhang

COMET: A Framework for Modeling Compound Operation Dataflows with Explicit Collectives

Modern machine learning accelerators are designed to efficiently execute deep neural networks (DNNs) by optimizing data movement, memory hierarchy, and compute throughput. However, emerging DNN models such as large language models, state…

Hardware Architecture · Computer Science 2025-09-03 Shubham Negi , Manik Singhal , Aayush Ankit , Sudeep Bhoja , Kaushik Roy

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication

There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-22 Gordon E. Moon , Hyoukjun Kwon , Geonhwa Jeong , Prasanth Chatarasi , Sivasankaran Rajamanickam , Tushar Krishna

SMAUG: End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads

In recent years, there has been tremendous advances in hardware acceleration of deep neural networks. However, most of the research has focused on optimizing accelerator microarchitecture for higher performance and energy efficiency on a…

Machine Learning · Computer Science 2019-12-12 Sam Likun Xi , Yuan Yao , Kshitij Bhardwaj , Paul Whatmough , Gu-Yeon Wei , David Brooks

SparseMap: A Sparse Tensor Accelerator Framework Based on Evolution Strategy

The growing demand for sparse tensor algebra (SpTA) in machine learning and big data has driven the development of various sparse tensor accelerators. However, most existing manually designed accelerators are limited to specific scenarios,…

Machine Learning · Computer Science 2025-08-19 Boran Zhao , Haiming Zhai , Zihang Yuan , Hetian Liu , Tian Xia , Wenzhe Zhao , Pengju Ren

Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration

State Space Models (SSMs) offer a promising alternative to transformers for long-sequence processing. However, their efficiency remains hindered by memory-bound operations, particularly in the prefill stage. While MARCA, a recent first…

Hardware Architecture · Computer Science 2026-04-10 Robin Geens , Arne Symons , Marian Verhelst

The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design

The energy and latency of an accelerator running a deep neural network (DNN) depend on how the computation and data movement are scheduled in the accelerator (i.e., mapping), and picking an optimal mapping is essential to achieve…

Hardware Architecture · Computer Science 2026-05-05 Michael Gilbert , Tanner Andrulis , Vivienne Sze , Joel S. Emer

Meta-Optimization and Program Search using Language Models for Task and Motion Planning

Intelligent interaction with the real world requires robotic agents to jointly reason over high-level plans and low-level controls. Task and motion planning (TAMP) addresses this by combining symbolic planning and continuous trajectory…

Robotics · Computer Science 2025-09-18 Denis Shcherba , Eckart Cobo-Briesewitz , Cornelius V. Braun , Marc Toussaint

Designing Efficient and High-performance AI Accelerators with Customized STT-MRAM

In this paper, we demonstrate the design of efficient and high-performance AI/Deep Learning accelerators with customized STT-MRAM and a reconfigurable core. Based on model-driven detailed design space exploration, we present the design…

Hardware Architecture · Computer Science 2021-04-07 Kaniz Mishty , Mehdi Sadi

Dato: A Task-Based Programming Model for Dataflow Accelerators

Recent deep learning workloads increasingly push computational demand beyond what current memory systems can sustain, with many kernels stalling on data movement rather than computation. While modern dataflow accelerators incorporate…

Programming Languages · Computer Science 2025-09-09 Shihan Fang , Hongzheng Chen , Niansong Zhang , Jiajie Li , Han Meng , Adrian Liu , Zhiru Zhang

Finding Fast Transformers: One-Shot Neural Architecture Search by Component Composition

Transformer-based models have achieved stateof-the-art results in many tasks in natural language processing. However, such models are usually slow at inference time, making deployment difficult. In this paper, we develop an efficient…

Machine Learning · Computer Science 2020-08-18 Henry Tsai , Jayden Ooi , Chun-Sung Ferng , Hyung Won Chung , Jason Riesa

Mitigating the Bandwidth Wall via Data-Streaming System-Accelerator Co-Design

Transformers have revolutionized AI in natural language processing and computer vision, but their large computation and memory demands pose major challenges for hardware acceleration. In practice, end-to-end throughput is often limited by…

Hardware Architecture · Computer Science 2026-03-20 Qunyou Liu , Marina Zapater , David Atienza

SAMO: Optimised Mapping of Convolutional Neural Networks to Streaming Architectures

Significant effort has been placed on the development of toolflows that map Convolutional Neural Network (CNN) models to Field Programmable Gate Arrays (FPGAs) with the aim of automating the production of high performing designs for a…

Hardware Architecture · Computer Science 2022-08-10 Alexander Montgomerie-Corcoran , Zhewen Yu , Christos-Savvas Bouganis

Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads

Large Language Models (LLMs) impose massive computational demands, driving the need for scalable multi-chiplet accelerators. However, existing mapping space exploration efforts for such accelerators primarily focus on traditional…

Hardware Architecture · Computer Science 2026-04-02 Boyu Li , Zongwei Zhu , Yi Xiong , Qianyue Cao , Jiawei Geng , Xiaonan Zhang , Xi Li