Related papers: TileLoom: Automatic Dataflow Planning for Tile-Bas…

TileLang: A Composable Tiled Programming Model for AI Systems

Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations…

Machine Learning · Computer Science 2025-04-29 Lei Wang , Yu Cheng , Yining Shi , Zhengju Tang , Zhiwen Mo , Wenhao Xie , Lingxiao Ma , Yuqing Xia , Jilong Xue , Fan Yang , Zhi Yang

StreamTensor: Make Tensors Stream in Dataflow Accelerators for LLMs

Efficient execution of deep learning workloads on dataflow architectures is crucial for overcoming memory bottlenecks and maximizing performance. While streaming intermediate results between computation kernels can significantly improve…

Hardware Architecture · Computer Science 2025-09-24 Hanchen Ye , Deming Chen

CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures

The demand for efficient machine learning (ML) accelerators is growing rapidly, driving the development of novel computing concepts such as resistive random access memory (RRAM)-based tiled computing-in-memory (CIM) architectures. CIM…

Hardware Architecture · Computer Science 2024-01-18 Rebecca Pelke , Jose Cubero-Cascante , Nils Bosbach , Felix Staudigl , Rainer Leupers , Jan Moritz Joseph

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for…

Programming Languages · Computer Science 2026-05-06 Size Zheng , Xuegui Zheng , Hanshi Sun , Qi Hou , Wenlei Bao , Shiyu Li , Haojie Duanmu , Jin Fang , Chenli Xue , Chenhui Huang , Yuanqiang Liu , Renze Chen , Ningxin Zheng , Dongyang Wang , Li-Wen Chang , Liqiang Lu , Yun Liang , Jidong Zhai , Xin Liu

SpeedLLM: An FPGA Co-design of Large Language Model Inference Accelerator

This paper introduces SpeedLLM, a neural network accelerator designed on the Xilinx Alevo U280 platform and optimized for the Tinyllama framework to enhance edge computing performance. Key innovations include data stream parallelism, a…

Hardware Architecture · Computer Science 2025-07-22 Peipei Wang , Wu Guan , Liping Liang , Zhijun Wang , Hanqing Luo , Zhibin Zhang

StreamDCIM: A Tile-based Streaming Digital CIM Accelerator with Mixed-stationary Cross-forwarding Dataflow for Multimodal Transformer

Multimodal Transformers are emerging artificial intelligence (AI) models designed to process a mixture of signals from diverse modalities. Digital computing-in-memory (CIM) architectures are considered promising for achieving high…

Hardware Architecture · Computer Science 2025-02-11 Shantian Qin , Ziqing Qiang , Zhihua Fan , Wenming Li , Xuejun An , Xiaochun Ye , Dongrui Fan

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication

There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-22 Gordon E. Moon , Hyoukjun Kwon , Geonhwa Jeong , Prasanth Chatarasi , Sivasankaran Rajamanickam , Tushar Krishna

Dato: A Task-Based Programming Model for Dataflow Accelerators

Recent deep learning workloads increasingly push computational demand beyond what current memory systems can sustain, with many kernels stalling on data movement rather than computation. While modern dataflow accelerators incorporate…

Programming Languages · Computer Science 2025-09-09 Shihan Fang , Hongzheng Chen , Niansong Zhang , Jiajie Li , Han Meng , Adrian Liu , Zhiru Zhang

ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming

In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces…

Computation and Language · Computer Science 2025-03-27 Dewei Wang , Wei Zhu , Liyang Ling , Ettore Tiotto , Quintin Wang , Whitney Tsang , Julian Opperman , Jacky Deng

Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation

Serving Large Language Models (LLMs) is critical for AI-powered applications, yet it demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key…

Machine Learning · Computer Science 2025-09-03 Yaoyao Ding , Bohan Hou , Xiao Zhang , Allan Lin , Tianqi Chen , Cody Yu Hao , Yida Wang , Gennady Pekhimenko

AccelCIM: Systematic Dataflow Exploration for SRAM Compute-in-Memory Accelerator

SRAM-based compute-in-memory (CIM) offers high computational density and energy efficiency for deep neural network (DNN) accelerators, but its limited capacity causes on/off-chip data movement overhead for large DNN models. Existing CIM…

Hardware Architecture · Computer Science 2026-04-21 Chenhao Xue , Yukun Wang , An Guo , Yuhui Shi , Jinwei Zhou , Xiping Dong , Yihan Yin , Yuanpeng Zhang , Tianyu Jia , Wei Gao , Qiang Wu , Xin Si , Jun Yang , Guangyu Sun

TensorLib: A Spatial Accelerator Generation Framework for Tensor Algebra

Tensor algebra finds applications in various domains, and these applications, especially when accelerated on spatial hardware accelerators, can deliver high performance and low power. Spatial hardware accelerator exhibits complex design…

Hardware Architecture · Computer Science 2021-04-27 Liancheng Jia , Zizhang Luo , Liqiang Lu , Yun Liang

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives

Large deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamental building blocks for distributed model…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-04 Size Zheng , Jin Fang , Xuegui Zheng , Qi Hou , Wenlei Bao , Ningxin Zheng , Ziheng Jiang , Dongyang Wang , Jianxi Ye , Haibin Lin , Li-Wen Chang , Xin Liu

Dataflow & Tiling Strategies in Edge-AI FPGA Accelerators: A Comprehensive Literature Review

Edge-AI applications demand high-throughput, low-latency inference on FPGAs under tight resource and power constraints. This survey provides a comprehensive review of two key architectural decisions for FPGA-based neural network…

Hardware Architecture · Computer Science 2025-06-03 Richie Li

Stream-HLS: Towards Automatic Dataflow Acceleration

High-level synthesis (HLS) has enabled the rapid development of custom hardware circuits for many software applications. However, developing high-performance hardware circuits using HLS is still a non-trivial task requiring expertise in…

Hardware Architecture · Computer Science 2025-01-17 Suhail Basalama , Jason Cong

LEAP: LLM Inference on Scalable PIM-NoC Architecture with Balanced Dataflow and Fine-Grained Parallelism

Large language model (LLM) inference has been a prevalent demand in daily life and industries. The large tensor sizes and computing complexities in LLMs have brought challenges to memory, computing, and databus. This paper proposes a…

Hardware Architecture · Computer Science 2025-09-19 Yimin Wang , Yue Jiet Chong , Xuanyao Fong

TDO-CIM: Transparent Detection and Offloading for Computation In-memory

Computation in-memory is a promising non-von Neumann approach aiming at completely diminishing the data transfer to and from the memory subsystem. Although a lot of architectures have been proposed, compiler support for such architectures…

Hardware Architecture · Computer Science 2020-07-02 Kanishkan Vadivel , Lorenzo Chelini , Ali BanaGozar , Gagandeep Singh , Stefano Corda , Roel Jordans , Henk Corporaal

Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity

Sparse tensor algebra is a challenging class of workloads to accelerate due to low arithmetic intensity and varying sparsity patterns. Prior sparse tensor algebra accelerators have explored tiling sparse data to increase exploitable data…

Hardware Architecture · Computer Science 2024-06-27 Zi Yu Xue , Yannan Nellie Wu , Joel S. Emer , Vivienne Sze

PALM: A Efficient Performance Simulator for Tiled Accelerators with Large-scale Model Training

Deep learning (DL) models are piquing high interest and scaling at an unprecedented rate. To this end, a handful of tiled accelerators have been proposed to support such large-scale training tasks. However, these accelerators often…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-07 Jiahao Fang , Huizheng Wang , Qize Yang , Dehao Kong , Xu Dai , Jinyi Deng , Yang Hu , Shouyi Yin

FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators

Attention accounts for an increasingly dominant fraction of total computation during inference for mixture-of-experts (MoE) models, making efficient acceleration critical. Emerging domain-specific accelerators for large model inference are…

Hardware Architecture · Computer Science 2026-04-03 Chi Zhang , Luca Colagrande , Renzo Andri , Luca Benini