English
Related papers

Related papers: TileLoom: Automatic Dataflow Planning for Tile-Bas…

200 papers

Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations…

Machine Learning · Computer Science 2025-04-29 Lei Wang , Yu Cheng , Yining Shi , Zhengju Tang , Zhiwen Mo , Wenhao Xie , Lingxiao Ma , Yuqing Xia , Jilong Xue , Fan Yang , Zhi Yang

Efficient execution of deep learning workloads on dataflow architectures is crucial for overcoming memory bottlenecks and maximizing performance. While streaming intermediate results between computation kernels can significantly improve…

Hardware Architecture · Computer Science 2025-09-24 Hanchen Ye , Deming Chen

The demand for efficient machine learning (ML) accelerators is growing rapidly, driving the development of novel computing concepts such as resistive random access memory (RRAM)-based tiled computing-in-memory (CIM) architectures. CIM…

Hardware Architecture · Computer Science 2024-01-18 Rebecca Pelke , Jose Cubero-Cascante , Nils Bosbach , Felix Staudigl , Rainer Leupers , Jan Moritz Joseph

The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for…

This paper introduces SpeedLLM, a neural network accelerator designed on the Xilinx Alevo U280 platform and optimized for the Tinyllama framework to enhance edge computing performance. Key innovations include data stream parallelism, a…

Hardware Architecture · Computer Science 2025-07-22 Peipei Wang , Wu Guan , Liping Liang , Zhijun Wang , Hanqing Luo , Zhibin Zhang

Multimodal Transformers are emerging artificial intelligence (AI) models designed to process a mixture of signals from diverse modalities. Digital computing-in-memory (CIM) architectures are considered promising for achieving high…

Hardware Architecture · Computer Science 2025-02-11 Shantian Qin , Ziqing Qiang , Zhihua Fan , Wenming Li , Xuejun An , Xiaochun Ye , Dongrui Fan

There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-22 Gordon E. Moon , Hyoukjun Kwon , Geonhwa Jeong , Prasanth Chatarasi , Sivasankaran Rajamanickam , Tushar Krishna

Recent deep learning workloads increasingly push computational demand beyond what current memory systems can sustain, with many kernels stalling on data movement rather than computation. While modern dataflow accelerators incorporate…

Programming Languages · Computer Science 2025-09-09 Shihan Fang , Hongzheng Chen , Niansong Zhang , Jiajie Li , Han Meng , Adrian Liu , Zhiru Zhang

In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces…

Computation and Language · Computer Science 2025-03-27 Dewei Wang , Wei Zhu , Liyang Ling , Ettore Tiotto , Quintin Wang , Whitney Tsang , Julian Opperman , Jacky Deng

Serving Large Language Models (LLMs) is critical for AI-powered applications, yet it demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key…

Machine Learning · Computer Science 2025-09-03 Yaoyao Ding , Bohan Hou , Xiao Zhang , Allan Lin , Tianqi Chen , Cody Yu Hao , Yida Wang , Gennady Pekhimenko

SRAM-based compute-in-memory (CIM) offers high computational density and energy efficiency for deep neural network (DNN) accelerators, but its limited capacity causes on/off-chip data movement overhead for large DNN models. Existing CIM…

Hardware Architecture · Computer Science 2026-04-21 Chenhao Xue , Yukun Wang , An Guo , Yuhui Shi , Jinwei Zhou , Xiping Dong , Yihan Yin , Yuanpeng Zhang , Tianyu Jia , Wei Gao , Qiang Wu , Xin Si , Jun Yang , Guangyu Sun

Tensor algebra finds applications in various domains, and these applications, especially when accelerated on spatial hardware accelerators, can deliver high performance and low power. Spatial hardware accelerator exhibits complex design…

Hardware Architecture · Computer Science 2021-04-27 Liancheng Jia , Zizhang Luo , Liqiang Lu , Yun Liang

Large deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamental building blocks for distributed model…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-04 Size Zheng , Jin Fang , Xuegui Zheng , Qi Hou , Wenlei Bao , Ningxin Zheng , Ziheng Jiang , Dongyang Wang , Jianxi Ye , Haibin Lin , Li-Wen Chang , Xin Liu

Edge-AI applications demand high-throughput, low-latency inference on FPGAs under tight resource and power constraints. This survey provides a comprehensive review of two key architectural decisions for FPGA-based neural network…

Hardware Architecture · Computer Science 2025-06-03 Richie Li

High-level synthesis (HLS) has enabled the rapid development of custom hardware circuits for many software applications. However, developing high-performance hardware circuits using HLS is still a non-trivial task requiring expertise in…

Hardware Architecture · Computer Science 2025-01-17 Suhail Basalama , Jason Cong

Large language model (LLM) inference has been a prevalent demand in daily life and industries. The large tensor sizes and computing complexities in LLMs have brought challenges to memory, computing, and databus. This paper proposes a…

Hardware Architecture · Computer Science 2025-09-19 Yimin Wang , Yue Jiet Chong , Xuanyao Fong

Computation in-memory is a promising non-von Neumann approach aiming at completely diminishing the data transfer to and from the memory subsystem. Although a lot of architectures have been proposed, compiler support for such architectures…

Hardware Architecture · Computer Science 2020-07-02 Kanishkan Vadivel , Lorenzo Chelini , Ali BanaGozar , Gagandeep Singh , Stefano Corda , Roel Jordans , Henk Corporaal

Sparse tensor algebra is a challenging class of workloads to accelerate due to low arithmetic intensity and varying sparsity patterns. Prior sparse tensor algebra accelerators have explored tiling sparse data to increase exploitable data…

Hardware Architecture · Computer Science 2024-06-27 Zi Yu Xue , Yannan Nellie Wu , Joel S. Emer , Vivienne Sze

Deep learning (DL) models are piquing high interest and scaling at an unprecedented rate. To this end, a handful of tiled accelerators have been proposed to support such large-scale training tasks. However, these accelerators often…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-07 Jiahao Fang , Huizheng Wang , Qize Yang , Dehao Kong , Xu Dai , Jinyi Deng , Yang Hu , Shouyi Yin

Attention accounts for an increasingly dominant fraction of total computation during inference for mixture-of-experts (MoE) models, making efficient acceleration critical. Emerging domain-specific accelerators for large model inference are…

Hardware Architecture · Computer Science 2026-04-03 Chi Zhang , Luca Colagrande , Renzo Andri , Luca Benini
‹ Prev 1 2 3 10 Next ›