Related papers: TPU-Gen: LLM-Driven Custom Tensor Processing Unit …

TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

Recent studies have extensively explored NPU architectures for accelerating AI inference in on-device environments, which are inherently resource-constrained. Meanwhile, transformer-based large language models (LLMs) have become dominant,…

Hardware Architecture · Computer Science 2026-02-16 Jonghun Lee , Junghoon Lee , Hyeonjin Kim , Seoho Jeon , Jisup Yoon , Hyunbin Park , Meejeong Park , Heonjae Ha

Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML…

Hardware Architecture · Computer Science 2024-07-12 Mohammed Elbtity , Peyton Chandarana , Ramtin Zand

AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units

To meet the ever-increasing demand for computational efficiency, Neural Processing Units (NPUs) have become critical in modern AI infrastructure. However, unlocking their full potential requires developing high-performance compute kernels…

Artificial Intelligence · Computer Science 2026-04-20 Xinzi Cao , Jianyang Zhai , Pengfei Li , Zhiheng Hu , Cen Yan , Bingxu Mu , Guanghuan Fang , Bin She , Jiayu Li , Yihan Su , Dongyang Tao , Xiansong Huang , Fan Xu , Feidiao Yang , Yao Lu , Chang-Dong Wang , Yutong Lu , Weicheng Xue , Bin Zhou , Yonghong Tian

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively…

Machine Learning · Computer Science 2025-06-12 Wentao Chen , Jiace Zhu , Qi Fan , Yehan Ma , An Zou

A Learned Performance Model for Tensor Processing Units

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for…

Performance · Computer Science 2021-03-19 Samuel J. Kaufman , Phitchaya Mangpo Phothilimthana , Yanqi Zhou , Charith Mendis , Sudip Roy , Amit Sabne , Mike Burrows

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy…

Machine Learning · Computer Science 2026-04-03 Tara Saba , Anne Ouyang , Xujie Si , Fan Long

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Deep neural networks (DNNs) are of critical use in different domains. To accelerate DNN computation, tensor compilers are proposed to generate efficient code on different domain-specific accelerators. Existing tensor compilers mainly focus…

Machine Learning · Computer Science 2023-07-12 Zixuan Ma , Haojie Wang , Jingze Xing , Liyan Zheng , Chen Zhang , Huanqi Cao , Kezhao Huang , Shizhi Tang , Penghan Wang , Jidong Zhai

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

The rapid development of large language models (LLM) has greatly enhanced everyday applications. While many FPGA-based accelerators, with flexibility for fine-grained data control, exhibit superior speed and energy efficiency compared to…

Hardware Architecture · Computer Science 2026-03-24 Zifan He , Shengyu Ye , Rui Ma , Yang Wang , Jason Cong

Towards a Design Framework for TNN-Based Neuromorphic Sensory Processing Units

Temporal Neural Networks (TNNs) are spiking neural networks that exhibit brain-like sensory processing with high energy efficiency. This work presents the ongoing research towards developing a custom design framework for designing efficient…

Emerging Technologies · Computer Science 2022-05-31 Prabhu Vellaisamy , John Paul Shen

Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs

With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption…

Hardware Architecture · Computer Science 2025-03-04 Zhantong Zhu , Hongou Li , Wenjie Ren , Meng Wu , Le Ye , Ru Huang , Tianyu Jia

From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs

With the widespread adoption of Large Language Models (LLMs), the demand for high-performance LLM inference services continues to grow. To meet this demand, a growing number of AI accelerators have been proposed, such as Google TPU, Huawei…

Hardware Architecture · Computer Science 2025-10-08 Tianhao Zhu , Dahu Feng , Erhu Feng , Yubin Xia

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this…

Machine Learning · Computer Science 2023-06-13 Ying Sheng , Lianmin Zheng , Binhang Yuan , Zhuohan Li , Max Ryabinin , Daniel Y. Fu , Zhiqiang Xie , Beidi Chen , Clark Barrett , Joseph E. Gonzalez , Percy Liang , Christopher Ré , Ion Stoica , Ce Zhang

QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is…

Machine Learning · Computer Science 2025-05-13 Xuzhi Zhang , Shaohui Peng , Qirui Zhou , Yuanbo Wen , Qi Guo , Ruizhi Chen , Xinguo Zhu , Weiqiang Xiong , Haixin Chen , Congying Ma , Ke Gao , Chen Zhao , Yanjun Wu , Yunji Chen , Ling Li

Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference

Ternary weight quantization (e.g., BitNet b1.58) offers a promising path to mitigate the memory bandwidth bottleneck in Large Language Model (LLM) inference. However, conventional compute platforms lack native support for ternary-weight…

Hardware Architecture · Computer Science 2026-04-29 Robin Geens , Joran Heldens , Joren Dumoulin , Marian Verhelst

A Computational Model for Tensor Core Units

To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature…

Data Structures and Algorithms · Computer Science 2020-07-10 Rezaul Chowdhury , Francesco Silvestri , Flavio Vella

AGON: Automated Design Framework for Customizing Processors from ISA Documents

Customized processors are attractive solutions for vast domain-specific applications due to their high energy efficiency. However, designing a processor in traditional flows is time-consuming and expensive. To address this, researchers have…

Hardware Architecture · Computer Science 2025-01-22 Chongxiao Li , Di Huang , Pengwei Jin , Tianyun Ma , Husheng Han , Shuyao Cheng , Yifan Hao , Yongwei Zhao , Guanglin Xu , Zidong Du , Rui Zhang , Xiaqing Li , Yuanbo Wen , Xing Hu , Qi Guo

Learning to Optimize Tensor Programs

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective…

Machine Learning · Computer Science 2019-01-10 Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses the memory capacity and bandwidth challenges of…

Machine Learning · Computer Science 2019-08-27 Youngeun Kwon , Yunjae Lee , Minsoo Rhu

Tensor Product Generation Networks for Deep NLP Modeling

We present a new approach to the design of deep networks for natural language processing (NLP), based on the general technique of Tensor Product Representations (TPRs) for encoding and processing symbol structures in distributed neural…

Computer Vision and Pattern Recognition · Computer Science 2017-12-19 Qiuyuan Huang , Paul Smolensky , Xiaodong He , Li Deng , Dapeng Wu

Towards Memory-Efficient Neural Networks via Multi-Level in situ Generation

Deep neural networks (DNN) have shown superior performance in a variety of tasks. As they rapidly evolve, their escalating computation and memory demands make it challenging to deploy them on resource-constrained edge devices. Though…

Machine Learning · Computer Science 2021-09-07 Jiaqi Gu , Hanqing Zhu , Chenghao Feng , Mingjie Liu , Zixuan Jiang , Ray T. Chen , David Z. Pan