Related papers: RPU -- A Reasoning Processing Unit

HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth…

Hardware Architecture · Computer Science 2025-12-19 Myunghyun Rhee , Joonseop Sim , Taeyoung Ahn , Seungyong Lee , Daegun Yoon , Euiseok Kim , Kyoung Park , Youngpyo Joo , Hoshik Kim

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

This work elaborates on a High performance computing (HPC) architecture based on Simple Linux Utility for Resource Management (SLURM) [1] for deploying heterogeneous Large Language Models (LLMs) into a scalable inference engine. Dynamic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-26 Anderson de Lima Luiz , Shubham Vijay Kurlekar , Munir Georges

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency…

Hardware Architecture · Computer Science 2024-08-15 Seungjae Moon , Jung-Hoon Kim , Junsoo Kim , Seongmin Hong , Junseo Cha , Minsu Kim , Sukbin Lim , Gyubin Choi , Dongjin Seo , Jongho Kim , Hunjong Lee , Hyunjun Park , Ryeowook Ko , Soongyu Choi , Jongse Park , Jinwon Lee , Joo-Young Kim

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

LLMs now form the backbone of AI agents across a diverse range of applications, including tool use, command-line interfaces, and web or computer interaction. These agentic LLM inference tasks are fundamentally different from chatbot-focused…

Hardware Architecture · Computer Science 2026-04-14 Haoran Wu , Can Xiao , Jiayi Nie , Xuan Guo , Binglei Lou , Jeffrey T. H. Wong , Zhiwen Mo , Cheng Zhang , Przemyslaw Forys , Chengyang Ai , Timi Adeniran , Wayne Luk , Hongxiang Fan , Jianyi Cheng , Timothy M. Jones , Rika Antonova , Robert Mullins , Aaron Zhao

The Recurrent Processing Unit: Hardware for High Speed Machine Learning

Machine learning applications are computationally demanding and power intensive. Hardware acceleration of these software tools is a natural step being explored using various technologies. A recurrent processing unit (RPU) is fast and…

Emerging Technologies · Computer Science 2019-12-17 Heidi Komkov , Alessandro Restelli , Brian Hunt , Liam Shaughnessy , Itamar Shani , Daniel P. Lathrop

The Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis

Large-language models (LLMs) are rapidly being applied to radiology, enabling automated image interpretation and report generation tasks. Their deployment in clinical practice requires both high diagnostic accuracy and low inference…

Tissues and Organs · Quantitative Biology 2025-11-11 Jyun-Ping Kao

MPU: Towards Bandwidth-abundant SIMT Processor via Near-bank Computing

With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed…

Hardware Architecture · Computer Science 2021-03-12 Xinfeng Xie , Peng Gu , Yufei Ding , Dimin Niu , Hongzhong Zheng , Yuan Xie

From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs

With the widespread adoption of Large Language Models (LLMs), the demand for high-performance LLM inference services continues to grow. To meet this demand, a growing number of AI accelerators have been proposed, such as Google TPU, Huawei…

Hardware Architecture · Computer Science 2025-10-08 Tianhao Zhu , Dahu Feng , Erhu Feng , Yubin Xia

PIM-AI: A Novel Architecture for High-Efficiency LLM Inference

Large Language Models (LLMs) have become essential in a variety of applications due to their advanced language understanding and generation capabilities. However, their computational and memory requirements pose significant challenges to…

Hardware Architecture · Computer Science 2024-12-02 Cristobal Ortega , Yann Falevoz , Renaud Ayrignac

Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

Modern GPU software stacks demand developers who can anticipate performance bottlenecks before ever launching a kernel; misjudging floating-point workloads upstream can derail tuning, scheduling, and even hardware procurement. Yet despite…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-05 Gregory Bolet , Giorgis Georgakoudis , Konstantinos Parasyris , Harshitha Menon , Niranjan Hasabnis , Kirk W. Cameron , Gal Oren

Hardware-based Heterogeneous Memory Management for Large Language Model Inference

A large language model (LLM) is one of the most important emerging machine learning applications nowadays. However, due to its huge model size and runtime increase of the memory footprint, LLM inferences suffer from the lack of memory…

Hardware Architecture · Computer Science 2025-04-22 Soojin Hwang , Jungwoo Kim , Sanghyeon Lee , Hongbeen Kim , Jaehyuk Huh

Fast On-device LLM Inference with NPUs

On-device inference for Large Language Models (LLMs), driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably…

Artificial Intelligence · Computer Science 2024-12-17 Daliang Xu , Hao Zhang , Liming Yang , Ruiqi Liu , Gang Huang , Mengwei Xu , Xuanzhe Liu

Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and…

Computation and Language · Computer Science 2025-07-23 Hongyin Luo , Nathaniel Morgan , Tina Li , Derek Zhao , Ai Vy Ngo , Philip Schroeder , Lijie Yang , Assaf Ben-Kish , Jack O'Brien , James Glass

RPU: The Ring Processing Unit

Ring-Learning-with-Errors (RLWE) has emerged as the foundation of many important techniques for improving security and privacy, including homomorphic encryption and post-quantum cryptography. While promising, these techniques have received…

Hardware Architecture · Computer Science 2023-04-14 Deepraj Soni , Negar Neda , Naifeng Zhang , Benedict Reynwar , Homer Gamil , Benjamin Heyman , Mohammed Nabeel , Ahmad Al Badawi , Yuriy Polyakov , Kellie Canida , Massoud Pedram , Michail Maniatakos , David Bruce Cousins , Franz Franchetti , Matthew French , Andrew Schmidt , Brandon Reagen

RAPID-LLM: Resilience-Aware Performance analysis of Infrastructure for Distributed LLM Training and Inference

RAPID-LLM is a unified performance modeling framework for large language model (LLM) training and inference on GPU clusters. It couples a DeepFlow-based frontend that generates hardware-aware, operator-level Chakra execution traces from an…

Performance · Computer Science 2025-12-23 George Karfakis , Faraz Tahmasebi , Binglu Chen , Lime Yao , Saptarshi Mitra , Tianyue Pan , Hyoukjun Kwon , Puneet Gupta

Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends

Edge computing processes data where it is generated, enabling faster decisions, lower bandwidth usage, and improved privacy. However, edge devices typically operate under strict constraints on processing power, memory, and energy…

Performance · Computer Science 2025-12-10 Pablo Prieto , Pablo Abad

TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

Recent studies have extensively explored NPU architectures for accelerating AI inference in on-device environments, which are inherently resource-constrained. Meanwhile, transformer-based large language models (LLMs) have become dominant,…

Hardware Architecture · Computer Science 2026-02-16 Jonghun Lee , Junghoon Lee , Hyeonjin Kim , Seoho Jeon , Jisup Yoon , Hyunbin Park , Meejeong Park , Heonjae Ha

Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment

Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) by generating natural language (NL) rationales that lead to the final answer. However, it struggles with numerical…

Artificial Intelligence · Computer Science 2025-02-13 Cheryl Li , Tianyuan Xu , Yiwen Guo

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

The rapid development of large language models (LLM) has greatly enhanced everyday applications. While many FPGA-based accelerators, with flexibility for fine-grained data control, exhibit superior speed and energy efficiency compared to…

Hardware Architecture · Computer Science 2026-03-24 Zifan He , Shengyu Ye , Rui Ma , Yang Wang , Jason Cong

On Performance Analysis of Graphcore IPUs: Analyzing Squared and Skewed Matrix Multiplication

In recent decades, High Performance Computing (HPC) has undergone significant enhancements, particularly in the realm of hardware platforms, aimed at delivering increased processing power while keeping power consumption within reasonable…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-03 S. -Kazem Shekofteh , Christian Alles , Nils Kochendörfer , Holger Fröning