English
Related papers

Related papers: RPU -- A Reasoning Processing Unit

200 papers

The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth…

Hardware Architecture · Computer Science 2025-12-19 Myunghyun Rhee , Joonseop Sim , Taeyoung Ahn , Seungyong Lee , Daegun Yoon , Euiseok Kim , Kyoung Park , Youngpyo Joo , Hoshik Kim

This work elaborates on a High performance computing (HPC) architecture based on Simple Linux Utility for Resource Management (SLURM) [1] for deploying heterogeneous Large Language Models (LLMs) into a scalable inference engine. Dynamic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-26 Anderson de Lima Luiz , Shubham Vijay Kurlekar , Munir Georges

The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency…

LLMs now form the backbone of AI agents across a diverse range of applications, including tool use, command-line interfaces, and web or computer interaction. These agentic LLM inference tasks are fundamentally different from chatbot-focused…

Machine learning applications are computationally demanding and power intensive. Hardware acceleration of these software tools is a natural step being explored using various technologies. A recurrent processing unit (RPU) is fast and…

Emerging Technologies · Computer Science 2019-12-17 Heidi Komkov , Alessandro Restelli , Brian Hunt , Liam Shaughnessy , Itamar Shani , Daniel P. Lathrop

Large-language models (LLMs) are rapidly being applied to radiology, enabling automated image interpretation and report generation tasks. Their deployment in clinical practice requires both high diagnostic accuracy and low inference…

Tissues and Organs · Quantitative Biology 2025-11-11 Jyun-Ping Kao

With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed…

Hardware Architecture · Computer Science 2021-03-12 Xinfeng Xie , Peng Gu , Yufei Ding , Dimin Niu , Hongzhong Zheng , Yuan Xie

With the widespread adoption of Large Language Models (LLMs), the demand for high-performance LLM inference services continues to grow. To meet this demand, a growing number of AI accelerators have been proposed, such as Google TPU, Huawei…

Hardware Architecture · Computer Science 2025-10-08 Tianhao Zhu , Dahu Feng , Erhu Feng , Yubin Xia

Large Language Models (LLMs) have become essential in a variety of applications due to their advanced language understanding and generation capabilities. However, their computational and memory requirements pose significant challenges to…

Hardware Architecture · Computer Science 2024-12-02 Cristobal Ortega , Yann Falevoz , Renaud Ayrignac

Modern GPU software stacks demand developers who can anticipate performance bottlenecks before ever launching a kernel; misjudging floating-point workloads upstream can derail tuning, scheduling, and even hardware procurement. Yet despite…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-05 Gregory Bolet , Giorgis Georgakoudis , Konstantinos Parasyris , Harshitha Menon , Niranjan Hasabnis , Kirk W. Cameron , Gal Oren

A large language model (LLM) is one of the most important emerging machine learning applications nowadays. However, due to its huge model size and runtime increase of the memory footprint, LLM inferences suffer from the lack of memory…

Hardware Architecture · Computer Science 2025-04-22 Soojin Hwang , Jungwoo Kim , Sanghyeon Lee , Hongbeen Kim , Jaehyuk Huh

On-device inference for Large Language Models (LLMs), driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably…

Artificial Intelligence · Computer Science 2024-12-17 Daliang Xu , Hao Zhang , Liming Yang , Ruiqi Liu , Gang Huang , Mengwei Xu , Xuanzhe Liu

To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and…

Computation and Language · Computer Science 2025-07-23 Hongyin Luo , Nathaniel Morgan , Tina Li , Derek Zhao , Ai Vy Ngo , Philip Schroeder , Lijie Yang , Assaf Ben-Kish , Jack O'Brien , James Glass

Ring-Learning-with-Errors (RLWE) has emerged as the foundation of many important techniques for improving security and privacy, including homomorphic encryption and post-quantum cryptography. While promising, these techniques have received…

RAPID-LLM is a unified performance modeling framework for large language model (LLM) training and inference on GPU clusters. It couples a DeepFlow-based frontend that generates hardware-aware, operator-level Chakra execution traces from an…

Edge computing processes data where it is generated, enabling faster decisions, lower bandwidth usage, and improved privacy. However, edge devices typically operate under strict constraints on processing power, memory, and energy…

Performance · Computer Science 2025-12-10 Pablo Prieto , Pablo Abad

Recent studies have extensively explored NPU architectures for accelerating AI inference in on-device environments, which are inherently resource-constrained. Meanwhile, transformer-based large language models (LLMs) have become dominant,…

Hardware Architecture · Computer Science 2026-02-16 Jonghun Lee , Junghoon Lee , Hyeonjin Kim , Seoho Jeon , Jisup Yoon , Hyunbin Park , Meejeong Park , Heonjae Ha

Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) by generating natural language (NL) rationales that lead to the final answer. However, it struggles with numerical…

Artificial Intelligence · Computer Science 2025-02-13 Cheryl Li , Tianyuan Xu , Yiwen Guo

The rapid development of large language models (LLM) has greatly enhanced everyday applications. While many FPGA-based accelerators, with flexibility for fine-grained data control, exhibit superior speed and energy efficiency compared to…

Hardware Architecture · Computer Science 2026-03-24 Zifan He , Shengyu Ye , Rui Ma , Yang Wang , Jason Cong

In recent decades, High Performance Computing (HPC) has undergone significant enhancements, particularly in the realm of hardware platforms, aimed at delivering increased processing power while keeping power consumption within reasonable…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-03 S. -Kazem Shekofteh , Christian Alles , Nils Kochendörfer , Holger Fröning
‹ Prev 1 2 3 10 Next ›