硬件体系结构

Provisioning to Runtime Optimization of a 100 MW-Scale AI Cluster

The electric power supply for AI data centers is now the most significant bottleneck in the race toward Artificial General Intelligence, surpassing even the constraint of AI accelerator availability. To our knowledge, this paper is the…

硬件体系结构 · 计算机科学 2026-05-27 Ehsan K. Ardestani , Leonardo Piga , Jovan Stojkovic , Pavan Balaji , Mustafa Ozdal , Mikel Jimenez Fernandez , Mihaela Dimovska , Luka Tadic , Hao Shen , Devika Vishwanath , Richa Mishra , Melaku Mihret , Valentin Andrei , Mauricio Cespedes , Julien Prigent , James Monahan , Tyler Graf , Bin Li , Charles Marquez , Shobhit Kanaujia , Kaushik Veeraraghavan , Chunqiang Tang

{\mu}-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP

Heterogeneous reconfigurable platforms with tensor cores, such as AMD ACAP, are increasingly adopted for deep neural network (DNN) inference due to their high throughput and flexibility. However, their suitability for microsecond-scale…

硬件体系结构 · 计算机科学 2026-05-27 Shixin Ji , Jinming Zhuang , Zhuoping Yang , Xingzhen Chen , Wei Zhang , Peipei Zhou

Hardware-Software Co-Design of Scalable, Energy-Efficient Analog Recurrent Computations

Always-on AI applications, from environmental sensors to biomedical implants, require ultra-low power consumption. Analog circuits offer a path to sub-microwatt inference, yet existing analog implementations are limited to feedforward…

硬件体系结构 · 计算机科学 2026-05-27 Arthur Fyon , Julien Brandoit , Loris Mendolia , Damien Ernst , Jean-Michel Redouté , Guillaume Drion

NeuroRing: Scaling Spiking Neural Networks via Multi-FPGA Bidirectional Ring Topologies and Stream-Dataflow Architectures

Spiking neural networks (SNNs) are a promising paradigm for energy-efficient event-driven computation, but large-scale SNN execution remains challenging because sparse spike communication and synchronization can dominate runtime. Existing…

硬件体系结构 · 计算机科学 2026-05-27 Muhammad Ihsan Al Hafiz , Artur Podobas

RulePlanner: All-in-One Reinforcement Learner for Unifying Design Rules in 3D Floorplanning

Floorplanning determines the coordinate and shape of each module in Integrated Circuits. With the scaling of technology nodes, in floorplanning stage especially 3D scenarios with multiple stacked layers, it has become increasingly…

硬件体系结构 · 计算机科学 2026-05-27 Ruizhe Zhong , Xingbo Du , Junchi Yan

Harmonia: Enhancing Data Placement and Migration in Hybrid Storage Systems via Multi-Agent Reinforcement Learning

Modern high-performance computing (HPC) environments rely on hybrid storage systems (HSS) that combine multiple storage devices with diverse latency, bandwidth, endurance, and capacity characteristics to meet the performance, capacity, and…

硬件体系结构 · 计算机科学 2026-05-27 Rakesh Nadig , Vamanan Arulchelvan , Rahul Bera , Taha Shahroodi , Gagandeep Singh , Andreas Kakolyris , Ismail Emir Yuksel , Mohammad Sadrosadati , Jisung Park , Onur Mutlu

DiSC: Resolution-Scalable Acceleration of Diffusion Models by Exploiting Sparsity and Cached Token Reuse with Hash-based Distribution

Transformer-based diffusion models offer superior scalability and performance but suffer from high computational overhead due to the iterative nature and quadratic complexity of self-attention at high resolutions. In this paper, we propose…

硬件体系结构 · 计算机科学 2026-05-26 Jieon Yoon , Hangyeol Lee , Jaehoon Heo , Joo-Young Kim

Code size reduction by advanced near addressing modes

To enable debugging and calibration of real time systems, which are in interaction with the real plant, the software used on those systems often has a huge number of global variables. The huge number of global variables exceed the range…

硬件体系结构 · 计算机科学 2026-05-26 Kajetan Nuernberger , Thomas Roecker , Gergely Fueto , Gabor Spaits , Horst Lehser

Co-Designing Graph-based Approximate Nearest Neighbor Search at Billion Scale for Processing-in-Memory

Approximate Nearest Neighbor Search (ANNS) is a core primitive in modern AI systems, and graph-based methods currently offer the best accuracy-efficiency trade-off at scale. The workload is fundamentally memory-bound: graph traversal…

硬件体系结构 · 计算机科学 2026-05-26 Sitian Chen , Yusen Li , Yao Chen , Minwen Deng , Jintao Meng , Amelie Chi Zhou

Architectural Limits of Cloud TPUs in Finite-Field Cryptography

We empirically characterise the cost-efficiency deficit between cloud Tensor Processing Units and GPUs for finite-field cryptography. Against A100 GPU baselines (cuZK), we measure a $[5{,}558\times, 6{,}908\times]$ deficit across v5p and v4…

硬件体系结构 · 计算机科学 2026-05-26 Hung Dang , Xuan Phu Dang , Tue Nguyen

XL-HD: Extended Learning in Hyperdimensional Computing via Deterministic Projections for In-Memory Accelerators

Hyperdimensional computing (HDC) is a promising approach for energy-efficient edge machine learning (ML), where low latency, low power, and tight memory budgets are essential. However, traditional HDC relies on symbolic binding and…

硬件体系结构 · 计算机科学 2026-05-26 Sabrina Hassan Moon , Abu Kaisar Mohammad Masum , Sercan Aygun , Dayane Reis

An Energy-Efficient Approximate Posit Multiply-Divide Unit

In modern computing units, division operations are generally slower than other arithmetic operations and require more resources, such as area and power, than multiplication. To reduce the delay, fast division algorithms use an initial…

硬件体系结构 · 计算机科学 2026-05-26 Rishi Thotli , Aditya Anirudh Jonnalagadda , Rishabh Hulsurkar , Anil Kumar Uppugunduru , Sreehari Veeramachaneni , Syed Ershad Ahmed , John Gustafson

MX-SAFE: Versatile Inference- and Training-Proof Microscaling Format with On-the-Fly Exponent and Mantissa Bit Allocation

As the demand for deep learning grows, cost reduction through quantization has become essential for both training and inference. In 2022, the Open Compute Project (OCP) consortium standardized narrow precision formats for deep learning,…

硬件体系结构 · 计算机科学 2026-05-26 Dahoon Park , Jahyun Koo , Sangwoo Hwang , Jaeha Kung

EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture

Large Language Models (LLMs) have achieved impressive performance across diverse domains but remain inefficient during the autoregressive decoding phase. Unlike the prefill stage, which employs compute-bound GEMM operations, decoding…

硬件体系结构 · 计算机科学 2026-05-26 Bowen Duan , Cong Guo , Chiyue Wei , Haoxuan Shan , Yuzhe Fu , Xinhua Chen , Yifan Xu , Ziyue Zhang , Changchun Zhou , Hai Li , Yiran Chen

A Per-Access Upper Bound for Shared-Resource Interference in Direct-Mapped Multicore Architectures

We present a formal bounding analysis for maximum credible interference in multicore processors under strict architectural invariants: direct-mapped L2 cache (1-way associativity), disabled Miss Status Handling Registers (MSHRs),…

硬件体系结构 · 计算机科学 2026-05-26 Felipe T. Pedroni

Adaptive KV Cache Reuse for Fast Long-Context LLM Serving

In long-context Large Language Model (LLM) inference, the Time-To-First-Token (TTFT) latency incurred by the prefill stage has become the foremost bottleneck limiting interactive performance and deployment cost. KV Cache reuse offers a…

硬件体系结构 · 计算机科学 2026-05-26 Fei li , Song Liu , Yan Liu , Jinhua Cui , Shiqiang Nie , Jinyu Wang , Weiguo Wu

CMAX-CAMEL: A Coarse-to-Fine Adaptive, Memory-Efficient, and Low-Power Edge Processor for Contrast Maximization

Contrast maximization (CMAX) is a direct geometric framework for event-based motion estimation, but its iterative warp-and-accumulate pipeline incurs input-dependent computation and frequent memory accesses, challenging real-time, low-power…

硬件体系结构 · 计算机科学 2026-05-26 Kyeongpil Min , Jongin Choi , Kyeongwon Lee , Woojoo Lee

SA-Kura: An Energy-Efficient Systolic Array Accelerator for Locally-Coupled Kuramoto Drift in Diffusion Sampling

Diffusion inference remains costly for edge deployment, yet existing accelerators focus almost exclusively on score networks because standard drift is merely a trivial linear scaling. Kuramoto orientation diffusion replaces this trivial…

硬件体系结构 · 计算机科学 2026-05-26 Jeongmin Jin , Kyeongwon Lee , Mundo Jeong , Jongin Choi , Woojoo Lee

Decompose, Optimize, and Reconstruct: Very Large Constant Multiplication at Scale

Efficient arithmetic circuit design for resourceconstrained hardware involves challenging combinatorial optimization problems, among which Multiple Constant Multiplication (MCM) is a prominent example. MCM aims at implementing…

硬件体系结构 · 计算机科学 2026-05-26 Théo Cantaloube , Nicolai Fiege , Anastasia Volkova , Christine Solnon

Predictive Software Scheduling as an Early-Warning Hint Layer for Optical Engine Thermal Drift in Heterogeneous SoIC Packaging

As semiconductor scaling reaches the A16 / 2 nm node, the integration of co-packaged optics (CPO) via TSMC's Co-Packaged Optics Ultra Engine (COUPE) architecture introduces critical thermal-optical coupling challenges. Micro-ring resonators…

硬件体系结构 · 计算机科学 2026-05-26 Chi Fei Chung