硬件体系结构

Not All Faults Are Equal: Transient-Fault Sensitivity Characterization of an Open-Source RISC-V Vector Cluster

We present a transient-fault sensitivity study of the open-source RISC-V vector cluster Spatz under SET and SEU fault models. Across 100,000 fault injections on six MatMul and Widening MatMul configurations, faulty data corruption (FD) is…

硬件体系结构 · 计算机科学 2026-05-07 Maoyuan Cai , Amirhossein Kiamarzi , Davide Rossi , Angelo Garofalo

Ultra Low-Power SDM-based Circuit-Switching for Networks-on-Chip

In many modern AI chips and multicore systems-on-chip, embedded applications exhibit predictable inter-core traffic behavior that can be characterized at design time. For such applications, a variety of design-time traffic management and…

硬件体系结构 · 计算机科学 2026-05-07 Meysam Zaeemi , Mehdi Modarressi

RangeGuard: Efficient, Bounded Approximate Error Correction for Reliable DNNs

As DRAM scales in density and adopts 3D integration, raw fault rates increase and multi-bit errors are no longer rare. Such errors can severely impact Deep Neural Networks (DNNs): although DNNs tolerate small numerical perturbations, random…

硬件体系结构 · 计算机科学 2026-05-07 Hanum Ko , Sangheum Yeon , Jong Hwan Ko , Jungrae Kim

The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance

Silent data corruption (SDC) threatens the reliability of large-scale GPU clusters used for training large language models, yet its rarity and lack of explicit error signals make accurate high-level modeling challenging. To address this…

硬件体系结构 · 计算机科学 2026-05-07 Chung-Hsuan Tung , Yanxiang Huang , Nirmal Saxena , Philip Shirvani , Saurabh Hukerikar , Twinkle Jain , Abhishek Tyagi , Sanjay Gongalore

t\"{a}k\={o}Formal: Enabling Robust Software for Programmable Memory Hierarchies (Extended Version)

Accelerators provide large performance and energy-efficiency benefits, but can significantly change the hardware-software interface. The t\"{a}k\={o} programmable memory hierarchy accelerates data movement by enabling programmers to run…

硬件体系结构 · 计算机科学 2026-05-07 Pranav Srinivasan , Manos Kapritsos , Yatin A. Manerkar

Resource Utilization of Differentiable Logic Gate Networks Deployed on FPGAs

On-edge machine learning (ML) often strives to maximize the intelligence of small models while miniaturizing the circuit size and power needed to perform inference. Meeting these needs, differentiable Logic Gate Networks (LGN) have…

硬件体系结构 · 计算机科学 2026-05-07 Stephen Wormald , Gilon Kravatsky , Damon Woodard , Domenic Forte

RV-IM100: Quantifying ISA Extension, Datapath Width, and Pipeline Depth Trade-offs in RISC-V Microarchitectures

While functional RISC-V implementations are readily available in academia, controlled empirical studies that extend a single baseline architecture along multiple design axes and quantify the resulting trade-offs at each step remain scarce.…

硬件体系结构 · 计算机科学 2026-05-06 Hyunwoo Kang

Lottery BP: Unlocking Quantum Error Decoding at Scale

To enable fault tolerance on millions of qubits in real time, scalable decoding is necessary, which motivates this paper. Existing decoding algorithms (decoders), such as clustering, matching, belief propagation (BP), and neural networks,…

硬件体系结构 · 计算机科学 2026-05-06 Yanzhang Zhu , Chen-Yu Peng , Yun Hao Chen , Yeong-Luh Ueng , Di Wu

DARTH-PUM: A Hybrid Processing-Using-Memory Architecture

Analog processing-using-memory (PUM; a.k.a. in-memory computing) makes use of electrical interactions inside memory arrays to perform bulk matrix-vector multiplication (MVM) operations. However, many popular matrix-based kernels need to…

硬件体系结构 · 计算机科学 2026-05-06 Ryan Wong , Ben Feinberg , Saugata Ghose

Fletch: File-System Metadata Caching in Programmable Switches

Fast and scalable metadata management across multiple metadata servers is crucial for distributed file systems to handle numerous files and directories. Client-side caching of frequently accessed metadata can mitigate server loads, but…

硬件体系结构 · 计算机科学 2026-05-06 Qingxiu Liu , Jiazhen Cai , Siyuan Sheng , Yuhui Chen , Lu Tang , Zhirong Shen , Patrick P. C. Lee

Performance and Energy Benefits of MRDIMMs

Multiplexed Rank DIMMs (MRDIMMs) have recently emerged as memory devices that enable higher bandwidth without increasing DRAM chip frequencies. This paper presents a detailed performance, power and energy evaluation of a production server…

硬件体系结构 · 计算机科学 2026-05-05 Pau Díaz , Mariana Carmin , Pouya Esmaili-Dokht , Victor Xirau , Felippe Zacarias , Henrique Potter , Harald Servat , Miquel Moreto , Eduard Ayguadé , Petar Radojković

Monolithic 3D Integration for Null Convention Logic (NCL)-Based Asynchronous Circuits

As the demand for high-speed and low-power electronics continues to grow, the quasi-delay-insensitive (QDI) asynchronous domain of digital design has emerged as a promising alternative to traditional clock-based designs. However, the…

硬件体系结构 · 计算机科学 2026-05-05 Xiameng Zhang , Kushal Ponugoti , Ashiq Sakib , Madhava Vemuri

ViM-Q: Scalable Algorithm-Hardware Co-Design for Vision Mamba Model Inference on FPGA

Vision Mamba (ViM) models offer a compelling efficiency advantage over Transformers by leveraging the linear complexity of State Space Models (SSMs), yet efficiently deploying them on FPGAs remains challenging. Linear layers struggle with…

硬件体系结构 · 计算机科学 2026-05-05 Shengzhe Lyu , Yuhan She , Patrick S. Y. Hung , Ray C. C. Cheung , Weitao Xu

PipeRTL: Timing-Aware Pipeline Optimization at IR-Level for RTL Generation

Modern hardware compilers increasingly rely on rich intermediate representations (IRs) to preserve optimization-relevant semantics before generating RTL code. However, one important optimization is still largely deferred to backend tools:…

硬件体系结构 · 计算机科学 2026-05-05 Shuo Yin , Fangzhou Liu , Lancheng Zou , Rongliang Fu , Wenqian Zhao , Chen Bai , Tsung-Yi Ho , Yuan Xie , Bei Yu

MANOJAVAM: A Scalable, Unified FPGA Accelerator for Matrix Multiplication and Singular Value Decomposition in Principal Component Analysis

Principal Component Analysis (PCA) is widely used for dimensionality reduction in hyperspectral imaging, genomics, and neurosciences. However, it suffers from computational bottlenecks in matrix multiplication and singular value…

硬件体系结构 · 计算机科学 2026-05-05 Srivaths Ramasubramanian , Anjali Devarajan , Kousthub P Kaivar , Vibha Shrestta , Shashank D , Sowmyarani C. N , Govinda Raju M , K. S Geetha

Understanding Simulated Architecture via gem5 Call-Stack Profiling

Understanding the behavior of simulated architectures in gem5 is critical for studying complex, deeply integrated computing systems. However, conventional analysis methods provide only an indirect view of the simulated system internals. In…

硬件体系结构 · 计算机科学 2026-05-05 Johan Söderström , Rashid Aligholipour , Yuan Yao

AMSnet-q: Unsupervised Circuit Identification and Performance Labeling for AMS Circuits

Analog and mixed-signal (AMS) circuit design remains heavily reliant on expert knowledge. While recent AI-driven automation tools can generate candidate topologies, they critically depend on manually curated datasets with functional and…

硬件体系结构 · 计算机科学 2026-05-05 Ze Zhang , Junzhuo Zhou , Yichen Shi , Zhuofu Tao , Rui Ji , Zhiping Yu , Quan Chen , Ting-Jung Lin , Lei He

Sim-FA: A GPGPU Simulator Framework for Fine-Grained FlashAttention Pipeline Analysis

To efficiently support Large Language Models (LLMs), modern GPGPU architectures have introduced new features and programming paradigms, such as warp specialization. These features enable temporal overlap between the producer and consumer,…

硬件体系结构 · 计算机科学 2026-05-05 Zhongchun Zhou , Yuhang Gu , Chengtao Lai , Ya Wang , Wei Zhang

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

Deploying Large Language Models (LLMs) on resource-constrained edge devices faces critical bottlenecks in memory bandwidth and power consumption. While ternary quantization (e.g., BitNet b1.58) significantly reduces model size, its direct…

硬件体系结构 · 计算机科学 2026-05-05 Zi-Wei Lin , Tian-Sheuan Chang

The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design

The energy and latency of an accelerator running a deep neural network (DNN) depend on how the computation and data movement are scheduled in the accelerator (i.e., mapping), and picking an optimal mapping is essential to achieve…

硬件体系结构 · 计算机科学 2026-05-05 Michael Gilbert , Tanner Andrulis , Vivienne Sze , Joel S. Emer