硬件体系结构

CARMEN: CORDIC-Accelerated Resource-Efficient Multi-Precision Inference Engine for Deep Learning

This paper presents CARMEN, a runtime-adaptive, CORDIC-accelerated multi-precision vector engine for resource-efficient deep learning inference. The key insight is that CORDIC iteration depth directly governs computational accuracy,…

硬件体系结构 · 计算机科学 2026-05-11 Sonu Kumar , Mukul Lokhande , Santosh Kumar Vishvakarma , Adam Teman

EULER-ADAS: Energy-Efficient & SIMD-Unified Logarithmic-Posit Engine for Precision-Reconfigurable Approximate ADAS Acceleration

Advanced driver-assistance systems (ADAS) require neural compute engines that deliver low-latency inference under strict power and area constraints. Posit arithmetic is attractive for such accelerators because it provides high numerical…

硬件体系结构 · 计算机科学 2026-05-11 Mukul Lokhande , Ratko Pilipovic , Omkar Kokane , Adam Teman , Santosh Kumar Vishvakarma

AnalogToBi: Device-Level Analog Circuit Topology Generation via Bipartite Graph and Grammar Guided Decoding

Analog circuit design remains highly dependent on expert knowledge due to the complexity of device-level interactions and topology design. Recent transformer-based approaches for device-level topology generation have shown promise, yet they…

硬件体系结构 · 计算机科学 2026-05-11 Seungmin Kim , Mingun Kim , Yuna Lee , Yulhwa Kim

Cleaning up the Mess: Re-Evaluating the Real-System Modeling Accuracy of Ramulator 2.0

A MICRO 2024 best paper runner-up publication (the Mess paper) with all three artifact badges awarded (including ``Reproducible'') proposes a new benchmark to evaluate real and simulated memory system performance. The publication contends…

硬件体系结构 · 计算机科学 2026-05-11 F. Nisa Bostanci , Haocong Luo , Ataberk Olgun , Maria Makeenkova , Geraldo F. Oliveira , A. Giray Yaglikci , Onur Mutlu

PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs

Power-of-two (PoT) quantization significantly reduces the size of deep neural networks (DNNs) and replaces multiplications with bit-shift operations for inference. Prior work has shown that PoT-quantized DNNs can preserve accuracy for tasks…

硬件体系结构 · 计算机科学 2026-05-08 Rappy Saha , Jude Haris , Nicolas Bohm Agostini , David Kaeli , José Cano

XtraMAC: An Efficient MAC Architecture for Mixed-Precision LLM Inference on FPGA

The widespread adoption of mixed-precision quantization in large language models (LLMs) has created demand for hardware that can efficiently perform multiply-accumulate (MAC) operations across mixed datatypes and switch datatypes at…

硬件体系结构 · 计算机科学 2026-05-08 Feng Yu , Hongshi Tan , Yao Chen , Weng-Fai Wong , Bingsheng He

A virtually connected probabilistic computer as a solver for higher-order, densely connected, or reconfigurable combinatorial optimisation problems

Recently, there has been growing interest in unconventional computing as an approach for solving NP-hard problems, by developing dedicated hardware to find solutions more efficiently than conventional CPUs. In many of these approaches,…

硬件体系结构 · 计算机科学 2026-05-08 Amy J. Searle , Harry Youel , Fredrik Hasselgren , Annika Möslein , Ramy Aboushelbaya , Marko von der Leyen

LLM-Driven Design Space Exploration of FPGA-based Accelerators

Designing field-programmable gate array (FPGA)-based accelerators for modern artificial intelligence workloads requires navigating a large and complex hardware design space encompassing architectural parameters, dataflow strategies, and…

硬件体系结构 · 计算机科学 2026-05-08 Vinamra Sharma , Xingjian Fu , Jude Haris , José Cano

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems

The Mixture-of-Experts (MoE) architecture is crucial for scaling large language models, but its scalability is severely limited by inter-GPU communication bottlenecks in multi-GPU systems. Although overlapping communication with computation…

硬件体系结构 · 计算机科学 2026-05-08 Zhuoshan Zhou , Chen Zhang , Shuyi Zhang , Qijun Zhang , Haibo Wang , Zhe Zhou , Zhipeng Tu , Guangyu Sun , Yijia Diao , Zhigang Ji , Jingwen Leng , Guanghui He , Minyi Guo

TokenStack: A Heterogeneous HBM-PIM Architecture and Runtime for Efficient LLM Inference

Large language model (LLM) serving is now limited by the key-value (KV) cache. During decode, each new token rereads prior KV state, so attention becomes a bandwidth- and capacity-heavy memory task. HBM-PIM helps by moving attention closer…

硬件体系结构 · 计算机科学 2026-05-08 Zhuoran Li , Zhuohang Bian , Zihao Huang , Guangyu Sun , Yun Liang , Youwei Zhuo

Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems

Tensor parallelism (TP) in large-scale LLM inference and training introduces frequent collective operations that dominate inter-GPU communication. While in-switch computing, exemplified by NVLink SHARP (NVLS), accelerates collective…

硬件体系结构 · 计算机科学 2026-05-08 Chen Zhang , Qijun Zhang , Zhuoshan Zhou , Yijia Diao , Haibo Wang , Zhe Zhou , Zhipeng Tu , Zhiyao Li , Guangyu Sun , Zhuoran Song , Zhigang Ji , Jingwen Leng , Minyi Guo

Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs

Mixture-of-Experts (MoE) has been adopted by many leading large models to reduce computational requirements. However, frequent inter-GPU communication in MoE expert parallelism (EP) becomes a performance challenge. We observe substantial…

硬件体系结构 · 计算机科学 2026-05-08 Qijun Zhang , Chen Zhang , Zhuoshan Zhou , Haibo Wang , Zhe Zhou , Zhipeng Tu , Guangyu Sun , Zhiyao Xie , Yijia Diao , Zhigang Ji , Jingwen Leng , Guanghui He , Minyi Guo

DICE: Enabling Efficient General-Purpose SIMT Execution with Statically Scheduled Coarse-Grained Reconfigurable Arrays

While GPUs dominate massively parallel computing through the single-instruction, multiple-thread (SIMT) programming model, their underlying single-instruction, multiple-data (SIMD) execution incurs substantial energy overhead from frequent…

硬件体系结构 · 计算机科学 2026-05-08 Jiayi Wang , Ang Da Lu , Zhichen Zeng , Ang Li

Beyond Static Policies: Exploring Dynamic Policy Selection for Single-Thread Performance Optimization

For over a decade, processor design has focused on implementing sophisticated policies for various components of the out-of-order pipeline, including cache replacement and prefetching. The prevailing design philosophy has been to build…

硬件体系结构 · 计算机科学 2026-05-08 Yanxin Zhang , Ian McDougall , Junnan Li , Shayne Wadle , Vikas Singh , Karthikeyan Sankaralingam

An Open-Source Flow for Single-Phase, Edge-Triggered to Two-Phase, Non-Overlapping Clocking Conversion

Two-phase clocking offers significant advantages in timing margin and clock flexibility, yet its adoption remains limited due to the absence of automation in modern design flows. Managing strict non-overlap and 180$^\circ$ phase separation…

硬件体系结构 · 计算机科学 2026-05-08 Paolo Pedroso , Lee-Way Wang , Matthew Guthaus

UVMarvel: an Automated LLM-aided UVM Machine for Subsystem-level RTL Verification

Verification presents a major bottleneck in Integrated Circuit (IC) development, consuming nearly 70% of total effort. While the Universal Verification Methodology (UVM) improves reuse through structured verification environments,…

硬件体系结构 · 计算机科学 2026-05-08 Junhao Ye , Dingrong Pan , Hanyuan Liu , Yuchen Hu , Jie Zhou , Ke Xu , Xinwei Fang , Xi Wang , Nan Guan , Zhe Jiang

SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison

Specialized accelerators dominate AI workloads, but CPUs remain critical for orchestrating these accelerators and running datacenter services. As a result, CPU performance increasingly shapes end-to-end system efficiency, making it…

硬件体系结构 · 计算机科学 2026-05-08 Ruihao Li , Andrew Jacob , Neeraja J. Yadwadkar , Lizy K. John

Duet: Creating Harmony between Processors and Embedded FPGAs

The demise of Moore's Law has led to the rise of hardware acceleration. However, the focus on accelerating stable algorithms in their entirety neglects the abundant fine-grained acceleration opportunities available in broader domains and…

硬件体系结构 · 计算机科学 2026-05-08 Ang Li , August Ning , David Wentzlaff

Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

Driven by a rapid co-evolution of both harness and underlying models, LLM agents are improving at a dizzying pace. In our prior work (performed in Dec. 2025), we introduced "Design Conductor" (or just "Conductor"), a system capable of…

硬件体系结构 · 计算机科学 2026-05-07 The Verkor Team , Ravi Krishna , Suresh Krishna , David Chin

MCFlash: Bulk Bitwise Processing in 3D NAND with Dynamic Sensing and Multi-level Encoding

This paper presents MCFlash, a practical and immediately deployable technique for executing bulk bitwise operations directly within commercial off-the-shelf(COTS) 3D NAND flash chips. MCFlash relies solely on standard user-mode…

硬件体系结构 · 计算机科学 2026-05-07 Habib Ur Rahman , Tharini Suresh , Sudeep Pasricha , Biswajit Ray