硬件体系结构

A detailed algorithmic study on a reuse-aware, near memory, all-digital Ising machine

Recently, nature-inspired computing approaches have gained significant attention for solving difficult optimization problems, particularly through Ising machines for NP-complete applications. Existing Ising accelerators range from quantum…

硬件体系结构 · 计算机科学 2026-05-26 Siddhartha Raman Sundara Raman , Lizy K. John , Jaydeep P. Kulkarni

A comparative study on power delivery aspects of compute-in/near-memory approaches using DRAM

Compute-in-memory (PIM) mitigates the memory wall by performing computation within memory, reducing data movement and improving energy efficiency. DRAM-based PIM is particularly attractive due to its high density, mature manufacturing…

硬件体系结构 · 计算机科学 2026-05-26 Siddhartha Raman Sundara Raman , Siyuan Ma , Lizy Kurian John

Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

Large-scale machine learning workloads increasingly rely on multi-GPU systems, yet their performance is often limited by an overlooked component: the CPU. Through a detailed study of modern large language model (LLM) inference and serving…

硬件体系结构 · 计算机科学 2026-05-26 Euijun Chung , Yuxiao Jia , Aaron Jezghani , Hyesoon Kim

ABI: A tightly integrated, unified, sparsity-aware, reconfigurable, compute near-register file/cache GPU architecture with light-weight softmax for deep learning, linear algebra, and Ising compute

We present a tightly integrated and unified near-memory GPU architecture that delivers 6 to 16 times speedup and 6 to 13 times energy savings across Convolutional Neural Networks, Graph Convolutional Networks, Linear Programming, Large…

硬件体系结构 · 计算机科学 2026-05-26 Siddhartha Raman Sundara Raman , Jaydeep P. Kulkarni

DORA: Dataflow-Instruction Orchestration Architecture for DNN Acceleration

As deep neural networks develop significantly more diverse and complex, achieving high performance and efficiency on complicated DNN models faces pressing challenges. Modern DNN workloads are increasingly diverse in operation types, tensor…

硬件体系结构 · 计算机科学 2026-05-25 Xingzhen Chen , Zhuoping Yang , Jinming Zhuang , Shixin Ji , Sarah Schultz , Zheng Dong , Weisong Shi , Peipei Zhou

To Overlay or to Customize? Revisiting Architectural Choices in Heterogeneous Systems

In this work, we present a systematic study of this trade-off from a deployment-centric perspective, focusing on an autonomous driving scenario. Instead of treating overlay and customized acceleration as isolated design points, we analyze…

硬件体系结构 · 计算机科学 2026-05-25 Xingzhen Chen , Shixin Ji , Zheng Dong , Peipei Zhou

DAE4HLS: Exposing Memory-Level Parallelism for High-Level Synthesis using Explicit Decoupling

High-level synthesis (HLS) performs well for simple memory access patterns, such as for sequential accesses that can be turned into bursts, or for memory accesses into small datasets that can be stored in scratchpads. This limits HLS to…

硬件体系结构 · 计算机科学 2026-05-25 David Metz , Magnus Själander

NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference

The Mixture-of-Experts (MoE) models have emerged as the state-of-the-art paradigm for scaling up large language models (LLMs) without proportionally increased computational cost. However, its on-device deployment faces a critical challenge…

硬件体系结构 · 计算机科学 2026-05-25 Weikai Xu , Meng Li , Shuzhang Zhong , Tianyang Luo , Dongxue Zhao , Ling Liang , Zongwei Wang , Qianqian Huang , Yimao Cai , Ru Huang

MASQ: Accelerating Masked Diffusion via Stage-Wise Multi-Precision Quantization

Masked diffusion enables region-specific image synthesis but suffers from computational redundancy, since the entire image is processed each timestep even though only the masked region requires generation. To address this, we introduce…

硬件体系结构 · 计算机科学 2026-05-25 Seeyeon Kim , Jaehun Lee , Sungyeob Yoo , Joo-Young Kim

ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration

Architectural simulation has become the critical bottleneck limiting design space exploration for high-performance computing systems. Modern GPUs and AI accelerators -- with hundreds to thousands of tightly-coupled components -- demand…

硬件体系结构 · 计算机科学 2026-05-25 Wei-Fen Lin , Jen-Chien Chang , Yen-Po Chen , Zi-Yi Tai , Yu-Cheng Chang , Chia-Pao Chiang , Yu-Yang Lee , Yu-Jie Wan

Bridging the Last Mile of Circuit Design: PostEDA-Bench, a Hierarchical Benchmark for PPA Convergence and DRC Fixing

LLM-based agents are increasingly applied to the "last mile" of Electronic Design Automation (EDA): repairing residual sign-off Design Rule Check (DRC) violations and converging Power-Performance-Area (PPA) targets after tool runs. Existing…

硬件体系结构 · 计算机科学 2026-05-25 Pengju Liu , Nuo Xu , Jinwei Tang , Yu Cao , Caiwen Ding

DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

High-performance Host processors can integrate Processing-In-Memory (PIM) devices, which can accelerate memory-intensive kernels of Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging the large memory…

硬件体系结构 · 计算机科学 2026-05-25 Peiming Yang , Sankeerth Durvasula , Ivan Fernandez , Mohammad Sadrosadati , Onur Mutlu , Gennady Pekhimenko , Christina Giannoula

NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search with DIMM-Based Near-Data Processing

As large language models (LLMs) continue to advance, retrieval-augmented generation (RAG) has become the key mechanism for expanding model knowledge and reducing hallucinations. Central to RAG is approximate nearest neighbor search (ANNS),…

硬件体系结构 · 计算机科学 2026-05-22 Cheng Zou , Shuo Yang , Chen Nie , Yu Zou , Yu He , Chao Jiang , Limin Xiao , Weifeng Zhang , Zhezhi He

Emerging memory technologies at room/cryogenic temperature

As conventional technology scaling approaches physical and power limitations, modern computing systems increasingly face performance bottlenecks arising from memory latency, energy consumption, scalability constraints, and data movement…

硬件体系结构 · 计算机科学 2026-05-22 Siddhartha Raman Sundara Raman

CompPow: A Case for Component-level GPU Power Management

The ever increasing demand for ML-driven intelligence in a wide spectrum of domains has led to ubiquity of GPUs. At the same time, GPUs are notorious for their power consumption needs and often dominate power allocation in a typical ML…

硬件体系结构 · 计算机科学 2026-05-22 Shaizeen Aga , Mohamed Assem Ibrahim

Spec2Cov: An Agentic Framework for Code Coverage Closure of Digital Hardware Designs

Hardware verification is one of the most challenging stages of the hardware design process, requiring significant time and resources to ensure a design is fully validated and production-ready. Verification teams aim to maximize design…

硬件体系结构 · 计算机科学 2026-05-22 Sean Lowe , Elias Hilaneh , Alma Babbit , Nakul Gopalan , Vidya Chhabria , Aman Arora

FASE: FPGA-Assisted Syscall Emulation for Rapid End-to-End Processor Performance Validation

The rapid advancement of AI workloads and domain-specific architectures has led to increasingly diverse processor microarchitectures, whose design exploration requires fast and accurate performance validation. However, traditional workflows…

硬件体系结构 · 计算机科学 2026-05-22 Chengzhen Meng , Xiuzhuang Chen , Bingcai Sui , Zhenyu Zhao , Tun Li , Hongjun Dai

Supporting Dynamic Control-Flow Execution for Runtime Reconfigurable Processors

As the need for more computing power grows, traditional methods are hitting limits. To boost performance, we're expanding Central Processing Unit (CPU) capabilities and using specialized hardware accelerators. For example, mobile devices…

硬件体系结构 · 计算机科学 2026-05-21 Hassan Nassar , Rafik Youssef , Lars Bauer , Jörg Henkel

ELSA: An ELastic SNN Inference Architecture for Efficient Neuromorphic Computing

Spiking neural networks (SNNs) exploit event-driven and addition-only computation to substantially improve efficiency for intelligent computation. A key temporal property of SNNs, elastic inference, allows outputs to emerge progressively,…

硬件体系结构 · 计算机科学 2026-05-21 Kang You , Chen Nie , Lee Jun Yan , Ziling Wei , Cheng Zou , Zekai Xu , Yu Feng , Honglan Jiang , Zhezhi He

HyDRA: Deadline and Reuse-Aware Cacheability for Hardware Accelerators

The system-level cache is a critical resource shared by processor cores and domain-specific accelerators in heterogeneous systems on chips (SoCs). The strict QoS requirements of accelerators, such as deadlines, can lead to severe…

硬件体系结构 · 计算机科学 2026-05-21 Ayushi Agarwal , Anannya Mathur , Preeti Ranjan Panda