硬件体系结构

Towards an End-To-End System for Real-Time Gesture Recognition from Surface Vibrations

Sensing surface vibrations promise unobtrusive interaction for smart home systems by enabling gesture recognition on existing everyday surfaces without disturbing living-space design. Existing approaches typically address only parts of the…

硬件体系结构 · 计算机科学 2026-05-12 Florian Hettstedt , Cedric Giese , Tianheng Ling , Keiichi Yasumoto , Gregor Schiele , Andreas Erbslöh

RFAmpDesigner: A Self-Evolving Multi-Agent LLM Framework for Automated Radio Frequency Amplifier Design

Automating radio frequency (RF) amplifier design remains challenging because existing methods suffer from the curse of dimensionality, weak use of domain knowledge, and poor transferability, leading to low data efficiency. Meanwhile,…

硬件体系结构 · 计算机科学 2026-05-12 Hang Lu , Guochang Li , Qianyu Chen , Huiyan Gao , Shaogang Wang , Xuanyu He , Yiwei Liu , Gaopeng Chen , Nayu Li , Xiaokang Qi , Chunyi Song , Zhiwei Xu

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical…

硬件体系结构 · 计算机科学 2026-05-12 Zhiqing Zhong , Zhijing Ye , Jian Zhang , Weijian Zheng , Bolun Sun , Xiaodong Yu

Emerging 2D Materials for Beyond von Neumann Computing: A Perspective

The end of conventional Dennard scaling and the widening gap between memory bandwidth and arithmetic throughput have made the von Neumann partition a structural bottleneck rather than a transient one. Two-dimensional (2D) materials, with…

硬件体系结构 · 计算机科学 2026-05-12 Yaser Banad

31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding

This work presents a 55nm speculative decoding-based LLM accelerator with bumping-based face-to-face ReRAM-on-logic stacking technology. It features a local rotation unit for outlier-free low-bit quantization, a stacking-aware PNM…

硬件体系结构 · 计算机科学 2026-05-12 Pingcheng Dong , Yonghao Tan , Xuejiao Liu , Peng Luo , Yu Liu , Di Pang , Songchen Ma , Xijie Huang , Shih-Yang Liu , Dong Zhang , Zhichao Lu , Luhong Liang , Chi-Ying Tsui , Fengbin Tu , Liang Zhao , Kwang-Ting Cheng

A Reconfigurable Multiplier Architecture for Error-Resilient Applications in RISC-V Core

Neural Networks (NNs) have been widely adopted due to their outstanding efficacy and adaptability across computer vision and deep learning applications. The optimization of NNs is necessary to enable their deployment on energy constrained…

硬件体系结构 · 计算机科学 2026-05-12 Pragun Jaswal , L. Hemanth Krishna , B. Srinivasu

Single 32-bit Sub-Channel DDR5 DIMMs: Architecture, Performance Bounds, and Standardisation

DDR5 SDRAM partitions each 64-bit memory channel into two independent 32-bit sub-channels. A DIMM populating only one sub-channel halves the die count required for a given module, enabling 8 GB modules with current 16 Gbit dies that the…

硬件体系结构 · 计算机科学 2026-05-12 Chih-Hua Ke

DSPE: An Energy-Efficient Edge Processor for DeepSeek Inference with MerkleTree-based Incremental Pruning, Multi-Stage Boothing Lookup and Dynamic Adaptive Posit Processing

In recent years, DeepSeek has achieved strong inference performance but remains hard to deploy on energy-constrained edge devices. This paper presents the DeepSeek Processing Element (DSPE), an edge-oriented architecture that alleviates the…

硬件体系结构 · 计算机科学 2026-05-12 Yuhan Zhang , Zhou Wang , Zhou Shu , Jiuren Zhou , Yanqing Xu , Xiaonan Tang , Shushan Qiao , Tianchun Ye , Yang Liu , Anil A. Bharath , Emm Mic Drakakis

FLARE: One-Shot PE-Level Fault Localization in Systolic Arrays via Algebraic Test Vectors

Systolic arrays are the dominant compute fabric for neural network inference. Prior work has addressed column-level fault detection efficiently with uniform test patterns, but row-level (PE-level) fault localization within a faulty column…

硬件体系结构 · 计算机科学 2026-05-12 Logashree Venkatasubramanian , Zishen Wan , Viveck Cadambe

REPTILES: Repeated Tiles of Sargantana, a RISC-V multicore based on OpenPiton

Chip industry continues advancing and expanding modern computing systems, resulting in more complex multi-core processors. Conversely, academic projects face scalability challenges due to limited resources, highlighting the need for…

硬件体系结构 · 计算机科学 2026-05-12 Noelia Oliete-Escuín , Arnau Bigas , Narcís Rodas , Albert Aguilera , Sajjad Ahmad , Jonathan Balkind , Xavier Carril , Max Doblas , Ivan Díaz , Roger Figueras , Alireza Foroodnia , Cesar Fuguet , Ignacio Genovese , Raúl Gilabert , Abbas Haghi , Alexander Kropotov , Neiel Leyva , Oscar Lostes-Cazorla , Lorién López-Villellas , Davy Million , Alireza Monemi , Sérik Pérez , Juan Antonio Rodríguez , Víctor Soria-Pardos , Behzad Salami , Francesc Moll , Oscar Palomar , Miquel Moretó , Lluc Alvarez

Design and Implementation of BNN-Based Object Detection on FPGA

This paper implements a Binary Neural Network (BNN) based YOLOv3-tiny-like object detector on a low-cost FPGA. The network takes 320*320*3 RGB images as input. Its main convolution layers use 1-bit weights and 8-bit activations, while Conv1…

硬件体系结构 · 计算机科学 2026-05-12 Xuyu Zhao , Yunpeng Wu , Mengyuan Zhu , Haoyu Huang , Xiaoyu Xu , Yanjing Li , Gaolong Zhang , Baochang Zhang

Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes

RTL generation is more than code synthesis. Designs must be syntactically valid, synthesizable, correct, hardware-efficient. SOTA evaluations stop at functional correctness and do not measure synthesis and implementation quality. This paper…

硬件体系结构 · 计算机科学 2026-05-12 Weimin Fu , Zeng Wang , Minghao Shao , Ramesh Karri , Muhammad Shafique , Johann Knechtel , Ozgur Sinanoglu , Xiaolong Guo

Five-Minute Rule 40 Years Later: A First-Principles Revisit for Modern Memory Hierarchy

In 1987, Jim Gray and Gianfranco Putzolu introduced the five-minute rule, a simple, storage-memory-economics-based heuristic for deciding when data should live in DRAM rather than on storage. Subsequent revisits to the rule largely retained…

硬件体系结构 · 计算机科学 2026-05-12 Tong Zhang , Vikram Sharma Mailthody , Fei Sun , Linsen Ma , Chris J. Newburn , Teresa Zhang , Yang Liu , Jiangpeng Li , Hao Zhong , Wen-Mei Hwu

VeriRAG: A Retrieval-Augmented Framework for Automated RTL Testability Repair

Large language models (LLMs) have demonstrated immense potential in computer-aided design (CAD), particularly for automated debugging and verification within electronic design automation (EDA) tools. However, Design for Testability (DFT)…

硬件体系结构 · 计算机科学 2026-05-12 Haomin Qi , Yuyang Du , Lihao Zhang , Soung Chang Liew , Kexin Chen , Yining Du

AccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs

AI accelerator operators are compiled into multi-stage pipeline programs where DMA, vector, matrix, and scalar units execute concurrently on shared on-chip buffers. A missing or misplaced synchronization primitive introduces…

硬件体系结构 · 计算机科学 2026-05-11 Hangcheng An , Rui Wang , Depei Qian

Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

Modern large language model workloads put increasing demands on parallel compute capability and on-chip memory capacity, while also stressing fine-grained data movement and synchronization. These trends motivate exploring and designing…

硬件体系结构 · 计算机科学 2026-05-11 Yinrong Li , Zexin Fu , Yichao Zhang , Germain Haugou , Chi Zhang , Marco Bertuletti , Bowen Wang , Luca Benini

Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs

Modern Deep Learning (DL) workloads are increasingly deployed in safety-critical domains, such as automotive systems and hyperscale data centers, where transient hardware faults pose a serious threat to system reliability. These workloads…

硬件体系结构 · 计算机科学 2026-05-11 Mohammad Hasan Ahmadilivani , Marten Roots , Marco Restifo , Sven-Markus Loorits , Luca Di Mauro , Jaan Raik

TREA: Low-precision Time-Multiplexed, Resource-Efficient Edge Accelerator for Object Detection and Classification

This work presents TREA, a low-precision time-multiplexed and resource-efficient edge-AI accelerator for object detection and classification, targeting stringent area-power-latency constraints of edge vision platforms. The proposed…

硬件体系结构 · 计算机科学 2026-05-11 Vijay Pratap Sharma , Mukul Lokhande , Ratko Pilipovic , Omkar Kokane , Santosh Kumar Vishvakarma

TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines

Commercial FPGAs, such as AMD Versal devices, increasingly incorporate AI engines that exploit low-precision packed-SIMD fused multiply-accumulate (FMA) to achieve proportional throughput gains. However, trans-precision FMA (e.g.,…

硬件体系结构 · 计算机科学 2026-05-11 Jiayi Wang , Maohua Nie , Sin-Chen Lin , C. -J. Richard Shi , Ang Li

EDA-Schema-V2: A Multimodal Schema, Open Datasets, and Benchmarks for Machine Learning in Digital Physical Design

The continuous scaling of CMOS technology has significantly increased the complexity of very large-scale integrated circuits, driving interest in applying machine learning (ML) to electronic design automation (EDA). However, the limited…

硬件体系结构 · 计算机科学 2026-05-11 Pratik Shrestha , Alec Aversa , Ioannis Savidis