硬件体系结构
Sensing surface vibrations promise unobtrusive interaction for smart home systems by enabling gesture recognition on existing everyday surfaces without disturbing living-space design. Existing approaches typically address only parts of the…
Automating radio frequency (RF) amplifier design remains challenging because existing methods suffer from the curse of dimensionality, weak use of domain knowledge, and poor transferability, leading to low data efficiency. Meanwhile,…
Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical…
The end of conventional Dennard scaling and the widening gap between memory bandwidth and arithmetic throughput have made the von Neumann partition a structural bottleneck rather than a transient one. Two-dimensional (2D) materials, with…
This work presents a 55nm speculative decoding-based LLM accelerator with bumping-based face-to-face ReRAM-on-logic stacking technology. It features a local rotation unit for outlier-free low-bit quantization, a stacking-aware PNM…
Neural Networks (NNs) have been widely adopted due to their outstanding efficacy and adaptability across computer vision and deep learning applications. The optimization of NNs is necessary to enable their deployment on energy constrained…
DDR5 SDRAM partitions each 64-bit memory channel into two independent 32-bit sub-channels. A DIMM populating only one sub-channel halves the die count required for a given module, enabling 8 GB modules with current 16 Gbit dies that the…
In recent years, DeepSeek has achieved strong inference performance but remains hard to deploy on energy-constrained edge devices. This paper presents the DeepSeek Processing Element (DSPE), an edge-oriented architecture that alleviates the…
Systolic arrays are the dominant compute fabric for neural network inference. Prior work has addressed column-level fault detection efficiently with uniform test patterns, but row-level (PE-level) fault localization within a faulty column…
Chip industry continues advancing and expanding modern computing systems, resulting in more complex multi-core processors. Conversely, academic projects face scalability challenges due to limited resources, highlighting the need for…
This paper implements a Binary Neural Network (BNN) based YOLOv3-tiny-like object detector on a low-cost FPGA. The network takes 320*320*3 RGB images as input. Its main convolution layers use 1-bit weights and 8-bit activations, while Conv1…
RTL generation is more than code synthesis. Designs must be syntactically valid, synthesizable, correct, hardware-efficient. SOTA evaluations stop at functional correctness and do not measure synthesis and implementation quality. This paper…
In 1987, Jim Gray and Gianfranco Putzolu introduced the five-minute rule, a simple, storage-memory-economics-based heuristic for deciding when data should live in DRAM rather than on storage. Subsequent revisits to the rule largely retained…
Large language models (LLMs) have demonstrated immense potential in computer-aided design (CAD), particularly for automated debugging and verification within electronic design automation (EDA) tools. However, Design for Testability (DFT)…
AI accelerator operators are compiled into multi-stage pipeline programs where DMA, vector, matrix, and scalar units execute concurrently on shared on-chip buffers. A missing or misplaced synchronization primitive introduces…
Modern large language model workloads put increasing demands on parallel compute capability and on-chip memory capacity, while also stressing fine-grained data movement and synchronization. These trends motivate exploring and designing…
Modern Deep Learning (DL) workloads are increasingly deployed in safety-critical domains, such as automotive systems and hyperscale data centers, where transient hardware faults pose a serious threat to system reliability. These workloads…
This work presents TREA, a low-precision time-multiplexed and resource-efficient edge-AI accelerator for object detection and classification, targeting stringent area-power-latency constraints of edge vision platforms. The proposed…
Commercial FPGAs, such as AMD Versal devices, increasingly incorporate AI engines that exploit low-precision packed-SIMD fused multiply-accumulate (FMA) to achieve proportional throughput gains. However, trans-precision FMA (e.g.,…
The continuous scaling of CMOS technology has significantly increased the complexity of very large-scale integrated circuits, driving interest in applying machine learning (ML) to electronic design automation (EDA). However, the limited…