硬件体系结构
The electric power supply for AI data centers is now the most significant bottleneck in the race toward Artificial General Intelligence, surpassing even the constraint of AI accelerator availability. To our knowledge, this paper is the…
Heterogeneous reconfigurable platforms with tensor cores, such as AMD ACAP, are increasingly adopted for deep neural network (DNN) inference due to their high throughput and flexibility. However, their suitability for microsecond-scale…
Always-on AI applications, from environmental sensors to biomedical implants, require ultra-low power consumption. Analog circuits offer a path to sub-microwatt inference, yet existing analog implementations are limited to feedforward…
Spiking neural networks (SNNs) are a promising paradigm for energy-efficient event-driven computation, but large-scale SNN execution remains challenging because sparse spike communication and synchronization can dominate runtime. Existing…
Floorplanning determines the coordinate and shape of each module in Integrated Circuits. With the scaling of technology nodes, in floorplanning stage especially 3D scenarios with multiple stacked layers, it has become increasingly…
Modern high-performance computing (HPC) environments rely on hybrid storage systems (HSS) that combine multiple storage devices with diverse latency, bandwidth, endurance, and capacity characteristics to meet the performance, capacity, and…
Transformer-based diffusion models offer superior scalability and performance but suffer from high computational overhead due to the iterative nature and quadratic complexity of self-attention at high resolutions. In this paper, we propose…
To enable debugging and calibration of real time systems, which are in interaction with the real plant, the software used on those systems often has a huge number of global variables. The huge number of global variables exceed the range…
Approximate Nearest Neighbor Search (ANNS) is a core primitive in modern AI systems, and graph-based methods currently offer the best accuracy-efficiency trade-off at scale. The workload is fundamentally memory-bound: graph traversal…
We empirically characterise the cost-efficiency deficit between cloud Tensor Processing Units and GPUs for finite-field cryptography. Against A100 GPU baselines (cuZK), we measure a $[5{,}558\times, 6{,}908\times]$ deficit across v5p and v4…
Hyperdimensional computing (HDC) is a promising approach for energy-efficient edge machine learning (ML), where low latency, low power, and tight memory budgets are essential. However, traditional HDC relies on symbolic binding and…
In modern computing units, division operations are generally slower than other arithmetic operations and require more resources, such as area and power, than multiplication. To reduce the delay, fast division algorithms use an initial…
As the demand for deep learning grows, cost reduction through quantization has become essential for both training and inference. In 2022, the Open Compute Project (OCP) consortium standardized narrow precision formats for deep learning,…
Large Language Models (LLMs) have achieved impressive performance across diverse domains but remain inefficient during the autoregressive decoding phase. Unlike the prefill stage, which employs compute-bound GEMM operations, decoding…
We present a formal bounding analysis for maximum credible interference in multicore processors under strict architectural invariants: direct-mapped L2 cache (1-way associativity), disabled Miss Status Handling Registers (MSHRs),…
In long-context Large Language Model (LLM) inference, the Time-To-First-Token (TTFT) latency incurred by the prefill stage has become the foremost bottleneck limiting interactive performance and deployment cost. KV Cache reuse offers a…
Contrast maximization (CMAX) is a direct geometric framework for event-based motion estimation, but its iterative warp-and-accumulate pipeline incurs input-dependent computation and frequent memory accesses, challenging real-time, low-power…
Diffusion inference remains costly for edge deployment, yet existing accelerators focus almost exclusively on score networks because standard drift is merely a trivial linear scaling. Kuramoto orientation diffusion replaces this trivial…
Efficient arithmetic circuit design for resourceconstrained hardware involves challenging combinatorial optimization problems, among which Multiple Constant Multiplication (MCM) is a prominent example. MCM aims at implementing…
As semiconductor scaling reaches the A16 / 2 nm node, the integration of co-packaged optics (CPO) via TSMC's Co-Packaged Optics Ultra Engine (COUPE) architecture introduces critical thermal-optical coupling challenges. Micro-ring resonators…