Related papers: Bit-Parallel Vector Composability for Neural Accel…

Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

Fully realizing the potential of acceleration for Deep Neural Networks (DNNs) requires understanding and leveraging algorithmic properties. This paper builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced…

Neural and Evolutionary Computing · Computer Science 2018-05-31 Hardik Sharma , Jongse Park , Naveen Suda , Liangzhen Lai , Benson Chau , Joon Kyung Kim , Vikas Chandra , Hadi Esmaeilzadeh

Bit-balance: Model-Hardware Co-design for Accelerating NNs by Exploiting Bit-level Sparsity

Bit-serial architectures can handle Neural Networks (NNs) with different weight precisions, achieving higher resource efficiency compared with bit-parallel architectures. Besides, the weights contain abundant zero bits owing to the fault…

Hardware Architecture · Computer Science 2023-02-02 Wenhao Sun , Zhiwei Zou , Deng Liu , Wendi Sun , Song Chen , Yi Kang

Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs

Despite foreseeing tremendous speedups over conventional deep neural networks, the performance advantage of binarized neural networks (BNNs) has merely been showcased on general-purpose processors such as CPUs and GPUs. In fact, due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-16 Ang Li , Simon Su

8-Bit Approximations for Parallelism in Deep Learning

The creation of practical deep learning data-products often requires parallelization across processors and computers to make deep learning feasible on large data sets, but bottlenecks in communication bandwidth make it difficult to attain…

Neural and Evolutionary Computing · Computer Science 2016-02-22 Tim Dettmers

A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their…

Hardware Architecture · Computer Science 2021-06-25 Petar Jokic , Erfan Azarkhish , Andrea Bonetti , Marc Pons , Stephane Emery , Luca Benini

Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment

Transformer models have revolutionized AI tasks, but their large size hinders real-world deployment on resource-constrained and latency-critical edge devices. While binarized Transformers offer a promising solution by significantly reducing…

Machine Learning · Computer Science 2025-05-13 Yuhao Ji , Chao Fang , Shaobo Ma , Haikuo Shao , Zhongfeng Wang

Bit-Line Computing for CNN Accelerators Co-Design in Edge AI Inference

By supporting the access of multiple memory words at the same time, Bit-line Computing (BC) architectures allow the parallel execution of bit-wise operations in-memory. At the array periphery, arithmetic operations are then derived with…

Hardware Architecture · Computer Science 2022-09-14 Marco Rios , Flavio Ponzina , Alexandre Levisse , Giovanni Ansaloni , David Atienza

Rethinking Co-design of Neural Architectures and Hardware Accelerators

Neural architectures and hardware accelerators have been two driving forces for the progress in deep learning. Previous works typically attempt to optimize hardware given a fixed model architecture or model architecture given fixed…

Machine Learning · Computer Science 2021-02-18 Yanqi Zhou , Xuanyi Dong , Berkin Akin , Mingxing Tan , Daiyi Peng , Tianjian Meng , Amir Yazdanbakhsh , Da Huang , Ravi Narayanaswami , James Laudon

A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks

FPGA-based hardware accelerators for convolutional neural networks (CNNs) have obtained great attentions due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-06-09 Yixing Li , Zichuan Liu , Kai Xu , Hao Yu , Fengbo Ren

A flexible FPGA accelerator for convolutional neural networks

Though CNNs are highly parallel workloads, in the absence of efficient on-chip memory reuse techniques, an accelerator for them quickly becomes memory bound. In this paper, we propose a CNN accelerator design for inference that is able to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-26 Kingshuk Majumder , Shubham Nema , Uday Bondhugula

Uni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural Renderers

Recent advancements in neural rendering technologies and their supporting devices have paved the way for immersive 3D experiences, significantly transforming human interaction with intelligent devices across diverse applications. However,…

Graphics · Computer Science 2025-04-01 Chaojian Li , Sixu Li , Linrui Jiang , Jingqun Zhang , Yingyan Celine Lin

FlexiBit: Fully Flexible Precision Bit-parallel Accelerator Architecture for Arbitrary Mixed Precision AI

Recent research has shown that large language models (LLMs) can utilize low-precision floating point (FP) quantization to deliver high efficiency while maintaining original model accuracy. In particular, recent works have shown the…

Hardware Architecture · Computer Science 2025-06-05 Faraz Tahmasebi , Yian Wang , Benji Y. H. Huang , Hyoukjun Kwon

Design of High-Throughput Mixed-Precision CNN Accelerators on FPGA

Convolutional Neural Networks (CNNs) reach high accuracies in various application domains, but require large amounts of computation and incur costly data movements. One method to decrease these costs while trading accuracy is weight and/or…

Hardware Architecture · Computer Science 2022-08-10 Cecilia Latotzke , Tim Ciesielski , Tobias Gemmeke

AutoNeural: Co-Designing Vision-Language Models for NPU Inference

While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary…

Computation and Language · Computer Science 2025-12-09 Wei Chen , Liangmin Wu , Yunhai Hu , Zhiyuan Li , Zhiyuan Cheng , Yicheng Qian , Lingyue Zhu , Zhipeng Hu , Luoyi Liang , Qiang Tang , Zhen Liu , Han Yang

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

Deploying Large Language Models (LLMs) on resource-constrained edge devices faces critical bottlenecks in memory bandwidth and power consumption. While ternary quantization (e.g., BitNet b1.58) significantly reduces model size, its direct…

Hardware Architecture · Computer Science 2026-05-05 Zi-Wei Lin , Tian-Sheuan Chang

Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks

Convolutional Neural Networks (CNNs) are widely used in deep learning applications, e.g. visual systems, robotics etc. However, existing software solutions are not efficient. Therefore, many hardware accelerators have been proposed…

Machine Learning · Computer Science 2021-09-08 Sasindu Wijeratne , Sandaruwan Jayaweera , Mahesh Dananjaya , Ajith Pasqual

Work-in-Progress: Real-Time Neural Network Inference on a Custom RISC-V Multicore Vector Processor

Neural networks are increasingly used in real-time systems, such as automated driving applications. This requires high-performance hardware with predictable timing behavior. State-of-the-art real-time hardware is limited in memory and…

Hardware Architecture · Computer Science 2024-10-15 Maximilian Kirschner , Konstantin Dudzik , Jürgen Becker

AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-24 Jason Cong , Peng Wei , Cody Hao Yu , Peng Zhang

Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization

Vector similarity search is a critical component of modern AI systems, but traditional CPU-based implementations face fundamental scalability bottlenecks for billion-scale corpora due to prohibitive computational overhead and memory…

Information Retrieval · Computer Science 2026-05-18 Fujun He , Chuyue Ye , Huaxiang Cai , Zetao Lv , Baolong Cui , Wenru Yan , Chao Zhan , Zigang Zhang , Hao Yi , Jie Xiang , Xiabing Li , Yuhang Gai , Ziyang Zhang , Pengfei Zheng , Yunfei Du

Booster: An Accelerator for Gradient Boosting Decision Trees

We propose Booster, a novel accelerator for gradient boosting trees based on the unique characteristics of gradient boosting models. We observe that the dominant steps of gradient boosting training (accounting for 90-98% of training time)…

Hardware Architecture · Computer Science 2020-11-06 Mingxuan He , T. N. Vijaykumar , Mithuna Thottethodi