Related papers: An Open-Source Framework for Efficient Numerically…

Scope: A Scalable Merged Pipeline Framework for Multi-Chip-Module NN Accelerators

Neural network (NN) accelerators with multi-chip-module (MCM) architectures enable integration of massive computation capability; however, they face challenges of computing resource underutilization and off-chip communication overheads.…

Hardware Architecture · Computer Science 2026-02-17 Zongle Huang , Hongyang Jia , Kaiwei Zou , Yongpan Liu

SMaLL: A Software Framework for portable Machine Learning Libraries

Interest in deploying Deep Neural Network (DNN) inference on edge devices has resulted in an explosion of the number and types of hardware platforms to use. While the high-level programming interface, such as TensorFlow, can be readily…

Mathematical Software · Computer Science 2023-03-09 Upasana Sridhar , Nicholai Tukanov , Elliott Binder , Tze Meng Low , Scott McMillan , Martin D. Schatz

SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models

Mixed-precision quantization is a promising approach for compressing large language models under tight memory budgets. However, existing mixed-precision methods typically suffer from one of two limitations: they either rely on expensive…

Machine Learning · Computer Science 2026-02-03 Xin Nie , Haicheng Zhang , Liang Dong , Beining Feng , Jinhong Weng , Guiling Sun

RMSMP: A Novel Deep Neural Network Quantization Framework with Row-wise Mixed Schemes and Multiple Precisions

This work proposes a novel Deep Neural Network (DNN) quantization framework, namely RMSMP, with a Row-wise Mixed-Scheme and Multi-Precision approach. Specifically, this is the first effort to assign mixed quantization schemes and multiple…

Machine Learning · Computer Science 2021-11-02 Sung-En Chang , Yanyu Li , Mengshu Sun , Weiwen Jiang , Sijia Liu , Yanzhi Wang , Xue Lin

SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing Fees

The advancement of Large Language Models (LLMs) has significantly boosted performance in natural language processing (NLP) tasks. However, the deployment of high-performance LLMs incurs substantial costs, primarily due to the increased…

Machine Learning · Computer Science 2024-03-22 Saehan Jo , Immanuel Trummer

NASH: Neural Architecture Search for Hardware-Optimized Machine Learning Models

As machine learning (ML) algorithms get deployed in an ever-increasing number of applications, these algorithms need to achieve better trade-offs between high accuracy, high throughput and low latency. This paper introduces NASH, a novel…

Machine Learning · Computer Science 2024-03-12 Mengfei Ji , Yuchun Chang , Baolin Zhang , Zaid Al-Ars

RSH-SpMM: A Row-Structured Hybrid Kernel for Sparse Matrix-Matrix Multiplication on GPUs

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental computation in graph analytics, scientific simulation, and sparse deep learning workloads. However, the extreme irregularity of real-world sparse matrices prevents existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-11 Aiying Li , Jingwei Sun , Han Li , Wence Ji , Guangzhong Sun

Open-Source GEMM Hardware Kernels Generator: Toward Numerically-Tailored Computations

Many scientific computing problems can be reduced to Matrix-Matrix Multiplications (MMM), making the General Matrix Multiply (GEMM) kernels in the Basic Linear Algebra Subroutine (BLAS) of interest to the high-performance computing…

Hardware Architecture · Computer Science 2023-05-31 Louis Ledoux , Marc Casas

AutoMM: Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Programmable System-on-Chip

As the increasing complexity of Neural Network(NN) models leads to high demands for computation, AMD introduces a heterogeneous programmable system-on-chip (SoC), i.e., Versal ACAP architectures featured with programmable logic (PL), CPUs,…

Hardware Architecture · Computer Science 2023-05-31 Jinming Zhuang , Zhuoping Yang , Peipei Zhou

Structured Weight Matrices-Based Hardware Accelerators in Deep Neural Networks: FPGAs and ASICs

Both industry and academia have extensively investigated hardware accelerations. In this work, to address the increasing demands in computational capability and memory requirement, we propose structured weight matrices (SWM)-based…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-01 Caiwen Ding , Ao Ren , Geng Yuan , Xiaolong Ma , Jiayu Li , Ning Liu , Bo Yuan , Yanzhi Wang

Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

Sparse matrix-vector and matrix-matrix multiplication (SpMV and SpMM) are fundamental in both conventional (graph analytics, scientific computing) and emerging (sparse DNN, GNN) domains. Workload-balancing and parallel-reduction are…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-15 Guyue Huang , Guohao Dai , Yu Wang , Yufei Ding , Yuan Xie

Structured Multi-Hashing for Model Compression

Despite the success of deep neural networks (DNNs), state-of-the-art models are too large to deploy on low-resource devices or common server configurations in which multiple models are held in memory. Model compression methods address this…

Machine Learning · Computer Science 2019-11-27 Elad Eban , Yair Movshovitz-Attias , Hao Wu , Mark Sandler , Andrew Poon , Yerlan Idelbayev , Miguel A. Carreira-Perpinan

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To…

Machine Learning · Computer Science 2023-06-30 Haihao Shen , Hengyu Meng , Bo Dong , Zhe Wang , Ofir Zafrir , Yi Ding , Yu Luo , Hanwen Chang , Qun Gao , Ziheng Wang , Guy Boudoukh , Moshe Wasserblat

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

Quantization is essential for efficient large language model (LLM) inference, yet the dequantization step-converting low-bit weights back to high-precision for matrix multiplication has become a critical bottleneck on modern AI…

Machine Learning · Statistics 2026-05-15 Lingchao Zheng , Yuwei Fan , Jun Li , Chengqiu Hu , Qichen Liao , Junyi Fan , Rui Shi , Fangzheng Miao

CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC…

Hardware Architecture · Computer Science 2023-01-09 Jinming Zhuang , Jason Lau , Hanchen Ye , Zhuoping Yang , Yubo Du , Jack Lo , Kristof Denolf , Stephen Neuendorffer , Alex Jones , Jingtong Hu , Deming Chen , Jason Cong , Peipei Zhou

Efficient and Mathematically Robust Operations for Certified Neural Networks Inference

In recent years, machine learning (ML) and neural networks (NNs) have gained widespread use and attention across various domains, particularly in transportation for achieving autonomy, including the emergence of flying taxis for urban air…

Machine Learning · Computer Science 2024-01-17 Fabien Geyer , Johannes Freitag , Tobias Schulz , Sascha Uhrig

SMASH: One-Shot Model Architecture Search through HyperNetworks

Designing architectures for deep neural networks requires expert knowledge and substantial computation time. We propose a technique to accelerate architecture selection by learning an auxiliary HyperNet that generates the weights of a main…

Machine Learning · Computer Science 2017-08-18 Andrew Brock , Theodore Lim , J. M. Ritchie , Nick Weston

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-21 Qiao Zhang , Rabab Alomairy , Dali Wang , Zhuowei Gu , Qinglei Cao

A proactive autoscaling and energy-efficient VM allocation framework using online multi-resource neural network for cloud data center

This work proposes an energy-efficient resource provisioning and allocation framework to meet the dynamic demands of future applications. The frequent variations in a cloud user's resource demand lead 'to the problem of excess power…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-06 Deepika Saxena , Ashutosh Kumar Singh

Points-to Analysis Using MDE: A Multi-level Deduplication Engine for Repetitive Data and Operations

Precise pointer analysis is a foundational component of many client analyses and optimizations. Scaling flow- and context-sensitive pointer analysis has been a long-standing challenge, suffering from combinatorial growth in both memory…

Programming Languages · Computer Science 2026-04-14 Anamitra Ghorui , Aditi Raste , Uday P. Khedker