Related papers: UNIT: Unifying Tensorized Instruction Compilation

oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation

With the rapid development of deep learning models and hardware support for dense computing, the deep learning workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of…

Machine Learning · Computer Science 2024-03-12 Jianhui Li , Zhennan Qin , Yijie Mei , Jingze Cui , Yunfei Song , Ciyong Chen , Yifei Zhang , Longsheng Du , Xianhang Cheng , Baihui Jin , Yan Zhang , Jason Ye , Eric Lin , Dan Lavery

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Deep neural networks (DNNs) are of critical use in different domains. To accelerate DNN computation, tensor compilers are proposed to generate efficient code on different domain-specific accelerators. Existing tensor compilers mainly focus…

Machine Learning · Computer Science 2023-07-12 Zixuan Ma , Haojie Wang , Jingze Xing , Liyan Zheng , Chen Zhang , Huanqi Cao , Kezhao Huang , Shizhi Tang , Penghan Wang , Jidong Zhai

Memory-Guided Unified Hardware Accelerator for Mixed-Precision Scientific Computing

Recent hardware acceleration advances have enabled powerful specialized accelerators for finite element computations, spiking neural network inference, and sparse tensor operations. However, existing approaches face fundamental limitations:…

Hardware Architecture · Computer Science 2026-01-09 Chuanzhen Wang , Leo Zhang , Eric Liu

Comprehensive Design Space Exploration for Tensorized Neural Network Hardware Accelerators

High-order tensor decomposition has been widely adopted to obtain compact deep neural networks for edge deployment. However, existing studies focus primarily on its algorithmic advantages such as accuracy and compression ratio-while…

Hardware Architecture · Computer Science 2025-11-26 Jinsong Zhang , Minghe Li , Jiayi Tian , Jinming Lu , Zheng Zhang

TensorIR: An Abstraction for Automatic Tensorized Program Optimization

Deploying deep learning models on various devices has become an important topic. The wave of hardware specialization brings a diverse set of acceleration primitives for multi-dimensional tensor computations. These new acceleration…

Machine Learning · Computer Science 2022-10-31 Siyuan Feng , Bohan Hou , Hongyi Jin , Wuwei Lin , Junru Shao , Ruihang Lai , Zihao Ye , Lianmin Zheng , Cody Hao Yu , Yong Yu , Tianqi Chen

Instruction-Based Coordination of Heterogeneous Processing Units for Acceleration of DNN Inference

This paper presents an instruction-based coordination architecture for Field-Programmable Gate Array (FPGA)-based systems with multiple high-performance Processing Units (PUs) for accelerating Deep Neural Network (DNN) inference. This…

Hardware Architecture · Computer Science 2026-01-06 Anastasios Petropoulos , Theodore Antonakopoulos

UnIT: Scalable Unstructured Inference-Time Pruning for MAC-efficient Neural Inference on MCUs

Existing pruning methods are typically applied during training or compile time and often rely on structured sparsity. While compatible with low-power microcontrollers (MCUs), structured pruning underutilizes the opportunity for fine-grained…

Machine Learning · Computer Science 2025-07-11 Ashe Neth , Sawinder kaur , Mohammad Nur Hossain Khan , Subrata Biswas , Asif Salekin , Bashima Islam

Combinatorial Register Allocation and Instruction Scheduling

This paper introduces a combinatorial optimization approach to register allocation and instruction scheduling, two central compiler problems. Combinatorial optimization has the potential to solve these problems optimally and to exploit…

Programming Languages · Computer Science 2019-06-21 Roberto Castañeda Lozano , Mats Carlsson , Gabriel Hjort Blindell , Christian Schulte

Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs

As deep learning models nowadays are widely adopted by both cloud services and edge devices, reducing the latency of deep learning model inferences becomes crucial to provide efficient model serving. However, it is challenging to develop…

Machine Learning · Computer Science 2023-02-16 Yaoyao Ding , Cody Hao Yu , Bojian Zheng , Yizhi Liu , Yida Wang , Gennady Pekhimenko

A Tensor-Based Compiler and a Runtime for Neuron-Level DNN Certifier Specifications

The uninterpretability of DNNs has led to the adoption of abstract interpretation-based certification as a practical means to establish trust in real-world systems that rely on DNNs. However, the current landscape supports only a limited…

Computation and Language · Computer Science 2025-07-29 Avaljot Singh , Yamin Chandini Sarita , Aditya Mishra , Ishaan Goyal , Gagandeep Singh , Charith Mendis

Mixed precision in Graphics Processing Unit

Modern graphics computing units (GPUs) are designed and optimized to perform highly parallel numerical calculations. This parallelism has enabled (and promises) significant advantages, both in terms of energy performance and calculation. In…

Hardware Architecture · Computer Science 2021-10-26 Quentin Gallouédec

Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi-Pumped Soft SIMD Operations

Recent advancements in quantization and mixed-precision approaches offers substantial opportunities to improve the speed and energy efficiency of Neural Networks (NN). Research has shown that individual parameters with varying low…

Hardware Architecture · Computer Science 2024-08-14 Giorgos Armeniakos , Alexis Maras , Sotirios Xydis , Dimitrios Soudris

EN-T: Optimizing Tensor Computing Engines Performance via Encoder-Based Methodology

Tensor computations, with matrix multiplication being the primary operation, serve as the fundamental basis for data analysis, physics, machine learning, and deep learning. As the scale and complexity of data continue to grow rapidly, the…

Hardware Architecture · Computer Science 2024-10-24 Qizhe Wu , Yuchen Gui , Zhichen Zeng , Xiaotian Wang , Huawen Liang , Xi Jin

A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic

Within the past years, hardware vendors have started designing low precision special function units in response to the demand of the Machine Learning community and their demand for high compute power in low precision formats. Also the…

Mathematical Software · Computer Science 2020-07-15 Ahmad Abdelfattah , Hartwig Anzt , Erik G. Boman , Erin Carson , Terry Cojean , Jack Dongarra , Mark Gates , Thomas Grützmacher , Nicholas J. Higham , Sherry Li , Neil Lindquist , Yang Liu , Jennifer Loe , Piotr Luszczek , Pratik Nayak , Sri Pranesh , Siva Rajamanickam , Tobias Ribizel , Barry Smith , Kasia Swirydowicz , Stephen Thomas , Stanimire Tomov , Yaohung M. Tsai , Ichitaro Yamazaki , Urike Meier Yang

MERIT: Tensor Transform for Memory-Efficient Vision Processing on Parallel Architectures

Computationally intensive deep neural networks (DNNs) are well-suited to run on GPUs, but newly developed algorithms usually require the heavily optimized DNN routines to work efficiently, and this problem could be even more difficult for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-12 Yu-Sheng Lin , Wei-Chao Chen , Shao-Yi Chien

Pushing Tensor Accelerators Beyond MatMul in a User-Schedulable Language

Tensor accelerators now represent a growing share of compute resources in modern CPUs and GPUs. However, they are hard to program, leading developers to use vendor-provided kernel libraries that support tensor accelerators. As a result, the…

Programming Languages · Computer Science 2026-02-12 Yihong Zhang , Derek Gerstmann , Andrew Adams , Maaz Bin Safeer Ahmad

UNIT: Unifying Image and Text Recognition in One Vision Encoder

Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-09 Yi Zhu , Yanpeng Zhou , Chunwei Wang , Yang Cao , Jianhua Han , Lu Hou , Hang Xu

Acceleration of Tensor-Product Operations with Tensor Cores

In this paper, we explore the acceleration of tensor product operations in finite element methods, leveraging the computational power of the NVIDIA A100 GPU Tensor Cores. We provide an accessible overview of the necessary mathematical…

Mathematical Software · Computer Science 2024-07-16 Cu Cui

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving…

Computer Vision and Pattern Recognition · Computer Science 2026-02-13 Leon Liangyu Chen , Haoyu Ma , Zhipeng Fan , Ziqi Huang , Animesh Sinha , Xiaoliang Dai , Jialiang Wang , Zecheng He , Jianwei Yang , Chunyuan Li , Junzhe Sun , Chu Wang , Serena Yeung-Levy , Felix Juefei-Xu

Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML…

Hardware Architecture · Computer Science 2024-07-12 Mohammed Elbtity , Peyton Chandarana , Ramtin Zand