Related papers: DVM: A Bytecode Virtual Machine Approach for Dynam…

VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination

With the widening gap between compute and memory operation latencies, data movement optimizations have become increasingly important for DNN compilation. Current optimizations such as layout transformations and operator fusion only target a…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-14 Muyan Hu , Ahan Gupta , Jiachen Yuan , Vima Gupta , Taeksang Kim , Xin Xu , Janardhan Kulkarni , Ofer Dekel , Vikram Adve , Charith Mendis

Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization

Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting attention for their ability to handle variable input sizes in real-time applications. However, existing compilation optimization methods for such networks often rely…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-04 Yangjie Zhou , Honglin Zhu , Qian Qiu , Weihao Cui , Zihan Liu , Cong Guo , Siyuan Feng , Jintao Meng , Haidong Lan , Jingwen Leng , Wenxi Zhu , Minwen Deng

DYNAMO: Dynamic Neutral Atom Multi-programming Optimizer Towards Quantum Operating Systems

As quantum computing advances towards practical applications, quantum operating systems become inevitable, where multi-programming -- the core functionality of operating systems -- enables concurrent execution of multiple quantum programs…

Quantum Physics · Physics 2025-07-08 Wenjie Sun , Xiaoyu Li , Zhigang Wang , Geng Chen , Lianhui Yu , Guowu Yang

DISC: A Dynamic Shape Compiler for Machine Learning Workloads

Many recent machine learning models show dynamic shape characteristics. However, existing AI compiler optimization systems suffer a lot from problems brought by dynamic shape models, including compilation overhead, memory usage,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-24 Kai Zhu , Wenyi Zhao , Zhen Zheng , Tianyou Guo , Pengzhan Zhao , Feiwen Zhu , Junjie Bai , Jun Yang , Xiaoyong Liu , Lansong Diao , Wei Lin

DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic…

Computer Vision and Pattern Recognition · Computer Science 2025-05-13 Zhenhailong Wang , Senthil Purushwalkam , Caiming Xiong , Silvio Savarese , Heng Ji , Ran Xu

The Collection Virtual Machine: An Abstraction for Multi-Frontend Multi-Backend Data Analysis

Getting the best performance from the ever-increasing number of hardware platforms has been a recurring challenge for data processing systems. In recent years, the advent of data science with its increasingly numerous and complex types of…

Databases · Computer Science 2020-04-10 Ingo Müller , Renato Marroquín , Dimitrios Koutsoukos , Mike Wawrzoniak , Sabir Akhadov , Gustavo Alonso

Gensor: A Graph-based Construction Tensor Compilation Method for Deep Learning

High-performance deep learning depends on efficient tensor programs. In recent years, automatic tensor program optimization, also known as tensor compilation, has emerged as the primary approach to generating efficient tensor programs.…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-18 Hangda Liu , Boyu Diao , Yu Yang , Wenxin Chen , Xiaohui Peng , Yongjun Xu

Towards Practical Real-Time Neural Video Compression

We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. In practice, the coding speed of NVCs depends on 1) computational costs, and 2) non-computational…

Image and Video Processing · Electrical Eng. & Systems 2025-03-19 Zhaoyang Jia , Bin Li , Jiahao Li , Wenxuan Xie , Linfeng Qi , Houqiang Li , Yan Lu

StreamTensor: Make Tensors Stream in Dataflow Accelerators for LLMs

Efficient execution of deep learning workloads on dataflow architectures is crucial for overcoming memory bottlenecks and maximizing performance. While streaming intermediate results between computation kernels can significantly improve…

Hardware Architecture · Computer Science 2025-09-24 Hanchen Ye , Deming Chen

Support Vector Machine Implementation on MPI-CUDA and Tensorflow Framework

Support Vector Machine (SVM) algorithm requires a high computational cost (both in memory and time) to solve a complex quadratic programming (QP) optimization problem during the training process. Consequently, SVM necessitates high…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-28 Islam Elgarhy

RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators

Operator fusion, as a key performance optimization technique in the deployment of AI models, significantly improves execution efficiency and has been widely adopted in modern AI compilers. However, for cascaded reduction operations…

Hardware Architecture · Computer Science 2026-03-12 Xinsheng Tang , Yangcheng Li , Nan Wang , Zhiyi Shu , Xingyu Ling , Junna Xing , Peng Zhou , Qiang Liu

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Deep neural networks (DNNs) are of critical use in different domains. To accelerate DNN computation, tensor compilers are proposed to generate efficient code on different domain-specific accelerators. Existing tensor compilers mainly focus…

Machine Learning · Computer Science 2023-07-12 Zixuan Ma , Haojie Wang , Jingze Xing , Liyan Zheng , Chen Zhang , Huanqi Cao , Kezhao Huang , Shizhi Tang , Penghan Wang , Jidong Zhai

A Dense Tensor Accelerator with Data Exchange Mesh for DNN and Vision Workloads

We propose a dense tensor accelerator called VectorMesh, a scalable, memory-efficient architecture that can support a wide variety of DNN and computer vision workloads. Its building block is a tile execution unit~(TEU), which includes…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-29 Yu-Sheng Lin , Wei-Chao Chen. Chia-Lin Yang , Shao-Yi Chien

A Simulator for LLVM Bitcode

In this paper, we introduce an interactive simulator for programs in the form of LLVM bitcode. The main features of the simulator include precise control over thread scheduling, automatic checkpoints and reverse stepping, support for…

Software Engineering · Computer Science 2019-07-10 Petr Ročkai , Jiří Barnat

Unimem: Runtime Data Management on Non-Volatile Memory-based Heterogeneous Main Memory

Non-volatile memory (NVM) provides a scalable and power-efficient solution to replace DRAM as main memory. However, because of relatively high latency and low bandwidth of NVM, NVM is often paired with DRAM to build a heterogeneous memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-03 Kai Wu , Yingchao Huang , Dong Li

Quantum-based Molecular Dynamics Simulations Using Tensor Cores

Tensor cores, along with tensor processing units, represent a new form of hardware acceleration specifically designed for deep neural network calculations in artificial intelligence applications. Tensor cores provide extraordinary…

Computational Physics · Physics 2021-09-14 Joshua Finkelstein , Justin S. Smith , Susan M. Mniszewski , Kipton Barros , Christian F. A. Negre , Emanuel H. Rubensson , Anders M. N. Niklasson

Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures

Irregular embedding lookups are a critical bottleneck in recommender models, sparse large language models, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units,…

Hardware Architecture · Computer Science 2025-04-15 Marco Siracusa , Olivia Hsu , Victor Soria-Pardos , Joshua Randall , Arnaud Grasset , Eric Biscondi , Doug Joseph , Randy Allen , Fredrik Kjolstad , Miquel Moretó Planas , Adrià Armejach

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor compilers struggle to fuse complex reduction…

Programming Languages · Computer Science 2026-04-21 Yifan Zhao , Egan Johnson , Prasanth Chatarasi , Vikram Adve , Sasa Misailovic

PIMCOMP: A Universal Compilation Framework for Crossbar-based PIM DNN Accelerators

Crossbar-based PIM DNN accelerators can provide massively parallel in-situ operations. A specifically designed compiler is important to achieve high performance for a wide variety of DNN workloads. However, some key compilation issues such…

Hardware Architecture · Computer Science 2023-07-06 Xiaotian Sun , Xinyu Wang , Wanqian Li , Lei Wang , Yinhe Han , Xiaoming Chen

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new…

Machine Learning · Computer Science 2018-10-09 Tianqi Chen , Thierry Moreau , Ziheng Jiang , Lianmin Zheng , Eddie Yan , Meghan Cowan , Haichen Shen , Leyuan Wang , Yuwei Hu , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy