Related papers: Mirage: A Multi-Level Superoptimizer for Tensor Pr…

Prism: Symbolic Superoptimization of Tensor Programs

This paper presents Prism, the first symbolic superoptimizer for tensor programs. The key idea is sGraph, a symbolic, hierarchical representation that compactly encodes large classes of tensor programs by symbolically representing some…

Programming Languages · Computer Science 2026-04-17 Mengdi Wu , Xiaoyu Jiang , Oded Padon , Zhihao Jia

Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

We introduce Mirage Persistent Kernel (MPK), the first compiler and runtime system that automatically transforms multi-GPU model inference into a single high-performance megakernel. MPK introduces an SM-level graph representation that…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Xinhao Cheng , Zhihao Zhang , Yu Zhou , Jianan Ji , Jinchen Jiang , Zepeng Zhao , Ziruo Xiao , Zihao Ye , Yingyi Huang , Ruihang Lai , Hongyi Jin , Bohan Hou , Mengdi Wu , Yixin Dong , Anthony Yip , Zihao Ye , Songting Wang , Wenqin Yang , Xupeng Miao , Tianqi Chen , Zhihao Jia

MIRAGE: Runtime Scheduling for Multi-Vector Image Retrieval with Hierarchical Decomposition

To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy.…

Computer Vision and Pattern Recognition · Computer Science 2026-03-04 Maoliang Li , Ke Li , Yaoyang Liu , Jiayu Chen , Zihao Zheng , Yinjun Wu , Chenchen Liu , Xiang Chen

MERIT: Tensor Transform for Memory-Efficient Vision Processing on Parallel Architectures

Computationally intensive deep neural networks (DNNs) are well-suited to run on GPUs, but newly developed algorithms usually require the heavily optimized DNN routines to work efficiently, and this problem could be even more difficult for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-12 Yu-Sheng Lin , Wei-Chao Chen , Shao-Yi Chien

Learning to Optimize Tensor Programs

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective…

Machine Learning · Computer Science 2019-01-10 Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

Learning from distinctive candidates to optimize reduced-precision convolution program on tensor cores

Convolution is one of the fundamental operations of deep neural networks with demanding matrix computation. In a graphic processing unit (GPU), Tensor Core is a specialized matrix processing hardware equipped with reduced-precision…

Machine Learning · Computer Science 2022-02-25 Junkyeong Choi , Hyucksung Kwon , Woongkyu Lee , Jungwook Choi , Jieun Lim

MIRAGE: An Iterative MapReduce based FrequentSubgraph Mining Algorithm

Frequent subgraph mining (FSM) is an important task for exploratory data analysis on graph data. Over the years, many algorithms have been proposed to solve this task. These algorithms assume that the data structure of the mining task is…

Databases · Computer Science 2013-07-24 Mansurul A Bhuiyan , Mohammad Al Hasan

Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code

This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions…

Programming Languages · Computer Science 2018-12-21 Riyadh Baghdadi , Jessica Ray , Malek Ben Romdhane , Emanuele Del Sozzo , Abdurrahman Akkas , Yunming Zhang , Patricia Suriana , Shoaib Kamil , Saman Amarasinghe

Tensor Program Optimization with Probabilistic Programs

Automatic optimization for tensor programs becomes increasingly important as we deploy deep learning in various environments, and efficient optimization relies on a rich search space and effective search. Most existing efforts adopt a…

Machine Learning · Computer Science 2022-10-11 Junru Shao , Xiyou Zhou , Siyuan Feng , Bohan Hou , Ruihang Lai , Hongyi Jin , Wuwei Lin , Masahiro Masuda , Cody Hao Yu , Tianqi Chen

MIRGE: An Array-Based Computational Framework for Scientific Computing

MIRGE is a computational approach for scientific computing based on NumPy-like array computation, but using lazy evaluation to recast computation as data-flow graphs, where nodes represent immutable, multi-dimensional arrays. Evaluation of…

Mathematical Software · Computer Science 2025-12-22 Matthias Diener , Matthew J. Smith , Michael T. Campbell , Kaushik Kulkarni , Michael J. Anderson , Andreas Klöckner , William Gropp , Jonathan B. Freund , Luke N. Olson

Twinkle: A GPU-based binary-lens microlensing code with contour integration method

With the rapidly increasing rate of microlensing planet detections, microlensing modeling software faces significant challenges in computation efficiency. Here, we develop the Twinkle code, an efficient and robust binary-lens modeling…

Instrumentation and Methods for Astrophysics · Physics 2025-03-18 Suwei Wang , Lile Wang , Subo Dong

Approximate Multiparametric Mixed-integer Convex Programming

We propose an algorithm for generating explicit solutions of multiparametric mixed-integer convex programs to within a given suboptimality tolerance. The algorithm is applicable to a very general class of optimization problems, but is most…

Optimization and Control · Mathematics 2019-06-12 Danylo Malyuta , Behcet Acikmese

Analyzing GPU Tensor Core Potential for Fast Reductions

The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-12 Roberto Carrasco , Raimundo Vega , Cristóbal A. Navarro

Mirage: An RNS-Based Photonic Accelerator for DNN Training

Photonic computing is a compelling avenue for performing highly efficient matrix multiplication, a crucial operation in Deep Neural Networks (DNNs). While this method has shown great success in DNN inference, meeting the high precision…

Hardware Architecture · Computer Science 2024-08-06 Cansu Demirkiran , Guowei Yang , Darius Bunandar , Ajay Joshi

Optimal Matrix-Mimetic Tensor Algebras via Variable Projection

Recent advances in {matrix-mimetic} tensor frameworks have made it possible to preserve linear algebraic properties for multilinear data analysis and, as a result, to obtain optimal representations of multiway data. Matrix mimeticity arises…

Numerical Analysis · Mathematics 2024-06-12 Elizabeth Newman , Katherine Keegan

Partitioning Unstructured Sparse Tensor Algebra for Load-Balanced Parallel Execution

Sparse tensor algebra is challenging to efficiently parallelize due to the irregular, data-dependent, and potentially skewed structure of sparse computation. We propose the first partitioning algorithm that provably load balances the…

Programming Languages · Computer Science 2026-04-23 Atharva Chougule , Alexander J Root , Rubens Lacouture , Bobby Yan , Rohan Yadav , Fredrik Kjolstad

TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

High-performance GPU kernel optimization remains a critical yet labor-intensive task in modern machine learning workloads. Although Triton, a domain-specific language for GPU programming, enables developers to write efficient kernels with…

Software Engineering · Computer Science 2025-12-16 Haonan Li , Keyu Man , Partha Kanuparthy , Hanning Chen , Wei Sun , Sreen Tallam , Chenguang Zhu , Kevin Zhu , Zhiyun Qian

Gensor: A Graph-based Construction Tensor Compilation Method for Deep Learning

High-performance deep learning depends on efficient tensor programs. In recent years, automatic tensor program optimization, also known as tensor compilation, has emerged as the primary approach to generating efficient tensor programs.…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-18 Hangda Liu , Boyu Diao , Yu Yang , Wenxin Chen , Xiaohui Peng , Yongjun Xu

TensorIR: An Abstraction for Automatic Tensorized Program Optimization

Deploying deep learning models on various devices has become an important topic. The wave of hardware specialization brings a diverse set of acceleration primitives for multi-dimensional tensor computations. These new acceleration…

Machine Learning · Computer Science 2022-10-31 Siyuan Feng , Bohan Hou , Hongyi Jin , Wuwei Lin , Junru Shao , Ruihang Lai , Zihao Ye , Lianmin Zheng , Cody Hao Yu , Yong Yu , Tianqi Chen

GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra

We propose GraphMineSuite (GMS): the first benchmarking suite for graph mining that facilitates evaluating and constructing high-performance graph mining algorithms. First, GMS comes with a benchmark specification based on extensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-01 Maciej Besta , Zur Vonarburg-Shmaria , Yannick Schaffner , Leonardo Schwarz , Grzegorz Kwasniewski , Lukas Gianinazzi , Jakub Beranek , Kacper Janda , Tobias Holenstein , Sebastian Leisinger , Peter Tatkowski , Esref Ozdemir , Adrian Balla , Marcin Copik , Philipp Lindenberger , Pavel Kalvoda , Marek Konieczny , Onur Mutlu , Torsten Hoefler