Related papers: Auto-tuning TensorFlow Threading Model for CPU Bac…

Automatic Tuning of Tensorflow's CPU Backend using Gradient-Free Optimization Algorithms

Modern deep learning (DL) applications are built using DL libraries and frameworks such as TensorFlow and PyTorch. These frameworks have complex parameters and tuning them to obtain good training and inference performance is challenging for…

Machine Learning · Computer Science 2021-09-15 Derssie Mebratu , Niranjan Hasabnis , Pietro Mercati , Gaurit Sharma , Shamima Najnin

StreamTensor: Make Tensors Stream in Dataflow Accelerators for LLMs

Efficient execution of deep learning workloads on dataflow architectures is crucial for overcoming memory bottlenecks and maximizing performance. While streaming intermediate results between computation kernels can significantly improve…

Hardware Architecture · Computer Science 2025-09-24 Hanchen Ye , Deming Chen

Exploiting Parallelism Opportunities with Deep Learning Frameworks

State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using…

Machine Learning · Computer Science 2020-07-01 Yu Emma Wang , Carole-Jean Wu , Xiaodong Wang , Kim Hazelwood , David Brooks

TensorFlow: A system for large-scale machine learning

TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-01 Martín Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard , Manjunath Kudlur , Josh Levenberg , Rajat Monga , Sherry Moore , Derek G. Murray , Benoit Steiner , Paul Tucker , Vijay Vasudevan , Pete Warden , Martin Wicke , Yuan Yu , Xiaoqiang Zheng

TensorFlow Doing HPC

TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML)…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-03 Steven W. D. Chien , Stefano Markidis , Vyacheslav Olshevsky , Yaroslav Bulatov , Erwin Laure , Jeffrey S. Vetter

Optimal Kernel Tuning Parameter Prediction using Deep Sequence Models

GPU kernels have come to the forefront of computing due to their utility in varied fields, from high-performance computing to machine learning. A typical GPU compute kernel is invoked millions, if not billions of times in a typical…

Machine Learning · Computer Science 2024-04-18 Khawir Mahmood , Jehandad Khan , Hammad Afzal

Characterizing Deep-Learning I/O Workloads in TensorFlow

The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-10 Steven W. D. Chien , Stefano Markidis , Chaitanya Prasad Sishtla , Luis Santos , Pawel Herman , Sai Narasimhamurthy , Erwin Laure

TLP: A Deep Learning-based Cost Model for Tensor Program Tuning

Tensor program tuning is a non-convex objective optimization problem, to which search-based approaches have proven to be effective. At the core of the search-based approaches lies the design of the cost model. Though deep learning-based…

Machine Learning · Computer Science 2022-11-23 Yi Zhai , Yu Zhang , Shuo Liu , Xiaomeng Chu , Jie Peng , Jianmin Ji , Yanyong Zhang

Machine Learning-driven Autotuning of Graphics Processing Unit Accelerated Computational Fluid Dynamics for Enhanced Performance

Optimizing the performance of computational fluid dynamics (CFD) applications accelerated by graphics processing units (GPUs) is crucial for efficient simulations. In this study, we employed a machine learning-based autotuning technique to…

Performance · Computer Science 2024-02-21 Weicheng Xue , Christohper John Roy

tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads

Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-05 Steven W. D. Chien , Artur Podobas , Ivy B. Peng , Stefano Markidis

Improving the Expressiveness of Deep Learning Frameworks with Recursion

Recursive neural networks have widely been used by researchers to handle applications with recursively or hierarchically structured data. However, embedded control flow deep learning frameworks such as TensorFlow, Theano, Caffe2, and MXNet…

Machine Learning · Computer Science 2018-09-05 Eunji Jeong , Joo Seong Jeong , Soojeong Kim , Gyeong-In Yu , Byung-Gon Chun

Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training

Training neural network often uses a machine learning framework such as TensorFlow and Caffe2. These frameworks employ a dataflow model where the NN training is modeled as a directed graph composed of a set of nodes. Operations in neural…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-20 Jiawen Liu , Dong Li , Gokcen Kestor , Jeffrey Vetter

A Learned Performance Model for Tensor Processing Units

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for…

Performance · Computer Science 2021-03-19 Samuel J. Kaufman , Phitchaya Mangpo Phothilimthana , Yanqi Zhou , Charith Mendis , Sudip Roy , Amit Sabne , Mike Burrows

Tensor Slicing and Optimization for Multicore NPUs

Although code generation for Convolution Neural Network (CNN) models has been extensively studied, performing efficient data slicing and parallelization for highly-constrai\-ned Multicore Neural Processor Units (NPUs) is still a challenging…

Performance · Computer Science 2023-04-07 Rafael Sousa , Marcio Pereira , Yongin Kwon , Taeho Kim , Namsoon Jung , Chang Soo Kim , Michael Frank , Guido Araujo

FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers

Many artificial intelligence models process input data of different lengths and resolutions, making the shape of the tensors dynamic. The performance of these models depends on the shape of the tensors, which makes it difficult to optimize…

Machine Learning · Computer Science 2024-08-01 Pengyu Mu , Linquan Wei , Yi Liu , Rui Wang

TensorSocket: Shared Data Loading for Deep Learning Training

Training deep learning models is a repetitive and resource-intensive process. Data scientists often train several models before landing on a set of parameters (e.g., hyper-parameter tuning) and model architecture (e.g., neural architecture…

Machine Learning · Computer Science 2025-08-04 Ties Robroek , Neil Kim Nielsen , Pınar Tözün

Towards Self-Tuning Parameter Servers

Recent years, many applications have been driven advances by the use of Machine Learning (ML). Nowadays, it is common to see industrial-strength machine learning jobs that involve millions of model parameters, terabytes of training data,…

Databases · Computer Science 2020-08-05 Chris Liu , Pengfei Zhang , Bo Tang , Hang Shen , Lei Zhu , Ziliang Lai , Eric Lo

Transfer-Tuning: Reusing Auto-Schedules for Efficient Tensor Program Code Generation

Auto-scheduling for tensor programs is a process where a search algorithm automatically explores candidate schedules (program transformations) for a given program on a target hardware platform to improve its performance. However this can be…

Machine Learning · Computer Science 2022-09-08 Perry Gibson , José Cano

A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit

Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-02 Filip Petrovič , David Střelák , Jana Hozzová , Jaroslav Oľha , Richard Trembecký , Siegfried Benkner , Jiří Filipovič

Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences

Remote procedure call (RPC) is the backbone of many modern distributed systems. Google's gRPC is one of the most popular open source RPC frameworks available in the community. gRPC is the main communication engine for Google's Deep Learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-04-05 Rajarshi Biswas , Xiaoyi Lu , Dhabaleswar K. Panda