Related papers: DELTA: Dynamically Optimizing GPU Memory beyond Te…

TENSILE: A Tensor granularity dynamic GPU memory scheduling method toward multiple dynamic workloads system

Recently, deep learning has been an area of intense research. However, as a kind of computing-intensive task, deep learning highly relies on the scale of GPU memory, which is usually prohibitive and scarce. Although some extensive works…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-08 Kaixin Zhang , Hongzhi Wang , Han Hu , Songling Zou , Jiye Qiu , Tongxin Li , Zhishun Wang

SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

Going deeper and wider in neural architectures improves the accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need change to less desired network…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-17 Linnan Wang , Jinmian Ye , Yiyang Zhao , Wei Wu , Ang Li , Shuaiwen Leon Song , Zenglin Xu , Tim Kraska

Combined Scheduling, Memory Allocation and Tensor Replacement for Minimizing Off-Chip Data Accesses of DNN Accelerators

Specialized hardware accelerators have been extensively used for Deep Neural Networks (DNNs) to provide power/performance benefits. These accelerators contain specialized hardware that supports DNN operators, and scratchpad memory for…

Machine Learning · Computer Science 2023-12-01 Yi Li , Aarti Gupta , Sharad Malik

Efficient Memory Management for GPU-based Deep Learning Systems

GPU (graphics processing unit) has been used for many data-intensive applications. Among them, deep learning systems are one of the most important consumer systems for GPU nowadays. As deep learning applications impose deeper and larger…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-18 Junzhe Zhang , Sai Ho Yeung , Yao Shu , Bingsheng He , Wei Wang

FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training

The increasing demand for on-device training of deep neural networks (DNNs) aims to leverage personal data for high-performance applications while addressing privacy concerns and reducing communication latency. However, resource-constrained…

Hardware Architecture · Computer Science 2026-03-31 Jinming Lu , Jiayi Tian , Hai Li , Ian Young , Zheng Zhang

Dynamic Space-Time Scheduling for GPU Inference

Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-03 Paras Jain , Xiangxi Mo , Ajay Jain , Harikaran Subbaraj , Rehan Sohail Durrani , Alexey Tumanov , Joseph Gonzalez , Ion Stoica

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Wei Gao , Qinghao Hu , Zhisheng Ye , Peng Sun , Xiaolin Wang , Yingwei Luo , Tianwei Zhang , Yonggang Wen

cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs

We introduce the CUDA Tensor Transpose (cuTT) library that implements high-performance tensor transposes for NVIDIA GPUs with Kepler and above architectures. cuTT achieves high performance by (a) utilizing two GPU-optimized transpose…

Mathematical Software · Computer Science 2017-05-05 Antti-Pekka Hynninen , Dmitry I. Lyakh

GPU Memory Usage Optimization for Backward Propagation in Deep Network Training

In modern Deep Learning, it has been a trend to design larger Deep Neural Networks (DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks (CNNs) have become the standard method…

Machine Learning · Computer Science 2025-02-19 Ding-Yong Hong , Tzu-Hsien Tsai , Ning Wang , Pangfeng Liu , Jan-Jan Wu

Efficient Data-Parallel Continual Learning with Asynchronous Distributed Rehearsal Buffers

Deep learning has emerged as a powerful method for extracting valuable information from large volumes of data. However, when new training data arrives continuously (i.e., is not fully available from the beginning), incremental training…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-06 Thomas Bouvier , Bogdan Nicolae , Hugo Chaugier , Alexandru Costan , Ian Foster , Gabriel Antoniu

XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments

Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they…

Machine Learning · Computer Science 2022-12-22 Manuela Schuler , Richard Membarth , Philipp Slusallek

OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks

The size of deep neural networks has grown exponentially in recent years. Unfortunately, hardware devices have not kept pace with the rapidly increasing memory requirements. To cope with this, researchers have turned to techniques such as…

Machine Learning · Computer Science 2022-11-04 Benoit Steiner , Mostafa Elhoushi , Jacob Kahn , James Hegarty

Delta Networks for Optimized Recurrent Network Computation

Many neural networks exhibit stability in their activation patterns over time in response to inputs from sensors operating under real-world conditions. By capitalizing on this property of natural signals, we propose a Recurrent Neural…

Neural and Evolutionary Computing · Computer Science 2016-12-19 Daniel Neil , Jun Haeng Lee , Tobi Delbruck , Shih-Chii Liu

GPU Cluster Scheduling for Network-Sensitive Deep Learning

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler…

Performance · Computer Science 2025-11-11 Aakash Sharma , Vivek M. Bhasi , Sonali Singh , George Kesidis , Mahmut T. Kandemir , Chita R. Das

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire…

Computation and Language · Computer Science 2026-05-05 Hossein Entezari Zarch , Lei Gao , Chaoyi Jiang , Murali Annavaram

Learning to Optimize Tensor Programs

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective…

Machine Learning · Computer Science 2019-01-10 Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

From Simulation to Practice: Generalizable Deep Reinforcement Learning for Cellular Schedulers

Efficient radio packet scheduling remains one of the most challenging tasks in cellular networks, and while heuristic methods exist, practical deep learning-based schedulers that are 3GPP-compliant and capable of real-time operation in 5G…

Signal Processing · Electrical Eng. & Systems 2025-10-10 Petteri Kela , Bryan Liu , Alvaro Valcarce

DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis

Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of the execution time of CNN training, and GPUs are commonly used to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-28 Sangkug Lym , Donghyuk Lee , Mike O'Connor , Niladrish Chatterjee , Mattan Erez

DELTA: DEep Learning Transfer using Feature Map with Attention for Convolutional Networks

Transfer learning through fine-tuning a pre-trained neural network with an extremely large dataset, such as ImageNet, can significantly accelerate training while the accuracy is frequently bottlenecked by the limited dataset size of the new…

Machine Learning · Computer Science 2020-05-14 Xingjian Li , Haoyi Xiong , Hanchao Wang , Yuxuan Rao , Liping Liu , Zeyu Chen , Jun Huan

CoSA: Scheduling by Constrained Optimization for Spatial Accelerators

Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many of which feature a large number of processing elements laid out spatially, together with a multi-level memory hierarchy and…

Machine Learning · Computer Science 2021-05-06 Qijing Huang , Minwoo Kang , Grace Dinh , Thomas Norell , Aravind Kalaiah , James Demmel , John Wawrzynek , Yakun Sophia Shao