English
Related papers

Related papers: DELTA: Dynamically Optimizing GPU Memory beyond Te…

200 papers

Recently, deep learning has been an area of intense research. However, as a kind of computing-intensive task, deep learning highly relies on the scale of GPU memory, which is usually prohibitive and scarce. Although some extensive works…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-08 Kaixin Zhang , Hongzhi Wang , Han Hu , Songling Zou , Jiye Qiu , Tongxin Li , Zhishun Wang

Going deeper and wider in neural architectures improves the accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need change to less desired network…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-17 Linnan Wang , Jinmian Ye , Yiyang Zhao , Wei Wu , Ang Li , Shuaiwen Leon Song , Zenglin Xu , Tim Kraska

Specialized hardware accelerators have been extensively used for Deep Neural Networks (DNNs) to provide power/performance benefits. These accelerators contain specialized hardware that supports DNN operators, and scratchpad memory for…

Machine Learning · Computer Science 2023-12-01 Yi Li , Aarti Gupta , Sharad Malik

GPU (graphics processing unit) has been used for many data-intensive applications. Among them, deep learning systems are one of the most important consumer systems for GPU nowadays. As deep learning applications impose deeper and larger…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-18 Junzhe Zhang , Sai Ho Yeung , Yao Shu , Bingsheng He , Wei Wang

The increasing demand for on-device training of deep neural networks (DNNs) aims to leverage personal data for high-performance applications while addressing privacy concerns and reducing communication latency. However, resource-constrained…

Hardware Architecture · Computer Science 2026-03-31 Jinming Lu , Jiayi Tian , Hai Li , Ian Young , Zheng Zhang

Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-03 Paras Jain , Xiangxi Mo , Ajay Jain , Harikaran Subbaraj , Rehan Sohail Durrani , Alexey Tumanov , Joseph Gonzalez , Ion Stoica

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Wei Gao , Qinghao Hu , Zhisheng Ye , Peng Sun , Xiaolin Wang , Yingwei Luo , Tianwei Zhang , Yonggang Wen

We introduce the CUDA Tensor Transpose (cuTT) library that implements high-performance tensor transposes for NVIDIA GPUs with Kepler and above architectures. cuTT achieves high performance by (a) utilizing two GPU-optimized transpose…

Mathematical Software · Computer Science 2017-05-05 Antti-Pekka Hynninen , Dmitry I. Lyakh

In modern Deep Learning, it has been a trend to design larger Deep Neural Networks (DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks (CNNs) have become the standard method…

Machine Learning · Computer Science 2025-02-19 Ding-Yong Hong , Tzu-Hsien Tsai , Ning Wang , Pangfeng Liu , Jan-Jan Wu

Deep learning has emerged as a powerful method for extracting valuable information from large volumes of data. However, when new training data arrives continuously (i.e., is not fully available from the beginning), incremental training…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-06 Thomas Bouvier , Bogdan Nicolae , Hugo Chaugier , Alexandru Costan , Ian Foster , Gabriel Antoniu

Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they…

Machine Learning · Computer Science 2022-12-22 Manuela Schuler , Richard Membarth , Philipp Slusallek

The size of deep neural networks has grown exponentially in recent years. Unfortunately, hardware devices have not kept pace with the rapidly increasing memory requirements. To cope with this, researchers have turned to techniques such as…

Machine Learning · Computer Science 2022-11-04 Benoit Steiner , Mostafa Elhoushi , Jacob Kahn , James Hegarty

Many neural networks exhibit stability in their activation patterns over time in response to inputs from sensors operating under real-world conditions. By capitalizing on this property of natural signals, we propose a Recurrent Neural…

Neural and Evolutionary Computing · Computer Science 2016-12-19 Daniel Neil , Jun Haeng Lee , Tobi Delbruck , Shih-Chii Liu

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler…

Performance · Computer Science 2025-11-11 Aakash Sharma , Vivek M. Bhasi , Sonali Singh , George Kesidis , Mahmut T. Kandemir , Chita R. Das

Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire…

Computation and Language · Computer Science 2026-05-05 Hossein Entezari Zarch , Lei Gao , Chaoyi Jiang , Murali Annavaram

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective…

Machine Learning · Computer Science 2019-01-10 Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

Efficient radio packet scheduling remains one of the most challenging tasks in cellular networks, and while heuristic methods exist, practical deep learning-based schedulers that are 3GPP-compliant and capable of real-time operation in 5G…

Signal Processing · Electrical Eng. & Systems 2025-10-10 Petteri Kela , Bryan Liu , Alvaro Valcarce

Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of the execution time of CNN training, and GPUs are commonly used to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-28 Sangkug Lym , Donghyuk Lee , Mike O'Connor , Niladrish Chatterjee , Mattan Erez

Transfer learning through fine-tuning a pre-trained neural network with an extremely large dataset, such as ImageNet, can significantly accelerate training while the accuracy is frequently bottlenecked by the limited dataset size of the new…

Machine Learning · Computer Science 2020-05-14 Xingjian Li , Haoyi Xiong , Hanchao Wang , Yuxuan Rao , Liping Liu , Zeyu Chen , Jun Huan

Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many of which feature a large number of processing elements laid out spatially, together with a multi-level memory hierarchy and…

Machine Learning · Computer Science 2021-05-06 Qijing Huang , Minwoo Kang , Grace Dinh , Thomas Norell , Aravind Kalaiah , James Demmel , John Wawrzynek , Yakun Sophia Shao
‹ Prev 1 2 3 10 Next ›