Related papers: Dynamic GPU Energy Optimization for Machine Learni…
Modern machine learning training is increasingly bottlenecked by data I/O rather than compute. GPUs often sit idle at below 50% utilization waiting for data. This paper presents a machine learning approach to predict I/O performance and…
Deep Learning Training (DLT) is a growing workload in shared GPU/CPU clusters due to its high computational cost and increasing number of jobs. This contributes to significant energy consumption in GPU clusters, further exacerbated by GPU…
Due to their highly parallel multi-cores architecture, GPUs are being increasingly used in a wide range of computationally intensive applications. Compared to CPUs, GPUs can achieve higher performances at accelerating the programs'…
The use of GPUs has proliferated for machine learning workflows and is now considered mainstream for many deep learning models. Meanwhile, when training state-of-the-art personal recommendation models, which consume the highest number of…
Training deep neural networks (DNNs) is a major workload in datacenters today, resulting in a tremendously fast growth of energy consumption. It is important to reduce the energy consumption while completing the DL training jobs early in…
Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In…
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel…
Increased reliance on graphics processing units (GPUs) for high-intensity computing tasks raises challenges regarding energy consumption. To address this issue, dynamic voltage and frequency scaling (DVFS) has emerged as a promising…
Training deep neural networks (DNNs) is becoming increasingly more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the…
Over the past years, great progress has been made in improving the computing power of general-purpose graphics processing units (GPGPUs), which facilitates the prosperity of deep neural networks (DNNs) in multiple fields like computer…
The global scarcity of GPUs necessitates more sophisticated strategies for Deep Learning jobs in shared cluster environments. Accurate estimation of how much GPU memory a job will require is fundamental to enabling advanced scheduling and…
To raise awareness of the environmental impact of deep learning (DL), many studies estimate the energy use of DL systems. However, energy estimates during DL training often rely on unverified assumptions. This work addresses that gap by…
Graphics Processing Units (GPUs) have revolutionized the computing landscape over the past decade. However, the growing energy demands of data centres and computing facilities equipped with GPUs come with significant capital and…
Training on edge devices poses several challenges as these devices are generally resource-constrained, especially in terms of power. State-of-the-art techniques at the device level reduce the GPU frequency to enforce power constraints,…
As AI workloads drive increases in datacenter power consumption, accurate GPU power estimation is critical for proactive power management. However, existing power models face a scalability bottleneck not in the modeling techniques…
The effectiveness and efficiency of machine learning methodologies are crucial, especially with respect to the quality of results and computational cost. This paper discusses different model optimization techniques, providing a…
Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. Neural architecture searches, hyperparameter sweeps, and rapid prototyping consume immense resources that can prevent…
Approximately 18 percent of the 3.2 million smartphone applications rely on integrated graphics processing units (GPUs) to achieve competitive performance. Graphics performance, typically measured in frames per second, is a strong function…
With the rise of AI in recent years and the increase in complexity of the models, the growing demand in computational resources is starting to pose a significant challenge. The need for higher compute power is being met with increasingly…
Modern computing paradigms, such as cloud computing, are increasingly adopting GPUs to boost their computing capabilities primarily due to the heterogeneous nature of AI/ML/deep learning workloads. However, the energy consumption of GPUs is…