Related papers: A Learned Performance Model for Tensor Processing …
Precise hardware performance models play a crucial role in code optimizations. They can assist compilers in making heuristic decisions or aid autotuners in identifying the optimal configuration for a given program. For example, the…
We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective…
GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when…
Many artificial intelligence (AI) devices have been developed to accelerate the training and inference of neural networks models. The most common ones are the Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU). They are highly…
We propose a generic algorithmic building block to accelerate training of machine learning models on heterogeneous compute systems. Our scheme allows to efficiently employ compute accelerators such as GPUs and FPGAs for the training of…
Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML…
Tensor Processing Units (TPUs) are specialized hardware accelerators for deep learning developed by Google. This paper aims to explore TPUs in cloud and edge computing focusing on its applications in AI. We provide an overview of TPUs,…
High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging.…
The use of GPUs has proliferated for machine learning workflows and is now considered mainstream for many deep learning models. Meanwhile, when training state-of-the-art personal recommendation models, which consume the highest number of…
Machine learning-based performance models are increasingly being used to build critical job scheduling and application optimization decisions. Traditionally, these models assume that data distribution does not change as more samples are…
To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature…
The deployment of convolutional neural networks is often hindered by high computational and storage requirements. Structured model pruning is a promising approach to alleviate these requirements. Using the VGG-16 model as an example, we…
While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL…
Optimizing the performance of computational fluid dynamics (CFD) applications accelerated by graphics processing units (GPUs) is crucial for efficient simulations. In this study, we employed a machine learning-based autotuning technique to…
Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the…
The increasing complexity and scale of Deep Neural Networks (DNNs) necessitate specialized tensor accelerators, such as Tensor Processing Units (TPUs), to meet various computational and energy efficiency requirements. Nevertheless,…
Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and…
Deep learning models are yielding increasingly better performances thanks to multiple factors. To be successful, model may have large number of parameters or complex architectures and be trained on large dataset. This leads to large…
Performance models can be very useful for understanding the behavior of applications and hence can help guide design and optimization decisions. Unfortunately, performance modeling of nontrivial computations typically requires significant…
Language models are now prevalent in software engineering with many developers using them to automate tasks and accelerate their development. While language models have been tremendous at accomplishing complex software engineering tasks,…