English
Related papers

Related papers: Exploiting Parallelism Opportunities with Deep Lea…

200 papers

The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields.…

Machine Learning · Computer Science 2022-07-04 Daniel Nichols , Siddharth Singh , Shu-Huai Lin , Abhinav Bhatele

With the rapid adoption of large language models (LLMs) in recommendation systems, the computational and communication bottlenecks caused by their massive parameter sizes and large data volumes have become increasingly prominent. This paper…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-25 Haowei Yang , Yu Tian , Zhongheng Yang , Zhao Wang , Chengrui Zhou , Dannier Li

According to the increasing complexity of network application and internet traffic, network processor as a subset of embedded processors have to process more computation intensive tasks. By scaling down the feature size and emersion of chip…

Hardware Architecture · Computer Science 2012-04-13 Mehdi Alipour , Hojjat Taghdisi

The increasing complexity of deep learning recommendation models (DLRM) has led to a growing need for large-scale distributed systems that can efficiently train vast amounts of data. In DLRM, the sparse embedding table is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-07 Xin Zhang , Quanyu Zhu , Liangbei Xu , Zain Huda , Wang Zhou , Jin Fang , Dennis van der Staay , Yuxi Hu , Jade Nie , Jiyan Yang , Chunzhi Yang

The computational requirements for training deep neural networks (DNNs) have grown to the point that it is now standard practice to parallelize training. Existing deep learning systems commonly use data or model parallelism, but…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-23 Zhihao Jia , Matei Zaharia , Alex Aiken

The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-22 Seokjin Go , Joongun Park , Spandan More , Hanjiang Wu , Irene Wang , Aaron Jezghani , Tushar Krishna , Divya Mahajan

In this paper we analyze, evaluate, and improve the performance of training generalized linear models on modern CPUs. We start with a state-of-the-art asynchronous parallel training algorithm, identify system-level performance bottlenecks,…

Machine Learning · Computer Science 2018-12-20 Nikolas Ioannou , Celestine Dünner , Kornilios Kourtis , Thomas Parnell

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Big data powered Deep Learning (DL) and its applications have blossomed in recent years, fueled by three technological trends: a large amount of digitized data openly accessible, a growing number of DL software frameworks in open source and…

Performance · Computer Science 2019-08-20 Yanzhao Wu , Ling Liu , Calton Pu , Wenqi Cao , Semih Sahin , Wenqi Wei , Qi Zhang

The transition from standard generative AI to \emph{reasoning-centric architectures}, exemplified by models capable of extensive Chain-of-Thought~(CoT) processing, marks a fundamental paradigm shift in system requirements. Unlike…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-20 Moiz Arif , Avinash Maurya , Sudharshan Vazhkudai , Bogdan Nicolae

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective…

Machine Learning · Computer Science 2019-01-10 Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

We present a novel characterization of the mapping of multiple parallelism forms (e.g. data and model parallelism) onto hierarchical accelerator systems that is hierarchy-aware and greatly reduces the space of software-to-hardware mapping.…

Programming Languages · Computer Science 2021-11-17 Ningning Xie , Tamara Norman , Dominik Grewe , Dimitrios Vytiniotis

As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…

Machine Learning · Computer Science 2025-03-13 Ruifeng She , Bowen Pang , Kai Li , Zehua Liu , Tao Zhong

While modern parallel computing systems offer high performance, utilizing these powerful computing resources to the highest possible extent demands advanced knowledge of various hardware architectures and parallel programming models.…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-03 Suejb Memeti , Sabri Pllana , Alecio Binotto , Joanna Kolodziej , Ivona Brandic

Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-05 Steven W. D. Chien , Artur Podobas , Ivy B. Peng , Stefano Markidis

This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-10 Peng Zhang , Jianbin Fang , Canqun Yang , Chun Huang , Tao Tang , Zheng Wang

The employment of high-performance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in deep learning (DL). DL frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-11 Soojeong Kim , Gyeong-In Yu , Hojin Park , Sungwoo Cho , Eunji Jeong , Hyeonmin Ha , Sanha Lee , Joo Seong Jeong , Byung-Gon Chun

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used…

Machine Learning · Computer Science 2022-11-08 Saptadeep Pal , Eiman Ebrahimi , Arslan Zulfiqar , Yaosheng Fu , Victor Zhang , Szymon Migacz , David Nellans , Puneet Gupta

While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-01 Quentin Anthony , Jacob Hatef , Deepak Narayanan , Stella Biderman , Stas Bekman , Junqi Yin , Aamir Shafi , Hari Subramoni , Dhabaleswar Panda
‹ Prev 1 2 3 10 Next ›