English
Related papers

Related papers: Mesh-TensorFlow: Deep Learning for Supercomputers

200 papers

Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-09 Yuan Yu , Martín Abadi , Paul Barham , Eugene Brevdo , Mike Burrows , Andy Davis , Jeff Dean , Sanjay Ghemawat , Tim Harley , Peter Hawkins , Michael Isard , Manjunath Kudlur , Rajat Monga , Derek Murray , Xiaoqiang Zheng

TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of…

Most research on novel techniques for 3D Medical Image Segmentation (MIS) is currently done using Deep Learning with GPU accelerators. The principal challenge of such technique is that a single input can easily cope computing resources, and…

Machine Learning · Computer Science 2021-11-01 Josep Lluis Berral , Oriol Aranda , Juan Luis Dominguez , Jordi Torres

Deep learning applications are drastically progressing in seismic processing and interpretation tasks. However, the majority of approaches subsample data volumes and restrict model sizes to minimise computational requirements. Subsampling…

Geophysics · Physics 2021-02-26 Claire Birnie , Haithem Jarraya , Fredrik Hansteen

Tensor parallelism is an essential technique for distributed training of large neural networks. However, automatically determining an optimal tensor parallel strategy is challenging due to the gigantic search space, which grows…

Machine Learning · Computer Science 2025-08-06 Ziji Shi , Le Jiang , Ang Wang , Jie Zhang , Chencan Wu , Yong Li , Xiaokui Xiao , Wei Lin , Jialin Li

To reduce training time of large-scale DNNs, scientists have started to explore parallelization strategies like data-parallelism, model-parallelism, and hybrid-parallelism. While data-parallelism has been extensively studied and developed,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-20 Ammar Ahmad Awan , Arpan Jain , Quentin Anthony , Hari Subramoni , Dhabaleswar K. Panda

In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution…

Machine Learning · Computer Science 2019-12-03 Alexey Svyatkovskiy , Julian Kates-Harbeck , William Tang

We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator…

Machine Learning · Computer Science 2024-08-20 Yonghao Zhuang , Hexu Zhao , Lianmin Zheng , Zhuohan Li , Eric P. Xing , Qirong Ho , Joseph E. Gonzalez , Ion Stoica , Hao Zhang

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Training deep neural networks (DNNs) in large-cluster computing environments is increasingly necessary, as networks grow in size and complexity. Local memory and processing limitations require robust data and model parallelism for crossing…

Machine Learning · Computer Science 2020-06-08 Russell J. Hewett , Thomas J. Grady

Single-Program-Multiple-Data (SPMD) parallelism has recently been adopted to train large deep neural networks (DNNs). Few studies have explored its applicability on heterogeneous clusters, to fully exploit available resources for large…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-12 Shiwei Zhang , Lansong Diao , Chuan Wu , Zongyan Cao , Siyu Wang , Wei Lin

Deep Learning (DL) algorithms have become the {\em de facto} choice for data analysis. Several DL implementations -- primarily limited to a single compute node -- such as Caffe, TensorFlow, Theano and Torch have become readily available.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-18 Abhinav Vishnu , Joseph Manzano , Charles Siegel , Jeff Daily

Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-08 Ji Liu , Zhihua Wu , Dianhai Yu , Yanjun Ma , Danlei Feng , Minxu Zhang , Xinxuan Wu , Xuefeng Yao , Dejing Dou

Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a deep neural network (DNN) model on a single device or using data parallelism. Still, they may not be flexible or efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-20 Jinhui Yuan , Xinqi Li , Cheng Cheng , Juncheng Liu , Ran Guo , Shenghang Cai , Chi Yao , Fei Yang , Xiaodong Yi , Chuan Wu , Haoran Zhang , Jie Zhao

The employment of high-performance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in deep learning (DL). DL frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-11 Soojeong Kim , Gyeong-In Yu , Hojin Park , Sungwoo Cho , Eunji Jeong , Hyeonmin Ha , Sanha Lee , Joo Seong Jeong , Byung-Gon Chun

Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique…

TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous…

Deep learning (DL) has transformed applications in a variety of domains, including computer vision, natural language processing, and tabular data analysis. The search for improved DL model accuracy has led practitioners to explore…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-10 Kabir Nagrecha

We propose a new integrated method of exploiting model, batch and domain parallelism for the training of deep neural networks (DNNs) on large distributed-memory computers using minibatch stochastic gradient descent (SGD). Our goal is to…

Machine Learning · Computer Science 2018-05-17 Amir Gholami , Ariful Azad , Peter Jin , Kurt Keutzer , Aydin Buluc

The deep neural networks (DNNs) have been enormously successful in tasks that were hitherto in the human-only realm such as image recognition, and language translation. Owing to their success the DNNs are being explored for use in ever more…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-20 Sanket Tavarageri , Srinivas Sridharan , Bharat Kaul
‹ Prev 1 2 3 10 Next ›