Related papers: A Distributed Multi-GPU System for Large-Scale Nod…

GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding

Learning continuous representations of nodes is attracting growing interest in both academia and industry recently, due to their simplicity and effectiveness in a variety of applications. Most of existing node embedding algorithms and…

Machine Learning · Computer Science 2019-03-05 Zhaocheng Zhu , Shizhen Xu , Meng Qu , Jian Tang

Large-Scale Stochastic Learning using GPUs

In this work we propose an accelerated stochastic learning system for very large-scale applications. Acceleration is achieved by mapping the training algorithm onto massively parallel processors: we demonstrate a parallel, asynchronous GPU…

Machine Learning · Computer Science 2017-02-24 Thomas Parnell , Celestine Dünner , Kubilay Atasu , Manolis Sifalakis , Haris Pozidis

Scalable Graph Embedding LearningOn A Single GPU

Graph embedding techniques have attracted growing interest since they convert the graph data into continuous and low-dimensional space. Effective graph analytic provides users a deeper understanding of what is behind the data and thus can…

Machine Learning · Computer Science 2022-01-21 Azita Nouri , Philip E. Davis , Pradeep Subedi , Manish Parashar

HUGE: Huge Unsupervised Graph Embeddings with TPUs

Graphs are a representation of structured data that captures the relationships between sets of objects. With the ubiquity of available network data, there is increasing industrial and academic need to quickly analyze graphs with billions of…

Machine Learning · Computer Science 2023-07-28 Brandon Mayer , Anton Tsitsulin , Hendrik Fichtenberger , Jonathan Halcrow , Bryan Perozzi

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-21 Shaohuai Shi , Xianhao Zhou , Shutao Song , Xingyao Wang , Zilin Zhu , Xue Huang , Xinan Jiang , Feihu Zhou , Zhenyu Guo , Liqiang Xie , Rui Lan , Xianbin Ouyang , Yan Zhang , Jieqian Wei , Jing Gong , Weiliang Lin , Ping Gao , Peng Meng , Xiaomin Xu , Chenyang Guo , Bo Yang , Zhibo Chen , Yongjian Wu , Xiaowen Chu

Accurate, Efficient and Scalable Graph Embedding

The Graph Convolutional Network (GCN) model and its variants are powerful graph embedding tools for facilitating classification and clustering on graphs. However, a major challenge is to reduce the complexity of layered GCNs and make them…

Machine Learning · Computer Science 2020-08-06 Hanqing Zeng , Hongkuan Zhou , Ajitesh Srivastava , Rajgopal Kannan , Viktor Prasanna

Parallel Computation of Graph Embeddings

Graph embedding aims at learning a vector-based representation of vertices that incorporates the structure of the graph. This representation then enables inference of graph properties. Existing graph embedding techniques, however, do not…

Machine Learning · Computer Science 2019-09-09 Chi Thang Duong , Hongzhi Yin , Thanh Dat Hoang , Truong Giang Le Ba , Matthias Weidlich , Quoc Viet Hung Nguyen , Karl Aberer

Scalable Hash Table for NUMA Systems

Hash tables are used in a plethora of applications, including database operations, DNA sequencing, string searching, and many more. As such, there are many parallelized hash tables targeting multicore, distributed, and accelerator-based…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-05 Alok Tripathy , Oded Green

Heterogeneous FPGA+GPU Embedded Systems: Challenges and Opportunities

The edge computing paradigm has emerged to handle cloud computing issues such as scalability, security and low response time among others. This new computing trend heavily relies on ubiquitous embedded systems on the edge. Performance and…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-28 Mohammad Hosseinabady , Mohd Amiruddin Bin Zainol , Jose Nunez-Yanez

HyScale-GNN: A Scalable Hybrid GNN Training System on Single-Node Heterogeneous Architecture

Graph Neural Networks (GNNs) have shown success in many real-world applications that involve graph-structured data. Most of the existing single-node GNN training systems are capable of training medium-scale graphs with tens of millions of…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-02 Yi-Chien Lin , Viktor Prasanna

Decentralized Diffusion Models

Large-scale AI model training divides work across thousands of GPUs, then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up…

Computer Vision and Pattern Recognition · Computer Science 2025-01-13 David McAllister , Matthew Tancik , Jiaming Song , Angjoo Kanazawa

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Muyang Li , Tianle Cai , Jiaxin Cao , Qinsheng Zhang , Han Cai , Junjie Bai , Yangqing Jia , Ming-Yu Liu , Kai Li , Song Han

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-12 Si Xu , Zixiao Huang , Yan Zeng , Shengen Yan , Xuefei Ning , Quanlu Zhang , Haolin Ye , Sipei Gu , Chunsheng Shui , Zhezheng Lin , Hao Zhang , Sheng Wang , Guohao Dai , Yu Wang

SPEED: Streaming Partition and Parallel Acceleration for Temporal Interaction Graph Embedding

Temporal Interaction Graphs (TIGs) are widely employed to model intricate real-world systems such as financial systems and social networks. To capture the dynamism and interdependencies of nodes, existing TIG embedding models need to…

Machine Learning · Computer Science 2023-09-12 Xi Chen , Yongxiang Liao , Yun Xiong , Yao Zhang , Siwei Zhang , Jiawei Zhang , Yiheng Sun

From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal $k$-Sparse GLMs

GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and…

Machine Learning · Computer Science 2026-05-22 Jiachang Liu , Andrea Lodi

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Graph embeddings map graph nodes to continuous vectors and are foundational to community detection, recommendation, and many scientific applications. At billion-scale, however, existing graph embedding systems face a trade-off: they either…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-13 Zhonggen Li , Xiangyu Ke , Yifan Zhu , Yunjun Gao , Feifei Li

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Computation and Language · Computer Science 2021-08-25 Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Anand Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , Matei Zaharia

Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters

In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution…

Machine Learning · Computer Science 2019-12-03 Alexey Svyatkovskiy , Julian Kates-Harbeck , William Tang

Graph Embeddings at Scale

Graph embedding is a popular algorithmic approach for creating vector representations for individual vertices in networks. Training these algorithms at scale is important for creating embeddings that can be used for classification, ranking,…

Machine Learning · Computer Science 2019-07-04 C. Bayan Bruss , Anish Khazane , Jonathan Rider , Richard Serpe , Saurabh Nagrecha , Keegan E. Hines

Scalable Edge Partitioning

Edge-centric distributed computations have appeared as a recent technique to improve the shortcomings of think-like-a-vertex algorithms on large scale-free networks. In order to increase parallelism on this model, edge partitioning -…

Data Structures and Algorithms · Computer Science 2018-10-12 Sebastian Schlag , Christian Schulz , Daniel Seemaier , Darren Strash