Related papers: SparsePipe: Parallel Deep Learning for 3D Point Cl…

PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism

With the rapid innovation of GPUs, heterogeneous GPU clusters in both public clouds and on-premise data centers have become increasingly commonplace. In this paper, we demonstrate how pipeline parallelism, a technique wellstudied for…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-28 Z. Jonny Kong , Qiang Xu , Y. Charlie Hu

Pipeline Parallelism for Inference on Heterogeneous Edge Computing

Deep neural networks with large model sizes achieve state-of-the-art results for tasks in computer vision (CV) and natural language processing (NLP). However, these large-scale models are too compute- or memory-intensive for…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-29 Yang Hu , Connor Imes , Xuanang Zhao , Souvik Kundu , Peter A. Beerel , Stephen P. Crago , John Paul N. Walters

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training

The increasing complexity of deep learning recommendation models (DLRM) has led to a growing need for large-scale distributed systems that can efficiently train vast amounts of data. In DLRM, the sparse embedding table is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-07 Xin Zhang , Quanyu Zhu , Liangbei Xu , Zain Huda , Wang Zhou , Jin Fang , Dennis van der Staay , Yuxi Hu , Jade Nie , Jiyan Yang , Chunzhi Yang

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training

With the increasing scale of models, the need for efficient distributed training has become increasingly urgent. Recently, many synchronous pipeline parallelism approaches have been proposed to improve training throughput. However, these…

Machine Learning · Computer Science 2024-10-28 Houming Wu , Ling Chen , Wenjie Yu

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training

We propose XPipe, an efficient asynchronous pipeline model parallelism approach for multi-GPU DNN training. XPipe is designed to use multiple GPUs to concurrently and continuously train different parts of a DNN model. To improve GPU…

Machine Learning · Computer Science 2020-11-10 Lei Guan , Wotao Yin , Dongsheng Li , Xicheng Lu

3DPipe: A Pipelined GPU Framework for Scalable Generalized Spatial Join over Polyhedral Objects

Spatial join is a fundamental operation in spatial databases. With the rapid growth of 3D data in applications such as LiDAR-based object detection and 3D digital pathology, there is an increasing need to support spatial join over 3D…

Databases · Computer Science 2026-04-23 Lyuheng Yuan , Da Yan , Akhlaque Ahmad , Fusheng Wang

GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism

Communication is a key bottleneck for distributed graph neural network (GNN) training. This paper proposes GNNPipe, a new approach that scales the distributed full-graph deep GNN training. Being the first to use layer-level model…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-26 Jingji Chen , Zhuoming Chen , Xuehai Qian

Optimizing Sparse Convolution on GPUs with CUDA for 3D Point Cloud Processing in Embedded Systems

In recent years, there has been a significant increase in the utilization of deep learning methods, particularly convolutional neural networks (CNNs), which have emerged as the dominant approach in various domains that involve structured…

Machine Learning · Computer Science 2024-04-09 Chester Luo , Kevin Lai

SPLATNet: Sparse Lattice Networks for Point Cloud Processing

We present a network architecture for processing point clouds that directly operates on a collection of points represented as a sparse set of samples in a high-dimensional lattice. Naively applying convolutions on this lattice scales…

Computer Vision and Pattern Recognition · Computer Science 2018-05-10 Hang Su , Varun Jampani , Deqing Sun , Subhransu Maji , Evangelos Kalogerakis , Ming-Hsuan Yang , Jan Kautz

DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines

Diffusion models have emerged as dominant performers for image generation. To support training large diffusion models, this paper studies pipeline parallel training of diffusion models and proposes DiffusionPipe, a synchronous pipeline…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-03 Ye Tian , Zhen Jia , Ziyue Luo , Yida Wang , Chuan Wu

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

Deep Neural Network (DNN) models have continuously been growing in size in order to improve the accuracy and quality of the models. Moreover, for training of large DNN models, the use of heterogeneous GPUs is inevitable due to the short…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-29 Jay H. Park , Gyeongchan Yun , Chang M. Yi , Nguyen T. Nguyen , Seungmin Lee , Jaesik Choi , Sam H. Noh , Young-ri Choi

Sparse Point Clouds Assisted Learned Image Compression

In the field of autonomous driving, a variety of sensor data types exist, each representing different modalities of the same scene. Therefore, it is feasible to utilize data from other sensors to facilitate image compression. However, few…

Computer Vision and Pattern Recognition · Computer Science 2024-12-23 Yiheng Jiang , Haotian Zhang , Li Li , Dong Liu , Zhu Li

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. To satisfy the requirement of computation and memory of DNN training, distributed deep learning based on model parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-15 Letian Zhao , Rui Xu , Tianqi Wang , Teng Tian , Xiaotian Wang , Wei Wu , Chio-in Ieong , Xi Jin

SSPU-Net: Self-Supervised Point Cloud Upsampling via Differentiable Rendering

Point clouds obtained from 3D sensors are usually sparse. Existing methods mainly focus on upsampling sparse point clouds in a supervised manner by using dense ground truth point clouds. In this paper, we propose a self-supervised point…

Computer Vision and Pattern Recognition · Computer Science 2021-08-04 Yifan Zhao , Le Hui , Jin Xie

PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Jiarui Fang , Jinzhe Pan , Aoyu Li , Xibo Sun , Jiannan Wang

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

Network pruning can reduce the high computation cost of deep neural network (DNN) models. However, to maintain their accuracies, sparse models often carry randomly-distributed weights, leading to irregular computations. Consequently, sparse…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-01 Cong Guo , Bo Yang Hsueh , Jingwen Leng , Yuxian Qiu , Yue Guan , Zehuan Wang , Xiaoying Jia , Xipeng Li , Minyi Guo , Yuhao Zhu

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-09 Zhida Jiang , Zhaolong Xing , Huichao Chai , Tianxing Sun , Qiang Peng , Baopeng Yuan , Jiaxing Wang , Hua Du , Zhixin Wu , Xuemiao Li , Yikui Cao , Xinyu Liu , Yongxiang Feng , Zhen Chen , Ke Zhang

An Efficient FPGA Accelerator for Point Cloud

Deep learning-based point cloud processing plays an important role in various vision tasks, such as autonomous driving, virtual reality (VR), and augmented reality (AR). The submanifold sparse convolutional network (SSCN) has been widely…

Signal Processing · Electrical Eng. & Systems 2022-10-17 Zilun Wang , Wendong Mao , Peixiang Yang , Zhongfeng Wang , Jun Lin

SparseFusion: Efficient Sparse Multi-Modal Fusion Framework for Long-Range 3D Perception

Multi-modal 3D object detection has exhibited significant progress in recent years. However, most existing methods can hardly scale to long-range scenarios due to their reliance on dense 3D features, which substantially escalate…

Computer Vision and Pattern Recognition · Computer Science 2024-03-18 Yiheng Li , Hongyang Li , Zehao Huang , Hong Chang , Naiyan Wang