Related papers: HeAT -- a Distributed and GPU-accelerated Tensor F…

HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs

Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-04 Chengming Zhang , Shaden Smith , Baixi Sun , Jiannan Tian , Jonathan Soifer , Xiaodong Yu , Shuaiwen Leon Song , Yuxiong He , Dingwen Tao

HEATS: Heterogeneity- and Energy-Aware Task-based Scheduling

Cloud providers usually offer diverse types of hardware for their users. Customers exploit this option to deploy cloud instances featuring GPUs, FPGAs, architectures other than x86 (e.g., ARM, IBM Power8), or featuring certain specific…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-28 Isabelly Rocha , Christian Göttel , Pascal Felber , Marcelo Pasin , Romain Rouvoy , Valerio Schiavoni

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-30 Shen Li , Yanli Zhao , Rohan Varma , Omkar Salpekar , Pieter Noordhuis , Teng Li , Adam Paszke , Jeff Smith , Brian Vaughan , Pritam Damania , Soumith Chintala

HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

Data-intensive applications impact many domains, and their steadily increasing size and complexity demands high-performance, highly usable environments. We integrate a set of ideas developed in various data science and data engineering…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-08-02 Supun Kamburugamuve , Chathura Widanage , Niranda Perera , Vibhatha Abeykoon , Ahmet Uyar , Thejaka Amila Kanewala , Gregor von Laszewski , Geoffrey Fox

HPAT: High Performance Analytics with Scripting Ease-of-Use

Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-12 Ehsan Totoni , Todd A. Anderson , Tatiana Shpeisman

TorchOpt: An Efficient Library for Differentiable Optimization

Recent years have witnessed the booming of various differentiable optimization algorithms. These algorithms exhibit different execution patterns, and their execution needs massive computational resources that go beyond a single CPU and GPU.…

Mathematical Software · Computer Science 2022-11-15 Jie Ren , Xidong Feng , Bo Liu , Xuehai Pan , Yao Fu , Luo Mai , Yaodong Yang

HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

Modern deep learning systems like PyTorch and Tensorflow are able to train enormous models with billions (or trillions) of parameters on a distributed infrastructure. These systems require that the internal nodes have the same memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-01 Yifan Ding , Nicholas Botzer , Tim Weninger

Hot PATE: Private Aggregation of Distributions for Diverse Task

The Private Aggregation of Teacher Ensembles (PATE) framework enables privacy-preserving machine learning by aggregating responses from disjoint subsets of sensitive data. Adaptations of PATE to tasks with inherent output diversity such as…

Machine Learning · Computer Science 2025-10-02 Edith Cohen , Benjamin Cohen-Wang , Xin Lyu , Jelani Nelson , Tamas Sarlos , Uri Stemmer

pyFAST: A Modular PyTorch Framework for Time Series Modeling with Multi-source and Sparse Data

Modern time series analysis demands frameworks that are flexible, efficient, and extensible. However, many existing Python libraries exhibit limitations in modularity and in their native support for irregular, multi-source, or sparse data.…

Machine Learning · Computer Science 2025-08-27 Zhijin Wang , Senzhen Wu , Yue Hu , Xiufeng Liu

Optimal Integrative Estimation for Distributed Precision Matrices with Heterogeneity Adjustment

Distributed learning offers a practical solution for the integrative analysis of multi-source datasets, especially under privacy or communication constraints. However, addressing prospective distributional heterogeneity and ensuring…

Methodology · Statistics 2025-11-27 Yinrui Sun , Yin Xia

A Multi-Layered Distributed Computing Framework for Enhanced Edge Computing

The rise of the Internet of Things and edge computing has shifted computing resources closer to end-users, benefiting numerous delay-sensitive, computation-intensive applications. To speed up computation, distributed computing is a…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-10 Ke Ma , Junfei Xie

Easy Acceleration with Distributed Arrays

High level programming languages and GPU accelerators are powerful enablers for a wide range of applications. Achieving scalable vertical (within a compute node), horizontal (across compute nodes), and temporal (over different generations…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-21 Jeremy Kepner , Chansup Byun , LaToya Anderson , William Arcand , David Bestor , William Bergeron , Alex Bonn , Daniel Burrill , Vijay Gadepally , Ryan Haney , Michael Houle , Matthew Hubbell , Hayden Jananthan , Michael Jones , Piotr Luszczek , Lauren Milechin , Guillermo Morales , Julie Mullen , Andrew Prout , Albert Reuther , Antonio Rosa , Charles Yee , Peter Michaleas

GPRat: Gaussian Process Regression with Asynchronous Tasks

Python is the de-facto language for software development in artificial intelligence (AI). Commonly used libraries, such as PyTorch and TensorFlow, rely on parallelization built into their BLAS backends to achieve speedup on CPUs. However,…

Machine Learning · Computer Science 2025-05-02 Maksim Helmann , Alexander Strack , Dirk Pflüger

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage

The AI hardware boom has led modern data centers to adopt HPC-style architectures centered on distributed, GPU-centric computation. Large GPU clusters interconnected by fast RDMA networks and backed by high-bandwidth NVMe storage enable…

Databases · Computer Science 2026-05-21 Jigao Luo , Nils Boeschen , Muhammad El-Hindi , Carsten Binnig

CAT: A GPU-Accelerated FHE Framework with Its Application to High-Precision Private Dataset Query

We introduce an open-source GPU-accelerated fully homomorphic encryption (FHE) framework CAT, which surpasses existing solutions in functionality and efficiency. \emph{CAT} features a three-layer architecture: a foundation of core math, a…

Cryptography and Security · Computer Science 2025-03-31 Qirui Li , Rui Zong

DeepOHeat: Operator Learning-based Ultra-fast Thermal Simulation in 3D-IC Design

Thermal issue is a major concern in 3D integrated circuit (IC) design. Thermal optimization of 3D IC often requires massive expensive PDE simulations. Neural network-based thermal prediction models can perform real-time prediction for many…

Machine Learning · Computer Science 2023-02-28 Ziyue Liu , Yixing Li , Jing Hu , Xinling Yu , Shinyu Shiau , Xin Ai , Zhiyu Zeng , Zheng Zhang

Online Heatmap Generation with Both High and Low Weights

Heatmap is a common geovisualization method that interpolates and visualizes a set of point observations on a map surface. Most of online web mapping libraries implement a one-pass heatmap algorithm using HTML5 canvas or WebGL for efficient…

Graphics · Computer Science 2022-12-16 Yan Y. Liu , Melissa Allen-Dumas

CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

Many hyperparameter optimization (HyperOpt) methods assume restricted computing resources and mainly focus on enhancing performance. Here we propose a novel cloud-based HyperOpt (CHOPT) framework which can efficiently utilize shared…

Machine Learning · Computer Science 2018-10-17 Jinwoong Kim , Minkyu Kim , Heungseok Park , Ernar Kusdavletov , Dongjun Lee , Adrian Kim , Ji-Hoon Kim , Jung-Woo Ha , Nako Sung

TensorNEAT: A GPU-accelerated Library for NeuroEvolution of Augmenting Topologies

The NeuroEvolution of Augmenting Topologies (NEAT) algorithm has received considerable recognition in the field of neuroevolution. Its effectiveness is derived from initiating with simple networks and incrementally evolving both their…

Neural and Evolutionary Computing · Computer Science 2025-04-14 Lishuang Wang , Mengfei Zhao , Enyu Liu , Kebin Sun , Ran Cheng

WgPy: GPU-accelerated NumPy-like array library for web browsers

To execute scientific computing programs such as deep learning at high speed, GPU acceleration is a powerful option. With the recent advancements in web technologies, interfaces like WebGL and WebGPU, which utilize GPUs on the client side…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-04 Masatoshi Hidaka , Tatsuya Harada