Related papers: Locality Optimization for Data Parallel Programs

Parallel Training of Deep Networks with Local Updates

Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times…

Machine Learning · Computer Science 2021-06-16 Michael Laskin , Luke Metz , Seth Nabarro , Mark Saroufim , Badreddine Noune , Carlo Luschi , Jascha Sohl-Dickstein , Pieter Abbeel

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

Compound AI applications, which compose calls to ML models using a general-purpose programming language like Python, are widely used for a variety of user-facing tasks, from software engineering to enterprise automation, making their…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-19 Stephen Mell , David Mell , Konstantinos Kallas , Steve Zdancewic , Osbert Bastani

Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications

Applications with low data reuse and frequent irregular memory accesses, such as graph or sparse linear algebra workloads, fail to scale well due to memory bottlenecks and poor core utilization. While prior work with prefetching,…

Hardware Architecture · Computer Science 2023-05-05 Marcelo Orenes-Vera , Esin Tureci , David Wentzlaff , Margaret Martonosi

Efficient Tree-Traversals: Reconciling Parallelism and Dense Data Representations

Recent work showed that compiling functional programs to use dense, serialized memory representations for recursive algebraic datatypes can yield significant constant-factor speedups for sequential programs. But serializing data in a…

Programming Languages · Computer Science 2021-07-02 Chaitanya Koparkar , Mike Rainey , Michael Vollmer , Milind Kulkarni , Ryan R. Newton

Parallel Information Algorithm with Local Tuning for Solving Multidimensional GO Problems

In this paper we propose a new parallel algorithm for solving global optimization (GO) multidimensional problems. The method unifies two powerful approaches for accelerating the search: parallel computations and local tuning on the behavior…

Optimization and Control · Mathematics 2011-03-31 Yaroslav D. Sergeyev

Generating Configurable Hardware from Parallel Patterns

In recent years the computing landscape has seen an in- creasing shift towards specialized accelerators. Field pro- grammable gate arrays (FPGAs) are particularly promising as they offer significant performance and energy improvements…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-24 Raghu Prabhakar , David Koeplinger , Kevin Brown , HyoukJoong Lee , Christopher De Sa , Christos Kozyrakis , Kunle Olukotun

Improving Locality in Sparse and Dense Matrix Multiplications

Consecutive matrix multiplications are commonly used in graph neural networks and sparse linear solvers. These operations frequently access the same matrices for both reading and writing. While reusing these matrices improves data locality,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-02 Mohammad Mahdi Salehi Dezfuli , Kazem Cheshmi

Automated Parallel Kernel Extraction from Dynamic Application Traces

Modern program runtime is dominated by segments of repeating code called kernels. Kernels are accelerated by increasing memory locality, increasing data-parallelism, and exploiting producer-consumer parallelism among kernels - which…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-31 Richard Uhrie , Chaitali Chakrabarti , John Brunhaver

AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism

Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their…

Machine Learning · Computer Science 2026-02-02 Thalaiyasingam Ajanthan , Sameera Ramasinghe , Gil Avraham , Hadi Mohaghegh Dolatabadi , Chamin P Hewa Koneputugodage , Violetta Shevchenko , Yan Zuo , Alexander Long

LOCAL: Low-Complex Mapping Algorithm for Spatial DNN Accelerators

Deep neural networks are a promising solution for applications that solve problems based on learning data sets. DNN accelerators solve the processing bottleneck as a domain-specific processor. Like other hardware solutions, there must be…

Hardware Architecture · Computer Science 2022-11-08 Midia Reshadi , David Gregg

Parallel Scheduling Self-attention Mechanism: Generalization and Optimization

Over the past few years, self-attention is shining in the field of deep learning, especially in the domain of natural language processing(NLP). Its impressive effectiveness, along with ubiquitous implementations, have aroused our interest…

Machine Learning · Computer Science 2020-12-03 Mingfei Yu , Masahiro Fujita

A Scalable Shared-Memory Parallel Simplex for Large-Scale Linear Programming

The Simplex tableau has been broadly used and investigated in the industry and academia. With the advent of the big data era, ever larger problems are posed to be solved in ever larger machines whose architecture type did not exist in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-29 Demetrios Coutinho , Felipe O. Lins e Silva , Daniel Aloise , Samuel , Xavier-de-Souza

Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning

We present a novel characterization of the mapping of multiple parallelism forms (e.g. data and model parallelism) onto hierarchical accelerator systems that is hierarchy-aware and greatly reduces the space of software-to-hardware mapping.…

Programming Languages · Computer Science 2021-11-17 Ningning Xie , Tamara Norman , Dominik Grewe , Dimitrios Vytiniotis

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

AutoParallel: A Python module for automatic parallelization and distributed execution of affine loop nests

The last improvements in programming languages, programming models, and frameworks have focused on abstracting the users from many programming issues. Among others, recent programming frameworks include simpler syntax, automatic memory…

Programming Languages · Computer Science 2018-10-29 Cristian Ramon-Cortes , Ramon Amela , Jorge Ejarque , Philippe Clauss , Rosa M. Badia

Optimizer Fusion: Efficient Training with Better Locality and Parallelism

Machine learning frameworks adopt iterative optimizers to train neural networks. Conventional eager execution separates the updating of trainable parameters from forward and backward computations. However, this approach introduces…

Machine Learning · Computer Science 2021-04-02 Zixuan Jiang , Jiaqi Gu , Mingjie Liu , Keren Zhu , David Z. Pan

Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach

As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…

Machine Learning · Computer Science 2025-03-13 Ruifeng She , Bowen Pang , Kai Li , Zehua Liu , Tao Zhong

Automatic Parallelization of Python Programs for Distributed Heterogeneous Computing

This paper introduces a novel approach to automatic ahead-of-time (AOT) parallelization and optimization of sequential Python programs for execution on distributed heterogeneous platforms. Our approach enables AOT source-to-source…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-15 Jun Shirako , Akihiro Hayashi , Sri Raj Paul , Alexey Tumanov , Vivek Sarkar

Parallel Local Search: Experiments with a PGAS-based programming model

Local search is a successful approach for solving combinatorial optimization and constraint satisfaction problems. With the progressing move toward multi and many-core systems, GPUs and the quest for Exascale systems, parallelism has become…

Programming Languages · Computer Science 2013-05-13 Rui Machado , Salvador Abreu , Daniel Diaz

TurboMap: GPU-Accelerated Local Mapping for Visual SLAM

In real-time Visual SLAM systems, local mapping must operate under strict latency constraints, as delays degrade map quality and increase the risk of tracking failure. GPU parallelization offers a promising way to reduce latency. However,…

Robotics · Computer Science 2026-03-19 Parsa Hosseininejad , Kimia Khabiri , Shishir Gopinath , Soudabeh Mohammadhashemi , Karthik Dantu , Steven Y. Ko