Related papers: NB-FEB: An Easy-to-Use and Scalable Universal Sync…

Low-Depth Parallel Algorithms for the Binary-Forking Model without Atomics

The binary-forking model is a parallel computation model, formally defined by Blelloch et al. very recently, in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of $\Theta(\log n)$…

Data Structures and Algorithms · Computer Science 2020-09-04 Zafar Ahmad , Rezaul Chowdhury , Rathish Das , Pramod Ganapathi , Aaron Gregory , Mohammad Mahdi Javanmard

N2N: A Parallel Framework for Large-Scale MILP under Distributed Memory

Parallelization has emerged as a promising approach for accelerating MILP solving. However, the complexity of the branch-and-bound (B&B) framework and the numerous effective algorithm components in MILP solvers make it difficult to…

Artificial Intelligence · Computer Science 2025-12-19 Longfei Wang , Junyan Liu , Fan Zhang , Jiangwen Wei , Yuanhua Tang , Jie Sun , Xiaodong Luo

A Case for Stale Synchronous Distributed Model for Declarative Recursive Computation

A large class of traditional graph and data mining algorithms can be concisely expressed in Datalog, and other Logic-based languages, once aggregates are allowed in recursion. In fact, for most BigData algorithms, the difficult semantic…

Programming Languages · Computer Science 2019-07-25 Ariyam Das , Carlo Zaniolo

Automatic Parallelization of Sequential Programs

Prior work on Automatically Scalable Computation (ASC) suggests that it is possible to parallelize sequential computation by building a model of whole-program execution, using that model to predict future computations, and then…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-21 Peter Kraft , Amos Waterland , Daniel Y Fu , Anitha Gollamudi , Shai Szulanski , Margo Seltzer

Efficient Synchronization Primitives for GPUs

In this paper, we revisit the design of synchronization primitives---specifically barriers, mutexes, and semaphores---and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and…

Operating Systems · Computer Science 2011-10-21 Jeff A. Stuart , John D. Owens

Emulating a large memory with a collection of small ones

Sequential computation is well understood but does not scale well with current technology. Within the next decade, systems will contain large numbers of processors with potentially thousands of processors per chip. Despite this, many…

Hardware Architecture · Computer Science 2015-11-17 James Hanlon

An Optimal Level-synchronous Shared-memory Parallel BFS Algorithm with Optimal parallel Prefix-sum Algorithm and its Implications for Energy Consumption

We present a work-efficient parallel level-synchronous Breadth First Search (BFS) algorithm for shared-memory architectures which achieves the theoretical lower bound on parallel running time. The optimality holds regardless of the shape of…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-20 Jesmin Jahan Tithi , Yonatan Fogel , Rezaul Chowdhury

A Scalable Shared-Memory Parallel Simplex for Large-Scale Linear Programming

The Simplex tableau has been broadly used and investigated in the industry and academia. With the advent of the big data era, ever larger problems are posed to be solved in ever larger machines whose architecture type did not exist in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-29 Demetrios Coutinho , Felipe O. Lins e Silva , Daniel Aloise , Samuel , Xavier-de-Souza

Scalable and Efficient Virtual Memory Sharing in Heterogeneous SoCs with TLB Prefetching and MMU-Aware DMA Engine

Shared virtual memory (SVM) is key in heterogeneous systems on chip (SoCs), which combine a general-purpose host processor with a many-core accelerator, both for programmability and to avoid data duplication. However, SVM can bring a…

Hardware Architecture · Computer Science 2018-08-30 Andreas Kurth , Pirmin Vogel , Andrea Marongiu , Luca Benini

Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors

B-spline based orbital representations are widely used in Quantum Monte Carlo (QMC) simulations of solids, historically taking as much as 50% of the total run time. Random accesses to a large four-dimensional array make it challenging to…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-10 Amrita Mathuriya , Ye Luo , Anouar Benali , Luke Shulenburger , Jeongnim Kim

BSF: a parallel computation model for scalability estimation of iterative numerical algorithms on cluster computing systems

This paper examines a new parallel computation model called bulk synchronous farm (BSF) that focuses on estimating the scalability of compute-intensive iterative algorithms aimed at cluster computing systems. In the BSF model, a computer is…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-05 Leonid B. Sokolinsky

BSF-skeleton: A Template for Parallelization of Iterative Numerical Algorithms on Cluster Computing Systems

This article describes a method for creating applications for cluster computing systems using the parallel BSF skeleton based on the original BSF (Bulk Synchronous Farm) model of parallel computations developed by the author earlier. This…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-06 Leonid B. Sokolinsky

Massively parallelizable proximal algorithms for large-scale stochastic optimal control problems

Scenario-based stochastic optimal control problems suffer from the curse of dimensionality as they can easily grow to six and seven figure sizes. First-order methods are suitable as they can deal with such large-scale problems, but may fail…

Optimization and Control · Mathematics 2021-07-06 Ajay K. Sampathirao , Panagiotis Patrinos , Alberto Bemporad , Pantelis Sopasakis

A simple and efficient explicit parallelization of logic programs using low-level threading primitives

In this work, we present an automatic way to parallelize logic programs for finding all the answers to queries using a transformation to low level threading primitives. Although much work has been done in parallelization of logic…

Programming Languages · Computer Science 2009-12-28 Diptikalyan Saha , Paul Fodor

Cerberus: Minimalistic Multi-shard Byzantine-resilient Transaction Processing

To enable high-performance and scalable blockchains, we need to step away from traditional consensus-based fully-replicated designs. One direction is to explore the usage of sharding in which we partition the managed dataset over many…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-12 Jelle Hellings , Daniel P. Hughes , Joshua Primero , Mohammad Sadoghi

Efficient Hybrid Execution of C++ Applications using Intel(R) Xeon Phi(TM) Coprocessor

The introduction of Intel(R) Xeon Phi(TM) coprocessors opened up new possibilities in development of highly parallel applications. The familiarity and flexibility of the architecture together with compiler support integrated into the Intel…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-11-26 Jiri Dokulil , Enes Bajrovic , Siegfried Benkner , Sabri Pllana , Martin Sandrieser , Beverly Bachmayer

FnF-BFT: Exploring Performance Limits of BFT Protocols

We introduce FnF-BFT, a parallel-leader byzantine fault-tolerant state-machine replication protocol for the partially synchronous model with theoretical performance bounds during synchrony. By allowing all replicas to act as leaders and…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-11 Zeta Avarikioti , Lioba Heimbach , Roland Schmid , Laurent Vanbever , Roger Wattenhofer , Patrick Wintermeyer

Performance Evaluation of Parallel Message Passing and Thread Programming Model on Multicore Architectures

The current trend of multicore architectures on shared memory systems underscores the need of parallelism. While there are some programming model to express parallelism, thread programming model has become a standard to support these system…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-12-13 D. T. Hasta , A. B. Mutiara

Performance-Driven Optimization of Parallel Breadth-First Search

Breadth-first search (BFS) is a fundamental graph algorithm that presents significant challenges for parallel implementation due to irregular memory access patterns, load imbalance and synchronization overhead. In this paper, we introduce a…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-04 Marati Bhaskar , Raghavendra Kanakagiri

Flat-Combining-Based Persistent Data Structures for Non-Volatile Memory

Flat combining (FC) is a synchronization paradigm in which a single thread, holding a global lock, collects requests by multiple threads for accessing a concurrent data structure and applies their combined requests to it. Although FC is…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-10 Matan Rusanovsky , Hagit Attiya , Ohad Ben-Baruch , Tom Gerby , Danny Hendler , Pedro Ramalhete