Related papers: Asynchronous Complex Analytics in a Distributed Da…

ASAP: Asynchronous Approximate Data-Parallel Computation

Emerging workloads, such as graph processing and machine learning are approximate because of the scale of data involved and the stochastic nature of the underlying algorithms. These algorithms are often distributed over multiple machines…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-12-28 Asim Kadav , Erik Kruus

Lightweight Asynchronous Snapshots for Distributed Dataflows

Distributed stateful stream processing enables the deployment and execution of large scale continuous computations in the cloud, targeting both low latency and high throughput. One of the most fundamental challenges of this paradigm is…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-06-30 Paris Carbone , Gyula Fóra , Stephan Ewen , Seif Haridi , Kostas Tzoumas

ASYNC: A Cloud Engine with Asynchrony and History for Distributed Machine Learning

ASYNC is a framework that supports the implementation of asynchrony and history for optimization methods on distributed computing platforms. The popularity of asynchronous optimization methods has increased in distributed machine learning.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-24 Saeed Soori , Bugra Can , Mert Gurbuzbalaba , Maryam Mehri Dehnavi

Distributed Programming via Safe Closure Passing

Programming systems incorporating aspects of functional programming, e.g., higher-order functions, are becoming increasingly popular for large-scale distributed programming. New frameworks such as Apache Spark leverage functional techniques…

Programming Languages · Computer Science 2016-02-12 Philipp Haller , Heather Miller

AAFLOW: Scalable Patterns for Agentic AI Workflows

Agentic workflows in large language model systems integrate retrieval, reasoning, and memory, but existing frameworks suffer from scalability and reproducibility limitations due to fragmented data orchestration, serialization overhead, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-05 Arup Kumar Sarker , Mills Staylor , Aymen Alsaadi , Gregor von Laszewski , Shantenu Jha , Geoffrey Fox

AIR: A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing

Distributed Stream Processing Systems (DSPSs) are among the currently most emerging topics in data management, with applications ranging from real-time event monitoring to processing complex dataflow programs and big data analytics. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-06 Vinu E. Venugopal , Martin Theobald , Samira Chaychi , Amal Tawakuli

Spinning Fast Iterative Data Flows

Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk…

Databases · Computer Science 2012-08-02 Stephan Ewen , Kostas Tzoumas , Moritz Kaufmann , Volker Markl

Collaborative Reuse of Streaming Dataflows in IoT Applications

Distributed Stream Processing Systems (DSPS) like Apache Storm and Spark Streaming enable composition of continuous dataflows that execute persistently over data streams. They are used by Internet of Things (IoT) applications to analyze…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-10 Shilpa Chaturvedi , Sahil Tyagi , Yogesh Simmhan

Sparkle: Optimizing Spark for Large Memory Machines and Analytics

Spark is an in-memory analytics platform that targets commodity server environments today. It relies on the Hadoop Distributed File System (HDFS) to persist intermediate checkpoint states and final processing results. In Spark, immutable…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-22 Mijung Kim , Jun Li , Haris Volos , Manish Marwah , Alexander Ulanov , Kimberly Keeton , Joseph Tucek , Lucy Cherkasova , Le Xu , Pradeep Fernando

Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms

Stochastic algorithms are efficient approaches to solving machine learning and optimization problems. In this paper, we propose a general framework called Splash for parallelizing stochastic algorithms on multi-node distributed systems.…

Machine Learning · Computer Science 2015-09-24 Yuchen Zhang , Michael I. Jordan

Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors

Asymmetric multicore processors (AMPs) couple high-performance big cores and low-power small cores with the same instruction-set architecture but different features, such as clock frequency or microarchitecture. Previous work has shown that…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-13 Juan Carlos Saez , Fernando Castro , Manuel Prieto-Matias

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-11 Jonathan Will , Lauritz Thamsen , Jonathan Bader , Dominik Scheinert , Odej Kao

Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-07 Ubaid Ullah Hafeez , Martin Maas , Mustafa Uysal , Richard McDougall

Asynchronous ADMM for Distributed Non-Convex Optimization in Power Systems

Large scale, non-convex optimization problems arising in many complex networks such as the power system call for efficient and scalable distributed optimization algorithms. Existing distributed methods are usually iterative and require…

Optimization and Control · Mathematics 2017-10-26 Junyao Guo , Gabriela Hug , Ozan Tonguz

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper…

Databases · Computer Science 2015-07-10 Olivier Curé , Hubert Naacke , Mohamed-Amine Baazizi , Bernd Amann

Graph Sampling with Distributed In-Memory Dataflow Systems

Given a large graph, a graph sample determines a subgraph with similar characteristics for certain metrics of the original graph. The samples are much smaller thereby accelerating and simplifying the analysis and visualization of large…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-11 Kevin Gomez , Matthias Täschner , M. Ali Rostami , Christopher Rost , Erhard Rahm

Experimentally Evaluating the Resource Efficiency of Big Data Autoscaling

Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs is often challenging. For efficient execution, individual…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-27 Jonathan Will , Nico Treide , Lauritz Thamsen , Odej Kao

Renewable Energy Integration in Distribution System -- Synchrophasor Sensor based Big Data Analysis, Visualization, and System Operation

Due to the large volume of heterogeneous data provided by both the customer and the grid side, a big data visualization platform is built to discover the hidden useful knowledge for smart grid (SG) operation, control and situation…

Systems and Control · Computer Science 2018-03-19 Yi Gu

Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-30 Bilal Akil , Ying Zhou , Uwe Röhm

Deep Learning with Apache SystemML

Enterprises operate large data lakes using Hadoop and Spark frameworks that (1) run a plethora of tools to automate powerful data preparation/transformation pipelines, (2) run on shared, large clusters to (3) perform many different…

Machine Learning · Computer Science 2018-02-14 Niketan Pansare , Michael Dusenberry , Nakul Jindal , Matthias Boehm , Berthold Reinwald , Prithviraj Sen