Related papers: C3O: Collaborative Cluster Configuration Optimizat…

Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs

Analyzing large datasets with distributed dataflow systems requires the use of clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. However, picking the appropriate resources…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-28 Jonathan Will , Jonathan Bader , Lauritz Thamsen

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Lauritz Thamsen , Dominik Scheinert , Jonathan Will , Jonathan Bader , Odej Kao

Flora: Efficient Cloud Resource Selection for Big Data Processing via Job Classification

Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters of cloud resources. Yet, selecting appropriate computational resources for dataflow jobs is often challenging. For efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-03 Jonathan Will , Lauritz Thamsen , Jonathan Bader , Odej Kao

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-14 Jonathan Will , Onur Arslan , Jonathan Bader , Dominik Scheinert , Lauritz Thamsen

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-11 Jonathan Will , Lauritz Thamsen , Jonathan Bader , Dominik Scheinert , Odej Kao

Co-Tuning of Cloud Infrastructure and Distributed Data Processing Platforms

Distributed Data Processing Platforms (e.g., Hadoop, Spark, and Flink) are widely used to store and process data in a cloud environment. These platforms distribute the storage and processing of data among the computing nodes of a cloud. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-08 Isuru Dharmadasa , Faheem Ullah

Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs

Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-16 Lauritz Thamsen , Ilya Verbitskiy , Sasho Nedelkoski , Vinh Thuy Tran , Vinicius Meyer , Miguel G. Xavier , Odej Kao , Cesar A. F. De Rose

On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

With the growing amount of data, data processing workloads and the management of their resource usage becomes increasingly important. Since managing a dedicated infrastructure is in many situations infeasible or uneconomical, users…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-19 Dominik Scheinert , Alireza Alamgiralem , Jonathan Bader , Jonathan Will , Thorsten Wittkopp , Lauritz Thamsen

Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts

Distributed dataflow systems enable the use of clusters for scalable data analytics. However, selecting appropriate cluster resources for a processing job is often not straightforward. Performance models trained on historical executions of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-19 Dominik Scheinert , Lauritz Thamsen , Houkun Zhu , Jonathan Will , Alexander Acker , Thorsten Wittkopp , Odej Kao

Towards a Peer-to-Peer Data Distribution Layer for Efficient and Collaborative Resource Optimization of Distributed Dataflow Applications

Performance modeling can help to improve the resource efficiency of clusters and distributed dataflow applications, yet the available modeling data is often limited. Collaborative approaches to performance modeling, characterized by the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-24 Dominik Scheinert , Soeren Becker , Jonathan Will , Luis Englaender , Lauritz Thamsen

Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-08 Jonathan Will , Lauritz Thamsen , Dominik Scheinert , Odej Kao

Evaluation of Distributed Data Processing Frameworks in Hybrid Clouds

Distributed data processing frameworks (e.g., Hadoop, Spark, and Flink) are widely used to distribute data among computing nodes of a cloud. Recently, there have been increasing efforts aimed at evaluating the performance of distributed…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-07 Faheem Ullah , Shagun Dhingra , Xiaoyu Xia , M. Ali Babar

A Comparison of Big Data Frameworks on a Layered Dataflow Model

In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-17 Claudia Misale , Maurizio Drocco , Marco Aldinucci , Guy Tremblay

Multi-objective Optimization of Clustering-based Scheduling for Multi-workflow On Clouds Considering Fairness

Distributed computing, such as cloud computing, provides promising platforms to execute multiple workflows. Workflow scheduling plays an important role in multi-workflow execution with multi-objective requirements. Although there exist many…

Artificial Intelligence · Computer Science 2022-05-24 Feng Li , Wen Jun , Tan , Wentong , Cai

Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-27 Dominik Scheinert , Philipp Wiesner , Thorsten Wittkopp , Lauritz Thamsen , Jonathan Will , Odej Kao

Data Sharing Options for Scientific Workflows on Amazon EC2

Efficient data management is a key component in achieving good performance for scientific workflows in distributed environments. Workflow applications typically communicate data between tasks using files. When tasks are distributed, these…

Instrumentation and Methods for Astrophysics · Physics 2015-03-17 Gideon Juve , Ewa Deelman , Karan Vahi , Gaurang Mehta , Bruce Berriman , Benjamin P. Berman , Phil Maechling

Runtime Variation in Big Data Analytics

The dynamic nature of resource allocation and runtime conditions on Cloud can result in high variability in a job's runtime across multiple iterations, leading to a poor experience. Identifying the sources of such variation and being able…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-04-10 Yiwen Zhu , Rathijit Sen , Robert Horton , John Mark , Agosta

Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-07 Ubaid Ullah Hafeez , Martin Maas , Mustafa Uysal , Richard McDougall

Optimal Data Placement for Data-Sharing Scientific Workflows in Heterogeneous Edge-Cloud Computing Environments

The heterogeneous edge-cloud computing paradigm can provide a more optimal direction to deploy scientific workflows than traditional distributed computing or cloud computing environments. Due to the different sizes of scientific datasets…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-14 Xin Du , Songtao Tang , Zhihui Lu , Keke Gai , Jie Wu , Patrick C. K. Hung

Scheduling Coflows in Multi-Core OCS Networks with Performance Guarantee

Coflow provides a key application-layer abstraction for capturing communication patterns, enabling the efficient coordination of parallel data flows to reduce job completion times in distributed systems. Modern data center networks (DCNs)…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-10 Xin Wang , Hong Shen , Hui Tian , Dong Wang