Related papers: Flora: Efficient Cloud Resource Selection for Big …

C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds

Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. Yet, selecting appropriate cloud…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-03 Jonathan Will , Lauritz Thamsen , Dominik Scheinert , Jonathan Bader , Odej Kao

Experimentally Evaluating the Resource Efficiency of Big Data Autoscaling

Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs is often challenging. For efficient execution, individual…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-27 Jonathan Will , Nico Treide , Lauritz Thamsen , Odej Kao

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-11 Jonathan Will , Lauritz Thamsen , Jonathan Bader , Dominik Scheinert , Odej Kao

Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs

Analyzing large datasets with distributed dataflow systems requires the use of clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. However, picking the appropriate resources…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-28 Jonathan Will , Jonathan Bader , Lauritz Thamsen

Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-08 Jonathan Will , Lauritz Thamsen , Dominik Scheinert , Odej Kao

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-14 Jonathan Will , Onur Arslan , Jonathan Bader , Dominik Scheinert , Lauritz Thamsen

Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs

Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-16 Lauritz Thamsen , Ilya Verbitskiy , Sasho Nedelkoski , Vinh Thuy Tran , Vinicius Meyer , Miguel G. Xavier , Odej Kao , Cesar A. F. De Rose

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Lauritz Thamsen , Dominik Scheinert , Jonathan Will , Jonathan Bader , Odej Kao

Co-Tuning of Cloud Infrastructure and Distributed Data Processing Platforms

Distributed Data Processing Platforms (e.g., Hadoop, Spark, and Flink) are widely used to store and process data in a cloud environment. These platforms distribute the storage and processing of data among the computing nodes of a cloud. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-08 Isuru Dharmadasa , Faheem Ullah

Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-06 Jonathan Will , Lauritz Thamsen , Jonathan Bader , Dominik Scheinert , Odej Kao

Artificial Intelligence for Cost-Aware Resource Prediction in Big Data Pipelines

Efficient resource allocation is a key challenge in modern cloud computing. Over-provisioning leads to unnecessary costs, while under-provisioning risks performance degradation and SLA violations. This work presents an artificial…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-08 Harshit Goyal

Evaluation of Distributed Data Processing Frameworks in Hybrid Clouds

Distributed data processing frameworks (e.g., Hadoop, Spark, and Flink) are widely used to distribute data among computing nodes of a cloud. Recently, there have been increasing efforts aimed at evaluating the performance of distributed…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-07 Faheem Ullah , Shagun Dhingra , Xiaoyu Xia , M. Ali Babar

Multi-objective Optimization of Clustering-based Scheduling for Multi-workflow On Clouds Considering Fairness

Distributed computing, such as cloud computing, provides promising platforms to execute multiple workflows. Workflow scheduling plays an important role in multi-workflow execution with multi-objective requirements. Although there exist many…

Artificial Intelligence · Computer Science 2022-05-24 Feng Li , Wen Jun , Tan , Wentong , Cai

Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-27 Dominik Scheinert , Philipp Wiesner , Thorsten Wittkopp , Lauritz Thamsen , Jonathan Will , Odej Kao

Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

Training and deploying deep learning models in real-world applications require processing large amounts of data. This is a challenging task when the amount of data grows to a hundred terabytes, or even, petabyte-scale. We introduce a hybrid…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-17 Davit Buniatyan

Scheduling Algorithms for Efficient Execution of Stream Workflow Applications in Multicloud Environments

Big data processing applications are becoming more and more complex. They are no more monolithic in nature but instead they are composed of decoupled analytical processes in the form of a workflow. One type of such workflow applications is…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-19 Mutaz Barika , Saurabh Garg , Andrew Chan , Rodrigo N. Calheiros

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Distributed in-memory data processing engines accelerate iterative applications by caching substantial datasets in memory rather than recomputing them in each iteration. Selecting a suitable cluster size for caching these datasets plays an…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-07 Hani Al-Sayeh , Muhammad Attahir Jibril , Bunjamin Memishi , Kai-Uwe Sattler

A Framework for Energy-aware Evaluation of Distributed Data Processing Platforms in Edge-Cloud Environment

Distributed data processing platforms (e.g., Hadoop, Spark, and Flink) are widely used to distribute the storage and processing of data among computing nodes of a cloud. The centralization of cloud resources has given birth to edge…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-07 Faheem Ullah , Imaduddin Mohammed , M. Ali Babar

DV-ARPA: Data Variety Aware Resource Provisioning for Big Data Processing in Accumulative Applications

In Cloud Computing, the resource provisioning approach used has a great impact on the processing cost, especially when it is used for Big Data processing. Due to data variety, the performance of virtual machines (VM) may differ based on the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-12 Hossein Ahmadvand , Fouzhan Foroutan

A Comparison of Big Data Frameworks on a Layered Dataflow Model

In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-17 Claudia Misale , Maurizio Drocco , Marco Aldinucci , Guy Tremblay