Related papers: Runtime Variation in Big Data Analytics

Task Runtime Prediction in Scientific Workflows Using an Online Incremental Learning Approach

Many algorithms in workflow scheduling and resource provisioning rely on the performance estimation of tasks to produce a scheduling plan. A profiler that is capable of modeling the execution of tasks and predicting their runtime…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-01 Muhammad H. Hilman , Maria A. Rodriguez , Rajkumar Buyya

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Lauritz Thamsen , Dominik Scheinert , Jonathan Will , Jonathan Bader , Odej Kao

Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs

Analyzing large datasets with distributed dataflow systems requires the use of clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. However, picking the appropriate resources…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-28 Jonathan Will , Jonathan Bader , Lauritz Thamsen

Cloud Workload Prediction based on Workflow Execution Time Discrepancies

Infrastructure as a service clouds hide the complexity of maintaining the physical infrastructure with a slight disadvantage: they also hide their internal working details. Should users need knowledge about these details e.g., to increase…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-20 Gabor Kecskemeti , Zsolt Nemeth , Attila Kertesz , Rajiv Ranjan

Formal and Empirical Study of Metadata-Based Profiling for Resource Management in the Computing Continuum

We present and formalize a general approach for profiling workload by leveraging only a priori available static metadata to supply appropriate resource needs. Understanding the requirements and characteristics of a workload's runtime is…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-30 Andrea Morichetta , Stefan Nastic , Victor Casamayor Pujol , Schahram Dustdar

Performance-Aware Management of Cloud Resources: A Taxonomy and Future Directions

Dynamic nature of the cloud environment has made distributed resource management process a challenge for cloud service providers. The importance of maintaining the quality of service in accordance with customer expectations as well as the…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-08 Sara Kardani-Moghaddam , Rajkumar Buyya , Kotagiri Ramamohanarao

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-14 Jonathan Will , Onur Arslan , Jonathan Bader , Dominik Scheinert , Lauritz Thamsen

The Case for Task Sampling based Learning for Cluster Job Scheduling

The ability to accurately estimate job runtime properties allows a scheduler to effectively schedule jobs. State-of-the-art online cluster job schedulers use history-based learning, which uses past job execution information to estimate the…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-17 Akshay Jajoo , Y. Charlie Hu , Xiaojun Lin , Nan Deng

Lotaru: Locally Predicting Workflow Task Runtimes for Resource Management on Heterogeneous Infrastructures

Many resource management techniques for task scheduling, energy and carbon efficiency, and cost optimization in workflows rely on a-priori task runtime knowledge. Building runtime prediction models on historical data is often not feasible…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-14 Jonathan Bader , Fabian Lehmann , Lauritz Thamsen , Ulf Leser , Odej Kao

workload forecasting and resource management models based on machine learning for cloud computing environments

The workload prediction and resource allocation significantly play an inevitable role in production of an efficient cloud environment. The proactive estimation of future workload followed by decision of resource allocation have become a…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-30 Deepika Saxena , Ashutosh Kumar Singh

Cloud Performance Decomposition for Long-Term Performance Engineering: A Case Study

Cloud performance fluctuates due to factors such as resource contention and workload changes. These factors can be short-term, seasonal, or long-term. Their effects are often intertwined in performance traces, making performance management…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-12 Shimul Debnath , William Hart , Lori Pollock , Donald Lien , Wei Wang

Online Job Scheduling with Redundancy and Opportunistic Checkpointing: A Speedup-Function-Based Analysis

In a large-scale computing cluster, the job completions can be substantially delayed due to two sources of variability, namely, variability in the job size and that in the machine service capacity. To tackle this issue, existing works have…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-07 Huanle Xu , Gustavo de Veciana , Wing Cheong Lau , Kunxiao Zhou

A Direct Approach for Solving Cloud Computing Task Assignment with Soft Deadlines

Job scheduling in cloud computing environments is a critical yet complex problem. Cloud computing user job requirements are highly dynamic and uncertain, while cloud computing resources are heterogeneous and constrained. This paper studies…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-25 Guang Fang , Yuxiang Zhao

Performance Cost Tradeoffs in Intelligent Load Balancing for Multi Data Center Cloud Systems: From Static Policies to Adaptive Resource Distribution

Cloud computing infrastructures increasingly rely on geographically distributed data centers to meet the growing demand for low latency, high availability, and cost-efficient service delivery. In this context, load balancing plays a…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-12 Saeid Aghasoleymani Najafabadi , Elaheh Nabavi Nia

Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-24 Jonathan Bader , Fabian Lehmann , Lauritz Thamsen , Jonathan Will , Ulf Leser , Odej Kao

Is Big Data Performance Reproducible in Modern Cloud Networks?

Performance variability has been acknowledged as a problem for over a decade by cloud practitioners and performance engineers. Yet, our survey of top systems conferences reveals that the research community regularly disregards variability…

Performance · Computer Science 2019-12-20 Alexandru Uta , Alexandru Custura , Dmitry Duplyakin , Ivo Jimenez , Jan Rellermeyer , Carlos Maltzahn , Robert Ricci , Alexandru Iosup

Predicting the Performance of Scientific Workflow Tasks for Cluster Resource Management: An Overview of the State of the Art

Scientific workflow management systems support large-scale data analysis on cluster infrastructures. For this, they interact with resource managers which schedule workflow tasks onto cluster nodes. In addition to workflow task descriptions,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-30 Jonathan Bader , Kathleen West , Soeren Becker , Svetlana Kulagina , Fabian Lehmann , Lauritz Thamsen , Henning Meyerhenke , Odej Kao

A Self-adaptive Auto-scaling Method for Scientific Applications on HPC Environments and Clouds

High intensive computation applications can usually take days to months to finish an execution. During this time, it is common to have variations of the available resources when considering that such hardware is usually shared among a…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-01-27 Kiran Mantripragada , Alecio Binotto , Leonardo P. Tizzei

Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints

We consider robust resource allocation of services in Clouds. More specifically, we consider the case of a large public or private Cloud platform that runs a relatively small set of large and independent services. These services are…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-10-22 Olivier Beaumont , Lionel Eyraud-Dubois , Paul Renaud-Goud

When Should I Run My Application Benchmark?: Studying Cloud Performance Variability for the Case of Stream Processing Applications

Performance benchmarking is a common practice in software engineering, particularly when building large-scale, distributed, and data-intensive systems. While cloud environments offer several advantages for running benchmarks, it is often…

Software Engineering · Computer Science 2025-04-17 Sören Henning , Adriano Vogel , Esteban Perez-Wohlfeil , Otmar Ertl , Rick Rabiser