Related papers: Flora: Efficient Cloud Resource Selection for Big …
Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. Yet, selecting appropriate cloud…
Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs is often challenging. For efficient execution, individual…
Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to…
Analyzing large datasets with distributed dataflow systems requires the use of clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. However, picking the appropriate resources…
Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial…
Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and…
Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage…
Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires…
Distributed Data Processing Platforms (e.g., Hadoop, Spark, and Flink) are widely used to store and process data in a cloud environment. These platforms distribute the storage and processing of data among the computing nodes of a cloud. The…
Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving…
Efficient resource allocation is a key challenge in modern cloud computing. Over-provisioning leads to unnecessary costs, while under-provisioning risks performance degradation and SLA violations. This work presents an artificial…
Distributed data processing frameworks (e.g., Hadoop, Spark, and Flink) are widely used to distribute data among computing nodes of a cloud. Recently, there have been increasing efforts aimed at evaluating the performance of distributed…
Distributed computing, such as cloud computing, provides promising platforms to execute multiple workflows. Workflow scheduling plays an important role in multi-workflow execution with multi-objective requirements. Although there exist many…
Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and…
Training and deploying deep learning models in real-world applications require processing large amounts of data. This is a challenging task when the amount of data grows to a hundred terabytes, or even, petabyte-scale. We introduce a hybrid…
Big data processing applications are becoming more and more complex. They are no more monolithic in nature but instead they are composed of decoupled analytical processes in the form of a workflow. One type of such workflow applications is…
Distributed in-memory data processing engines accelerate iterative applications by caching substantial datasets in memory rather than recomputing them in each iteration. Selecting a suitable cluster size for caching these datasets plays an…
Distributed data processing platforms (e.g., Hadoop, Spark, and Flink) are widely used to distribute the storage and processing of data among computing nodes of a cloud. The centralization of cloud resources has given birth to edge…
In Cloud Computing, the resource provisioning approach used has a great impact on the processing cost, especially when it is used for Big Data processing. Due to data variety, the performance of virtual machines (VM) may differ based on the…
In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only…