Related papers: An Efficient and Balanced Platform for Data-Parall…

Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System

Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based approach. Taskflow introduces an expressive task graph programming model to assist developers in the implementation of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-08 Tsung-Wei Huang , Dian-Lun Lin , Chun-Xun Lin , Yibo Lin

The Tiny-Tasks Granularity Trade-Off: Balancing overhead vs. performance in parallel systems

Models of parallel processing systems typically assume that one has $l$ workers and jobs are split into an equal number of $k=l$ tasks. Splitting jobs into $k > l$ smaller tasks, i.e. using ``tiny tasks'', can yield performance and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-24 Stefan Bora , Brenton Walker , Markus Fidler

Comparisons of Algorithms in Big Data Processing

Parallel computing is the fundamental base for MapReduce framework in Hadoop. Each data chunk is replicated over 3 servers for increasing availability of data and decreasing probability of data loss. Hence, the 3 servers that have Map task…

Performance · Computer Science 2020-07-21 Amirali Daghighi , Jim Q. Chen

A parallel sampling based clustering

The problem of automatically clustering data is an age old problem. People have created numerous algorithms to tackle this problem. The execution time of any of this algorithm grows with the number of input points and the number of cluster…

Machine Learning · Computer Science 2014-12-08 Aditya AV Sastry , Kalyan Netti

heSRPT: Parallel Scheduling to Minimize Mean Slowdown

Modern data centers serve workloads which are capable of exploiting parallelism. When a job parallelizes across multiple servers it will complete more quickly, but jobs receive diminishing returns from being allocated additional servers.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-20 Benjamin Berg , Rein Vesilo , Mor Harchol-Balter

Detecting Straggler MapReduce Tasks in Big Data Processing Infrastructure by Neural Network

Straggler task detection is one of the main challenges in applying MapReduce for parallelizing and distributing large-scale data processing. It is defined as detecting running tasks on weak nodes. Considering two stages in the Map phase…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-14 Amir Javadpour , Guojun Wang , Samira Rezaei , Kuan Ching Li

Don't cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling

Understanding the performance of data-parallel workloads when resource-constrained has significant practical importance but unfortunately has received only limited attention. This paper identifies, quantifies and demonstrates memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-02-15 Calin Iorgulescu , Florin Dinu , Aunn Raza , Wajih Ul Hassan , Willy Zwaenepoel

Supporting Parallelism in Server-based Multiprocessor Systems

Developing an efficient server-based real-time scheduling solution that supports dynamic task-level parallelism is now relevant to even the desktop and embedded domains and no longer only to the high performance computing market niche. This…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-15 Luís Nogueira , Luís Miguel Pinho

Near-Data Scheduling for Data Centers with Multiple Levels of Data Locality

Data locality is a fundamental issue for data-parallel applications. Considering MapReduce in Hadoop, the map task scheduling part requires an efficient algorithm which takes data locality into consideration; otherwise, the system may…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-14 Ali Yekkehkhany

Heterogeneous MacroTasking (HeMT) for Parallel Processing in the Public Cloud

Using tiny, equal-sized tasks (Homogeneous microTasking, HomT) has long been regarded an effective way of load balancing in parallel computing systems. When combined with nodes pulling in work upon becoming idle, HomT has the desirable…

Performance · Computer Science 2018-10-03 Yuquan Shan , George Kesidis , Bhuvan Urgaonkar , Jorg Schad , Jalal Khamse-Ashari , Ioannis Lambadaris

heSRPT: Optimal Parallel Scheduling of Jobs With Known Sizes

When parallelizing a set of jobs across many servers, one must balance a trade-off between granting priority to short jobs and maintaining the overall efficiency of the system. When the goal is to minimize the mean flow time of a set of…

Performance · Computer Science 2020-11-24 Benjamin Berg , Rein Vesilo , Mor Harchol-Balter

Overview of Caching Mechanisms to Improve Hadoop Performance

Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-24 Rana Ghazali , Douglas G. Down

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

Today, big data is generated from many sources and there is a huge demand for storing, managing, processing, and querying on big data. The MapReduce model and its counterpart open source implementation Hadoop, has proven itself as the de…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-08-04 Saeed Shahrivari , Saeed Jalili

Scheduling of Graph Queries: Controlling Intra- and Inter-query Parallelism for a High System Throughput

The vast amounts of data used in social, business or traffic networks, biology and other natural sciences are often managed in graph-based data sets, consisting of a few thousand up to billions and trillions of vertices and edges,…

Databases · Computer Science 2021-10-22 Matthias Hauck , Ismail Oukid , Holger Fröning

Leveraging Coding Techniques for Speeding up Distributed Computing

Large scale clusters leveraging distributed computing frameworks such as MapReduce routinely process data that are on the orders of petabytes or more. The sheer size of the data precludes the processing of the data on a single computer. The…

Information Theory · Computer Science 2018-02-12 Konstantinos Konstantinidis , Aditya Ramamoorthy

Predicting Scheduling Failures in the Cloud

Cloud Computing has emerged as a key technology to deliver and manage computing, platform, and software services over the Internet. Task scheduling algorithms play an important role in the efficiency of cloud computing services as they aim…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-07-14 Mbarka Soualhia , Foutse Khomh , Sofiene Tahar

An Experimental Evaluation of Performance of A Hadoop Cluster on Replica Management

Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing. A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large data sets across cluster of…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-11-10 Muralikrishnan Ramane , Sharmila Krishnamoorthy , Sasikala Gowtham

Efficient Task Grouping Through Samplewise Optimisation Landscape Analysis

Shared training approaches, such as multi-task learning (MTL) and gradient-based meta-learning, are widely used in various machine learning applications, but they often suffer from negative transfer, leading to performance degradation in…

Machine Learning · Computer Science 2024-12-10 Anshul Thakur , Yichen Huang , Soheila Molaei , Yujiang Wang , David A. Clifton

Parallel Processing of Large Graphs

More and more large data collections are gathered worldwide in various IT systems. Many of them possess the networked nature and need to be processed and analysed as graph structures. Due to their size they require very often usage of…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-06-04 Tomasz Kajdanowicz , Przemyslaw Kazienko , Wojciech Indyk

Streaming Task Graph Scheduling for Dataflow Architectures

Dataflow devices represent an avenue towards saving the control and data movement overhead of Load-Store Architectures. Various dataflow accelerators have been proposed, but how to efficiently schedule applications on such devices remains…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-06 Tiziano De Matteis , Lukas Gianinazzi , Johannes de Fine Licht , Torsten Hoefler