Related papers: Coded TeraSort

Leveraging Coding Techniques for Speeding up Distributed Computing

Large scale clusters leveraging distributed computing frameworks such as MapReduce routinely process data that are on the orders of petabytes or more. The sheer size of the data precludes the processing of the data on a single computer. The…

Information Theory · Computer Science 2018-02-12 Konstantinos Konstantinidis , Aditya Ramamoorthy

Defeating duplicates: A re-design of the LearnedSort algorithm

LearnedSort is a novel sorting algorithm that, unlike traditional methods, uses fast ML models to boost the sorting speed. The models learn to estimate the input's distribution and arrange the keys in sorted order by predicting their…

Data Structures and Algorithms · Computer Science 2021-07-08 Ani Kristo , Kapil Vaidya , Tim Kraska

Optimization and analysis of large scale data sorting algorithm based on Hadoop

When dealing with massive data sorting, we usually use Hadoop which is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. A common approach in implement of…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-06-02 Zhuo Wang , Longlong Tian , Dianjie Guo , Xiaoming Jiang

A Fundamental Tradeoff between Computation and Communication in Distributed Computing

How can we optimally trade extra computing power to reduce the communication load in distributed computing? We answer this question by characterizing a fundamental tradeoff between computation and communication in distributed computing,…

Information Theory · Computer Science 2017-09-26 Songze Li , Mohammad Ali Maddah-Ali , Qian Yu , A. Salman Avestimehr

Coded Computation over Heterogeneous Clusters

In large-scale distributed computing clusters, such as Amazon EC2, there are several types of "system noise" that can result in major degradation of performance: bottlenecks due to limited communication bandwidth, latency due to straggler…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-21 Amirhossein Reisizadeh , Saurav Prakash , Ramtin Pedarsani , Amir Salman Avestimehr

Coded MapReduce

MapReduce is a commonly used framework for executing data-intensive jobs on distributed server clusters. We introduce a variant implementation of MapReduce, namely "Coded MapReduce", to substantially reduce the inter-server communication…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-12-08 Songze Li , Mohammad Ali Maddah-Ali , A. Salman Avestimehr

Erasure Coding for Distributed Storage: An Overview

In a distributed storage system, code symbols are dispersed across space in nodes or storage units as opposed to time. In settings such as that of a large data center, an important consideration is the efficient repair of a failed node.…

Information Theory · Computer Science 2018-06-13 S. B. Balaji , M. Nikhil Krishnan , Myna Vajha , Vinayak Ramkumar , Birenjith Sasidharan , P. Vijay Kumar

An Experimental Evaluation of Performance of A Hadoop Cluster on Replica Management

Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing. A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large data sets across cluster of…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-11-10 Muralikrishnan Ramane , Sharmila Krishnamoorthy , Sasikala Gowtham

Timely-Throughput Optimal Coded Computing over Cloud Networks

In modern distributed computing systems, unpredictable and unreliable infrastructures result in high variability of computing resources. Meanwhile, there is significantly increasing demand for timely and event-driven services with deadline…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-12 Chien-Sheng Yang , Ramtin Pedarsani , A. Salman Avestimehr

Hierarchical Coded Matrix Multiplication

Slow working nodes, known as stragglers, can greatly reduce the speed of distributed computation. Coded matrix multiplication is a recently introduced technique that enables straggler-resistant distributed multiplication of large matrices.…

Information Theory · Computer Science 2019-07-23 Shahrzad Kiani , Nuwan Ferdinand , Stark C. Draper

zSort: Stable Distribution Sort using Z-Score Partitioning

Sorting is a foundational primitive in modern data processing, influencing the execution speed of high-performance data pipelines. However, the algorithmic landscape is currently bifurcated by a pervasive "Stability Tax": practitioners must…

Data Structures and Algorithms · Computer Science 2026-05-15 Hriday Jain , Ketan Sabale , Aditya Shastri , Hiren Kumar Thakkar , Ashutosh Londhe

Evaluation of Codes with Inherent Double Replication for Hadoop

In this paper, we evaluate the efficacy, in a Hadoop setting, of two coding schemes, both possessing an inherent double replication of data. The two coding schemes belong to the class of regenerating and locally regenerating codes…

Information Theory · Computer Science 2014-06-27 M. Nikhil Krishnan , N. Prakash , V. Lalitha , Birenjith Sasidharan , P. Vijay Kumar , Srinivasan Narayanamurthy , Ranjit Kumar , Siddhartha Nandi

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

Mining frequent itemsets from massive datasets is always being a most important problem of data mining. Apriori is the most popular and simplest algorithm for frequent itemset mining. To enhance the efficiency and scalability of Apriori, a…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-24 Sudhakar Singh , Rakhi Garg , P. K. Mishra

A novel approach for fast mining frequent itemsets use N-list structure based on MapReduce

Frequent Pattern Mining is a one field of the most significant topics in data mining. In recent years, many algorithms have been proposed for mining frequent itemsets. A new algorithm has been presented for mining frequent itemsets based on…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-23 Arkan A. G. Al-Hamodi , Songfeng Lu

Column-Oriented Storage Techniques for MapReduce

Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However,…

Databases · Computer Science 2011-05-24 Avrilia Floratou , Jignesh Patel , Eugene Shekita , Sandeep Tata

Observations on Factors Affecting Performance of MapReduce based Apriori on Hadoop Cluster

Designing fast and scalable algorithm for mining frequent itemsets is always being a most eminent and promising problem of data mining. Apriori is one of the most broadly used and popular algorithm of frequent itemset mining. Designing…

Databases · Computer Science 2017-01-24 Sudhakar Singh , Rakhi Garg , P. K. Mishra

List Sort: A New Approach for Sorting List to Reduce Execution Time

In this paper we are proposing a new sorting algorithm, List Sort algorithm, is based on the dynamic memory allocation. In this research study we have also shown the comparison of various efficient sorting techniques with List sort. Due the…

Data Structures and Algorithms · Computer Science 2013-10-30 Adarsh Kumar Verma , Prashant Kumar

RapidRAID: Pipelined Erasure Codes for Fast Data Archival in Distributed Storage Systems

To achieve reliability in distributed storage systems, data has usually been replicated across different nodes. However the increasing volume of data to be stored has motivated the introduction of erasure codes, a storage efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-08-06 Lluis Pamies-Juarez , Anwitaman Datta , Frederique Oggier

Hash sort: A linear time complexity multiple-dimensional sort algorithm

Sorting and hashing are two completely different concepts in computer science, and appear mutually exclusive to one another. Hashing is a search method using the data as a key to map to the location within memory, and is used for rapid…

Data Structures and Algorithms · Computer Science 2007-05-23 William F. Gilreath

A Game-Theoretic Approach for Runtime Capacity Allocation in MapReduce

Nowadays many companies have available large amounts of raw, unstructured data. Among Big Data enabling technologies, a central place is held by the MapReduce framework and, in particular, by its open source implementation, Apache Hadoop.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-01-18 Eugenio Gianniti , Danilo Ardagna , Michele Ciavotta , Mauro Passacantando