English
Related papers

Related papers: Coded TeraSort

200 papers

Large scale clusters leveraging distributed computing frameworks such as MapReduce routinely process data that are on the orders of petabytes or more. The sheer size of the data precludes the processing of the data on a single computer. The…

Information Theory · Computer Science 2018-02-12 Konstantinos Konstantinidis , Aditya Ramamoorthy

LearnedSort is a novel sorting algorithm that, unlike traditional methods, uses fast ML models to boost the sorting speed. The models learn to estimate the input's distribution and arrange the keys in sorted order by predicting their…

Data Structures and Algorithms · Computer Science 2021-07-08 Ani Kristo , Kapil Vaidya , Tim Kraska

When dealing with massive data sorting, we usually use Hadoop which is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. A common approach in implement of…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-06-02 Zhuo Wang , Longlong Tian , Dianjie Guo , Xiaoming Jiang

How can we optimally trade extra computing power to reduce the communication load in distributed computing? We answer this question by characterizing a fundamental tradeoff between computation and communication in distributed computing,…

Information Theory · Computer Science 2017-09-26 Songze Li , Mohammad Ali Maddah-Ali , Qian Yu , A. Salman Avestimehr

In large-scale distributed computing clusters, such as Amazon EC2, there are several types of "system noise" that can result in major degradation of performance: bottlenecks due to limited communication bandwidth, latency due to straggler…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-21 Amirhossein Reisizadeh , Saurav Prakash , Ramtin Pedarsani , Amir Salman Avestimehr

MapReduce is a commonly used framework for executing data-intensive jobs on distributed server clusters. We introduce a variant implementation of MapReduce, namely "Coded MapReduce", to substantially reduce the inter-server communication…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-12-08 Songze Li , Mohammad Ali Maddah-Ali , A. Salman Avestimehr

In a distributed storage system, code symbols are dispersed across space in nodes or storage units as opposed to time. In settings such as that of a large data center, an important consideration is the efficient repair of a failed node.…

Information Theory · Computer Science 2018-06-13 S. B. Balaji , M. Nikhil Krishnan , Myna Vajha , Vinayak Ramkumar , Birenjith Sasidharan , P. Vijay Kumar

Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing. A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large data sets across cluster of…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-11-10 Muralikrishnan Ramane , Sharmila Krishnamoorthy , Sasikala Gowtham

In modern distributed computing systems, unpredictable and unreliable infrastructures result in high variability of computing resources. Meanwhile, there is significantly increasing demand for timely and event-driven services with deadline…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-12 Chien-Sheng Yang , Ramtin Pedarsani , A. Salman Avestimehr

Slow working nodes, known as stragglers, can greatly reduce the speed of distributed computation. Coded matrix multiplication is a recently introduced technique that enables straggler-resistant distributed multiplication of large matrices.…

Information Theory · Computer Science 2019-07-23 Shahrzad Kiani , Nuwan Ferdinand , Stark C. Draper

Sorting is a foundational primitive in modern data processing, influencing the execution speed of high-performance data pipelines. However, the algorithmic landscape is currently bifurcated by a pervasive "Stability Tax": practitioners must…

Data Structures and Algorithms · Computer Science 2026-05-15 Hriday Jain , Ketan Sabale , Aditya Shastri , Hiren Kumar Thakkar , Ashutosh Londhe

In this paper, we evaluate the efficacy, in a Hadoop setting, of two coding schemes, both possessing an inherent double replication of data. The two coding schemes belong to the class of regenerating and locally regenerating codes…

Mining frequent itemsets from massive datasets is always being a most important problem of data mining. Apriori is the most popular and simplest algorithm for frequent itemset mining. To enhance the efficiency and scalability of Apriori, a…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-24 Sudhakar Singh , Rakhi Garg , P. K. Mishra

Frequent Pattern Mining is a one field of the most significant topics in data mining. In recent years, many algorithms have been proposed for mining frequent itemsets. A new algorithm has been presented for mining frequent itemsets based on…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-23 Arkan A. G. Al-Hamodi , Songfeng Lu

Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However,…

Databases · Computer Science 2011-05-24 Avrilia Floratou , Jignesh Patel , Eugene Shekita , Sandeep Tata

Designing fast and scalable algorithm for mining frequent itemsets is always being a most eminent and promising problem of data mining. Apriori is one of the most broadly used and popular algorithm of frequent itemset mining. Designing…

Databases · Computer Science 2017-01-24 Sudhakar Singh , Rakhi Garg , P. K. Mishra

In this paper we are proposing a new sorting algorithm, List Sort algorithm, is based on the dynamic memory allocation. In this research study we have also shown the comparison of various efficient sorting techniques with List sort. Due the…

Data Structures and Algorithms · Computer Science 2013-10-30 Adarsh Kumar Verma , Prashant Kumar

To achieve reliability in distributed storage systems, data has usually been replicated across different nodes. However the increasing volume of data to be stored has motivated the introduction of erasure codes, a storage efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-08-06 Lluis Pamies-Juarez , Anwitaman Datta , Frederique Oggier

Sorting and hashing are two completely different concepts in computer science, and appear mutually exclusive to one another. Hashing is a search method using the data as a key to map to the location within memory, and is used for rapid…

Data Structures and Algorithms · Computer Science 2007-05-23 William F. Gilreath

Nowadays many companies have available large amounts of raw, unstructured data. Among Big Data enabling technologies, a central place is held by the MapReduce framework and, in particular, by its open source implementation, Apache Hadoop.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-01-18 Eugenio Gianniti , Danilo Ardagna , Michele Ciavotta , Mauro Passacantando
‹ Prev 1 2 3 10 Next ›