Related papers: Hadoop Performance Models

Optimizing MapReduce for Highly Distributed Environments

MapReduce, the popular programming paradigm for large-scale data processing, has traditionally been deployed over tightly-coupled clusters where the data is already locally available. The assumption that the data and compute resources are…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-07-31 Benjamin Heintz , Abhishek Chandra , Ramesh K. Sitaraman

Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment

MapReduce has been widely applied in various fields of data and compute intensive applications and also it is important programming model for cloud computing. Hadoop is an open-source implementation of MapReduce which operates on terabytes…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-12-01 Sayalee Narkhede , Trupti Baraskar , Debajyoti Mukhopadhyay

An Alternative C++ based HPC system for Hadoop MapReduce

MapReduce is a technique used to vastly improve distributed processing of data and can massively speed up computation. Hadoop and its MapReduce relies on JVM and Java which is expensive on memory. High Performance Computing based MapReduce…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-29 Vignesh S. , Muthumanikandan V. , Siddarth S. , Sainath G

Benchmarking and Performance Modelling of MapReduce Communication Pattern

Understanding and predicting the performance of big data applications running in the cloud or on-premises could help minimise the overall cost of operations and provide opportunities in efforts to identify performance bottlenecks. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-26 Sheriffo Ceesay , Adam Barker , Yuhui Lin

Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments

Cloud Computing is emerging as a new computational paradigm shift. Hadoop-MapReduce has become a powerful Computation Model for processing large data on distributed commodity hardware clusters such as Clouds. In all Hadoop implementations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-07-04 B. Thirumala Rao , L. S. S. Reddy

An Open-Source Project for MapReduce Performance Self-Tuning

Many Hadoop configuration parameters have significant influence in the performance of running MapReduce jobs on Hadoop. It is time-consuming and tedious for general users to manually tune the parameters for optimal MapReduce performance.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-01 Donghua Chen

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

Today, big data is generated from many sources and there is a huge demand for storing, managing, processing, and querying on big data. The MapReduce model and its counterpart open source implementation Hadoop, has proven itself as the de…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-08-04 Saeed Shahrivari , Saeed Jalili

A Survey of Big Data Machine Learning Applications Optimization in Cloud Data Centers and Networks

This survey article reviews the challenges associated with deploying and optimizing big data applications and machine learning algorithms in cloud data centers and networks. The MapReduce programming model and its widely-used open-source…

Networking and Internet Architecture · Computer Science 2019-10-03 Sanaa Hamid Mohamed , Taisir E. H. El-Gorashi , Jaafar M. H. Elmirghani

High Performance Risk Aggregation: Addressing the Data Processing Challenge the Hadoop MapReduce Way

Monte Carlo simulations employed for the analysis of portfolios of catastrophic risk process large volumes of data. Often times these simulations are not performed in real-time scenarios as they are slow and consume large data. Such…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-11-25 Zhimin Yao , Blesson Varghese , Andrew Rau-Chaplin

Performance Tuning of Hadoop MapReduce: A Noisy Gradient Approach

Hadoop MapReduce is a framework for distributed storage and processing of large datasets that is quite popular in big data analytics. It has various configuration parameters (knobs) which play an important role in deciding the performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-28 Sandeep Kumar , Sindhu Padakandla , Chandrashekar L , Priyank Parihar , K Gopinath , Shalabh Bhatnagar

An Experimental Evaluation of Performance of A Hadoop Cluster on Replica Management

Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing. A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large data sets across cluster of…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-11-10 Muralikrishnan Ramane , Sharmila Krishnamoorthy , Sasikala Gowtham

Iterative MapReduce for Large Scale Machine Learning

Large datasets ("Big Data") are becoming ubiquitous because the potential value in deriving insights from data, across a wide range of business and scientific applications, is increasingly recognized. In particular, machine learning - one…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-03-15 Joshua Rosen , Neoklis Polyzotis , Vinayak Borkar , Yingyi Bu , Michael J. Carey , Markus Weimer , Tyson Condie , Raghu Ramakrishnan

Building Wavelet Histograms on Large Data in MapReduce

MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential.…

Databases · Computer Science 2011-11-01 Jeffrey Jestes , Ke Yi , Feifei Li

Column-Oriented Storage Techniques for MapReduce

Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However,…

Databases · Computer Science 2011-05-24 Avrilia Floratou , Jignesh Patel , Eugene Shekita , Sandeep Tata

On Modelling and Prediction of Total CPU Usage for Applications in MapReduce Environments

Recently, businesses have started using MapReduce as a popular computation framework for processing large amount of data, such as spam detection, and different data mining tasks, in both public and private clouds. Two of the challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-07-30 Nikzad Babaii Rizvandi , Javid Taheri , Reza Moraveji , Albert Y. Zomaya

Analyzing Large-Scale, Distributed and Uncertain Data

The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively…

Databases · Computer Science 2017-12-06 Yaron Gonen

Optimization and analysis of large scale data sorting algorithm based on Hadoop

When dealing with massive data sorting, we usually use Hadoop which is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. A common approach in implement of…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-06-02 Zhuo Wang , Longlong Tian , Dianjie Guo , Xiaoming Jiang

Running genetic algorithms on Hadoop for solving high dimensional optimization problems

Hadoop is a popular MapReduce framework for developing parallel applications in distributed environments. Several advantages of MapReduce such as programming ease and ability to use commodity hardware make the applicability of soft…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-13 Güngör Yildirim , İbrahim R Hallac , Galip Aydin , Yetkin Tatar

Measuring the Optimality of Hadoop Optimization

In recent years, much research has focused on how to optimize Hadoop jobs. Their approaches are diverse, ranging from improving HDFS and Hadoop job scheduler to optimizing parameters in Hadoop configurations. Despite their success in…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-07-12 Woo-Cheol Kim , Changryong Baek , Dongwon Lee

ReStore: Reusing Results of MapReduce Jobs

Analyzing large scale data has emerged as an important activity for many organizations in the past few years. This large scale data analysis is facilitated by the MapReduce programming and execution model and its implementations, most…

Databases · Computer Science 2012-03-02 Iman Elghandour , Ashraf Aboulnaga