Related papers: Column-Oriented Storage Techniques for MapReduce

An Alternative C++ based HPC system for Hadoop MapReduce

MapReduce is a technique used to vastly improve distributed processing of data and can massively speed up computation. Hadoop and its MapReduce relies on JVM and Java which is expensive on memory. High Performance Computing based MapReduce…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-29 Vignesh S. , Muthumanikandan V. , Siddarth S. , Sainath G

An Experimental Evaluation of Performance of A Hadoop Cluster on Replica Management

Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing. A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large data sets across cluster of…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-11-10 Muralikrishnan Ramane , Sharmila Krishnamoorthy , Sasikala Gowtham

Overview of Caching Mechanisms to Improve Hadoop Performance

Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-24 Rana Ghazali , Douglas G. Down

Finding a Second Wind: Speeding Up Graph Traversal Queries in RDBMSs Using Column-Oriented Processing

Recursive queries and recursive derived tables constitute an important part of the SQL standard. Their efficient processing is important for many real-life applications that rely on graph or hierarchy traversal. Position-enabled…

Databases · Computer Science 2023-08-21 Mikhail Firsov , Michael Polyntsov , Kirill Smirnov , George Chernishev

Analyzing Large-Scale, Distributed and Uncertain Data

The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively…

Databases · Computer Science 2017-12-06 Yaron Gonen

Hadoop Performance Models

Hadoop MapReduce is now a popular choice for performing large-scale data analytics. This technical report describes a detailed set of mathematical performance models for describing the execution of a MapReduce job on Hadoop. The models…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-07 Herodotos Herodotou

Evaluation of Codes with Inherent Double Replication for Hadoop

In this paper, we evaluate the efficacy, in a Hadoop setting, of two coding schemes, both possessing an inherent double replication of data. The two coding schemes belong to the class of regenerating and locally regenerating codes…

Information Theory · Computer Science 2014-06-27 M. Nikhil Krishnan , N. Prakash , V. Lalitha , Birenjith Sasidharan , P. Vijay Kumar , Srinivasan Narayanamurthy , Ranjit Kumar , Siddhartha Nandi

Heterogeneous Multi core processors for improving the efficiency of Market basket analysis algorithm in data mining

Heterogeneous multi core processors can offer diverse computing capabilities. The efficiency of Market Basket Analysis Algorithm can be improved with heterogeneous multi core processors. Market basket analysis algorithm utilises apriori…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-09-24 Aashiha Priyadarshni. L

Optimizing MapReduce for Highly Distributed Environments

MapReduce, the popular programming paradigm for large-scale data processing, has traditionally been deployed over tightly-coupled clusters where the data is already locally available. The assumption that the data and compute resources are…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-07-31 Benjamin Heintz , Abhishek Chandra , Ramesh K. Sitaraman

Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment

MapReduce has been widely applied in various fields of data and compute intensive applications and also it is important programming model for cloud computing. Hadoop is an open-source implementation of MapReduce which operates on terabytes…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-12-01 Sayalee Narkhede , Trupti Baraskar , Debajyoti Mukhopadhyay

Hybrid Materialization in a Disk-Based Column-Store

In column-oriented query processing, a materialization strategy determines when lightweight positions (row IDs) are translated into tuples. It is an important part of column-store architecture, since it defines the class of supported query…

Databases · Computer Science 2023-04-19 Evgeniy Klyuchikov , Elena Mikhailova , George Chernishev

Hadoop Mapreduce Performance Enhancement Using In-node Combiners

While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of conventional software and hardware. Hadoop framework distributes large datasets over multiple commodity servers and performs parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-17 Woo-Hyun Lee , Hee-Gook Jun , Hyoung-Joo Kim

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

Today, big data is generated from many sources and there is a huge demand for storing, managing, processing, and querying on big data. The MapReduce model and its counterpart open source implementation Hadoop, has proven itself as the de…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-08-04 Saeed Shahrivari , Saeed Jalili

Optimization and analysis of large scale data sorting algorithm based on Hadoop

When dealing with massive data sorting, we usually use Hadoop which is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. A common approach in implement of…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-06-02 Zhuo Wang , Longlong Tian , Dianjie Guo , Xiaoming Jiang

Hadoop-Oriented SVM-LRU (H-SVM-LRU): An Intelligent Cache Replacement Algorithm to Improve MapReduce Performance

Modern applications can generate a large amount of data from different sources with high velocity, a combination that is difficult to store and process via traditional tools. Hadoop is one framework that is used for the parallel processing…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-29 Rana Ghazali , Sahar Adabi , Ali Rezaee , Douglas G. Down , Ali Movaghar

Past, Present and Future of Hadoop: A Survey

In this paper, a technology for massive data storage and computing named Hadoop is surveyed. Hadoop consists of heterogeneous computing devices like regular PCs abstracting away the details of parallel processing and developers can just…

Networking and Internet Architecture · Computer Science 2022-03-01 Ameneh Zarei , Shahla Safari , Mahmood Ahmadi , Farhad Mardukhi

A Game-Theoretic Approach for Runtime Capacity Allocation in MapReduce

Nowadays many companies have available large amounts of raw, unstructured data. Among Big Data enabling technologies, a central place is held by the MapReduce framework and, in particular, by its open source implementation, Apache Hadoop.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-01-18 Eugenio Gianniti , Danilo Ardagna , Michele Ciavotta , Mauro Passacantando

Testing MapReduce-Based Systems

MapReduce (MR) is the most popular solution to build applications for large-scale data processing. These applications are often deployed on large clusters of commodity machines, where failures happen constantly due to bugs, hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-02-11 João Eugenio Marynowski , Michel Albonico , Eduardo Cunha de Almeida , Gerson Sunyé

Revisiting Data Compression in Column-Stores

Data compression is widely used in contemporary column-oriented DBMSes to lower space usage and to speed up query processing. Pioneering systems have introduced compression to tackle the disk bandwidth bottleneck by trading CPU processing…

Databases · Computer Science 2021-05-20 Alexander Slesarev , Evgeniy Klyuchikov , Kirill Smirnov , George Chernishev

ReStore: Reusing Results of MapReduce Jobs

Analyzing large scale data has emerged as an important activity for many organizations in the past few years. This large scale data analysis is facilitated by the MapReduce programming and execution model and its implementations, most…

Databases · Computer Science 2012-03-02 Iman Elghandour , Ashraf Aboulnaga