Related papers: A Survey on Geographically Distributed Big-Data Pr…
Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level,…
Today, big data is generated from many sources and there is a huge demand for storing, managing, processing, and querying on big data. The MapReduce model and its counterpart open source implementation Hadoop, has proven itself as the de…
Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from…
The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively…
Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of…
MapReduce, the popular programming paradigm for large-scale data processing, has traditionally been deployed over tightly-coupled clusters where the data is already locally available. The assumption that the data and compute resources are…
Recently, due to rapid development of information and communication technologies, the data are created and consumed in the avalanche way. Distributed computing create preconditions for analyzing and processing such Big Data by distributing…
This survey article reviews the challenges associated with deploying and optimizing big data applications and machine learning algorithms in cloud data centers and networks. The MapReduce programming model and its widely-used open-source…
Distributed data processing frameworks (e.g., Hadoop, Spark, and Flink) are widely used to distribute data among computing nodes of a cloud. Recently, there have been increasing efforts aimed at evaluating the performance of distributed…
In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a…
Recently we create so much data (2.5 quintillion bytes every day) that 90% of the data in the world today has been created in the last two years alone [1]. This data comes from sensors used to gather traffic or climate information, posts to…
Huge amounts of data being generated continuously by digitally interconnected systems of humans, organizations and machines. Data comes in variety of formats including structured, unstructured and semi-structured, what makes it impossible…
Distributed processing frameworks, such as MapReduce, Hadoop, and Spark are popular systems for processing large amounts of data. The design of efficient algorithms in these frameworks is a challenging problem, as the systems both require…
When dealing with massive data sorting, we usually use Hadoop which is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. A common approach in implement of…
The objective of this work was to utilize BigBench [1] as a Big Data benchmark and evaluate and compare two processing engines: MapReduce [2] and Spark [3]. MapReduce is the established engine for processing data on Hadoop. Spark is a…
Real-world data from diverse domains require real-time scalable analysis. Large-scale data processing frameworks or engines such as Hadoop fall short when results are needed on-the-fly. Apache Spark's streaming library is increasingly…
With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…
During the recent years, a number of efficient and scalable frequent itemset mining algorithms for big data analytics have been proposed by many researchers. Initially, MapReduce-based frequent itemset mining algorithms on Hadoop cluster…
This work explores the use of big data technologies deployed in the cloud for processing of astronomical data. We have applied Hadoop and Spark to the task of co-adding astronomical images. We compared the overhead and execution time of…
In this digitalised world where every information is stored, the data a are growing exponentially. It is estimated that data are doubles itself every two years. Geospatial data are one of the prime contributors to the big data scenario.…