Related papers: Big Data analytics. Three use cases with R, Python…
With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…
Analyzing and working with big data could be very diffi cult using classical means like relational database management systems or desktop software packages for statistics and visualization. Instead, big data requires large clusters with…
Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from…
Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level,…
Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many…
Recently, due to rapid development of information and communication technologies, the data are created and consumed in the avalanche way. Distributed computing create preconditions for analyzing and processing such Big Data by distributing…
With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…
The paradigm of big data is characterized by the need to collect and process data sets of great volume, arriving at the systems with great velocity, in a variety of formats. Spark is a widely used big data processing system that can be…
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib…
This paper presents BigDL (a distributed deep learning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning…
Apache Hadoop and Spark are gaining prominence in Big Data processing and analytics. Both of them are widely deployed on Internet companies. On the other hand, high-performance data analysis requirements are causing academical and…
Distributed data processing frameworks (e.g., Hadoop, Spark, and Flink) are widely used to distribute data among computing nodes of a cloud. Recently, there have been increasing efforts aimed at evaluating the performance of distributed…
The effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to…
Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of…
With the overwhelming amount of complex and heterogeneous data pouring from any-where, any-time, and any-device, there is undeniably an era of Big Data. The emergence of the Big Data as a disruptive technology for next generation of…
Extra-large datasets are becoming increasingly accessible, and computing tools designed to handle huge amount of data efficiently are democratizing rapidly. However, conventional statistical and econometric tools are still lacking fluency…
As the size of modern data sets exceeds the disk and memory capacities of a single computer, machine learning practitioners have resorted to parallel and distributed computing. Given that optimization is one of the pillars of machine…
During the recent years, a number of efficient and scalable frequent itemset mining algorithms for big data analytics have been proposed by many researchers. Initially, MapReduce-based frequent itemset mining algorithms on Hadoop cluster…
The proliferation of mobile devices, such as smartphones and Internet of Things (IoT) gadgets, results in the recent mobile big data (MBD) era. Collecting MBD is unprofitable unless suitable analytics and learning methods are utilized for…
Enterprises operate large data lakes using Hadoop and Spark frameworks that (1) run a plethora of tools to automate powerful data preparation/transformation pipelines, (2) run on shared, large clusters to (3) perform many different…