English

A Simple and Efficient MapReduce Algorithm for Data Cube Materialization

Databases 2017-09-29 v1

Abstract

Data cube materialization is a classical database operator introduced in Gray et al.~(Data Mining and Knowledge Discovery, Vol.~1), which is critical for many analysis tasks. Nandi et al.~(Transactions on Knowledge and Data Engineering, Vol.~6) first studied cube materialization for large scale datasets using the MapReduce framework, and proposed a sophisticated modification of a simple broadcast algorithm to handle a dataset with a 216GB cube size within 25 minutes with 2k machines in 2012. We take a different approach, and propose a simple MapReduce algorithm which (1) minimizes the total number of copy-add operations, (2) leverages locality of computation, and (3) balances work evenly across machines. As a result, the algorithm shows excellent performance, and materialized a real dataset with a cube size of 35.0G tuples and 1.75T bytes in 54 minutes, with 0.4k machines in 2014.

Cite

@article{arxiv.1709.10072,
  title  = {A Simple and Efficient MapReduce Algorithm for Data Cube Materialization},
  author = {Mukund Sundararajan and Qiqi Yan},
  journal= {arXiv preprint arXiv:1709.10072},
  year   = {2017}
}