i2MapReduce: Incremental MapReduce for Mining Evolving Big Data

Yanfeng Zhang; Shimin Chen; Qiang Wang; Ge Yu

i2MapReduce: Incremental MapReduce for Mining Evolving Big Data

Distributed, Parallel, and Cluster Computing 2015-01-21 v1

Authors: Yanfeng Zhang , Shimin Chen , Qiang Wang , Ge Yu

Abstract

As new data and updates are constantly arriving, the results of data mining applications become stale and obsolete over time. Incremental processing is a promising approach to refreshing mining results. It utilizes previously saved states to avoid the expense of re-computation from scratch. In this paper, we propose i2MapReduce, a novel incremental processing extension to MapReduce, the most widely used framework for mining big data. Compared with the state-of-the-art work on Incoop, i2MapReduce (i) performs key-value pair level incremental processing rather than task level re-computation, (ii) supports not only one-step computation but also more sophisticated iterative computation, which is widely used in data mining applications, and (iii) incorporates a set of novel techniques to reduce I/O overhead for accessing preserved fine-grain computation states. We evaluate i2MapReduce using a one-step algorithm and three iterative algorithms with diverse computation characteristics. Experimental results on Amazon EC2 show significant performance improvements of i2MapReduce compared to both plain and iterative MapReduce performing re-computation.

Keywords

distributed systems data processing information retrieval

Cite

@article{arxiv.1501.04854,
  title  = {i2MapReduce: Incremental MapReduce for Mining Evolving Big Data},
  author = {Yanfeng Zhang and Shimin Chen and Qiang Wang and Ge Yu},
  journal= {arXiv preprint arXiv:1501.04854},
  year   = {2015}
}

i2MapReduce: Incremental MapReduce for Mining Evolving Big Data

Abstract

Keywords

Cite

Related papers