English

Efficient Multi-way Theta-Join Processing Using MapReduce

Databases 2012-08-02 v1

Abstract

Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective. Although there have been some works using the (key,value) pair-based programming model to support join operations, efficient processing of multi-way Theta-join queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Theta-join query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.

Keywords

Cite

@article{arxiv.1208.0081,
  title  = {Efficient Multi-way Theta-Join Processing Using MapReduce},
  author = {Xiaofei Zhang and Lei Chen and Min Wang},
  journal= {arXiv preprint arXiv:1208.0081},
  year   = {2012}
}

Comments

VLDB2012

R2 v1 2026-06-21T21:44:27.097Z