Related papers: Efficient sorting, duplicate removal, grouping, an…
We engineer algorithms for sorting huge data sets on massively parallel machines. The algorithms are based on the multiway merging paradigm. We first outline an algorithm whose I/O requirement is close to a lower bound. Thus, in contrast to…
Sorting and hashing are two completely different concepts in computer science, and appear mutually exclusive to one another. Hashing is a search method using the data as a key to map to the location within memory, and is used for rapid…
Group-by-aggregate (GBA) queries are integral to data analysis, allowing users to group data by specific attributes and apply aggregate functions such as sum, average, and count. Database Management Systems (DBMSs) typically execute GBA…
Efficiently computing group aggregations (i.e., GROUP BY) on modern architectures is critical for analytic database systems. Hash-based approaches in today's engines predominantly use a partitioned approach, in which incoming data is…
Clustering is often used for discovering structure in data. Clustering systems differ in the objective function used to evaluate clustering quality and the control strategy used to search the space of clusterings. Ideally, the search…
Recent work shows how offset-value coding speeds up database query execution, not only sorting but also duplicate removal and grouping (aggregation) in sorted streams, order-preserving exchange (shuffle), merge join, and more. It already…
Distributed data aggregation is an important task, allowing the decentralized determination of meaningful global properties, that can then be used to direct the execution of other applications. The resulting values result from the…
In the age of big data, sorting is an indispensable operation for DBMSes and similar systems. Having data sorted can help produce query plans with significantly lower run times. It also can provide other benefits like having non-blocking…
Various decision support systems are available that implement Data Mining and Data Warehousing techniques for diving into the sea of data for getting useful patterns of knowledge (pearls). Classification, regression, clustering, and many…
We propose a clustering-based iterative algorithm to solve certain optimization problems in machine learning, where we start the algorithm by aggregating the original data, solving the problem on aggregated data, and then in subsequent…
Sorting is a fundamental and well studied problem that has been studied extensively. Sorting plays an important role in the area of databases, as many queries can be served much faster if the relations are first sorted. One of the most…
Motivated by the development of computer theory, the sorting algorithm is emerging in an endless stream. Inspired by decrease and conquer method, we propose a brand new sorting algorithmUltimately Heapsort. The algorithm consists of two…
This is paper introduces a new single-pass reservoir weighted-sampling stream aggregation algorithm, Priority-Based Aggregation (PBA). While order sampling is a powerful and e cient method for weighted sampling from a stream of uniquely…
Priority queues are fundamental abstract data structures, often used to manage limited resources in parallel programming. Several proposed parallel priority queue implementations are based on skiplists, harnessing the potential for…
To minimize data movement, state-of-the-art parallel sorting algorithms use techniques based on sampling and histogramming to partition keys prior to redistribution. Sampling enables partitioning to be done using a representative subset of…
Sorting is one of the oldest computing problems and is still very important in the age of big data. Various algorithms and implementation techniques have been proposed. In this study, we focus on comparison based, internal sorting…
Aggregate computation in relational databases has long been done using the standard unary aggregation and binary join operators. These implement the classical model of computing joins between relations two at a time, materializing the…
Clustering is one of the major tasks in data mining. In the last few years, Clustering of spatial data has received a lot of research attention. Spatial databases are components of many advanced information systems like geographic…
A common approach to data analysis involves understanding and manipulating succinct representations of data. In earlier work, we put forward a succinct representation system for relational data called factorised databases and reported on…
There has been surprisingly little work on algorithms for sorting strings on distributed-memory parallel machines. We develop efficient algorithms for this problem based on the multi-way merging principle. These algorithms inspect only…