Related papers: Scalable Blocking for Very Large Databases
Clustering analysis is of substantial significance for data mining. The properties of big data raise higher demand for more efficient and economical distributed clustering methods. However, existing distributed clustering methods mainly…
Active search is the process of identifying high-value data points in a large and often high-dimensional parameter space that can be expensive to evaluate. Traditional active search techniques like Bayesian optimization trade off…
Modern high load applications store data using multiple database instances. Such an architecture requires data consistency, and it is important to ensure even distribution of data among nodes. Load balancing is used to achieve these goals.…
To accommodate the needs of large-scale distributed P2P systems, scalable data management strategies are required, allowing applications to efficiently cope with continuously growing, highly dis tributed data. This paper addresses the…
Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming…
Current deep learning architectures are growing larger in order to learn from complex datasets. These architectures require giant matrix multiplication operations to train millions of parameters. Conversely, there is another growing trend…
De-duplication---identification of distinct records referring to the same real-world entity---is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, {\em blocking} has been…
Entity matching seeks to identify data records over one or multiple data sources that refer to the same real-world entity. Virtually every entity matching task on large datasets requires blocking, a step that reduces the number of record…
Modern cloud databases present scaling as a binary decision: scale-out by adding nodes or scale-up by increasing per-node resources. This one-dimensional view is limiting because database performance, cost, and coordination overhead emerge…
High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also…
Most cloud services and distributed applications rely on hashing algorithms that allow dynamic scaling of a robust and efficient hash table. Examples include AWS, Google Cloud and BitTorrent. Consistent and rendezvous hashing are algorithms…
Duplication, whether exact or partial, is a common issue in many datasets. In clinical notes data, duplication (and near duplication) can arise for many reasons, such as the pervasive use of templates, copy-pasting, or notes being generated…
We present a new algorithm for the widely used density-based clustering method DBscan. Our algorithm computes the DBscan-clustering in $O(n\log n)$ time in $\mathbb{R}^2$, irrespective of the scale parameter $\varepsilon$ (and assuming the…
Many real-world applications operate on dynamic graphs that undergo rapid changes in their topological structure over time. However, it is challenging to design dynamic algorithms that are capable of supporting such graph changes…
In this paper, we introduce DLB, a Deep Learning based load Balancing mechanism, to effectively address the data skew problem. The key idea of DLB is to replace hash functions in the load balancing mechanisms with deep learning models,…
Hashing method maps similar data to binary hashcodes with smaller hamming distance, and it has received a broad attention due to its low storage cost and fast retrieval speed. However, the existing limitations make the present algorithms…
In the world of deep learning, Transformer models have become very significant, leading to improvements in many areas from understanding language to recognizing images, covering a wide range of applications. Despite their success, the…
Case-based Reasoning (CBR) on high-dimensional and heterogeneous data is a trending yet challenging and computationally expensive task in the real world. A promising approach is to obtain low-dimensional hash codes representing cases and…
Supervised hashing methods are widely-used for nearest neighbor search in computer vision applications. Most state-of-the-art supervised hashing approaches employ batch-learners. Unfortunately, batch-learning strategies can be inefficient…
Blockchain uses the idea of storing transaction data in the form of a distributed ledger wherein each node in the network stores a current copy of the sequence of transactions in the form of a hash chain. This requirement of storing the…