Related papers: Scalable Blocking for Very Large Databases

Hashing-Based Distributed Clustering for Massive High-Dimensional Data

Clustering analysis is of substantial significance for data mining. The properties of big data raise higher demand for more efficient and economical distributed clustering methods. However, existing distributed clustering methods mainly…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-03 Yifeng Xiao , Jiang Xue , Deyu Meng

A Hierarchical Approach to Scaling Batch Active Search Over Structured Data

Active search is the process of identifying high-value data points in a large and often high-dimensional parameter space that can be expensive to evaluate. Traditional active search techniques like Bayesian optimization trade off…

Machine Learning · Computer Science 2020-07-21 Vivek Myers , Peyton Greenside

Benchmarking Hashing Algorithms for Load Balancing in a Distributed Database Environment

Modern high load applications store data using multiple database instances. Such an architecture requires data consistency, and it is important to ensure even distribution of data among nodes. Load balancing is used to achieve these goals.…

Databases · Computer Science 2022-11-03 Alexander Slesarev , Mikhail Mikhailov , George Chernishev

BlobSeer: How to Enable Efficient Versioning for Large Object Storage under Heavy Access Concurrency

To accommodate the needs of large-scale distributed P2P systems, scalable data management strategies are required, allowing applications to efficiently cope with continuously growing, highly dis tributed data. This paper addresses the…

Distributed, Parallel, and Cluster Computing · Computer Science 2009-09-30 Bogdan Nicolae , Gabriel Antoniu , Luc Bougé

Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets

Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-10-16 Thiago S. F. X. Teixeira , George Teodoro , Eduardo Valle , Joel H. Saltz

Scalable and Sustainable Deep Learning via Randomized Hashing

Current deep learning architectures are growing larger in order to learn from complex datasets. These architectures require giant matrix multiplication operations to train millions of parameters. Conversely, there is another growing trend…

Machine Learning · Statistics 2016-12-06 Ryan Spring , Anshumali Shrivastava

CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication Tasks

De-duplication---identification of distinct records referring to the same real-world entity---is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, {\em blocking} has been…

Databases · Computer Science 2011-11-17 Anish Das Sarma , Ankur Jain , Ashwin Machanavajjhala , Philip Bohannon

AutoBlock: A Hands-off Blocking Framework for Entity Matching

Entity matching seeks to identify data records over one or multiple data sources that refer to the same real-world entity. Virtually every entity matching task on large datasets requires blocking, a step that reduces the number of record…

Databases · Computer Science 2019-12-10 Wei Zhang , Hao Wei , Bunyamin Sisman , Xin Luna Dong , Christos Faloutsos , David Page

Diagonal Scaling: A Multi-Dimensional Resource Model and Optimization Framework for Distributed Databases

Modern cloud databases present scaling as a binary decision: scale-out by adding nodes or scale-up by increasing per-node resources. This one-dimensional view is limiting because database performance, cost, and coordination overhead emerge…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-05 Shahir Abdullah , Syed Rohit Zaman

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-13 Kai Keller , Leonardo Bautista Gomez

Hyperdimensional Hashing: A Robust and Efficient Dynamic Hash Table

Most cloud services and distributed applications rely on hashing algorithms that allow dynamic scaling of a robust and efficient hash table. Examples include AWS, Google Cloud and BitTorrent. Consistent and rendezvous hashing are algorithms…

Data Structures and Algorithms · Computer Science 2022-05-17 Mike Heddes , Igor Nunes , Tony Givargis , Alexandru Nicolau , Alex Veidenbaum

Deduplication in a massive clinical note dataset

Duplication, whether exact or partial, is a common issue in many datasets. In clinical notes data, duplication (and near duplication) can arise for many reasons, such as the pervasive use of templates, copy-pasting, or notes being generated…

Databases · Computer Science 2017-04-20 Sanjeev Shenoy , Tsung-Ting Kuo , Rodney Gabriel , Julian McAuley , Chun-Nan Hsu

Faster DB-scan and HDB-scan in Low-Dimensional Euclidean Spaces

We present a new algorithm for the widely used density-based clustering method DBscan. Our algorithm computes the DBscan-clustering in $O(n\log n)$ time in $\mathbb{R}^2$, irrespective of the scale parameter $\varepsilon$ (and assuming the…

Computational Geometry · Computer Science 2017-03-01 Mark de Berg , Ade Gunawan , Marcel Roeloffzen

BatchHL: Answering Distance Queries on Batch-Dynamic Networks at Scale

Many real-world applications operate on dynamic graphs that undergo rapid changes in their topological structure over time. However, it is challenging to design dynamic algorithms that are capable of supporting such graph changes…

Databases · Computer Science 2022-04-26 Muhammad Farhan , Qing Wang , Henning Koehler

DLB: Deep Learning Based Load Balancing

In this paper, we introduce DLB, a Deep Learning based load Balancing mechanism, to effectively address the data skew problem. The key idea of DLB is to replace hash functions in the load balancing mechanisms with deep learning models,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-14 Xiaoke Zhu , Qi Zhang , Taining Cheng , Ling Liu , Wei Zhou , and Jing He

Scalable Discrete Supervised Hash Learning with Asymmetric Matrix Factorization

Hashing method maps similar data to binary hashcodes with smaller hamming distance, and it has received a broad attention due to its low storage cost and fast retrieval speed. However, the existing limitations make the present algorithms…

Computer Vision and Pattern Recognition · Computer Science 2016-09-29 Shifeng Zhang , Jianmin Li , Jinma Guo , Bo Zhang

Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference

In the world of deep learning, Transformer models have become very significant, leading to improvements in many areas from understanding language to recognizing images, covering a wide range of applications. Despite their success, the…

Machine Learning · Computer Science 2024-07-19 Ghadeer Jaradat , Mohammed Tolba , Ghada Alsuhli , Hani Saleh , Mahmoud Al-Qutayri , Thanos Stouraitis , Baker Mohammad

Supervised Deep Hashing for High-dimensional and Heterogeneous Case-based Reasoning

Case-based Reasoning (CBR) on high-dimensional and heterogeneous data is a trending yet challenging and computationally expensive task in the real world. A promising approach is to obtain low-dimensional hash codes representing cases and…

Information Retrieval · Computer Science 2022-06-30 Qi Zhang , Liang Hu , Chongyang Shi , Ke Liu , Longbing Cao

Online Supervised Hashing for Ever-Growing Datasets

Supervised hashing methods are widely-used for nearest neighbor search in computer vision applications. Most state-of-the-art supervised hashing approaches employ batch-learners. Unfortunately, batch-learning strategies can be inefficient…

Computer Vision and Pattern Recognition · Computer Science 2015-11-11 Fatih Cakir , Sarah Adel Bargal , Stan Sclaroff

Dynamic Distributed Storage for Scaling Blockchains

Blockchain uses the idea of storing transaction data in the form of a distributed ledger wherein each node in the network stores a current copy of the sequence of transactions in the form of a hash chain. This requirement of storing the…

Information Theory · Computer Science 2018-01-09 Ravi Kiran Raman , Lav R. Varshney