Related papers: Work Sharing and Offloading for Efficient Approxim…

Efficient Approximate Search for Sets of Vectors

We consider a similarity measure between two sets $A$ and $B$ of vectors, that balances the average and maximum cosine distance between pairs of vectors, one from set $A$ and one from set $B$. As a motivation for this measure, we present…

Data Structures and Algorithms · Computer Science 2021-08-31 Michael Leybovich , Oded Shmueli

DiskJoin: Large-scale Vector Similarity Join with SSD

Similarity join--a widely used operation in data science--finds all pairs of items that have distance smaller than a threshold. Prior work has explored distributed computation methods to scale similarity join to large data volumes but these…

Databases · Computer Science 2025-10-13 Yanqi Chen , Xiao Yan , Alexandra Meliou , Eric Lo

Efficient Data Access Paths for Mixed Vector-Relational Search

The rapid growth of machine learning capabilities and the adoption of data processing methods using vector embeddings sparked a great interest in creating systems for vector data management. While the predominant approach of vector data…

Databases · Computer Science 2024-03-26 Viktor Sanca , Anastasia Ailamaki

Towards Output-Optimal Uniform Sampling and Approximate Counting for Join-Project Queries

Uniform sampling and approximate counting are fundamental primitives for modern database applications, ranging from query optimization to approximate query processing. While recent breakthroughs have established optimal sampling and…

Databases · Computer Science 2026-05-13 Xiao Hu , Jinchao Huang

Fast Join Project Query Evaluation using Matrix Multiplication

In the last few years, much effort has been devoted to developing join algorithms in order to achieve worst-case optimality for join queries over relational databases. Towards this end, the database community has had considerable success in…

Databases · Computer Science 2020-03-02 Shaleen Deep , Xiao Hu , Paraschos Koutris

Toward Efficient and Scalable Design of In-Memory Graph-Based Vector Search

Vector data is prevalent across business and scientific applications, and its popularity is growing with the proliferation of learned embeddings. Vector data collections often reach billions of vectors with thousands of dimensions, thus,…

Information Retrieval · Computer Science 2025-09-09 Ilias Azizi , Karima Echihab , Themis Palpanas , Vassilis Christophides

Multi-Agent Join

It is crucial to provide real-time performance in many applications, such as interactive and exploratory data analysis. In these settings, users often need to view subsets of query results quickly. It is challenging to deliver such results…

Databases · Computer Science 2023-12-25 Vahid Ghadakchi , Mian Xie , Arash Termehchy , Bakhtiyar Doskenov , Bharghav Srikhakollu , Summit Haque , Huazheng Wang

HARMONY: A Scalable Distributed Vector Database for High-Throughput Approximate Nearest Neighbor Search

Approximate Nearest Neighbor Search (ANNS) is essential for various data-intensive applications, including recommendation systems, image retrieval, and machine learning. Scaling ANNS to handle billions of high-dimensional vectors on a…

Databases · Computer Science 2025-06-18 Qian Xu , Feng Zhang , Chengxi Li , Lei Cao , Zheng Chen , Jidong Zhai , Xiaoyong Du

Fusion vectors: Embedding Graph Fusions for Efficient Unsupervised Rank Aggregation

The vast increase in amount and complexity of digital content led to a wide interest in ad-hoc retrieval systems in recent years. Complementary, the existence of heterogeneous data sources and retrieval models stimulated the proliferation…

Computer Vision and Pattern Recognition · Computer Science 2019-07-03 Icaro Cavalcante Dourado , Ricardo da Silva Torres

Making Fast Graph-based Algorithms with Graph Metric Embeddings

The computation of distance measures between nodes in graphs is inefficient and does not scale to large graphs. We explore dense vector representations as an effective way to approximate the same information: we introduce a simple yet…

Computation and Language · Computer Science 2019-06-18 Andrey Kutuzov , Mohammad Dorgham , Oleksiy Oliynyk , Chris Biemann , Alexander Panchenko

Efficient Taxonomic Similarity Joins with Adaptive Overlap Constraint

A similarity join aims to find all similar pairs between two collections of records. Established approaches usually deal with synthetic differences like typos and abbreviations, but neglect the semantic relations between words. Such…

Information Retrieval · Computer Science 2018-10-30 Pengfei Xu , Jiaheng Lu

PASS-JOIN: A Partition-based Method for Similarity Joins

As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string…

Databases · Computer Science 2011-12-01 Guoliang Li , Dong Deng , Jiannan Wang , Jianhua Feng

Approximate Distributed Joins in Apache Spark

The join operation is a fundamental building block of parallel data processing. Unfortunately, it is very resource-intensive to compute an equi-join across massive datasets. The approximate computing paradigm allows users to trade accuracy…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-16 Do Le Quoc , Istemi Ekin Akkus , Pramod Bhatotia , Spyros Blanas , Ruichuan Chen , Christof Fetzer , Thorsten Strufe

Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search

ANNS for embedded vector representations of texts is commonly used in information retrieval, with two important information representations being sparse and dense vectors. While it has been shown that combining these representations…

Information Retrieval · Computer Science 2024-10-29 Haoyu Zhang , Jun Liu , Zhenhua Zhu , Shulin Zeng , Maojia Sheng , Tao Yang , Guohao Dai , Yu Wang

SIEVE: Effective Filtered Vector Search with Collection of Indexes

Many real-world tasks such as recommending videos with the kids tag can be reduced to finding most similar vectors associated with hard predicates. This task, filtered vector search, is challenging as prior state-of-the-art graph-based…

Databases · Computer Science 2025-07-22 Zhaoheng Li , Silu Huang , Wei Ding , Yongjoo Park , Jianjun Chen

Elastic Index Selection for Label-Hybrid AKNN Search

Real-world vector embeddings are usually associated with extra labels, such as attributes and keywords. Many applications require the nearest neighbor search that contains specific labels, such as searching for product image embeddings…

Databases · Computer Science 2025-12-12 Mingyu Yang , Wenxuan Xia , Wentao Li , Raymond Chi-Wing Wong , Wei Wang

Scalable and robust set similarity join

Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard…

Databases · Computer Science 2018-03-05 Tobias Christiani , Rasmus Pagh , Johan Sivertsen

VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and Optimized Search

Traditional retrieval methods have been essential for assessing document similarity but struggle with capturing semantic nuances. Despite advancements in latent semantic analysis (LSA) and deep learning, achieving comprehensive semantic…

Information Retrieval · Computer Science 2024-09-27 Solmaz Seyed Monir , Irene Lau , Shubing Yang , Dongfang Zhao

Towards Efficient and Scalable Distributed Vector Search with RDMA

Similarity-based vector search facilitates many important applications such as search and recommendation but is limited by the memory capacity and bandwidth of a single machine due to large datasets and intensive data read. In this paper,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-10 Xiangyu Zhi , Meng Chen , Xiao Yan , Baotong Lu , Hui Li , Qianxi Zhang , Qi Chen , James Cheng

Multiple Index Merge for Approximate Nearest Neighbor Search

Approximate $k$ nearest neighbor (AKNN) search in high-dimensional space is a foundational problem in vector databases with widespread applications. Among the numerous AKNN indexes, Proximity Graph-based indexes achieve state-of-the-art…

Databases · Computer Science 2026-02-20 Liuchang Jing , Mingyu Yang , Lei Li , Jianbin Qin , Wei Wang