Related papers: Efficient Document Indexing Using Pivot Tree

Fast k-NN search

Efficient index structures for fast approximate nearest neighbor queries are required in many applications such as recommendation systems. In high-dimensional spaces, many conventional methods suffer from excessive usage of memory and slow…

Machine Learning · Statistics 2019-04-24 Ville Hyvönen , Teemu Pitkänen , Sotiris Tasoulis , Elias Jääsaari , Risto Tuomainen , Liang Wang , Jukka Corander , Teemu Roos

One of the important factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate…

Information Retrieval · Computer Science 2019-09-26 Hamid Mohammadi , Seyed Hossein Khasteh

A Triangle Inequality for Cosine Similarity

Similarity search is a fundamental problem for many data analysis techniques. Many efficient search techniques rely on the triangle inequality of metrics, which allows pruning parts of the search space based on transitive bounds on…

Machine Learning · Computer Science 2021-11-02 Erich Schubert

Search Efficiency in Indexing Structures for Similarity Searching

Similarity searching finds application in a wide variety of domains including multilingual databases, computational biology, pattern recognition and text retrieval. Similarity is measured in terms of a distance function, edit distance, in…

Databases · Computer Science 2007-05-23 Girish Motwani , Sandhya G. Nair

GTS: GPU-based Tree Index for Fast Similarity Search

Similarity search, the task of identifying objects most similar to a given query object under a specific metric, has gathered significant attention due to its practical applications. However, the absence of coordinate information to…

Databases · Computer Science 2024-05-14 Yifan Zhu , Ruiyao Ma , Baihua Zheng , Xiangyu Ke , Lu Chen , Yunjun Gao

Efficient indexing and searching of high dimensional data has been an area of active research due to the growing exploitation of high dimensional data and the vulnerability of traditional search methods to the curse of dimensionality. This…

Information Retrieval · Computer Science 2015-05-13 Yu Zhong

Textual Spatial Cosine Similarity

When dealing with document similarity many methods exist today, like cosine similarity. More complex methods are also available based on the semantic analysis of textual information, which are computationally expensive and rarely used in…

Information Retrieval · Computer Science 2015-05-18 Giancarlo Crocetti

Enhancing Retrieval Systems with Inference-Time Logical Reasoning

Traditional retrieval methods rely on transforming user queries into vector representations and retrieving documents based on cosine similarity within an embedding space. While efficient and scalable, this approach often fails to handle…

Computation and Language · Computer Science 2025-03-25 Felix Faltings , Wei Wei , Yujia Bao

Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity

We present a new scientific document similarity model based on matching fine-grained aspects of texts. To train our model, we exploit a naturally-occurring source of supervision: sentences in the full-text of papers that cite multiple…

Computation and Language · Computer Science 2022-05-05 Sheshera Mysore , Arman Cohan , Tom Hope

A Comparison of Semantic Similarity Methods for Maximum Human Interpretability

The inclusion of semantic information in any similarity measures improves the efficiency of the similarity measure and provides human interpretable results for further analysis. The similarity calculation method that focuses on features…

Information Retrieval · Computer Science 2019-11-01 Pinky Sitikhu , Kritish Pahi , Pujan Thapa , Subarna Shakya

Maximum Inner-Product Search using Tree Data-structures

The problem of {\em efficiently} finding the best match for a query in a given set with respect to the Euclidean distance or the cosine similarity has been extensively studied in literature. However, a closely related problem of efficiently…

Computational Geometry · Computer Science 2021-06-24 Parikshit Ram , Alexander G. Gray

K-tree: Large Scale Document Clustering

We introduce K-tree in an information retrieval context. It is an efficient approximation of the k-means clustering algorithm. Unlike k-means it forms a hierarchy of clusters. It has been extended to address issues with sparse…

Information Retrieval · Computer Science 2010-01-07 Christopher M. De Vries , Shlomo Geva

Document Retrieval on Repetitive String Collections

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their…

Information Retrieval · Computer Science 2017-05-22 Travis Gagie , Aleksi Hartikainen , Kalle Karhu , Juha Kärkkäinen , Gonzalo Navarro , Simon J. Puglisi , Jouni Sirén

Multi-reference Cosine: A New Approach to Text Similarity Measurement in Large Collections

The importance of an efficient and scalable document similarity detection system is undeniable nowadays. Search engines need batch text similarity measures to detect duplicated and near-duplicated web pages in their indexes in order to…

Information Retrieval · Computer Science 2018-10-09 Hamid Mohammadi , Amin Nikoukaran

Efficient Clustering from Distributions over Topics

There are many scenarios where we may want to find pairs of textually similar documents in a large corpus (e.g. a researcher doing literature review, or an R&D project manager analyzing project proposals). To programmatically discover those…

Computation and Language · Computer Science 2020-12-16 Carlos Badenes-Olmedo , Jose-Luis Redondo García , Oscar Corcho

A Practical Index Structure Supporting Fr\'echet Proximity Queries Among Trajectories

We present a scalable approach for range and $k$ nearest neighbor queries under computationally expensive metrics, like the continuous Fr\'echet distance on trajectory data. Based on clustering for metric indexes, we obtain a dynamic tree…

Computational Geometry · Computer Science 2021-12-14 Joachim Gudmundsson , Michael Horton , John Pfeifer , Martin P. Seybold

A new simple and effective measure for bag-of-word inter-document similarity measurement

To measure the similarity of two documents in the bag-of-words (BoW) vector representation, different term weighting schemes are used to improve the performance of cosine similarity---the most widely used inter-document similarity measure…

Information Retrieval · Computer Science 2019-02-12 Sunil Aryal , Kai Ming Ting , Takashi Washio , Gholamreza Haffari

Given a large dataset of binary codes and a binary query point, we address how to efficiently find $K$ codes in the dataset that yield the largest cosine similarities to the query. The straightforward answer to this problem is to compare…

Databases · Computer Science 2018-04-19 Sepehr Eghbali , Ladan Tahvildari

Computational notebook software such as Jupyter Notebook is popular for data science tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task,…

Information Retrieval · Computer Science 2022-02-01 Misato Horiuchi , Yuya Sasaki , Chuan Xiao , Makoto Onizuka

A Learned Index for Exact Similarity Search in Metric Spaces

Indexing is an effective way to support efficient query processing in large databases. Recently the concept of learned index, which replaces or complements traditional index structures with machine learning models, has been actively…

Databases · Computer Science 2022-08-01 Yao Tian , Tingyun Yan , Xi Zhao , Kai Huang , Xiaofang Zhou