English
Related papers

Related papers: Engineering a Simplified 0-Bit Consistent Weighted…

200 papers

Min-Hash is a popular technique for efficiently estimating the Jaccard similarity of binary sets. Consistent Weighted Sampling (CWS) generalizes the Min-Hash scheme to sketch weighted sets and has drawn increasing interest from the…

Data Structures and Algorithms · Computer Science 2017-06-06 Wei Wu , Bin Li , Ling Chen , Chengqi Zhang , Philip S. Yu

Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines. We introduce a simple algorithm that takes a weighted set $x \in \mathbb{R}_{\geq 0}^{d}$ and…

Data Structures and Algorithms · Computer Science 2020-05-26 Tobias Christiani

Document sketching using Jaccard similarity has been a workable effective technique in reducing near-duplicates in Web page and image search results, and has also proven useful in file system synchronization, compression and learning…

Data Structures and Algorithms · Computer Science 2014-10-17 Bernhard Haeupler , Mark Manasse , Kunal Talwar

We present a new approach for computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees…

Minwise hashing is a fundamental and one of the most successful hashing algorithm in the literature. Recent advances based on the idea of densification~\cite{Proc:OneHashLSH_ICML14,Proc:Shrivastava_UAI14} have shown that it is possible to…

Data Structures and Algorithms · Computer Science 2017-03-16 Anshumali Shrivastava

Sketching is a probabilistic data compression technique that has been largely developed in the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a…

Methodology · Statistics 2019-04-04 Daniel Ahfock , William J. Astle , Sylvia Richardson

Scalable algorithms to solve optimization and regression tasks even approximately, are needed to work with large datasets. In this paper we study efficient techniques from matrix sketching to solve a variety of convex constrained regression…

Machine Learning · Computer Science 2019-11-01 Graham Cormode , Charlie Dickens

We consider the $\textit{Similarity Sketching}$ problem: Given a universe $[u] = \{0,\ldots, u-1\}$ we want a random function $S$ mapping subsets $A\subseteq [u]$ into vectors $S(A)$ of size $t$, such that the Jaccard similarity $J(A,B) =…

Data Structures and Algorithms · Computer Science 2024-05-07 Søren Dahlgaard , Mathias Bæk Tejs Langhede , Jakob Bæk Tejs Houen , Mikkel Thorup

Minwise hashing is the standard technique in the context of search and databases for efficiently estimating set (e.g., high-dimensional 0/1 vector) similarities. Recently, b-bit minwise hashing was proposed which significantly improves upon…

Machine Learning · Statistics 2011-08-04 Ping Li , Christian Konig

In sketched clustering, a dataset of $T$ samples is first sketched down to a vector of modest size, from which the centroids are subsequently extracted. Advantages include i) reduced storage complexity and ii) centroid extraction complexity…

Information Theory · Computer Science 2019-05-21 Evan Byrne , Antoine Chatalic , Remi Gribonval , Philip Schniter

Iterative Hessian sketch (IHS) is an effective sketching method for modeling large-scale data. It was originally proposed by Pilanci and Wainwright (2016; JMLR) based on randomized sketching matrices. However, it is computationally…

Machine Learning · Statistics 2020-03-10 Aijun Zhang , Hengtao Zhang , Guosheng Yin

Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. A standard approach to speed up this task is to compute "sketches" of the DNA reads (typically via…

Information Theory · Computer Science 2021-07-12 Ilan Shomorony , Govinda M. Kamath

Weighted minwise hashing (WMH) is one of the fundamental subroutine, required by many celebrated approximation algorithms, commonly adopted in industrial practice for large scale-search and learning. The resource bottleneck of the…

Data Structures and Algorithms · Computer Science 2016-02-29 Anshumali Shrivastava

Recent advancement of the WWW, IOT, social network, e-commerce, etc. have generated a large volume of data. These datasets are mostly represented by high dimensional and sparse datasets. Many fundamental subroutines of common data analytic…

Information Retrieval · Computer Science 2019-10-11 Rameshwar Pratap , Debajyoti Bera , Karthik Revanuru

Matrix sketching is a recently developed data compression technique. An input matrix A is efficiently approximated with a smaller matrix B, so that B preserves most of the properties of A up to some guaranteed approximation ratio. In so…

Machine Learning · Statistics 2019-12-03 Roberta Falcone , Angela Montanari , Laura Anderlucci

Matrix sketching is a powerful tool for reducing the size of large data matrices. Yet there are fundamental limitations to this size reduction when we want to recover an accurate estimator for a task such as least square regression. We show…

Data Structures and Algorithms · Computer Science 2024-05-10 Sachin Garg , Kevin Tan , Michał Dereziński

Estimating cardinality, i.e., the number of distinct elements, of a data stream is a fundamental problem in areas like databases, computer networks, and information retrieval. This study delves into a broader scenario where each element…

Databases · Computer Science 2024-06-28 Yiyan Qi , Rundong Li , Pinghui Wang , Yufang Sun , Rui Xing

Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very efficient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets…

Data Structures and Algorithms · Computer Science 2018-07-24 Otmar Ertl

In their seminal work, Broder \textit{et. al.}~\citep{BroderCFM98} introduces the $\mathrm{minHash}$ algorithm that computes a low-dimensional sketch of high-dimensional binary data that closely approximates pairwise Jaccard similarity.…

Machine Learning · Computer Science 2023-08-23 Rameshwar Pratap , Raghav Kulkarni

We consider statistical as well as algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. For a LS problem with input data $(X, Y) \in \mathbb{R}^{n \times p} \times \mathbb{R}^n$,…

Machine Learning · Statistics 2015-08-26 Garvesh Raskutti , Michael Mahoney
‹ Prev 1 2 3 10 Next ›