Related papers: Efficient Sketching Algorithm for Sparse Binary Da…

Efficient Binary Embedding of Categorical Data using BinSketch

In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and our…

Machine Learning · Computer Science 2021-11-16 Bhisham Dev Verma , Rameshwar Pratap , Debajyoti Bera

Streaming Binary Sketching based on Subspace Tracking and Diagonal Uniformization

In this paper, we address the problem of learning compact similarity-preserving embeddings for massive high-dimensional streams of data in order to perform efficient similarity search. We present a new online method for computing binary…

Machine Learning · Computer Science 2018-02-12 Anne Morvan , Antoine Souloumiac , Cédric Gouy-Pailler , Jamal Atif

The rise of internet has resulted in an explosion of data consisting of millions of articles, images, songs, and videos. Most of this data is high dimensional and sparse. The need to perform an efficient search for similar objects in such…

Data Structures and Algorithms · Computer Science 2016-12-20 Raghav Kulkarni , Rameshwar Pratap

Sketching and Neural Networks

High-dimensional sparse data present computational and statistical challenges for supervised learning. We propose compact linear sketches for reducing the dimensionality of the input, followed by a single layer neural network. We show that…

Machine Learning · Computer Science 2016-04-21 Amit Daniely , Nevena Lazic , Yoram Singer , Kunal Talwar

Effective and Sparse Count-Sketch via k-means clustering

Count-sketch is a popular matrix sketching algorithm that can produce a sketch of an input data matrix X in O(nnz(X))time where nnz(X) denotes the number of non-zero entries in X. The sketched matrix will be much smaller than X while…

Machine Learning · Computer Science 2020-11-30 Yuhan Wang , Zijian Lei , Liang Lan

Dimensionality Reduction for Categorical Data

Categorical attributes are those that can take a discrete set of values, e.g., colours. This work is about compressing vectors over categorical attributes to low-dimension discrete vectors. The current hash-based methods compressing vectors…

Machine Learning · Computer Science 2021-12-08 Debajyoti Bera , Rameshwar Pratap , Bhisham Dev Verma

$b$-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches

Recently, randomly mapping vectorial data to strings of discrete symbols (i.e., sketches) for fast and space-efficient similarity searches has become popular. Such random mapping is called similarity-preserving hashing and approximates a…

Machine Learning · Computer Science 2019-10-21 Shunsuke Kanda , Yasuo Tabei

Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation

We present a new approach for computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees…

Databases · Computer Science 2025-03-06 Aline Bessa , Majid Daliri , Juliana Freire , Cameron Musco , Christopher Musco , Aécio Santos , Haoxiang Zhang

Sub-linear Memory Sketches for Near Neighbor Search on Streaming Data

We present the first sublinear memory sketch that can be queried to find the nearest neighbors in a dataset. Our online sketching algorithm compresses an N element dataset to a sketch of size $O(N^b \log^3 N)$ in $O(N^{(b+1)} \log^3 N)$…

Data Structures and Algorithms · Computer Science 2020-09-15 Benjamin Coleman , Richard G. Baraniuk , Anshumali Shrivastava

Fast Similarity Sketching

We consider the $\textit{Similarity Sketching}$ problem: Given a universe $[u] = \{0,\ldots, u-1\}$ we want a random function $S$ mapping subsets $A\subseteq [u]$ into vectors $S(A)$ of size $t$, such that the Jaccard similarity $J(A,B) =…

Data Structures and Algorithms · Computer Science 2024-05-07 Søren Dahlgaard , Mathias Bæk Tejs Langhede , Jakob Bæk Tejs Houen , Mikkel Thorup

Efficient Sketching and Nearest Neighbor Search Algorithms for Sparse Vector Sets

Sparse embeddings of data form an attractive class due to their inherent interpretability: Every dimension is tied to a term in some vocabulary, making it easy to visually decipher the latent space. Sparsity, however, poses unique…

Data Structures and Algorithms · Computer Science 2025-09-30 Sebastian Bruch , Franco Maria Nardini , Cosimo Rulli , Rossano Venturini

Distributed Least Squares in Small Space via Sketching and Bias Reduction

Matrix sketching is a powerful tool for reducing the size of large data matrices. Yet there are fundamental limitations to this size reduction when we want to recover an accurate estimator for a task such as least square regression. We show…

Data Structures and Algorithms · Computer Science 2024-05-10 Sachin Garg , Kevin Tan , Michał Dereziński

Iterative Hessian Sketch in Input Sparsity Time

Scalable algorithms to solve optimization and regression tasks even approximately, are needed to work with large datasets. In this paper we study efficient techniques from matrix sketching to solve a variety of convex constrained regression…

Machine Learning · Computer Science 2019-11-01 Graham Cormode , Charlie Dickens

Similarity-preserving hashing is a core technique for fast similarity searches, and it randomly maps data points in a metric space to strings of discrete symbols (i.e., sketches) in the Hamming space. While traditional hashing techniques…

Data Structures and Algorithms · Computer Science 2020-09-25 Shunsuke Kanda , Yasuo Tabei

Implicit Sparse Code Hashing

We address the problem of converting large-scale high-dimensional image data into binary codes so that approximate nearest-neighbor search over them can be efficiently performed. Different from most of the existing unsupervised approaches…

Computer Vision and Pattern Recognition · Computer Science 2015-12-02 Tsung-Yu Lin , Tsung-Wei Ke , Tyng-Luh Liu

Sublinear Sketches for Approximate Nearest Neighbor and Kernel Density Estimation

Approximate Nearest Neighbor (ANN) search and Approximate Kernel Density Estimation (A-KDE) are fundamental problems at the core of modern machine learning, with broad applications in data analysis, information systems, and large-scale…

Machine Learning · Computer Science 2025-10-28 Ved Danait , Srijan Das , Sujoy Bhore

Sparse Subspace Clustering: Algorithm, Theory, and Applications

In many real-world problems, we are dealing with collections of high-dimensional data, such as images, videos, text and web documents, DNA microarray data, and more. Often, high-dimensional data lie close to low-dimensional structures…

Computer Vision and Pattern Recognition · Computer Science 2013-02-06 Ehsan Elhamifar , Rene Vidal

An Econometric Perspective on Algorithmic Subsampling

Datasets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data. While more data are better than less, diminishing returns suggest that we may not need terabytes of data…

Econometrics · Economics 2020-05-01 Sokbae Lee , Serena Ng

Sketched Subspace Clustering

The immense amount of daily generated and communicated data presents unique challenges in their processing. Clustering, the grouping of data without the presence of ground-truth labels, is an important tool for drawing inferences from data.…

Machine Learning · Statistics 2018-02-08 Panagiotis A. Traganitis , Georgios B. Giannakis

Communication Lower Bounds and Algorithms for Sketching with Random Dense Matrices

Sketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-24 Hussam Al Daas , Grey Ballard , Laura Grigori , Md Taufique Hussain , Suraj Kumar , Mohammad Marufur Rahman , Kathryn Rouse