English
Related papers

Related papers: Efficient Sketching Algorithm for Sparse Binary Da…

200 papers

In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and our…

Machine Learning · Computer Science 2021-11-16 Bhisham Dev Verma , Rameshwar Pratap , Debajyoti Bera

In this paper, we address the problem of learning compact similarity-preserving embeddings for massive high-dimensional streams of data in order to perform efficient similarity search. We present a new online method for computing binary…

Machine Learning · Computer Science 2018-02-12 Anne Morvan , Antoine Souloumiac , Cédric Gouy-Pailler , Jamal Atif

The rise of internet has resulted in an explosion of data consisting of millions of articles, images, songs, and videos. Most of this data is high dimensional and sparse. The need to perform an efficient search for similar objects in such…

Data Structures and Algorithms · Computer Science 2016-12-20 Raghav Kulkarni , Rameshwar Pratap

High-dimensional sparse data present computational and statistical challenges for supervised learning. We propose compact linear sketches for reducing the dimensionality of the input, followed by a single layer neural network. We show that…

Machine Learning · Computer Science 2016-04-21 Amit Daniely , Nevena Lazic , Yoram Singer , Kunal Talwar

Count-sketch is a popular matrix sketching algorithm that can produce a sketch of an input data matrix X in O(nnz(X))time where nnz(X) denotes the number of non-zero entries in X. The sketched matrix will be much smaller than X while…

Machine Learning · Computer Science 2020-11-30 Yuhan Wang , Zijian Lei , Liang Lan

Categorical attributes are those that can take a discrete set of values, e.g., colours. This work is about compressing vectors over categorical attributes to low-dimension discrete vectors. The current hash-based methods compressing vectors…

Machine Learning · Computer Science 2021-12-08 Debajyoti Bera , Rameshwar Pratap , Bhisham Dev Verma

Recently, randomly mapping vectorial data to strings of discrete symbols (i.e., sketches) for fast and space-efficient similarity searches has become popular. Such random mapping is called similarity-preserving hashing and approximates a…

Machine Learning · Computer Science 2019-10-21 Shunsuke Kanda , Yasuo Tabei

We present a new approach for computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees…

We present the first sublinear memory sketch that can be queried to find the nearest neighbors in a dataset. Our online sketching algorithm compresses an N element dataset to a sketch of size $O(N^b \log^3 N)$ in $O(N^{(b+1)} \log^3 N)$…

Data Structures and Algorithms · Computer Science 2020-09-15 Benjamin Coleman , Richard G. Baraniuk , Anshumali Shrivastava

We consider the $\textit{Similarity Sketching}$ problem: Given a universe $[u] = \{0,\ldots, u-1\}$ we want a random function $S$ mapping subsets $A\subseteq [u]$ into vectors $S(A)$ of size $t$, such that the Jaccard similarity $J(A,B) =…

Data Structures and Algorithms · Computer Science 2024-05-07 Søren Dahlgaard , Mathias Bæk Tejs Langhede , Jakob Bæk Tejs Houen , Mikkel Thorup

Sparse embeddings of data form an attractive class due to their inherent interpretability: Every dimension is tied to a term in some vocabulary, making it easy to visually decipher the latent space. Sparsity, however, poses unique…

Data Structures and Algorithms · Computer Science 2025-09-30 Sebastian Bruch , Franco Maria Nardini , Cosimo Rulli , Rossano Venturini

Matrix sketching is a powerful tool for reducing the size of large data matrices. Yet there are fundamental limitations to this size reduction when we want to recover an accurate estimator for a task such as least square regression. We show…

Data Structures and Algorithms · Computer Science 2024-05-10 Sachin Garg , Kevin Tan , Michał Dereziński

Scalable algorithms to solve optimization and regression tasks even approximately, are needed to work with large datasets. In this paper we study efficient techniques from matrix sketching to solve a variety of convex constrained regression…

Machine Learning · Computer Science 2019-11-01 Graham Cormode , Charlie Dickens

Similarity-preserving hashing is a core technique for fast similarity searches, and it randomly maps data points in a metric space to strings of discrete symbols (i.e., sketches) in the Hamming space. While traditional hashing techniques…

Data Structures and Algorithms · Computer Science 2020-09-25 Shunsuke Kanda , Yasuo Tabei

We address the problem of converting large-scale high-dimensional image data into binary codes so that approximate nearest-neighbor search over them can be efficiently performed. Different from most of the existing unsupervised approaches…

Computer Vision and Pattern Recognition · Computer Science 2015-12-02 Tsung-Yu Lin , Tsung-Wei Ke , Tyng-Luh Liu

Approximate Nearest Neighbor (ANN) search and Approximate Kernel Density Estimation (A-KDE) are fundamental problems at the core of modern machine learning, with broad applications in data analysis, information systems, and large-scale…

Machine Learning · Computer Science 2025-10-28 Ved Danait , Srijan Das , Sujoy Bhore

In many real-world problems, we are dealing with collections of high-dimensional data, such as images, videos, text and web documents, DNA microarray data, and more. Often, high-dimensional data lie close to low-dimensional structures…

Computer Vision and Pattern Recognition · Computer Science 2013-02-06 Ehsan Elhamifar , Rene Vidal

Datasets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data. While more data are better than less, diminishing returns suggest that we may not need terabytes of data…

Econometrics · Economics 2020-05-01 Sokbae Lee , Serena Ng

The immense amount of daily generated and communicated data presents unique challenges in their processing. Clustering, the grouping of data without the presence of ground-truth labels, is an important tool for drawing inferences from data.…

Machine Learning · Statistics 2018-02-08 Panagiotis A. Traganitis , Georgios B. Giannakis

Sketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-24 Hussam Al Daas , Grey Ballard , Laura Grigori , Md Taufique Hussain , Suraj Kumar , Mohammad Marufur Rahman , Kathryn Rouse
‹ Prev 1 2 3 10 Next ›