Related papers: Compressing Large Sample Data for Discriminant Ana…

Sketching Datasets for Large-Scale Learning (long version)

This article considers "compressive learning," an approach to large-scale machine learning where datasets are massively compressed before learning (e.g., clustering, classification, or regression) is performed. In particular, a "sketch" is…

Machine Learning · Statistics 2021-06-28 Rémi Gribonval , Antoine Chatalic , Nicolas Keriven , Vincent Schellekens , Laurent Jacques , Philip Schniter

Statistical properties of sketching algorithms

Sketching is a probabilistic data compression technique that has been largely developed in the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a…

Methodology · Statistics 2019-04-04 Daniel Ahfock , William J. Astle , Sylvia Richardson

Matrix sketching for supervised classification with imbalanced classes

Matrix sketching is a recently developed data compression technique. An input matrix A is efficiently approximated with a smaller matrix B, so that B preserves most of the properties of A up to some guaranteed approximation ratio. In so…

Machine Learning · Statistics 2019-12-03 Roberta Falcone , Angela Montanari , Laura Anderlucci

Bayesian Data Sketching for Varying Coefficient Regression Models

Varying coefficient models are popular for estimating nonlinear regression functions in functional data models. Their Bayesian variants have received limited attention in large data applications, primarily due to prohibitively slow…

Machine Learning · Statistics 2025-06-03 Rajarshi Guhaniyogi , Laura Baracaldo , Sudipto Banerjee

Sketching in Bayesian High Dimensional Regression With Big Data Using Gaussian Scale Mixture Priors

Bayesian computation of high dimensional linear regression models with a popular Gaussian scale mixture prior distribution using Markov Chain Monte Carlo (MCMC) or its variants can be extremely slow or completely prohibitive due to the…

Methodology · Statistics 2021-05-12 Rajarshi Guhaniyogi , Aaron Scheffler

Bayesian Compressed Regression

As an alternative to variable selection or shrinkage in high dimensional regression, we propose to randomly compress the predictors prior to analysis. This dramatically reduces storage and computational bottlenecks, performing well when the…

Machine Learning · Statistics 2013-03-26 Rajarshi Guhaniyogi , David B. Dunson

A subsampling approach for large data sets when the Generalised Linear Model is potentially misspecified

Subsampling is a computationally efficient and scalable method to draw inference in large data settings based on a subset of the data rather than needing to consider the whole dataset. When employing subsampling techniques, a crucial…

Methodology · Statistics 2025-10-08 Amalan Mahendran , Helen Thompson , James M. McGree

Compressive Statistical Learning with Random Feature Moments

We describe a general framework -- compressive statistical learning -- for resource-efficient large-scale learning: the training collection is compressed in one pass into a low-dimensional sketch (a vector of random empirical generalized…

Machine Learning · Statistics 2021-06-23 Rémi Gribonval , Gilles Blanchard , Nicolas Keriven , Yann Traonmilin

Balanced Subsampling for Big Data with Categorical Covariates

Supervised learning under measurement constraints is a common challenge in statistical and machine learning. In many applications, despite extensive design points, acquiring responses for all points is often impractical due to resource…

Methodology · Statistics 2025-03-19 Lin Wang

Large Scale Constrained Linear Regression Revisited: Faster Algorithms via Preconditioning

In this paper, we revisit the large-scale constrained linear regression problem and propose faster methods based on some recent developments in sketching and optimization. Our algorithms combine (accelerated) mini-batch SGD with a new…

Machine Learning · Computer Science 2018-02-12 Di Wang , Jinhui Xu

Sketching as a Tool for Numerical Linear Algebra

This survey highlights the recent advances in algorithms for numerical linear algebra that have come from the technique of linear sketching, whereby given a matrix, one first compresses it to a much smaller matrix by multiplying it by a…

Data Structures and Algorithms · Computer Science 2015-02-11 David P. Woodruff

Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis with Limited Computational Resources

Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most…

Methodology · Statistics 2023-04-14 Shuyuan Wu , Xuening Zhu , Hansheng Wang

Compressive Learning for Semi-Parametric Models

In the compressive learning theory, instead of solving a statistical learning problem from the input data, a so-called sketch is computed from the data prior to learning. The sketch has to capture enough information to solve the problem…

Machine Learning · Statistics 2019-10-23 Michael P. Sheehan , Antoine Gonon , Mike E. Davies

A Compressive Classification Framework for High-Dimensional Data

We propose a compressive classification framework for settings where the data dimensionality is significantly higher than the sample size. The proposed method, referred to as compressive regularized discriminant analysis (CRDA) is based on…

Machine Learning · Statistics 2020-11-13 Muhammad Naveed Tabassum , Esa Ollila

Sketching for Large-Scale Learning of Mixture Models

Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. We propose a "compressive learning" framework where we estimate model parameters from a sketch of the training data. This sketch…

Machine Learning · Computer Science 2017-05-08 Nicolas Keriven , Anthony Bourrier , Rémi Gribonval , Patrick Pérez

Starting Small -- Learning with Adaptive Sample Sizes

For many machine learning problems, data is abundant and it may be prohibitive to make multiple passes through the full training set. In this context, we investigate strategies for dynamically increasing the effective sample size, when…

Machine Learning · Computer Science 2016-10-10 Hadi Daneshmand , Aurelien Lucchi , Thomas Hofmann

Revisiting Data Augmentation in Model Compression: An Empirical and Comprehensive Study

The excellent performance of deep neural networks is usually accompanied by a large number of parameters and computations, which have limited their usage on the resource-limited edge devices. To address this issue, abundant methods such as…

Computer Vision and Pattern Recognition · Computer Science 2023-05-23 Muzhou Yu , Linfeng Zhang , Kaisheng Ma

A Comparison of 10 Sampling Algorithms for Configurable Systems

Almost every software system provides configuration options to tailor the system to the target platform and application scenario. Often, this configurability renders the analysis of every individual system configuration infeasible. To…

Software Engineering · Computer Science 2016-02-17 Flávio Medeiros , Christian Kästner , Márcio Ribeiro , Rohit Gheyi , Sven Apel

Sketched Subspace Clustering

The immense amount of daily generated and communicated data presents unique challenges in their processing. Clustering, the grouping of data without the presence of ground-truth labels, is an important tool for drawing inferences from data.…

Machine Learning · Statistics 2018-02-08 Panagiotis A. Traganitis , Georgios B. Giannakis

Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression

A major challenge for building statistical models in the big data era is that the available data volume far exceeds the computational capability. A common approach for solving this problem is to employ a subsampled dataset that can be…

Computation · Statistics 2018-09-14 Lei Han , Kean Ming Tan , Ting Yang , Tong Zhang