Related papers: Divide-and-Conquer Information-Based Optimal Subda…

Information-Based Optimal Subdata Selection for Big Data Linear Regression

Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large data sets due to computational limitations. A critical step in big data analysis is…

Methodology · Statistics 2019-06-27 HaiYing Wang , Min Yang , John Stufken

Efficient Data Reduction Strategies for Big Data and High-Dimensional LASSO Regressions

The IBOSS approach proposed by Wang et al. (2019) selects the most informative subset of n points. It assumes that the ordinary least squares method is used and requires that the number of variables, p, is not large. However, in many…

Methodology · Statistics 2024-01-23 Xin Wang , Min Yang , William Li

Divide-and-conquer methods for big data analysis

In the context of big data analysis, the divide-and-conquer methodology refers to a multiple-step process: first splitting a data set into several smaller ones; then analyzing each set separately; finally combining results from each…

Machine Learning · Statistics 2021-02-23 Xueying Chen , Jerry Q. Cheng , Min-ge Xie

BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges

Data subset selection aims to find a smaller yet informative subset of a large dataset that can approximate the full-dataset training, addressing challenges associated with training neural networks on large-scale datasets. However, existing…

Machine Learning · Computer Science 2024-06-06 Hoyong Choi , Nohyun Ki , Hye Won Chung

Optimal Sensing and Data Estimation in a Large Sensor Network

An energy efficient use of large scale sensor networks necessitates activating a subset of possible sensors for estimation at a fusion center. The problem is inherently combinatorial; to this end, a set of iterative, randomized algorithms…

Information Theory · Computer Science 2017-09-13 Arpan Chattopadhyay , Urbashi Mitra

Dynamic Information Sub-Selection for Decision Support

We introduce Dynamic Information Sub-Selection (DISS), a novel framework of AI assistance designed to enhance the performance of black-box decision-makers by tailoring their information processing on a per-instance basis. Blackbox…

Machine Learning · Computer Science 2024-11-01 Hung-Tien Huang , Maxwell Lennon , Shreyas Bhat Brahmavar , Sean Sylvia , Junier B. Oliva

Selecting the Best Optimizing System

We formulate selecting the best optimizing system (SBOS) problems and provide solutions for those problems. In an SBOS problem, a finite number of systems are contenders. Inside each system, a continuous decision variable affects the…

Methodology · Statistics 2025-11-04 Nian Si , Yifu Tang , Zeyu Zheng

A Cross-Entropy-based Method to Perform Information-based Feature Selection

From a machine learning point of view, identifying a subset of relevant features from a real data set can be useful to improve the results achieved by classification methods and to reduce their time and space complexity. To achieve this…

Machine Learning · Computer Science 2017-05-23 Pietro Cassara , Alessandro Rozza , Mirco Nanni

COMBSS: Best Subset Selection via Continuous Optimization

The problem of best subset selection in linear regression is considered with the aim to find a fixed size subset of features that best fits the response. This is particularly challenging when the total available number of features is very…

Methodology · Statistics 2023-11-28 Sarat Moka , Benoit Liquet , Houying Zhu , Samuel Muller

Optimal Data Selection: An Online Distributed View

The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. We seek to solve this problem at its core, collecting only valuable data and throwing out the rest via…

Machine Learning · Computer Science 2023-12-18 Mariel Werner , Anastasios Angelopoulos , Stephen Bates , Michael I. Jordan

D-optimal Subsampling Design for Massive Data Linear Regression

Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider multiple linear regression for an extraordinarily large number of…

Methodology · Statistics 2025-05-30 Torsten Glemser , Rainer Schwabe

Nearly Optimal Subdata Selection

When, in terms of the number of data points, the size of a dataset exceeds available computing resources, or when labeling is expensive, an attractive solution consists of selecting only some of the data points (subdata) for further…

Methodology · Statistics 2026-04-28 Min Yang , Wei Zheng , John Stufken , Ming-Chung Chang , Ting Tian , Xueqin Wang

Diversity Subsampling: Custom Subsamples from Large Data Sets

Subsampling from a large data set is useful in many supervised learning contexts to provide a global view of the data based on only a fraction of the observations. Diverse (or space-filling) subsampling is an appealing subsampling approach…

Methodology · Statistics 2023-11-27 Boyang Shang , Daniel W. Apley , Sanjay Mehrotra

Data Pruning by Information Maximization

In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall…

Computer Vision and Pattern Recognition · Computer Science 2025-08-15 Haoru Tan , Sitong Wu , Wei Huang , Shizhen Zhao , Xiaojuan Qi

Superclustering by finding statistically significant separable groups of optimal gaussian clusters

The paper presents the algorithm for clustering a dataset by grouping the optimal, from the point of view of the BIC criterion, number of Gaussian clusters into the optimal, from the point of view of their statistical separability,…

Machine Learning · Computer Science 2023-10-31 Oleg I. Berngardt

SPlit: An Optimal Method for Data Splitting

In this article we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of Support Points (SP), which was initially developed for finding the optimal…

Machine Learning · Statistics 2021-05-10 V. Roshan Joseph , Akhil Vakayil

Positive region preserved random sampling: an efficient feature selection method for massive data

Selecting relevant features is an important and necessary step for intelligent machines to maximize their chances of success. However, intelligent machines generally have no enough computing resources when faced with huge volume of data.…

Machine Learning · Computer Science 2025-07-04 Hexiang Bai , Deyu Li , Jiye Liang , Yanhui Zhai

Less Is Better: Unweighted Data Subsampling via Influence Function

In the time of Big Data, training complex models on large-scale data sets is challenging, making it appealing to reduce data volume for saving computation resources by subsampling. Most previous works in subsampling are weighted methods…

Machine Learning · Computer Science 2021-04-14 Zifeng Wang , Hong Zhu , Zhenhua Dong , Xiuqiang He , Shao-Lun Huang

Robust subset selection

The best subset selection (or "best subsets") estimator is a classic tool for sparse regression, and developments in mathematical optimization over the past decade have made it more computationally tractable than ever. Notwithstanding its…

Methodology · Statistics 2022-01-11 Ryan Thompson

An Algorithm for Optimal Partitioning of Data on an Interval

Many signal processing problems can be solved by maximizing the fitness of a segmented model over all possible partitions of the data interval. This letter describes a simple but powerful algorithm that searches the exponentially large…

Numerical Analysis · Mathematics 2025-10-20 Brad Jackson , Jeffrey D. Scargle , David Barnes , Sundararajan Arabhi , Alina Alt , Peter Gioumousis , Elyus Gwin , Paungkaew Sangtrakulcharoen , Linda Tan , Tun Tao Tsai