Related papers: Nearly Optimal Subdata Selection

Optimal subdata selection for linear model selection

If the assumed model does not accurately capture the underlying structure of the data, a statistical method is likely to yield sub-optimal results, and so model selection is crucial in order to conduct any statistical analysis. However, in…

Methodology · Statistics 2023-06-21 Vasilis Chasiotis , Dimitris Karlis

Feature Selection for Data-driven Explainable Optimization

Mathematical optimization, although often leading to NP-hard models, is now capable of solving even large-scale instances within reasonable time. However, the primary focus is often placed solely on optimality. This implies that while…

Optimization and Control · Mathematics 2025-12-23 Kevin-Martin Aigner , Marc Goerigk , Michael Hartisch , Frauke Liers , Arthur Miehlich , Florian Rösel

COMBSS: Best Subset Selection via Continuous Optimization

The problem of best subset selection in linear regression is considered with the aim to find a fixed size subset of features that best fits the response. This is particularly challenging when the total available number of features is very…

Methodology · Statistics 2023-11-28 Sarat Moka , Benoit Liquet , Houying Zhu , Samuel Muller

Solving the Best Subset Selection Problem via Suboptimal Algorithms

Best subset selection in linear regression is well known to be nonconvex and computationally challenging to solve, as the number of possible subsets grows rapidly with increasing dimensionality of the problem. As a result, finding the…

Machine Learning · Statistics 2025-04-01 Vikram Singh , Min Sun

Subdata selection for big data regression: an improved approach

In the big data era researchers face a series of problems. Even standard approaches/methodologies, like linear regression, can be difficult or problematic with huge volumes of data. Traditional approaches for regression in big datasets may…

Methodology · Statistics 2024-11-13 Vasilis Chasiotis , Dimitris Karlis

Approximately Optimal Subset Selection for Statistical Design and Modelling

We study the problem of optimal subset selection from a set of correlated random variables. In particular, we consider the associated combinatorial optimization problem of maximizing the determinant of a symmetric positive definite matrix…

Computation · Statistics 2019-07-12 Yu Wang , Nhu D. Le , James V. Zidek

Towards a statistical theory of data selection under weak supervision

Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the…

Machine Learning · Statistics 2023-10-05 Germain Kolossov , Andrea Montanari , Pulkit Tandon

A Cross-Entropy-based Method to Perform Information-based Feature Selection

From a machine learning point of view, identifying a subset of relevant features from a real data set can be useful to improve the results achieved by classification methods and to reduce their time and space complexity. To achieve this…

Machine Learning · Computer Science 2017-05-23 Pietro Cassara , Alessandro Rozza , Mirco Nanni

Optimal Data Selection: An Online Distributed View

The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. We seek to solve this problem at its core, collecting only valuable data and throwing out the rest via…

Machine Learning · Computer Science 2023-12-18 Mariel Werner , Anastasios Angelopoulos , Stephen Bates , Michael I. Jordan

A Semidefinite Programming Based Search Strategy for Feature Selection with Mutual Information Measure

Feature subset selection, as a special case of the general subset selection problem, has been the topic of a considerable number of studies due to the growing importance of data-mining applications. In the feature subset selection problem…

Machine Learning · Computer Science 2014-11-13 Tofigh Naghibi , Sarah Hoffmann , Beat Pfister

Optimal subsampling algorithm for the marginal model with large longitudinal data

Big data is ubiquitous in practices, and it has also led to heavy computation burden. To reduce the calculation cost and ensure the effectiveness of parameter estimators, an optimal subset sampling method is proposed to estimate the…

Methodology · Statistics 2023-11-16 Haohui Han , Liya Fu

Information-Based Optimal Subdata Selection for Big Data Linear Regression

Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large data sets due to computational limitations. A critical step in big data analysis is…

Methodology · Statistics 2019-06-27 HaiYing Wang , Min Yang , John Stufken

Sample Complexity of Algorithm Selection Using Neural Networks and Its Applications to Branch-and-Cut

Data-driven algorithm design is a paradigm that uses statistical and machine learning techniques to select from a class of algorithms for a computational problem an algorithm that has the best expected performance with respect to some…

Machine Learning · Computer Science 2024-06-05 Hongyu Cheng , Sammy Khalife , Barbara Fiedorowicz , Amitabh Basu

Diversity Subsampling: Custom Subsamples from Large Data Sets

Subsampling from a large data set is useful in many supervised learning contexts to provide a global view of the data based on only a fraction of the observations. Diverse (or space-filling) subsampling is an appealing subsampling approach…

Methodology · Statistics 2023-11-27 Boyang Shang , Daniel W. Apley , Sanjay Mehrotra

Near Optimal Inference for the Best-Performing Algorithm

Consider a collection of competing machine learning algorithms. Given their performance on a benchmark of datasets, we would like to identify the best performing algorithm. Specifically, which algorithm is most likely to rank highest on a…

Machine Learning · Computer Science 2025-08-08 Amichai Painsky

Approximate Computation for Big Data Analytics

Over the past a few years, research and development has made significant progresses on big data analytics. A fundamental issue for big data analytics is the efficiency. If the optimal solution is unable to attain or not required or has a…

Databases · Computer Science 2019-01-03 Shuai Ma , Jinpeng Huai

Compute-Constrained Data Selection

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the…

Machine Learning · Computer Science 2025-04-09 Junjie Oscar Yin , Alexander M. Rush

Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis with Limited Computational Resources

Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most…

Methodology · Statistics 2023-04-14 Shuyuan Wu , Xuening Zhu , Hansheng Wang

Efficient Approximation Algorithms for Optimal Large-scale Network Monitoring

The growing amount of applications that generate vast amount of data in short time scales render the problem of partial monitoring, coupled with prediction, a rather fundamental one. We study the aforementioned canonical problem under the…

Data Structures and Algorithms · Computer Science 2016-08-02 Michalis Kallitsis , Stilian Stoev , George Michailidis

On the Subbagging Estimation for Massive Data

This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers. Specifically, for the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each…

Methodology · Statistics 2021-03-05 Tao Zou , Xian Li , Xuan Liang , Hansheng Wang