English
Related papers

Related papers: Nearly Optimal Subdata Selection

200 papers

If the assumed model does not accurately capture the underlying structure of the data, a statistical method is likely to yield sub-optimal results, and so model selection is crucial in order to conduct any statistical analysis. However, in…

Methodology · Statistics 2023-06-21 Vasilis Chasiotis , Dimitris Karlis

Mathematical optimization, although often leading to NP-hard models, is now capable of solving even large-scale instances within reasonable time. However, the primary focus is often placed solely on optimality. This implies that while…

Optimization and Control · Mathematics 2025-12-23 Kevin-Martin Aigner , Marc Goerigk , Michael Hartisch , Frauke Liers , Arthur Miehlich , Florian Rösel

The problem of best subset selection in linear regression is considered with the aim to find a fixed size subset of features that best fits the response. This is particularly challenging when the total available number of features is very…

Methodology · Statistics 2023-11-28 Sarat Moka , Benoit Liquet , Houying Zhu , Samuel Muller

Best subset selection in linear regression is well known to be nonconvex and computationally challenging to solve, as the number of possible subsets grows rapidly with increasing dimensionality of the problem. As a result, finding the…

Machine Learning · Statistics 2025-04-01 Vikram Singh , Min Sun

In the big data era researchers face a series of problems. Even standard approaches/methodologies, like linear regression, can be difficult or problematic with huge volumes of data. Traditional approaches for regression in big datasets may…

Methodology · Statistics 2024-11-13 Vasilis Chasiotis , Dimitris Karlis

We study the problem of optimal subset selection from a set of correlated random variables. In particular, we consider the associated combinatorial optimization problem of maximizing the determinant of a symmetric positive definite matrix…

Computation · Statistics 2019-07-12 Yu Wang , Nhu D. Le , James V. Zidek

Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the…

Machine Learning · Statistics 2023-10-05 Germain Kolossov , Andrea Montanari , Pulkit Tandon

From a machine learning point of view, identifying a subset of relevant features from a real data set can be useful to improve the results achieved by classification methods and to reduce their time and space complexity. To achieve this…

Machine Learning · Computer Science 2017-05-23 Pietro Cassara , Alessandro Rozza , Mirco Nanni

The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. We seek to solve this problem at its core, collecting only valuable data and throwing out the rest via…

Machine Learning · Computer Science 2023-12-18 Mariel Werner , Anastasios Angelopoulos , Stephen Bates , Michael I. Jordan

Feature subset selection, as a special case of the general subset selection problem, has been the topic of a considerable number of studies due to the growing importance of data-mining applications. In the feature subset selection problem…

Machine Learning · Computer Science 2014-11-13 Tofigh Naghibi , Sarah Hoffmann , Beat Pfister

Big data is ubiquitous in practices, and it has also led to heavy computation burden. To reduce the calculation cost and ensure the effectiveness of parameter estimators, an optimal subset sampling method is proposed to estimate the…

Methodology · Statistics 2023-11-16 Haohui Han , Liya Fu

Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large data sets due to computational limitations. A critical step in big data analysis is…

Methodology · Statistics 2019-06-27 HaiYing Wang , Min Yang , John Stufken

Data-driven algorithm design is a paradigm that uses statistical and machine learning techniques to select from a class of algorithms for a computational problem an algorithm that has the best expected performance with respect to some…

Machine Learning · Computer Science 2024-06-05 Hongyu Cheng , Sammy Khalife , Barbara Fiedorowicz , Amitabh Basu

Subsampling from a large data set is useful in many supervised learning contexts to provide a global view of the data based on only a fraction of the observations. Diverse (or space-filling) subsampling is an appealing subsampling approach…

Methodology · Statistics 2023-11-27 Boyang Shang , Daniel W. Apley , Sanjay Mehrotra

Consider a collection of competing machine learning algorithms. Given their performance on a benchmark of datasets, we would like to identify the best performing algorithm. Specifically, which algorithm is most likely to rank highest on a…

Machine Learning · Computer Science 2025-08-08 Amichai Painsky

Over the past a few years, research and development has made significant progresses on big data analytics. A fundamental issue for big data analytics is the efficiency. If the optimal solution is unable to attain or not required or has a…

Databases · Computer Science 2019-01-03 Shuai Ma , Jinpeng Huai

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the…

Machine Learning · Computer Science 2025-04-09 Junjie Oscar Yin , Alexander M. Rush

Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most…

Methodology · Statistics 2023-04-14 Shuyuan Wu , Xuening Zhu , Hansheng Wang

The growing amount of applications that generate vast amount of data in short time scales render the problem of partial monitoring, coupled with prediction, a rather fundamental one. We study the aforementioned canonical problem under the…

Data Structures and Algorithms · Computer Science 2016-08-02 Michalis Kallitsis , Stilian Stoev , George Michailidis

This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers. Specifically, for the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each…

Methodology · Statistics 2021-03-05 Tao Zou , Xian Li , Xuan Liang , Hansheng Wang
‹ Prev 1 2 3 10 Next ›