Related papers: GLISTER: Generalization based Data Subset Selectio…

A CLIP-Powered Framework for Robust and Generalizable Data Selection

Large-scale datasets have been pivotal to the advancements of deep learning models in recent years, but training on such large datasets invariably incurs substantial storage and computational overhead. Meanwhile, real-world datasets often…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Suorong Yang , Peng Ye , Wanli Ouyang , Dongzhan Zhou , Furao Shen

When Chosen Wisely, More Data Is What You Need: A Universal Sample-Efficient Strategy For Data Augmentation

Data Augmentation (DA) is known to improve the generalizability of deep neural networks. Most existing DA techniques naively add a certain number of augmented samples without considering the quality and the added computational cost of these…

Machine Learning · Computer Science 2022-03-18 Ehsan Kamalloo , Mehdi Rezagholizadeh , Ali Ghodsi

Model Slicing for Supporting Complex Analytics with Elastic Inference Cost and Resource Constraints

Deep learning models have been used to support analytics beyond simple aggregation, where deeper and wider models have been shown to yield great results. These models consume a huge amount of memory and computational operations. However,…

Machine Learning · Computer Science 2021-04-22 Shaofeng Cai , Gang Chen , Beng Chin Ooi , Jinyang Gao

Clustering-based Source-aware Assessment of True Robustness for Learning Models

We introduce a novel validation framework to measure the true robustness of learning models for real-world applications by creating source-inclusive and source-exclusive partitions in a dataset via clustering. We develop a robustness metric…

Machine Learning · Computer Science 2017-04-04 Ozsel Kilinc , Ismail Uysal

Bayesian Generative Active Deep Learning

Deep learning models have demonstrated outstanding performance in several problems, but their training process tends to require immense amounts of computational and human resources for training and labeling, constraining the types of…

Machine Learning · Computer Science 2019-04-29 Toan Tran , Thanh-Toan Do , Ian Reid , Gustavo Carneiro

Accelerating Deep Learning with Dynamic Data Pruning

Deep learning's success has been attributed to the training of large, overparameterized models on massive amounts of data. As this trend continues, model training has become prohibitively costly, requiring access to powerful computing…

Machine Learning · Computer Science 2021-11-25 Ravi S Raju , Kyle Daruwalla , Mikko Lipasti

Exploiting Class Learnability in Noisy Data

In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets…

Machine Learning · Computer Science 2018-11-16 Matthew Klawonn , Eric Heim , James Hendler

Data-Efficient and Robust Task Selection for Meta-Learning

Meta-learning methods typically learn tasks under the assumption that all tasks are equally important. However, this assumption is often not valid. In real-world applications, tasks can vary both in their importance during different…

Machine Learning · Computer Science 2024-05-14 Donglin Zhan , James Anderson

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes…

Computation and Language · Computer Science 2024-09-18 Simon Yu , Liangyu Chen , Sara Ahmadian , Marzieh Fadaee

Learning Gradient Descent: Better Generalization and Longer Horizons

Training deep neural networks is a highly nontrivial task, involving carefully selecting appropriate training algorithms, scheduling step sizes and tuning other hyperparameters. Trying different combinations can be quite labor-intensive and…

Machine Learning · Computer Science 2017-06-13 Kaifeng Lv , Shunhua Jiang , Jian Li

Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation

Data-efficient learning has garnered significant attention, especially given the current trend of large multi-modal models. Recently, dataset distillation has become an effective approach by synthesizing data samples that are essential for…

Machine Learning · Computer Science 2024-08-08 Yue Xu , Yong-Lu Li , Kaitong Cui , Ziyu Wang , Cewu Lu , Yu-Wing Tai , Chi-Keung Tang

Learning to Better Search with Language Models via Guided Reinforced Self-Training

While language models have shown remarkable performance across diverse tasks, they still encounter challenges in complex reasoning scenarios. Recent research suggests that language models trained on linearized search traces toward…

Artificial Intelligence · Computer Science 2025-10-28 Seungyong Moon , Bumsoo Park , Hyun Oh Song

Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass

Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering algorithms are infeasible due to their high memory requirements and/or unfavorable runtime…

Data Structures and Algorithms · Computer Science 2026-01-27 Gregor Ulm , Simon Smith , Adrian Nilsson , Emil Gustavsson , Mats Jirstrand

Clust-Splitter - an Efficient Nonsmooth Optimization-Based Algorithm for Clustering Large Datasets

Clustering is a fundamental task in data mining and machine learning, particularly for analyzing large-scale data. In this paper, we introduce Clust-Splitter, an efficient algorithm based on nonsmooth optimization, designed to solve the…

Machine Learning · Computer Science 2026-03-19 Jenni Lampainen , Kaisa Joki , Napsu Karmitsa , Marko M. Mäkelä

Are Bias Mitigation Techniques for Deep Learning Effective?

A critical problem in deep learning is that systems learn inappropriate biases, resulting in their inability to perform well on minority groups. This has led to the creation of multiple algorithms that endeavor to mitigate bias. However, it…

Machine Learning · Computer Science 2024-04-24 Robik Shrestha , Kushal Kafle , Christopher Kanan

GIST: Distributed Training for Large-Scale Graph Convolutional Networks

The graph convolutional network (GCN) is a go-to solution for machine learning on graphs, but its training is notoriously difficult to scale both in terms of graph size and the number of model parameters. Although some work has explored…

Machine Learning · Computer Science 2022-03-15 Cameron R. Wolfe , Jingkang Yang , Arindam Chowdhury , Chen Dun , Artun Bayer , Santiago Segarra , Anastasios Kyrillidis

Learning to Generate Synthetic Training Data using Gradient Matching and Implicit Differentiation

Using huge training datasets can be costly and inconvenient. This article explores various data distillation techniques that can reduce the amount of data required to successfully train deep networks. Inspired by recent ideas, we suggest…

Machine Learning · Computer Science 2022-03-17 Dmitry Medvedev , Alexander D'yakonov

Group Distributionally Robust Dataset Distillation with Risk Minimization

Dataset distillation (DD) has emerged as a widely adopted technique for crafting a synthetic dataset that captures the essential information of a training dataset, facilitating the training of accurate neural models. Its applications span…

Machine Learning · Computer Science 2025-02-04 Saeed Vahidian , Mingyu Wang , Jianyang Gu , Vyacheslav Kungurtsev , Wei Jiang , Yiran Chen

Towards Sustainable Learning: Coresets for Data-efficient Deep Learning

To improve the efficiency and sustainability of learning deep models, we propose CREST, the first scalable framework with rigorous theoretical guarantees to identify the most valuable examples for training non-convex models, particularly…

Machine Learning · Computer Science 2023-06-05 Yu Yang , Hao Kang , Baharan Mirzasoleiman

Active Robust Learning

In many practical applications of learning algorithms, unlabeled data is cheap and abundant whereas labeled data is expensive. Active learning algorithms developed to achieve better performance with lower cost. Usually Representativeness…

Machine Learning · Computer Science 2016-08-26 Hossein Ghafarian , Hadi Sadoghi Yazdi