Related papers: Do We Need More Training Data?

Optimizing the Training Diet: Data Mixture Search for Robust Time Series Forecasting

The standard paradigm for training deep learning models on sensor data assumes that more data is always better. However, raw sensor streams are often imbalanced and contain significant redundancy, meaning that not all data points contribute…

Machine Learning · Computer Science 2025-12-15 Federico Pennino , Maurizio Gabbrielli

Exploring the Limits of Large Scale Pre-training

Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work,…

Machine Learning · Computer Science 2021-10-06 Samira Abnar , Mostafa Dehghani , Behnam Neyshabur , Hanie Sedghi

What Matters in Learning from Large-Scale Datasets for Robot Manipulation

Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe.…

Robotics · Computer Science 2025-06-17 Vaibhav Saxena , Matthew Bronars , Nadun Ranawaka Arachchige , Kuancheng Wang , Woo Chul Shin , Soroush Nasiriany , Ajay Mandlekar , Danfei Xu

Search Spaces for Neural Model Training

While larger neural models are pushing the boundaries of what deep learning can do, often more weights are needed to train models rather than to run inference for tasks. This paper seeks to understand this behavior using search spaces --…

Machine Learning · Computer Science 2021-05-28 Darko Stosic , Dusan Stosic

Scaling Laws for Mixture Pretraining Under Data Constraints

As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable…

Machine Learning · Computer Science 2026-05-18 Anastasiia Sedova , Skyler Seto , Natalie Schluter , Pierre Ablin

Interpretability with Accurate Small Models

Models often need to be constrained to a certain size for them to be considered interpretable. For example, a decision tree of depth 5 is much easier to understand than one of depth 50. Limiting model size, however, often reduces accuracy.…

Machine Learning · Computer Science 2020-07-02 Abhishek Ghose , Balaraman Ravindran

No One Representation to Rule Them All: Overlapping Features of Training Methods

Despite being able to capture a range of features of the data, high accuracy models trained with supervision tend to make similar predictions. This seemingly implies that high-performing models share similar biases regardless of training…

Machine Learning · Computer Science 2022-04-27 Raphael Gontijo-Lopes , Yann Dauphin , Ekin D. Cubuk

Mixtera: A Data Plane for Foundation Model Training

State-of-the-art large language and vision models are trained over trillions of tokens that are aggregated from a large variety of sources. As training data collections grow, manually managing the samples becomes time-consuming, tedious,…

Machine Learning · Computer Science 2026-02-03 Maximilian Böther , Xiaozhe Yao , Tolga Kerimoglu , Dan Graur , Viktor Gsteiger , Ana Klimovic

From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources

The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a…

Machine Learning · Statistics 2026-02-19 Soham Bakshi , Sunrit Chakraborty

Should We Always Train Models on Fine-Grained Classes?

In classification problems, models must predict a class label based on the input data features. However, class labels are organized hierarchically in many datasets. While a classification task is often defined at a specific level of this…

Machine Learning · Computer Science 2025-09-08 Davide Pirovano , Federico Milanesio , Michele Caselle , Piero Fariselli , Matteo Osella

Image retrieval outperforms diffusion models on data augmentation

Many approaches have been proposed to use diffusion models to augment training datasets for downstream tasks, such as classification. However, diffusion models are themselves trained on large datasets, often with noisy annotations, and it…

Computer Vision and Pattern Recognition · Computer Science 2023-12-01 Max F. Burg , Florian Wenzel , Dominik Zietlow , Max Horn , Osama Makansi , Francesco Locatello , Chris Russell

Data Fusion of Deep Learned Molecular Embeddings for Property Prediction

Data-driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many applications, data is sparse, severely limiting their accuracy and…

Machine Learning · Computer Science 2025-10-29 Robert J Appleton , Brian C Barnes , Alejandro Strachan

Fine-tuning with Very Large Dropout

It is impossible today to pretend that the practice of machine learning is always compatible with the idea that training and testing data follow the same distribution. Several authors have recently used ensemble techniques to show how…

Machine Learning · Computer Science 2025-03-03 Jianyu Zhang , Léon Bottou

Language Models Improve When Pretraining Data Matches Target Tasks

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine…

Computation and Language · Computer Science 2025-07-17 David Mizrahi , Anders Boesen Lindbo Larsen , Jesse Allardice , Suzie Petryk , Yuri Gorokhov , Jeffrey Li , Alex Fang , Josh Gardner , Tom Gunter , Afshin Dehghan

Impact of Training Dataset Size on Neural Answer Selection Models

It is held as a truism that deep neural networks require large datasets to train effective models. However, large datasets, especially with high-quality labels, can be expensive to obtain. This study sets out to investigate (i) how large a…

Information Retrieval · Computer Science 2019-01-31 Trond Linjordet , Krisztian Balog

Deep Learning Training Procedure Augmentations

Recent advances in Deep Learning have greatly improved performance on various tasks such as object detection, image segmentation, sentiment analysis. The focus of most research directions up until very recently has been on beating…

Computer Vision and Pattern Recognition · Computer Science 2022-11-29 Cristian Simionescu

On the Pitfalls of Learning with Limited Data: A Facial Expression Recognition Case Study

Deep learning models need large amounts of data for training. In video recognition and classification, significant advances were achieved with the introduction of new large databases. However, the creation of large-databases for training is…

Computer Vision and Pattern Recognition · Computer Science 2021-07-05 Miguel Rodríguez Santander , Juan Hernández Albarracín , Adín Ramírez Rivera

Adaptive mixture approximation for target tracking in clutter

Target tracking represents a state estimation problem recurrent in many practical scenarios like air traffic control, autonomous vehicles, marine radar surveillance and so on. In a Bayesian perspective, when phenomena like clutter are…

Applications · Statistics 2022-11-28 Alessandro D'Ortenzio , Costanzo Manes , Umut Orguner

Towards overcoming data scarcity in materials science: unifying models and datasets with a mixture of experts framework

While machine learning has emerged in recent years as a useful tool for rapid prediction of materials properties, generating sufficient data to reliably train models without overfitting is still impractical for many applications. Towards…

Materials Science · Physics 2022-07-29 Rees Chang , Yu-Xiong Wang , Elif Ertekin

How much data do you need? Part 2: Predicting DL class specific training dataset sizes

This paper targets the question of predicting machine learning classification model performance, when taking into account the number of training examples per class and not just the overall number of training examples. This leads to the a…

Machine Learning · Computer Science 2024-03-12 Thomas Mühlenstädt , Jelena Frtunikj