English
Related papers

Related papers: Do We Need More Training Data?

200 papers

The standard paradigm for training deep learning models on sensor data assumes that more data is always better. However, raw sensor streams are often imbalanced and contain significant redundancy, meaning that not all data points contribute…

Machine Learning · Computer Science 2025-12-15 Federico Pennino , Maurizio Gabbrielli

Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work,…

Machine Learning · Computer Science 2021-10-06 Samira Abnar , Mostafa Dehghani , Behnam Neyshabur , Hanie Sedghi

Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe.…

While larger neural models are pushing the boundaries of what deep learning can do, often more weights are needed to train models rather than to run inference for tasks. This paper seeks to understand this behavior using search spaces --…

Machine Learning · Computer Science 2021-05-28 Darko Stosic , Dusan Stosic

As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable…

Machine Learning · Computer Science 2026-05-18 Anastasiia Sedova , Skyler Seto , Natalie Schluter , Pierre Ablin

Models often need to be constrained to a certain size for them to be considered interpretable. For example, a decision tree of depth 5 is much easier to understand than one of depth 50. Limiting model size, however, often reduces accuracy.…

Machine Learning · Computer Science 2020-07-02 Abhishek Ghose , Balaraman Ravindran

Despite being able to capture a range of features of the data, high accuracy models trained with supervision tend to make similar predictions. This seemingly implies that high-performing models share similar biases regardless of training…

Machine Learning · Computer Science 2022-04-27 Raphael Gontijo-Lopes , Yann Dauphin , Ekin D. Cubuk

State-of-the-art large language and vision models are trained over trillions of tokens that are aggregated from a large variety of sources. As training data collections grow, manually managing the samples becomes time-consuming, tedious,…

Machine Learning · Computer Science 2026-02-03 Maximilian Böther , Xiaozhe Yao , Tolga Kerimoglu , Dan Graur , Viktor Gsteiger , Ana Klimovic

The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a…

Machine Learning · Statistics 2026-02-19 Soham Bakshi , Sunrit Chakraborty

In classification problems, models must predict a class label based on the input data features. However, class labels are organized hierarchically in many datasets. While a classification task is often defined at a specific level of this…

Machine Learning · Computer Science 2025-09-08 Davide Pirovano , Federico Milanesio , Michele Caselle , Piero Fariselli , Matteo Osella

Many approaches have been proposed to use diffusion models to augment training datasets for downstream tasks, such as classification. However, diffusion models are themselves trained on large datasets, often with noisy annotations, and it…

Computer Vision and Pattern Recognition · Computer Science 2023-12-01 Max F. Burg , Florian Wenzel , Dominik Zietlow , Max Horn , Osama Makansi , Francesco Locatello , Chris Russell

Data-driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many applications, data is sparse, severely limiting their accuracy and…

Machine Learning · Computer Science 2025-10-29 Robert J Appleton , Brian C Barnes , Alejandro Strachan

It is impossible today to pretend that the practice of machine learning is always compatible with the idea that training and testing data follow the same distribution. Several authors have recently used ensemble techniques to show how…

Machine Learning · Computer Science 2025-03-03 Jianyu Zhang , Léon Bottou

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine…

It is held as a truism that deep neural networks require large datasets to train effective models. However, large datasets, especially with high-quality labels, can be expensive to obtain. This study sets out to investigate (i) how large a…

Information Retrieval · Computer Science 2019-01-31 Trond Linjordet , Krisztian Balog

Recent advances in Deep Learning have greatly improved performance on various tasks such as object detection, image segmentation, sentiment analysis. The focus of most research directions up until very recently has been on beating…

Computer Vision and Pattern Recognition · Computer Science 2022-11-29 Cristian Simionescu

Deep learning models need large amounts of data for training. In video recognition and classification, significant advances were achieved with the introduction of new large databases. However, the creation of large-databases for training is…

Computer Vision and Pattern Recognition · Computer Science 2021-07-05 Miguel Rodríguez Santander , Juan Hernández Albarracín , Adín Ramírez Rivera

Target tracking represents a state estimation problem recurrent in many practical scenarios like air traffic control, autonomous vehicles, marine radar surveillance and so on. In a Bayesian perspective, when phenomena like clutter are…

Applications · Statistics 2022-11-28 Alessandro D'Ortenzio , Costanzo Manes , Umut Orguner

While machine learning has emerged in recent years as a useful tool for rapid prediction of materials properties, generating sufficient data to reliably train models without overfitting is still impractical for many applications. Towards…

Materials Science · Physics 2022-07-29 Rees Chang , Yu-Xiong Wang , Elif Ertekin

This paper targets the question of predicting machine learning classification model performance, when taking into account the number of training examples per class and not just the overall number of training examples. This leads to the a…

Machine Learning · Computer Science 2024-03-12 Thomas Mühlenstädt , Jelena Frtunikj
‹ Prev 1 2 3 10 Next ›