Related papers: vtreat: a data.frame Processor for Predictive Mode…

Data Complexity-aware Deep Model Performance Forecasting

Deep learning models are widely used across computer vision and other domains. When working on the model induction, selecting the right architecture for a given dataset often relies on repetitive trial-and-error procedures. This procedure…

Machine Learning · Computer Science 2026-01-06 Yen-Chia Chen , Hsing-Kuo Pao , Hanjuan Huang

A Survey of Predictive Modelling under Imbalanced Distributions

Many real world data mining applications involve obtaining predictive models using data sets with strongly imbalanced distributions of the target variable. Frequently, the least common values of this target variable are associated with…

Machine Learning · Computer Science 2015-05-14 Paula Branco , Luis Torgo , Rita Ribeiro

Learning Defect Prediction from Unrealistic Data

Pretrained models of code, such as CodeBERT and CodeT5, have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data, which are rarely available for…

Machine Learning · Computer Science 2024-01-23 Kamel Alrashedy , Vincent J. Hellendoorn , Alessandro Orso

Pre-registration for Predictive Modeling

Amid rising concerns of reproducibility and generalizability in predictive modeling, we explore the possibility and potential benefits of introducing pre-registration to the field. Despite notable advancements in predictive modeling,…

Machine Learning · Computer Science 2023-12-01 Jake M. Hofman , Angelos Chatzimparmpas , Amit Sharma , Duncan J. Watts , Jessica Hullman

An Integrated Data Processing Framework for Pretraining Foundation Models

The ability of the foundation models heavily relies on large-scale, diverse, and high-quality pretraining data. In order to improve data quality, researchers and practitioners often have to manually curate datasets from difference sources…

Machine Learning · Computer Science 2024-04-24 Yiding Sun , Feng Wang , Yutao Zhu , Wayne Xin Zhao , Jiaxin Mao

Process-BERT: A Framework for Representation Learning on Educational Process Data

Educational process data, i.e., logs of detailed student activities in computerized or online learning platforms, has the potential to offer deep insights into how students learn. One can use process data for many downstream tasks such as…

Machine Learning · Computer Science 2022-04-29 Alexander Scarlatos , Christopher Brinton , Andrew Lan

Predictive Models in Software Engineering: Challenges and Opportunities

Predictive models are one of the most important techniques that are widely applied in many areas of software engineering. There have been a large number of primary studies that apply predictive models and that present well-preformed studies…

Software Engineering · Computer Science 2020-08-11 Yanming Yang , Xin Xia , David Lo , Tingting Bi , John Grundy , Xiaohu Yang

Uncertainty Estimation in Machine Learning

Most machine learning techniques are based upon statistical learning theory, often simplified for the sake of computing speed. This paper is focused on the uncertainty aspect of mathematical modeling in machine learning. Regression analysis…

Machine Learning · Computer Science 2022-06-07 Valentin Arkov

ModelPred: A Framework for Predicting Trained Model from Training Data

In this work, we propose ModelPred, a framework that helps to understand the impact of changes in training data on a trained model. This is critical for building trust in various stages of a machine learning pipeline: from cleaning…

Machine Learning · Computer Science 2022-12-27 Yingyan Zeng , Jiachen T. Wang , Si Chen , Hoang Anh Just , Ran Jin , Ruoxi Jia

Fitting Prediction Rule Ensembles to Psychological Research Data: An Introduction and Tutorial

Prediction rule ensembles (PREs) are a relatively new statistical learning method, which aim to strike a balance between predictive accuracy and interpretability. Starting from a decision tree ensemble, like a boosted tree ensemble or a…

Applications · Statistics 2023-10-02 Marjolein Fokkema , Carolin Strobl

Learning to Represent and Predict Sets with Deep Neural Networks

In this thesis, we develop various techniques for working with sets in machine learning. Each input or output is not an image or a sequence, but a set: an unordered collection of multiple objects, each object described by a feature vector.…

Machine Learning · Computer Science 2021-03-09 Yan Zhang

Exploring data subsets with vtree

Variable trees are a new method for the exploration of discrete multivariate data. They display nested subsets and corresponding frequencies and percentages. Manual calculation of these quantities can be laborious, especially when there are…

Computation · Statistics 2021-02-08 Nick Barrowman , Richard J. Webster

Adjusting for Bias with Procedural Data

3D softwares are now capable of producing highly realistic images that look nearly indistinguishable from the real images. This raises the question: can real datasets be enhanced with 3D rendered data? We investigate this question. In this…

Computer Vision and Pattern Recognition · Computer Science 2022-04-06 Shesh Narayan Gupta , Nicholas Bear Brown

Multiple Regression for Matrix and Vector Predictors: Models, Theory, Algorithms, and Beyond

Matrix regression plays an important role in modern data analysis due to its ability to handle complex relationships involving both matrix and vector variables. We propose a class of regularized regression models capable of predicting both…

Optimization and Control · Mathematics 2025-01-14 Meixia Lin , Ziyang Zeng , Yangjing Zhang

ProcData: An R Package for Process Data Analysis

Process data refer to data recorded in the log files of computer-based items. These data, represented as timestamped action sequences, keep track of respondents' response processes of solving the items. Process data analysis aims at…

Computation · Statistics 2020-06-11 Xueying Tang , Susu Zhang , Zhi Wang , Jingchen Liu , Zhiliang Ying

DsDm: Model-Aware Dataset Selection with Datamodels

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…

Machine Learning · Computer Science 2024-01-24 Logan Engstrom , Axel Feldmann , Aleksander Madry

Setting the Standard: Recommended Practices for Data Preprocessing in Data-Driven Climate Prediction

Artificial intelligence (AI) - and specifically machine learning (ML) - applications for climate prediction across timescales are proliferating quickly. The emergence of these methods prompts a revisit to the impact of data preprocessing, a…

Data Analysis, Statistics and Probability · Physics 2025-12-18 Jason C. Furtado , Maria J. Molina , Marybeth C. Arcodia , Weston Anderson , Tom Beucler , John A. Callahan , Laura M. Ciasto , Vittorio A. Gensini , Michelle L'Heureux , Kathleen Pegion , Jhayron S. Pérez-Carrasquilla , Maike Sonnewald , Ken Takahashi , Baoqiang Xiang , Brian G. Zimmerman

Data Engineering for the Analysis of Semiconductor Manufacturing Data

We have analyzed manufacturing data from several different semiconductor manufacturing plants, using decision tree induction software called Q-YIELD. The software generates rules for predicting when a given product should be rejected. The…

Machine Learning · Computer Science 2007-05-23 Peter D. Turney

Developmental Pretraining (DPT) for Image Classification Networks

In the backdrop of increasing data requirements of Deep Neural Networks for object recognition that is growing more untenable by the day, we present Developmental PreTraining (DPT) as a possible solution. DPT is designed as a…

Machine Learning · Computer Science 2023-12-04 Niranjan Rajesh , Debayan Gupta

Removing the influence of a group variable in high-dimensional predictive modelling

In many application areas, predictive models are used to support or make important decisions. There is increasing awareness that these models may contain spurious or otherwise undesirable correlations. Such correlations may arise from a…

Applications · Statistics 2021-09-21 Emanuele Aliverti , Kristian Lum , James E. Johndrow , David B. Dunson