Related papers: An Integrated Data Processing Framework for Pretra…

Enhancing Data Quality in Federated Fine-Tuning of Foundation Models

In the current landscape of foundation model training, there is a significant reliance on public domain data, which is nearing exhaustion according to recent research. To further scale up, it is crucial to incorporate collaboration among…

Machine Learning · Computer Science 2024-03-08 Wanru Zhao , Yaxin Du , Nicholas Donald Lane , Siheng Chen , Yanfeng Wang

Data Complexity-aware Deep Model Performance Forecasting

Deep learning models are widely used across computer vision and other domains. When working on the model induction, selecting the right architecture for a given dataset often relies on repetitive trial-and-error procedures. This procedure…

Machine Learning · Computer Science 2026-01-06 Yen-Chia Chen , Hsing-Kuo Pao , Hanjuan Huang

An Integrated Framework for Process Discovery Algorithm Evaluation

Process mining offers techniques to exploit event data by providing insights and recommendations to improve business processes. The growing amount of algorithms for process discovery has raised the question of which algorithms perform best…

Software Engineering · Computer Science 2018-06-20 Toon Jouck , Alfredo Bolt , Benoît Depaire , Massimiliano de Leoni , Wil M. P. van der Aalst

Streamlined Framework for Agile Forecasting Model Development towards Efficient Inventory Management

This paper proposes a framework for developing forecasting models by streamlining the connections between core components of the developmental process. The proposed framework enables swift and robust integration of new datasets,…

Machine Learning · Computer Science 2023-04-14 Jonathan Hans Soeseno , Sergio González , Trista Pei-Chun Chen

Fix your Models by Fixing your Datasets

The quality of underlying training data is very crucial for building performant machine learning models with wider generalizabilty. However, current machine learning (ML) tools lack streamlined processes for improving the data quality. So,…

Machine Learning · Computer Science 2021-12-16 Atindriyo Sanyal , Vikram Chatterji , Nidhi Vyas , Ben Epstein , Nikita Demir , Anthony Corletti

Augmenting data-driven models for energy systems through feature engineering: A Python framework for feature engineering

Data-driven modeling is an approach in energy systems modeling that has been gaining popularity. In data-driven modeling, machine learning methods such as linear regression, neural networks or decision-tree based methods are being applied.…

Machine Learning · Computer Science 2023-01-05 Sandra Wilfling

A Framework for Assessing Achievability of Data-Quality Constraints

Assessing and improving the quality of data are fundamental challenges for data-intensive systems that have given rise to applications targeting transformation and cleaning of data. However, while schema design, data cleaning, and data…

Databases · Computer Science 2017-03-28 Rada Chirkova , Jon Doyle , Juan L. Reutter

Revisiting Data Analysis with Pre-trained Foundation Models

Data analysis focuses on harnessing advanced statistics, programming, and machine learning techniques to extract valuable insights from vast datasets. An increasing volume and variety of research emerged, addressing datasets of diverse…

Databases · Computer Science 2025-01-06 Chen Liang , Donghua Yang , Zheng Liang , Zhiyu Liang , Tianle Zhang , Boyu Xiao , Yuqing Yang , Wenqi Wang , Hongzhi Wang

Latent Iterative Refinement for Modular Source Separation

Traditional source separation approaches train deep neural network models end-to-end with all the data available at once by minimizing the empirical risk on the whole training set. On the inference side, after training the model, the user…

Sound · Computer Science 2023-10-17 Dimitrios Bralios , Efthymios Tzinis , Gordon Wichern , Paris Smaragdis , Jonathan Le Roux

EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models

In recent years, instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs). To construct high-quality instruction datasets, many instruction processing…

Computation and Language · Computer Science 2024-06-25 Yixin Ou , Ningyu Zhang , Honghao Gui , Ziwen Xu , Shuofei Qiao , Yida Xue , Runnan Fang , Kangwei Liu , Lei Li , Zhen Bi , Guozhou Zheng , Huajun Chen

Process-BERT: A Framework for Representation Learning on Educational Process Data

Educational process data, i.e., logs of detailed student activities in computerized or online learning platforms, has the potential to offer deep insights into how students learn. One can use process data for many downstream tasks such as…

Machine Learning · Computer Science 2022-04-29 Alexander Scarlatos , Christopher Brinton , Andrew Lan

From Prediction to Understanding: Will AI Foundation Models Transform Brain Science?

Generative pretraining (the "GPT" in ChatGPT) enables language models to learn from vast amounts of internet text without human supervision. This approach has driven breakthroughs across AI by allowing deep neural networks to learn from…

Neurons and Cognition · Quantitative Biology 2025-09-23 Thomas Serre , Ellie Pavlick

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

The wide use of machine learning is fundamentally changing the software development paradigm (a.k.a. Software 2.0) where data becomes a first-class citizen, on par with code. As machine learning is used in sensitive applications, it becomes…

Databases · Computer Science 2019-04-25 Ki Hyun Tae , Yuji Roh , Young Hun Oh , Hyunsu Kim , Steven Euijong Whang

Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework

Careful curation of data sources can significantly improve the performance of LLM pre-training, but predominant approaches rely heavily on intuition or costly trial-and-error, making them difficult to generalize across different data…

Machine Learning · Computer Science 2025-03-28 Thomson Yen , Andrew Wei Tung Siah , Haozhe Chen , Tianyi Peng , Daniel Guetta , Hongseok Namkoong

MLLM-DataEngine: An Iterative Refinement Approach for MLLM

Despite the great advance of Multimodal Large Language Models (MLLMs) in both instruction dataset building and benchmarking, the independence of training and evaluation makes current MLLMs hard to further improve their capability under the…

Machine Learning · Computer Science 2023-09-12 Zhiyuan Zhao , Linke Ouyang , Bin Wang , Siyuan Huang , Pan Zhang , Xiaoyi Dong , Jiaqi Wang , Conghui He

Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore,…

Machine Learning · Computer Science 2025-02-20 Manal Rahal , Bestoun S. Ahmed , Gergely Szabados , Torgny Fornstedt , Jorgen Samuelsson

Generative Data Refinement: Just Ask for Better Data

For a fixed parameter size, the capabilities of large models are primarily determined by the quality and quantity of its training data. Consequently, training datasets now grow faster than the rate at which new data is indexed on the web,…

Machine Learning · Computer Science 2025-09-12 Minqi Jiang , João G. M. Araújo , Will Ellsworth , Sian Gooding , Edward Grefenstette

Simple and Effective Input Reformulations for Translation

Foundation language models learn from their finetuning input context in different ways. In this paper, we reformulate inputs during finetuning for challenging translation tasks, leveraging model strengths from pretraining in novel ways to…

Computation and Language · Computer Science 2026-01-05 Brian Yu , Hansen Lillemark , Kurt Keutzer

An evaluation framework for synthetic data generation models

Nowadays, the use of synthetic data has gained popularity as a cost-efficient strategy for enhancing data augmentation for improving machine learning models performance as well as addressing concerns related to sensitive data privacy.…

Machine Learning · Computer Science 2025-10-27 Ioannis E. Livieris , Nikos Alimpertis , George Domalis , Dimitris Tsakalidis

A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT

Pretrained Foundation Models (PFMs) are regarded as the foundation for various downstream tasks with different data modalities. A PFM (e.g., BERT, ChatGPT, and GPT-4) is trained on large-scale data which provides a reasonable parameter…

Artificial Intelligence · Computer Science 2023-05-02 Ce Zhou , Qian Li , Chen Li , Jun Yu , Yixin Liu , Guangjing Wang , Kai Zhang , Cheng Ji , Qiben Yan , Lifang He , Hao Peng , Jianxin Li , Jia Wu , Ziwei Liu , Pengtao Xie , Caiming Xiong , Jian Pei , Philip S. Yu , Lichao Sun