English
Related papers

Related papers: An Integrated Data Processing Framework for Pretra…

200 papers

In the current landscape of foundation model training, there is a significant reliance on public domain data, which is nearing exhaustion according to recent research. To further scale up, it is crucial to incorporate collaboration among…

Machine Learning · Computer Science 2024-03-08 Wanru Zhao , Yaxin Du , Nicholas Donald Lane , Siheng Chen , Yanfeng Wang

Deep learning models are widely used across computer vision and other domains. When working on the model induction, selecting the right architecture for a given dataset often relies on repetitive trial-and-error procedures. This procedure…

Machine Learning · Computer Science 2026-01-06 Yen-Chia Chen , Hsing-Kuo Pao , Hanjuan Huang

Process mining offers techniques to exploit event data by providing insights and recommendations to improve business processes. The growing amount of algorithms for process discovery has raised the question of which algorithms perform best…

Software Engineering · Computer Science 2018-06-20 Toon Jouck , Alfredo Bolt , Benoît Depaire , Massimiliano de Leoni , Wil M. P. van der Aalst

This paper proposes a framework for developing forecasting models by streamlining the connections between core components of the developmental process. The proposed framework enables swift and robust integration of new datasets,…

Machine Learning · Computer Science 2023-04-14 Jonathan Hans Soeseno , Sergio González , Trista Pei-Chun Chen

The quality of underlying training data is very crucial for building performant machine learning models with wider generalizabilty. However, current machine learning (ML) tools lack streamlined processes for improving the data quality. So,…

Machine Learning · Computer Science 2021-12-16 Atindriyo Sanyal , Vikram Chatterji , Nidhi Vyas , Ben Epstein , Nikita Demir , Anthony Corletti

Data-driven modeling is an approach in energy systems modeling that has been gaining popularity. In data-driven modeling, machine learning methods such as linear regression, neural networks or decision-tree based methods are being applied.…

Machine Learning · Computer Science 2023-01-05 Sandra Wilfling

Assessing and improving the quality of data are fundamental challenges for data-intensive systems that have given rise to applications targeting transformation and cleaning of data. However, while schema design, data cleaning, and data…

Databases · Computer Science 2017-03-28 Rada Chirkova , Jon Doyle , Juan L. Reutter

Data analysis focuses on harnessing advanced statistics, programming, and machine learning techniques to extract valuable insights from vast datasets. An increasing volume and variety of research emerged, addressing datasets of diverse…

Databases · Computer Science 2025-01-06 Chen Liang , Donghua Yang , Zheng Liang , Zhiyu Liang , Tianle Zhang , Boyu Xiao , Yuqing Yang , Wenqi Wang , Hongzhi Wang

Traditional source separation approaches train deep neural network models end-to-end with all the data available at once by minimizing the empirical risk on the whole training set. On the inference side, after training the model, the user…

In recent years, instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs). To construct high-quality instruction datasets, many instruction processing…

Computation and Language · Computer Science 2024-06-25 Yixin Ou , Ningyu Zhang , Honghao Gui , Ziwen Xu , Shuofei Qiao , Yida Xue , Runnan Fang , Kangwei Liu , Lei Li , Zhen Bi , Guozhou Zheng , Huajun Chen

Educational process data, i.e., logs of detailed student activities in computerized or online learning platforms, has the potential to offer deep insights into how students learn. One can use process data for many downstream tasks such as…

Machine Learning · Computer Science 2022-04-29 Alexander Scarlatos , Christopher Brinton , Andrew Lan

Generative pretraining (the "GPT" in ChatGPT) enables language models to learn from vast amounts of internet text without human supervision. This approach has driven breakthroughs across AI by allowing deep neural networks to learn from…

Neurons and Cognition · Quantitative Biology 2025-09-23 Thomas Serre , Ellie Pavlick

The wide use of machine learning is fundamentally changing the software development paradigm (a.k.a. Software 2.0) where data becomes a first-class citizen, on par with code. As machine learning is used in sensitive applications, it becomes…

Databases · Computer Science 2019-04-25 Ki Hyun Tae , Yuji Roh , Young Hun Oh , Hyunsu Kim , Steven Euijong Whang

Careful curation of data sources can significantly improve the performance of LLM pre-training, but predominant approaches rely heavily on intuition or costly trial-and-error, making them difficult to generalize across different data…

Machine Learning · Computer Science 2025-03-28 Thomson Yen , Andrew Wei Tung Siah , Haozhe Chen , Tianyi Peng , Daniel Guetta , Hongseok Namkoong

Despite the great advance of Multimodal Large Language Models (MLLMs) in both instruction dataset building and benchmarking, the independence of training and evaluation makes current MLLMs hard to further improve their capability under the…

Machine Learning · Computer Science 2023-09-12 Zhiyuan Zhao , Linke Ouyang , Bin Wang , Siyuan Huang , Pan Zhang , Xiaoyi Dong , Jiaqi Wang , Conghui He

Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore,…

Machine Learning · Computer Science 2025-02-20 Manal Rahal , Bestoun S. Ahmed , Gergely Szabados , Torgny Fornstedt , Jorgen Samuelsson

For a fixed parameter size, the capabilities of large models are primarily determined by the quality and quantity of its training data. Consequently, training datasets now grow faster than the rate at which new data is indexed on the web,…

Machine Learning · Computer Science 2025-09-12 Minqi Jiang , João G. M. Araújo , Will Ellsworth , Sian Gooding , Edward Grefenstette

Foundation language models learn from their finetuning input context in different ways. In this paper, we reformulate inputs during finetuning for challenging translation tasks, leveraging model strengths from pretraining in novel ways to…

Computation and Language · Computer Science 2026-01-05 Brian Yu , Hansen Lillemark , Kurt Keutzer

Nowadays, the use of synthetic data has gained popularity as a cost-efficient strategy for enhancing data augmentation for improving machine learning models performance as well as addressing concerns related to sensitive data privacy.…

Machine Learning · Computer Science 2025-10-27 Ioannis E. Livieris , Nikos Alimpertis , George Domalis , Dimitris Tsakalidis

Pretrained Foundation Models (PFMs) are regarded as the foundation for various downstream tasks with different data modalities. A PFM (e.g., BERT, ChatGPT, and GPT-4) is trained on large-scale data which provides a reasonable parameter…

‹ Prev 1 2 3 10 Next ›