Related papers: Data-Prep-Kit: getting your data ready for LLM app…

Darkit: A User-Friendly Software Toolkit for Spiking Large Language Model

Large language models (LLMs) have been widely applied in various practical applications, typically comprising billions of parameters, with inference processes requiring substantial energy and computational resources. In contrast, the human…

Software Engineering · Computer Science 2024-12-23 Xin Du , Shifan Ye , Qian Zheng , Yangfan Hu , Rui Yan , Shunyu Qi , Shuyang Chen , Huajin Tang , Gang Pan , Shuiguang Deng

PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?

Data preparation is a central and time-consuming stage in data analysis workflows. Traditionally, commercial tools have relied on graphical user interfaces (GUIs) to simplify data preparation, allowing users to define transformations…

Databases · Computer Science 2026-05-12 Jingzhe Xu , Rui Wang , Jiannan Wang , Guoliang Li

DeepPrep: An LLM-Powered Agentic System for Autonomous Data Preparation

Data preparation, which aims to transform heterogeneous and noisy raw tables into analysis-ready data, remains a major bottleneck in data science. Recent approaches leverage large language models (LLMs) to automate data preparation from…

Databases · Computer Science 2026-02-10 Meihao Fan , Ju Fan , Yuxin Zhang , Shaolei Zhang , Xiaoyong Du , Jie Song , Peng Li , Fuxin Jiang , Tieying Zhang , Jianjun Chen

Codellm-Devkit: A Framework for Contextualizing Code LLMs with Program Analysis Insights

Large Language Models for Code (or code LLMs) are increasingly gaining popularity and capabilities, offering a wide array of functionalities such as code completion, code generation, code summarization, test generation, code translation,…

Software Engineering · Computer Science 2024-10-18 Rahul Krishna , Rangeet Pan , Raju Pavuluri , Srikanth Tamilselvam , Maja Vukovic , Saurabh Sinha

Large Language Models as Data Preprocessors

Large Language Models (LLMs), typified by OpenAI's GPT, have marked a significant advancement in artificial intelligence. Trained on vast amounts of text data, LLMs are capable of understanding and generating human-like text across a…

Artificial Intelligence · Computer Science 2024-10-29 Haochen Zhang , Yuyang Dong , Chuan Xiao , Masafumi Oyamada

Empowering Tabular Data Preparation with Language Models: Why and How?

Data preparation is a critical step in enhancing the usability of tabular data and thus boosts downstream data-driven tasks. Traditional methods often face challenges in capturing the intricate relationships within tables and adapting to…

Artificial Intelligence · Computer Science 2025-08-05 Mengshi Chen , Yuxiang Sun , Tengchao Li , Jianwei Wang , Kai Wang , Xuemin Lin , Ying Zhang , Wenjie Zhang

ModelGPT: Unleashing LLM's Capabilities for Tailored Model Generation

The rapid advancement of Large Language Models (LLMs) has revolutionized various sectors by automating routine tasks, marking a step toward the realization of Artificial General Intelligence (AGI). However, they still struggle to…

Machine Learning · Computer Science 2024-02-21 Zihao Tang , Zheqi Lv , Shengyu Zhang , Fei Wu , Kun Kuang

A Survey of Large Language Models

Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable AI algorithms for comprehending and grasping a language. As a major approach,…

Computation and Language · Computer Science 2026-03-19 Wayne Xin Zhao , Kun Zhou , Junyi Li , Tianyi Tang , Xiaolei Wang , Yupeng Hou , Yingqian Min , Beichen Zhang , Junjie Zhang , Zican Dong , Yifan Du , Chen Yang , Yushuo Chen , Zhipeng Chen , Jinhao Jiang , Ruiyang Ren , Yifan Li , Xinyu Tang , Zikang Liu , Peiyu Liu , Jian-Yun Nie , Ji-Rong Wen

Jellyfish: A Large Language Model for Data Preprocessing

This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in…

Artificial Intelligence · Computer Science 2024-10-30 Haochen Zhang , Yuyang Dong , Chuan Xiao , Masafumi Oyamada

StreamLink: Large-Language-Model Driven Distributed Data Engineering System

Large Language Models (LLMs) have shown remarkable proficiency in natural language understanding (NLU), opening doors for innovative applications. We introduce StreamLink - an LLM-driven distributed data system designed to improve the…

Databases · Computer Science 2025-05-29 Dawei Feng , Di Mei , Huiri Tan , Lei Ren , Xianying Lou , Zhangxi Tan

DDK: Distilling Domain Knowledge for Efficient Large Language Models

Despite the advanced intelligence abilities of large language models (LLMs) in various applications, they still face significant computational and storage demands. Knowledge Distillation (KD) has emerged as an effective strategy to improve…

Computation and Language · Computer Science 2024-07-24 Jiaheng Liu , Chenchen Zhang , Jinyang Guo , Yuanxing Zhang , Haoran Que , Ken Deng , Zhiqi Bai , Jie Liu , Ge Zhang , Jiakai Wang , Yanan Wu , Congnan Liu , Wenbo Su , Jiamang Wang , Lin Qu , Bo Zheng

Towards Next-Generation LLM Training: From the Data-Centric Perspective

Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks and domains, with data playing a central role in enabling these advances. Despite this success, the preparation and effective utilization of…

Computation and Language · Computer Science 2026-03-17 Hao Liang , Zhengyang Zhao , Zhaoyang Han , Meiyi Qiang , Xiaochen Ma , Bohan Zeng , Qifeng Cai , Zhiyu Li , Linpeng Tang , Weinan E , Wentao Zhang

Data Processing for the OpenGPT-X Model Family

This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project…

Computation and Language · Computer Science 2025-08-08 Nicolo' Brandizzi , Hammam Abdelwahab , Anirban Bhowmick , Lennard Helmer , Benny Jörg Stein , Pavel Denisov , Qasid Saleem , Michael Fromm , Mehdi Ali , Richard Rutmann , Farzad Naderi , Mohamad Saif Agy , Alexander Schwirjow , Fabian Küch , Luzian Hahn , Malte Ostendorff , Pedro Ortiz Suarez , Georg Rehm , Dennis Wegener , Nicolas Flores-Herr , Joachim Köhler , Johannes Leveling

OnPrem.LLM: A Privacy-Conscious Document Intelligence Toolkit

We present OnPrem$.$LLM, a Python-based toolkit for applying large language models (LLMs) to sensitive, non-public data in offline or restricted environments. The system is designed for privacy-preserving use cases and provides prebuilt…

Computation and Language · Computer Science 2025-09-30 Arun S. Maiya

UniDM: A Unified Framework for Data Manipulation with Large Language Models

Designing effective data manipulation methods is a long standing problem in data lakes. Traditional methods, which rely on rules or machine learning models, require extensive human efforts on training data collection and tuning models.…

Artificial Intelligence · Computer Science 2024-05-13 Yichen Qian , Yongyi He , Rong Zhu , Jintao Huang , Zhijian Ma , Haibin Wang , Yaohua Wang , Xiuyu Sun , Defu Lian , Bolin Ding , Jingren Zhou

VerilogDB: The Largest, Highest-Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation

Large Language Models (LLMs) are gaining popularity for hardware design automation, particularly through Register Transfer Level (RTL) code generation. In this work, we examine the current literature on RTL generation using LLMs and…

Hardware Architecture · Computer Science 2025-07-21 Paul E. Calzada , Zahin Ibnat , Tanvir Rahman , Kamal Kandula , Danyu Lu , Sujan Kumar Saha , Farimah Farahmandi , Mark Tehranipoor

Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs

Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them, which is essential for a wide range of data-centric applications. Driven by (i) rising demands for…

Databases · Computer Science 2026-01-27 Wei Zhou , Jun Zhou , Haoyu Wang , Zhenghao Li , Qikang He , Shaokun Han , Guoliang Li , Xuanhe Zhou , Yeye He , Chunwei Liu , Zirui Tang , Bin Wang , Shen Tang , Kai Zuo , Yuyu Luo , Zhenzhe Zheng , Conghui He , Jingren Zhou , Fan Wu

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc…

Machine Learning · Computer Science 2025-12-19 Hao Liang , Xiaochen Ma , Zhou Liu , Zhen Hao Wong , Zhengyang Zhao , Zimo Meng , Runming He , Chengyu Shen , Qifeng Cai , Zhaoyang Han , Meiyi Qiang , Yalin Feng , Tianyi Bai , Zewei Pan , Ziyi Guo , Yizhen Jiang , Jingwen Deng , Qijie You , Peichao Lai , Tianyu Guo , Chi Hsu Tsai , Hengyi Feng , Rui Hu , Wenkai Yu , Junbo Niu , Bohan Zeng , Ruichuan An , Lu Ma , Jihao Huang , Yaowei Zheng , Conghui He , Linpeng Tang , Bin Cui , Weinan E , Wentao Zhang

LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models

Creating high-quality, large-scale datasets for large language models (LLMs) often relies on resource-intensive, GPU-accelerated models for quality filtering, making the process time-consuming and costly. This dependence on GPUs limits…

Computation and Language · Computer Science 2024-11-19 Yungi Kim , Hyunsoo Ha , Seonghoon Yang , Sukyung Lee , Jihoo Kim , Chanjun Park

Chit-Chat or Deep Talk: Prompt Engineering for Process Mining

This research investigates the application of Large Language Models (LLMs) to augment conversational agents in process mining, aiming to tackle its inherent complexity and diverse skill requirements. While LLM advancements present novel…

Artificial Intelligence · Computer Science 2023-07-20 Urszula Jessen , Michal Sroka , Dirk Fahland