Related papers: Data Wrangling Task Automation Using Code-Generati…

Lost in the Pipeline: How Well Do Large Language Models Handle Data Preparation?

Large language models have recently demonstrated their exceptional capabilities in supporting and automating various tasks. Among the tasks worth exploring for testing large language model capabilities, we considered data preparation, a…

Computation and Language · Computer Science 2025-12-01 Matteo Spreafico , Ludovica Tassini , Camilla Sancricca , Cinzia Cappiello

Quality Assessment of Tabular Data using Large Language Models and Code Generation

Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines…

Software Engineering · Computer Science 2025-09-23 Ashlesha Akella , Akshar Kaul , Krishnasuri Narayanam , Sameep Mehta

Data Context Informed Data Wrangling

The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process have been carried out using…

Databases · Computer Science 2018-11-26 Martin Koehler , Alex Bogatu , Cristina Civili , Nikolaos Konstantinou , Edward Abel , Alvaro A. A. Fernandes , John Keane , Leonid Libkin , Norman W. Paton

Evolution without Large Models: Training Language Model with Task Principles

A common training approach for language models involves using a large-scale language model to expand a human-provided dataset, which is subsequently used for model training.This method significantly reduces training costs by eliminating the…

Computation and Language · Computer Science 2025-07-09 Minghang Zhu , Shen Gao , Zhengliang Shi , Jiabao Fang , Pengjie Ren , Zhaochun Ren , Zhumin Chen , Shuo Shang

Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets

High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or…

Machine Learning · Computer Science 2025-03-11 Tommaso Bendinelli , Artur Dox , Christian Holz

Iterative Data Programming for Expanding Text Classification Corpora

Real-world text classification tasks often require many labeled training examples that are expensive to obtain. Recent advancements in machine teaching, specifically the data programming paradigm, facilitate the creation of training data…

Machine Learning · Computer Science 2020-02-05 Neil Mallinar , Abhishek Shah , Tin Kam Ho , Rajendra Ugrani , Ayush Gupta

An Overview and Discussion on Using Large Language Models for Implementation Generation of Solutions to Open-Ended Problems

Large Language Models offer new opportunities to devise automated implementation generation methods that can tackle problem solving activities beyond traditional methods, which require algorithmic specifications and can use only static…

Computation and Language · Computer Science 2025-01-06 Hashmath Shaik , Alex Doboli

Semantically Aligned Question and Code Generation for Automated Insight Generation

Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language…

Software Engineering · Computer Science 2024-05-06 Ananya Singha , Bhavya Chopra , Anirudh Khatry , Sumit Gulwani , Austin Z. Henley , Vu Le , Chris Parnin , Mukul Singh , Gust Verbruggen

Neural Language Generation: Formulation, Methods, and Evaluation

Recent advances in neural network-based generative modeling have reignited the hopes in having computer systems capable of seamlessly conversing with humans and able to understand natural language. Neural architectures have been employed to…

Computation and Language · Computer Science 2020-08-03 Cristina Garbacea , Qiaozhu Mei

Unlock the Potential of Large Language Models for Predictive Tabular Tasks in Data Science with Table-Specific Pretraining

In the domain of data science, the predictive tasks of classification, regression, and imputation of missing values are commonly encountered challenges associated with tabular data. This research endeavors to apply Large Language Models…

Machine Learning · Computer Science 2026-04-23 Yazheng Yang , Yuqi Wang , Yaxuan Li , Sankalok Sen , Lei Li , Lin Qiu , Qi Liu

Data-to-text Generation with Variational Sequential Planning

We consider the task of data-to-text generation, which aims to create textual output from non-linguistic input. We focus on generating long-form text, i.e., documents with multiple paragraphs, and propose a neural model enhanced with a…

Computation and Language · Computer Science 2022-03-01 Ratish Puduppully , Yao Fu , Mirella Lapata

Towards Better Multi-task Learning: A Framework for Optimizing Dataset Combinations in Large Language Models

To efficiently select optimal dataset combinations for enhancing multi-task learning (MTL) performance in large language models, we proposed a novel framework that leverages a neural network to predict the best dataset combinations. The…

Computation and Language · Computer Science 2025-05-06 Zaifu Zhan , Rui Zhang

A Survey on Data Selection for Language Models

A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as…

Computation and Language · Computer Science 2024-08-05 Alon Albalak , Yanai Elazar , Sang Michael Xie , Shayne Longpre , Nathan Lambert , Xinyi Wang , Niklas Muennighoff , Bairu Hou , Liangming Pan , Haewon Jeong , Colin Raffel , Shiyu Chang , Tatsunori Hashimoto , William Yang Wang

Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey

Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table…

Computation and Language · Computer Science 2024-06-25 Xi Fang , Weijie Xu , Fiona Anting Tan , Jiani Zhang , Ziqing Hu , Yanjun Qi , Scott Nickleach , Diego Socolinsky , Srinivasan Sengamedu , Christos Faloutsos

Understanding the Capabilities of Large Language Models for Automated Planning

Automated planning is concerned with developing efficient algorithms to generate plans or sequences of actions to achieve a specific goal in a given environment. Emerging Large Language Models (LLMs) can answer questions, write high-quality…

Artificial Intelligence · Computer Science 2023-05-26 Vishal Pallagani , Bharath Muppasani , Keerthiram Murugesan , Francesca Rossi , Biplav Srivastava , Lior Horesh , Francesco Fabiano , Andrea Loreggia

The Impact of Large Language Models on Task Automation in Manufacturing Services

This paper explores the potential of large language models (LLMs) for task automation in the provision of technical services in the production machinery sector. By focusing on text correction, summarization, and question answering, the…

General Economics · Economics 2025-05-19 Jochen Wulf , Juerg Meierhofer

Deep Sequence Models for Text Classification Tasks

The exponential growth of data generated on the Internet in the current information age is a driving force for the digital economy. Extraction of information is the major value in an accumulated big data. Big data dependency on statistical…

Computation and Language · Computer Science 2022-07-20 Saheed Salahudeen Abdullahi , Sun Yiming , Shamsuddeen Hassan Muhammad , Abdulrasheed Mustapha , Ahmad Muhammad Aminu , Abdulkadir Abdullahi , Musa Bello , Saminu Mohammad Aliyu

LLM Reasoning Engine: Specialized Training for Enhanced Mathematical Reasoning

Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks but face challenges in mathematical reasoning, where complex problem-solving requires both linguistic understanding and mathematical…

Computation and Language · Computer Science 2025-03-20 Shuguang Chen , Guang Lin

Exploring Data Augmentation for Code Generation Tasks

Advances in natural language processing, such as transfer learning from pre-trained language models, have impacted how models are trained for programming language tasks too. Previous research primarily explored code pre-training and…

Computation and Language · Computer Science 2023-02-08 Pinzhen Chen , Gerasimos Lampouras

Skill Learning Using Process Mining for Large Language Model Plan Generation

Large language models (LLMs) hold promise for generating plans for complex tasks, but their effectiveness is limited by sequential execution, lack of control flow models, and difficulties in skill retrieval. Addressing these issues is…

Computation and Language · Computer Science 2024-10-18 Andrei Cosmin Redis , Mohammadreza Fani Sani , Bahram Zarrin , Andrea Burattin