Related papers: DACO: Towards Application-Driven and Comprehensive…

AnnotatedTables: A Large Tabular Dataset with Language Model Annotations

Tabular data is ubiquitous in real-world applications and abundant on the web, yet its annotation has traditionally required human labor, posing a significant scalability bottleneck for tabular machine learning. Our methodology can…

Machine Learning · Computer Science 2024-06-25 Yaojie Hu , Ilias Fountalis , Jin Tian , Nikolaos Vasiloglou

Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis

This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research…

Computation and Language · Computer Science 2025-12-02 Angelina Parfenova , Andreas Marfurt , Alexander Denzler , Juergen Pfeffer

Collecting high-quality adversarial data for machine reading comprehension tasks with humans and models in the loop

We present our experience as annotators in the creation of high-quality, adversarial machine-reading-comprehension data for extractive QA for Task 1 of the First Workshop on Dynamic Adversarial Data Collection (DADC). DADC is an emergent…

Computation and Language · Computer Science 2022-06-30 Damian Y. Romero Diaz , Magdalena Anioł , John Culnan

Automated Annotation with Generative AI Requires Validation

Generative large language models (LLMs) can be a powerful tool for augmenting text annotation procedures, but their performance varies across annotation tasks due to prompt quality, text data idiosyncrasies, and conceptual difficulty.…

Computation and Language · Computer Science 2023-06-02 Nicholas Pangakis , Samuel Wolken , Neil Fasching

DAgent: A Relational Database-Driven Data Analysis Report Generation Agent

Relational database-driven data analysis (RDB-DA) report generation, which aims to generate data analysis reports after querying relational databases, has been widely applied in fields such as finance and healthcare. Typically, these tasks…

Databases · Computer Science 2025-04-02 Wenyi Xu , Yuren Mao , Xiaolu Zhang , Chao Zhang , Xuemei Dong , Mengfei Zhang , Yunjun Gao

Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing…

Computation and Language · Computer Science 2025-10-07 Honglin Lin , Qizhi Pei , Xin Gao , Zhuoshi Pan , Yu Li , Juntao Li , Conghui He , Lijun Wu

Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI

Automated text annotation is a compelling use case for generative large language models (LLMs) in social media research. Recent work suggests that LLMs can achieve strong performance on annotation tasks; however, these studies evaluate LLMs…

Computation and Language · Computer Science 2024-09-24 Nicholas Pangakis , Samuel Wolken

Flowco: Rethinking Data Analysis in the Age of LLMs

Conducting data analysis typically involves authoring code to transform, visualize, analyze, and interpret data. Large language models (LLMs) are now capable of generating such code for simple, routine analyses. LLMs promise to democratize…

Human-Computer Interaction · Computer Science 2025-04-22 Stephen N. Freund , Brooke Simon , Emery D. Berger , Eunice Jun

Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications

Despite growing interest in using large language models (LLMs) to automate annotation, their effectiveness in complex, nuanced, and multi-dimensional labelling tasks remains relatively underexplored. This study focuses on annotation for the…

Information Retrieval · Computer Science 2025-07-02 Leila Tavakoli , Hamed Zamani

Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow

Industries such as finance, meteorology, and energy generate vast amounts of data daily. Efficiently managing, processing, and displaying this data requires specialized expertise and is often tedious and repetitive. Leveraging large…

Computation and Language · Computer Science 2025-05-20 Wenqi Zhang , Yongliang Shen , Zeqi Tan , Guiyang Hou , Weiming Lu , Yueting Zhuang

Large Language Models for Data Annotation and Synthesis: A Survey

Data annotation and synthesis generally refers to the labeling or generating of raw data with relevant information, which could be used for improving the efficacy of machine learning models. The process, however, is labor-intensive and…

Computation and Language · Computer Science 2024-12-04 Zhen Tan , Dawei Li , Song Wang , Alimohammad Beigi , Bohan Jiang , Amrita Bhattacharjee , Mansooreh Karami , Jundong Li , Lu Cheng , Huan Liu

Leveraging Data Recasting to Enhance Tabular Reasoning

Creating challenging tabular inference data is essential for learning complex reasoning. Prior work has mostly relied on two data generation strategies. The first is human annotation, which yields linguistically diverse data but is…

Computation and Language · Computer Science 2022-11-24 Aashna Jena , Vivek Gupta , Manish Shrivastava , Julian Martin Eisenschlos

SLA-Awareness for AI-assisted coding

The integration of AI-assisted coding tools within development environments drastically reduces development time, and allows developers to focus more on creative and critical aspects of software engineering through the use of Code Large…

Software Engineering · Computer Science 2025-03-26 Kishanthan Thangarajah , Arthur Leung , Boyuan Chen , Ahmed E. Hassan

AIANO: Enhancing Information Retrieval with AI-Augmented Annotation

The rise of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) has rapidly increased the need for high-quality, curated information retrieval datasets. These datasets, however, are currently created with off-the-shelf…

Information Retrieval · Computer Science 2026-02-05 Sameh Khattab , Marie Bauer , Lukas Heine , Till Rostalski , Jens Kleesiek , Julian Friedrich

TablePilot: Recommending Human-Preferred Tabular Data Analysis with Large Language Models

Tabular data analysis is crucial in many scenarios, yet efficiently identifying the most relevant data analysis queries and results for a new table remains a significant challenge. The complexity of tabular data, diverse analytical…

Computation and Language · Computer Science 2025-04-01 Deyin Yi , Yihao Liu , Lang Cao , Mengyu Zhou , Haoyu Dong , Shi Han , Dongmei Zhang

Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond

Large-scale data collection is essential for developing personalized training data, mitigating the shortage of training data, and fine-tuning specialized models. However, creating high-quality datasets quickly and accurately remains a…

Artificial Intelligence · Computer Science 2026-04-21 Minghao Liu , Zonglin Di , Jiaheng Wei , Zhongruo Wang , Hengxiang Zhang , Ruixuan Xiao , Haoyu Wang , Jinlong Pang , Hao Chen , Ankit Shah , Hongxin Wei , Xinlei He , Zhaowei Zhao , Haobo Wang , Lei Feng , Jindong Wang , James Davis , Yang Liu

Data Wrangling Task Automation Using Code-Generating Language Models

Ensuring data quality in large tabular datasets is a critical challenge, typically addressed through data wrangling tasks. Traditional statistical methods, though efficient, cannot often understand the semantic context and deep learning…

Machine Learning · Computer Science 2025-02-25 Ashlesha Akella , Krishnasuri Narayanam

Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets

Data is the engine of modern computer vision, which necessitates collecting large-scale datasets. This is expensive, and guaranteeing the quality of the labels is a major challenge. In this paper, we investigate efficient annotation…

Computer Vision and Pattern Recognition · Computer Science 2021-04-27 Yuan-Hong Liao , Amlan Kar , Sanja Fidler

TACO: Topics in Algorithmic COde generation dataset

We introduce TACO, an open-source, large-scale code generation dataset, with a focus on the optics of algorithms, designed to provide a more challenging training dataset and evaluation benchmark in the field of code generation models. TACO…

Artificial Intelligence · Computer Science 2023-12-29 Rongao Li , Jie Fu , Bo-Wen Zhang , Tao Huang , Zhihong Sun , Chen Lyu , Guang Liu , Zhi Jin , Ge Li

DART: Open-Domain Structured Data Record to Text Generation

We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-Text annotations can be a costly process, especially when dealing with tables which are the major source of…

Computation and Language · Computer Science 2021-04-13 Linyong Nan , Dragomir Radev , Rui Zhang , Amrit Rau , Abhinand Sivaprasad , Chiachun Hsieh , Xiangru Tang , Aadit Vyas , Neha Verma , Pranav Krishna , Yangxiaokang Liu , Nadia Irwanto , Jessica Pan , Faiaz Rahman , Ahmad Zaidi , Mutethia Mutuma , Yasin Tarabar , Ankit Gupta , Tao Yu , Yi Chern Tan , Xi Victoria Lin , Caiming Xiong , Richard Socher , Nazneen Fatema Rajani