English
Related papers

Related papers: DACO: Towards Application-Driven and Comprehensive…

200 papers

Tabular data is ubiquitous in real-world applications and abundant on the web, yet its annotation has traditionally required human labor, posing a significant scalability bottleneck for tabular machine learning. Our methodology can…

Machine Learning · Computer Science 2024-06-25 Yaojie Hu , Ilias Fountalis , Jin Tian , Nikolaos Vasiloglou

This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research…

Computation and Language · Computer Science 2025-12-02 Angelina Parfenova , Andreas Marfurt , Alexander Denzler , Juergen Pfeffer

We present our experience as annotators in the creation of high-quality, adversarial machine-reading-comprehension data for extractive QA for Task 1 of the First Workshop on Dynamic Adversarial Data Collection (DADC). DADC is an emergent…

Computation and Language · Computer Science 2022-06-30 Damian Y. Romero Diaz , Magdalena Anioł , John Culnan

Generative large language models (LLMs) can be a powerful tool for augmenting text annotation procedures, but their performance varies across annotation tasks due to prompt quality, text data idiosyncrasies, and conceptual difficulty.…

Computation and Language · Computer Science 2023-06-02 Nicholas Pangakis , Samuel Wolken , Neil Fasching

Relational database-driven data analysis (RDB-DA) report generation, which aims to generate data analysis reports after querying relational databases, has been widely applied in fields such as finance and healthcare. Typically, these tasks…

Databases · Computer Science 2025-04-02 Wenyi Xu , Yuren Mao , Xiaolu Zhang , Chao Zhang , Xuemei Dong , Mengfei Zhang , Yunjun Gao

Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing…

Computation and Language · Computer Science 2025-10-07 Honglin Lin , Qizhi Pei , Xin Gao , Zhuoshi Pan , Yu Li , Juntao Li , Conghui He , Lijun Wu

Automated text annotation is a compelling use case for generative large language models (LLMs) in social media research. Recent work suggests that LLMs can achieve strong performance on annotation tasks; however, these studies evaluate LLMs…

Computation and Language · Computer Science 2024-09-24 Nicholas Pangakis , Samuel Wolken

Conducting data analysis typically involves authoring code to transform, visualize, analyze, and interpret data. Large language models (LLMs) are now capable of generating such code for simple, routine analyses. LLMs promise to democratize…

Human-Computer Interaction · Computer Science 2025-04-22 Stephen N. Freund , Brooke Simon , Emery D. Berger , Eunice Jun

Despite growing interest in using large language models (LLMs) to automate annotation, their effectiveness in complex, nuanced, and multi-dimensional labelling tasks remains relatively underexplored. This study focuses on annotation for the…

Information Retrieval · Computer Science 2025-07-02 Leila Tavakoli , Hamed Zamani

Industries such as finance, meteorology, and energy generate vast amounts of data daily. Efficiently managing, processing, and displaying this data requires specialized expertise and is often tedious and repetitive. Leveraging large…

Computation and Language · Computer Science 2025-05-20 Wenqi Zhang , Yongliang Shen , Zeqi Tan , Guiyang Hou , Weiming Lu , Yueting Zhuang

Data annotation and synthesis generally refers to the labeling or generating of raw data with relevant information, which could be used for improving the efficacy of machine learning models. The process, however, is labor-intensive and…

Computation and Language · Computer Science 2024-12-04 Zhen Tan , Dawei Li , Song Wang , Alimohammad Beigi , Bohan Jiang , Amrita Bhattacharjee , Mansooreh Karami , Jundong Li , Lu Cheng , Huan Liu

Creating challenging tabular inference data is essential for learning complex reasoning. Prior work has mostly relied on two data generation strategies. The first is human annotation, which yields linguistically diverse data but is…

Computation and Language · Computer Science 2022-11-24 Aashna Jena , Vivek Gupta , Manish Shrivastava , Julian Martin Eisenschlos

The integration of AI-assisted coding tools within development environments drastically reduces development time, and allows developers to focus more on creative and critical aspects of software engineering through the use of Code Large…

Software Engineering · Computer Science 2025-03-26 Kishanthan Thangarajah , Arthur Leung , Boyuan Chen , Ahmed E. Hassan

The rise of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) has rapidly increased the need for high-quality, curated information retrieval datasets. These datasets, however, are currently created with off-the-shelf…

Information Retrieval · Computer Science 2026-02-05 Sameh Khattab , Marie Bauer , Lukas Heine , Till Rostalski , Jens Kleesiek , Julian Friedrich

Tabular data analysis is crucial in many scenarios, yet efficiently identifying the most relevant data analysis queries and results for a new table remains a significant challenge. The complexity of tabular data, diverse analytical…

Computation and Language · Computer Science 2025-04-01 Deyin Yi , Yihao Liu , Lang Cao , Mengyu Zhou , Haoyu Dong , Shi Han , Dongmei Zhang

Large-scale data collection is essential for developing personalized training data, mitigating the shortage of training data, and fine-tuning specialized models. However, creating high-quality datasets quickly and accurately remains a…

Ensuring data quality in large tabular datasets is a critical challenge, typically addressed through data wrangling tasks. Traditional statistical methods, though efficient, cannot often understand the semantic context and deep learning…

Machine Learning · Computer Science 2025-02-25 Ashlesha Akella , Krishnasuri Narayanam

Data is the engine of modern computer vision, which necessitates collecting large-scale datasets. This is expensive, and guaranteeing the quality of the labels is a major challenge. In this paper, we investigate efficient annotation…

Computer Vision and Pattern Recognition · Computer Science 2021-04-27 Yuan-Hong Liao , Amlan Kar , Sanja Fidler

We introduce TACO, an open-source, large-scale code generation dataset, with a focus on the optics of algorithms, designed to provide a more challenging training dataset and evaluation benchmark in the field of code generation models. TACO…

Artificial Intelligence · Computer Science 2023-12-29 Rongao Li , Jie Fu , Bo-Wen Zhang , Tao Huang , Zhihong Sun , Chen Lyu , Guang Liu , Zhi Jin , Ge Li

We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-Text annotations can be a costly process, especially when dealing with tables which are the major source of…

‹ Prev 1 2 3 10 Next ›