Related papers: Cocoon: Semantic Table Profiling Using Large Langu…

Data Cleaning Using Large Language Models

Data cleaning is a crucial yet challenging task in data analysis, often requiring significant manual effort. To automate data cleaning, previous systems have relied on statistical rules derived from erroneous data, resulting in low accuracy…

Databases · Computer Science 2024-10-22 Shuo Zhang , Zezhou Huang , Eugene Wu

LLM-Aided Customizable Profiling of Code Data Based On Programming Language Concepts

Data profiling is critical in machine learning for generating descriptive statistics, supporting both deeper understanding and downstream tasks like data valuation and curation. This work addresses profiling specifically in the context of…

Software Engineering · Computer Science 2025-03-21 Pankaj Thorat , Adnan Qidwai , Adrija Dhar , Aishwariya Chakraborty , Anand Eswaran , Hima Patel , Praveen Jayachandran

Conf-Profile: A Confidence-Driven Reasoning Paradigm for Label-Free User Profiling

User profiling, as a core technique for user understanding, aims to infer structural attributes from user information. Large Language Models (LLMs) provide a promising avenue for user profiling, yet the progress is hindered by the lack of…

Artificial Intelligence · Computer Science 2025-09-24 Yingxin Li , Jianbo Zhao , Xueyu Ren , Jie Tang , Wangjie You , Xu Chen , Kan Zhou , Chao Feng , Jiao Ran , Yuan Meng , Zhi Wang

Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data

Visual reasoning is crucial for multimodal large language models (MLLMs) to address complex chart queries, yet high-quality rationale data remains scarce. Existing methods leveraged (M)LLMs for data generation, but direct prompting often…

Computer Vision and Pattern Recognition · Computer Science 2025-03-21 Zijian Li , Jingjing Fu , Lei Song , Jiang Bian , Jun Zhang , Rui Wang

Interpreting Performance Profiles with Deep Learning

Profiling tools (also known as profilers) play an important role in understanding program performance at runtime, such as hotspots, bottlenecks, and inefficiencies. While profilers have been proven to be useful, they give extra burden to…

Software Engineering · Computer Science 2025-08-06 Zhuoran Liu

Solving Data Quality Problems with Desbordante: a Demo

Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and…

Databases · Computer Science 2023-07-31 George Chernishev , Michael Polyntsov , Anton Chizhov , Kirill Stupakov , Ilya Shchuckin , Alexander Smirnov , Maxim Strutovsky , Alexey Shlyonskikh , Mikhail Firsov , Stepan Manannikov , Nikita Bobrov , Daniil Goncharov , Ilia Barutkin , Vladislav Shalnev , Kirill Muraviev , Anna Rakhmukova , Dmitriy Shcheka , Anton Chernikov , Mikhail Vyrodov , Yaroslav Kurbatov , Maxim Fofanov , Sergei Belokonnyi , Pavel Anosov , Arthur Saliou , Eduard Gaisin , Kirill Smirnov

FALKON: An Optimal Large Scale Kernel Method

Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited…

Machine Learning · Statistics 2018-02-01 Alessandro Rudi , Luigi Carratino , Lorenzo Rosasco

Aligning Large Language Models with Human Opinions through Persona Selection and Value--Belief--Norm Reasoning

Reasoning and predicting human opinions with large language models (LLMs) is essential yet challenging. Current methods employ role-playing with personae but face two major issues: LLMs are sensitive to even a single irrelevant persona,…

Computation and Language · Computer Science 2024-12-17 Do Xuan Long , Kenji Kawaguchi , Min-Yen Kan , Nancy F. Chen

Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables

Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first…

Databases · Computer Science 2025-04-16 Qixu Chen , Yeye He , Raymond Chi-Wing Wong , Weiwei Cui , Song Ge , Haidong Zhang , Dongmei Zhang , Surajit Chaudhuri

ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows

Table processing-including cleaning, transformation, augmentation, and matching-is a foundational yet error-prone stage in real-world data pipelines. While recent LLM-based approaches show promise for automating such tasks, they often…

Artificial Intelligence · Computer Science 2026-05-13 Wei Liu , Yang Gu , Xi Yan , Zihan Nan , Beicheng Xu , Keyao Ding , Bin Cui , Wentao Zhang

Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets

High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or…

Machine Learning · Computer Science 2025-03-11 Tommaso Bendinelli , Artur Dox , Christian Holz

Unraveling Token Prediction Refinement and Identifying Essential Layers in Language Models

This research aims to unravel how large language models (LLMs) iteratively refine token predictions through internal processing. We utilized a logit lens technique to analyze the model's token predictions derived from intermediate…

Computation and Language · Computer Science 2025-06-10 Jaturong Kongmanee

DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Tasks Based on Data and Model Compression

Large language models (LLMs) excel in general tasks but struggle with domain-specific ones, requiring fine-tuning with specific data. With many open-source LLMs available, selecting the best model for fine-tuning downstream tasks is…

Computation and Language · Computer Science 2025-09-05 Wei Huang , Huang Wei , Yinggui Wang

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets, allowing the trained model to outperform baselines trained on the full dataset. However, the expanding body of research…

Computation and Language · Computer Science 2025-02-25 Ziche Liu , Rui Ke , Yajiao Liu , Feng Jiang , Haizhou Li

CoRank: LLM-Based Compact Reranking with Document Features for Scientific Retrieval

Scientific retrieval is essential for advancing scientific knowledge discovery. Within this process, document reranking plays a critical role in refining first-stage retrieval results. However, standard LLM listwise reranking faces…

Information Retrieval · Computer Science 2025-08-19 Runchu Tian , Xueqiang Xu , Bowen Jin , SeongKu Kang , Jiawei Han

SEAL : Interactive Tool for Systematic Error Analysis and Labeling

With the advent of Transformers, large language models (LLMs) have saturated well-known NLP benchmarks and leaderboards with high aggregate performance. However, many times these models systematically fail on tail data or rare groups not…

Computation and Language · Computer Science 2022-10-13 Nazneen Rajani , Weixin Liang , Lingjiao Chen , Meg Mitchell , James Zou

Semantic Classification of Tabular Datasets via Character-Level Convolutional Neural Networks

A character-level convolutional neural network (CNN) motivated by applications in "automated machine learning" (AutoML) is proposed to semantically classify columns in tabular data. Simulated data containing a set of base classes is first…

Computation and Language · Computer Science 2019-01-25 Paul Azunre , Craig Corcoran , Numa Dhamani , Jeffrey Gleason , Garrett Honke , David Sullivan , Rebecca Ruppel , Sandeep Verma , Jonathon Morgan

Designing Domain-Specific Large Language Models: The Critical Role of Fine-Tuning in Public Opinion Simulation

Large language models (LLMs) have transformed natural language processing, yet face challenges in specialized tasks such as simulating opinions on environmental policies. This paper introduces a novel fine-tuning approach that integrates…

Computation and Language · Computer Science 2024-12-10 Haocheng Lin

Document-Level Tabular Numerical Cross-Checking: A Coarse-to-Fine Approach

Numerical consistency across tables in disclosure documents is critical for ensuring accuracy, maintaining credibility, and avoiding reputational and economic risks. Automated tabular numerical cross-checking presents two significant…

Computation and Language · Computer Science 2025-06-17 Chaoxu Pang , Yixuan Cao , Ganbin Zhou , Hongwei Li , Ping Luo

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous…

Databases · Computer Science 2026-05-22 Hengrui Zhang , Yulong Hui , Yihao Liu , Huanchen Zhang