English
Related papers

Related papers: Cocoon: Semantic Table Profiling Using Large Langu…

200 papers

Data cleaning is a crucial yet challenging task in data analysis, often requiring significant manual effort. To automate data cleaning, previous systems have relied on statistical rules derived from erroneous data, resulting in low accuracy…

Databases · Computer Science 2024-10-22 Shuo Zhang , Zezhou Huang , Eugene Wu

Data profiling is critical in machine learning for generating descriptive statistics, supporting both deeper understanding and downstream tasks like data valuation and curation. This work addresses profiling specifically in the context of…

Software Engineering · Computer Science 2025-03-21 Pankaj Thorat , Adnan Qidwai , Adrija Dhar , Aishwariya Chakraborty , Anand Eswaran , Hima Patel , Praveen Jayachandran

User profiling, as a core technique for user understanding, aims to infer structural attributes from user information. Large Language Models (LLMs) provide a promising avenue for user profiling, yet the progress is hindered by the lack of…

Artificial Intelligence · Computer Science 2025-09-24 Yingxin Li , Jianbo Zhao , Xueyu Ren , Jie Tang , Wangjie You , Xu Chen , Kan Zhou , Chao Feng , Jiao Ran , Yuan Meng , Zhi Wang

Visual reasoning is crucial for multimodal large language models (MLLMs) to address complex chart queries, yet high-quality rationale data remains scarce. Existing methods leveraged (M)LLMs for data generation, but direct prompting often…

Computer Vision and Pattern Recognition · Computer Science 2025-03-21 Zijian Li , Jingjing Fu , Lei Song , Jiang Bian , Jun Zhang , Rui Wang

Profiling tools (also known as profilers) play an important role in understanding program performance at runtime, such as hotspots, bottlenecks, and inefficiencies. While profilers have been proven to be useful, they give extra burden to…

Software Engineering · Computer Science 2025-08-06 Zhuoran Liu

Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and…

Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited…

Machine Learning · Statistics 2018-02-01 Alessandro Rudi , Luigi Carratino , Lorenzo Rosasco

Reasoning and predicting human opinions with large language models (LLMs) is essential yet challenging. Current methods employ role-playing with personae but face two major issues: LLMs are sensitive to even a single irrelevant persona,…

Computation and Language · Computer Science 2024-12-17 Do Xuan Long , Kenji Kawaguchi , Min-Yen Kan , Nancy F. Chen

Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first…

Table processing-including cleaning, transformation, augmentation, and matching-is a foundational yet error-prone stage in real-world data pipelines. While recent LLM-based approaches show promise for automating such tasks, they often…

Artificial Intelligence · Computer Science 2026-05-13 Wei Liu , Yang Gu , Xi Yan , Zihan Nan , Beicheng Xu , Keyao Ding , Bin Cui , Wentao Zhang

High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or…

Machine Learning · Computer Science 2025-03-11 Tommaso Bendinelli , Artur Dox , Christian Holz

This research aims to unravel how large language models (LLMs) iteratively refine token predictions through internal processing. We utilized a logit lens technique to analyze the model's token predictions derived from intermediate…

Computation and Language · Computer Science 2025-06-10 Jaturong Kongmanee

Large language models (LLMs) excel in general tasks but struggle with domain-specific ones, requiring fine-tuning with specific data. With many open-source LLMs available, selecting the best model for fine-tuning downstream tasks is…

Computation and Language · Computer Science 2025-09-05 Wei Huang , Huang Wei , Yinggui Wang

Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets, allowing the trained model to outperform baselines trained on the full dataset. However, the expanding body of research…

Computation and Language · Computer Science 2025-02-25 Ziche Liu , Rui Ke , Yajiao Liu , Feng Jiang , Haizhou Li

Scientific retrieval is essential for advancing scientific knowledge discovery. Within this process, document reranking plays a critical role in refining first-stage retrieval results. However, standard LLM listwise reranking faces…

Information Retrieval · Computer Science 2025-08-19 Runchu Tian , Xueqiang Xu , Bowen Jin , SeongKu Kang , Jiawei Han

With the advent of Transformers, large language models (LLMs) have saturated well-known NLP benchmarks and leaderboards with high aggregate performance. However, many times these models systematically fail on tail data or rare groups not…

Computation and Language · Computer Science 2022-10-13 Nazneen Rajani , Weixin Liang , Lingjiao Chen , Meg Mitchell , James Zou

A character-level convolutional neural network (CNN) motivated by applications in "automated machine learning" (AutoML) is proposed to semantically classify columns in tabular data. Simulated data containing a set of base classes is first…

Computation and Language · Computer Science 2019-01-25 Paul Azunre , Craig Corcoran , Numa Dhamani , Jeffrey Gleason , Garrett Honke , David Sullivan , Rebecca Ruppel , Sandeep Verma , Jonathon Morgan

Large language models (LLMs) have transformed natural language processing, yet face challenges in specialized tasks such as simulating opinions on environmental policies. This paper introduces a novel fine-tuning approach that integrates…

Computation and Language · Computer Science 2024-12-10 Haocheng Lin

Numerical consistency across tables in disclosure documents is critical for ensuring accuracy, maintaining credibility, and avoiding reputational and economic risks. Automated tabular numerical cross-checking presents two significant…

Computation and Language · Computer Science 2025-06-17 Chaoxu Pang , Yixuan Cao , Ganbin Zhou , Hongwei Li , Ping Luo

Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous…

Databases · Computer Science 2026-05-22 Hengrui Zhang , Yulong Hui , Yihao Liu , Huanchen Zhang
‹ Prev 1 2 3 10 Next ›