Related papers: MATE: Multi-Attribute Table Extraction

Efficiently Estimating Mutual Information Between Attributes Across Tables

Relational data augmentation is a powerful technique for enhancing data analytics and improving machine learning models by incorporating columns from external datasets. However, it is challenging to efficiently discover relevant external…

Databases · Computer Science 2025-03-06 Aécio Santos , Flip Korn , Juliana Freire

Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table Retrieval

Retrieving relevant tables containing the necessary information to accurately answer a given question over tables is critical to open-domain question-answering (QA) systems. Previous methods assume the answer to such a question can be found…

Information Retrieval · Computer Science 2025-01-13 Peter Baile Chen , Yi Zhang , Dan Roth

Tailoring Table Retrieval from a Field-aware Hybrid Matching Perspective

Table retrieval, essential for accessing information through tabular data, is less explored compared to text retrieval. The row/column structure and distinct fields of tables (including titles, headers, and cells) present unique challenges.…

Information Retrieval · Computer Science 2025-03-05 Da Li , Keping Bi , Jiafeng Guo , Xueqi Cheng

WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses

Data discovery is a major challenge in enterprise data analysis: users often struggle to find data relevant to their analysis goals or even to navigate through data across data sources, each of which may easily contain thousands of tables.…

Databases · Computer Science 2023-01-04 Tianji Cong , James Gale , Jason Frantz , H. V. Jagadish , Çağatay Demiralp

MATE: Multi-view Attention for Table Transformer Efficiency

This work presents a sparse-attention Transformer architecture for modeling documents that contain large tables. Tables are ubiquitous on the web, and are rich in information. However, more than 20% of relational tables on the web have 20…

Computation and Language · Computer Science 2021-09-10 Julian Martin Eisenschlos , Maharshi Gor , Thomas Müller , William W. Cohen

Missingness-Adaptive Factor Identification in High-Dimensional Data

Determining the number of factors in high-dimensional factor models remains a fundamental challenge, particularly when data are incomplete. This paper introduces the concept of identifiable factors, those that can be reliably recovered…

Methodology · Statistics 2026-04-21 Ping Zeng , Yicheng Zeng , Lixing Zhu

Scalable Data Discovery Using Profiles

We study the problem of discovering joinable datasets at scale. This is, how to automatically discover pairs of attributes in a massive collection of independent, heterogeneous datasets that can be joined. Exact (e.g., based on distinct…

Databases · Computer Science 2020-12-07 Javier Flores , Sergi Nadal , Oscar Romero

Tablext: A Combined Neural Network And Heuristic Based Table Extractor

A significant portion of the data available today is found within tables. Therefore, it is necessary to use automated table extraction to obtain thorough results when data-mining. Today's popular state-of-the-art methods for table…

Information Retrieval · Computer Science 2021-04-26 Zach Colter , Morteza Fayazi , Zineb Benameur-El , Serafina Kamp , Shuyan Yu , Ronald Dreslinski

Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

Machine-learning from a disparate set of tables, a data lake, requires assembling features by merging and aggregating tables. Data discovery can extend autoML to data tables by automating these steps. We present an in-depth analysis of such…

Databases · Computer Science 2025-05-20 Riccardo Cappuzzo , Aimee Coelho , Felix Lefebvre , Paolo Papotti , Gael Varoquaux

QJoin: Transformation-aware Joinable Data Discovery Using Reinforcement Learning

Discovering which tables in large, heterogeneous repositories can be joined and by what transformations is a central challenge in data integration and data discovery. Traditional join discovery methods are largely designed for equi-joins,…

Databases · Computer Science 2025-12-03 Ning Wang , Sainyam Galhotra

PATE: Proximity-Aware Time series anomaly Evaluation

Evaluating anomaly detection algorithms in time series data is critical as inaccuracies can lead to flawed decision-making in various domains where real-time analytics and data-driven strategies are essential. Traditional performance…

Machine Learning · Computer Science 2024-05-21 Ramin Ghorbani , Marcel J. T. Reinders , David M. J. Tax

Measuring and Predicting the Quality of a Join for Data Discovery

We study the problem of discovering joinable datasets at scale. We approach the problem from a learning perspective relying on profiles. These are succinct representations that capture the underlying characteristics of the schemata and data…

Databases · Computer Science 2023-06-01 Sergi Nadal , Raquel Panadero , Javier Flores , Oscar Romero

MATE: A Model-based Algorithm Tuning Engine

In this paper, we introduce a Model-based Algorithm Turning Engine, namely MATE, where the parameters of an algorithm are represented as expressions of the features of a target optimisation problem. In contrast to most static…

Neural and Evolutionary Computing · Computer Science 2021-02-16 Mohamed El Yafrani , Marcella Scoczynski Ribeiro Martins , Inkyung Sung , Markus Wagner , Carola Doerr , Peter Nielsen

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

Finding joinable tables in data lakes is key procedure in many applications such as data integration, data augmentation, data analysis, and data market. Traditional approaches that find equi-joinable tables are unable to deal with…

Information Retrieval · Computer Science 2023-08-31 Yuyang Dong , Kunihiro Takeoka , Chuan Xiao , Masafumi Oyamada

REaR: Retrieve, Expand and Refine for Effective Multitable Retrieval

Answering natural language queries over relational data often requires retrieving and reasoning over multiple tables, yet most retrievers optimize only for query-table relevance and ignore table table compatibility. We introduce REAR…

Information Retrieval · Computer Science 2025-11-04 Rishita Agarwal , Himanshu Singhal , Peter Baile Chen , Manan Roy Choudhury , Dan Roth , Vivek Gupta

OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories

How can we discover join relationships among columns of tabular data in a data repository? Can this be done effectively when metadata is missing? Traditional column matching works mainly rely on similarity measures based on exact value…

Databases · Computer Science 2024-03-13 Christos Koutras , Jiani Zhang , Xiao Qin , Chuan Lei , Vasileios Ioannidis , Christos Faloutsos , George Karypis , Asterios Katsifodimos

Medical artificial intelligence toolbox (MAIT): an explainable machine learning framework for binary classification, survival modelling, and regression analyses

While machine learning offers diverse techniques suitable for exploring various medical research questions, a cohesive synergistic framework can facilitate the integration and understanding of new approaches within unified model development…

Machine Learning · Computer Science 2025-01-09 Ramtin Zargari Marandi , Anne Svane Frahm , Jens Lundgren , Daniel Dawson Murray , Maja Milojevic

IoT Data Discovery: Routing Table and Summarization Techniques

In this paper, we consider the IoT data discovery problem in very large and growing scale networks. Through analysis, examples, and experimental studies, we show the importance of peer-to-peer, unstructured routing for IoT data discovery…

Networking and Internet Architecture · Computer Science 2022-05-09 Hieu Tran , Son Nguyen , I-Ling Yen , Farokh Bastani

Neural Metric Learning for Fast End-to-End Relation Extraction

Relation extraction (RE) is an indispensable information extraction task in several disciplines. RE models typically assume that named entity recognition (NER) is already performed in a previous step by another independent model. Several…

Computation and Language · Computer Science 2019-08-29 Tung Tran , Ramakanth Kavuluru

Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context

Documents are often used for knowledge sharing and preservation in business and science, within which are tables that capture most of the critical data. Unfortunately, most documents are stored and distributed as PDF or scanned images,…

Computer Vision and Pattern Recognition · Computer Science 2020-12-03 Xinyi Zheng , Doug Burdick , Lucian Popa , Xu Zhong , Nancy Xin Ru Wang