Related papers: Scalable Data Discovery Using Profiles

Measuring and Predicting the Quality of a Join for Data Discovery

We study the problem of discovering joinable datasets at scale. We approach the problem from a learning perspective relying on profiles. These are succinct representations that capture the underlying characteristics of the schemata and data…

Databases · Computer Science 2023-06-01 Sergi Nadal , Raquel Panadero , Javier Flores , Oscar Romero

A New Scale for Attribute Dependency in Large Database Systems

Large, data centric applications are characterized by its different attributes. In modern day, a huge majority of the large data centric applications are based on relational model. The databases are collection of tables and every table…

Information Retrieval · Computer Science 2012-06-28 Soumya Sen , Anjan Dutta , Agostino Cortesi , Nabendu Chaki

Scalable Sampling for High Utility Patterns

Discovering valuable insights from data through meaningful associations is a crucial task. However, it becomes challenging when trying to identify representative patterns in quantitative databases, especially with large datasets, as…

Databases · Computer Science 2024-10-31 Lamine Diop , Marc Plantevit

FREYJA: Efficient Join Discovery in Data Lakes

Data lakes are massive repositories of raw and heterogeneous data, designed to meet the requirements of modern data storage. Nonetheless, this same philosophy increases the complexity of performing discovery tasks to find relevant data for…

Databases · Computer Science 2026-01-23 Marc Maynou , Sergi Nadal , Raquel Panadero , Javier Flores , Oscar Romero , Anna Queralt

Dataset Discovery in Data Lakes

Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data…

Databases · Computer Science 2020-11-23 Alex Bogatu , Alvaro A. A. Fernandes , Norman W. Paton , Nikolaos Konstantinou

Scalable Prototype Selection by Genetic Algorithms and Hashing

Classification in the dissimilarity space has become a very active research area since it provides a possibility to learn from data given in the form of pairwise non-metric dissimilarities, which otherwise would be difficult to cope with.…

Machine Learning · Statistics 2017-12-27 Yenisel Plasencia-Calaña , Mauricio Orozco-Alzate , Heydi Méndez-Vázquez , Edel García-Reyes , Robert P. W. Duin

JSONoid: Monoid-based Enrichment for Configurable and Scalable Data-Driven Schema Discovery

Schema discovery is an important aspect to working with data in formats such as JSON. Unlike relational databases, JSON data sets often do not have associated structural information. Consumers of such datasets are often left to browse…

Databases · Computer Science 2023-07-07 Michael J. Mior

Evaluating Joinable Column Discovery Approaches for Context-Aware Search

Joinable Column Discovery is a critical challenge in automating enterprise data analysis. While existing approaches focus on syntactic overlap and semantic similarity, there remains limited understanding of which methods perform best for…

Databases · Computer Science 2025-10-29 Harsha Kokel , Aamod Khatiwada , Tejaswini Pedapati , Haritha Ananthakrishnan , Oktie Hassanzadeh , Horst Samulowitz , Kavitha Srinivas

Scientific Dataset Discovery via Topic-level Recommendation

Data intensive research requires the support of appropriate datasets. However, it is often time-consuming to discover usable datasets matching a specific research topic. We formulate the dataset discovery problem on an attributed…

Information Retrieval · Computer Science 2021-06-08 Basmah Altaf , Shichao Pei , Xiangliang Zhang

This work tackles the problem of fuzzy joining of strings that naturally tokenize into meaningful substrings, e.g., full names. Tokenized-string joins have several established applications in the context of data integration and cleaning.…

Information Retrieval · Computer Science 2019-03-25 Ahmed Metwally , Chun-Heng Huang

Multi-Attribute Selectivity Estimation Using Deep Learning

Selectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor…

Databases · Computer Science 2019-06-19 Shohedul Hasan , Saravanan Thirumuruganathan , Jees Augustine , Nick Koudas , Gautam Das

Scalable Feature Matching Across Large Data Collections

This paper is concerned with matching feature vectors in a one-to-one fashion across large collections of datasets. Formulating this task as a multidimensional assignment problem with decomposable costs (MDADC), we develop extremely fast…

Computation · Statistics 2021-01-07 David Degras

Robust and Scalable Entity Alignment in Big Data

Entity alignment has always had significant uses within a multitude of diverse scientific fields. In particular, the concept of matching entities across networks has grown in significance in the world of social science as communicative…

Social and Information Networks · Computer Science 2020-04-21 James Flamino , Christopher Abriola , Ben Zimmerman , Zhongheng Li , Joel Douglas

From Community Detection to Community Profiling

Most existing community-related studies focus on detection, which aim to find the community membership for each user from user friendship links. However, membership alone, without a complete profile of what a community is and how it…

Social and Information Networks · Computer Science 2017-01-18 Hongyun Cai , Vincent W. Zheng , Fanwei Zhu , Kevin Chen-Chuan Chang , Zi Huang

Scalable Joint Models for Reliable Uncertainty-Aware Event Prediction

Missing data and noisy observations pose significant challenges for reliably predicting events from irregularly sampled multivariate time series (longitudinal) data. Imputation methods, which are typically used for completing the data prior…

Machine Learning · Statistics 2017-08-17 Hossein Soleimani , James Hensman , Suchi Saria

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

Finding joinable tables in data lakes is key procedure in many applications such as data integration, data augmentation, data analysis, and data market. Traditional approaches that find equi-joinable tables are unable to deal with…

Information Retrieval · Computer Science 2023-08-31 Yuyang Dong , Kunihiro Takeoka , Chuan Xiao , Masafumi Oyamada

Scalable Private Partition Selection via Adaptive Weighting

In the differentially private partition selection problem (a.k.a. private set union, private key discovery), users hold subsets of items from an unbounded universe. The goal is to output as many items as possible from the union of the…

Data Structures and Algorithms · Computer Science 2025-08-12 Justin Y. Chen , Vincent Cohen-Addad , Alessandro Epasto , Morteza Zadimoghaddam

Consistent and Flexible Selectivity Estimation for High-Dimensional Data

Selectivity estimation aims at estimating the number of database objects that satisfy a selection criterion. Answering this problem accurately and efficiently is essential to many applications, such as density estimation, outlier detection,…

Databases · Computer Science 2021-05-28 Yaoshu Wang , Chuan Xiao , Jianbin Qin , Rui Mao , Onizuka Makoto , Wei Wang , Rui Zhang , Yoshiharu Ishikawa

Recognizing Variables from their Data via Deep Embeddings of Distributions

A key obstacle in automated analytics and meta-learning is the inability to recognize when different datasets contain measurements of the same variable. Because provided attribute labels are often uninformative in practice, this task may be…

Machine Learning · Computer Science 2019-09-12 Jonas Mueller , Alex Smola

Scalable Co-Clustering for Large-Scale Data through Dynamic Partitioning and Hierarchical Merging

Co-clustering simultaneously clusters rows and columns, revealing more fine-grained groups. However, existing co-clustering methods suffer from poor scalability and cannot handle large-scale data. This paper presents a novel and scalable…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-20 Zihan Wu , Zhaoke Huang , Hong Yan