English
Related papers

Related papers: Scalable Data Discovery Using Profiles

200 papers

We study the problem of discovering joinable datasets at scale. We approach the problem from a learning perspective relying on profiles. These are succinct representations that capture the underlying characteristics of the schemata and data…

Databases · Computer Science 2023-06-01 Sergi Nadal , Raquel Panadero , Javier Flores , Oscar Romero

Large, data centric applications are characterized by its different attributes. In modern day, a huge majority of the large data centric applications are based on relational model. The databases are collection of tables and every table…

Information Retrieval · Computer Science 2012-06-28 Soumya Sen , Anjan Dutta , Agostino Cortesi , Nabendu Chaki

Discovering valuable insights from data through meaningful associations is a crucial task. However, it becomes challenging when trying to identify representative patterns in quantitative databases, especially with large datasets, as…

Databases · Computer Science 2024-10-31 Lamine Diop , Marc Plantevit

Data lakes are massive repositories of raw and heterogeneous data, designed to meet the requirements of modern data storage. Nonetheless, this same philosophy increases the complexity of performing discovery tasks to find relevant data for…

Databases · Computer Science 2026-01-23 Marc Maynou , Sergi Nadal , Raquel Panadero , Javier Flores , Oscar Romero , Anna Queralt

Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data…

Databases · Computer Science 2020-11-23 Alex Bogatu , Alvaro A. A. Fernandes , Norman W. Paton , Nikolaos Konstantinou

Classification in the dissimilarity space has become a very active research area since it provides a possibility to learn from data given in the form of pairwise non-metric dissimilarities, which otherwise would be difficult to cope with.…

Schema discovery is an important aspect to working with data in formats such as JSON. Unlike relational databases, JSON data sets often do not have associated structural information. Consumers of such datasets are often left to browse…

Databases · Computer Science 2023-07-07 Michael J. Mior

Joinable Column Discovery is a critical challenge in automating enterprise data analysis. While existing approaches focus on syntactic overlap and semantic similarity, there remains limited understanding of which methods perform best for…

Data intensive research requires the support of appropriate datasets. However, it is often time-consuming to discover usable datasets matching a specific research topic. We formulate the dataset discovery problem on an attributed…

Information Retrieval · Computer Science 2021-06-08 Basmah Altaf , Shichao Pei , Xiangliang Zhang

This work tackles the problem of fuzzy joining of strings that naturally tokenize into meaningful substrings, e.g., full names. Tokenized-string joins have several established applications in the context of data integration and cleaning.…

Information Retrieval · Computer Science 2019-03-25 Ahmed Metwally , Chun-Heng Huang

Selectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor…

Databases · Computer Science 2019-06-19 Shohedul Hasan , Saravanan Thirumuruganathan , Jees Augustine , Nick Koudas , Gautam Das

This paper is concerned with matching feature vectors in a one-to-one fashion across large collections of datasets. Formulating this task as a multidimensional assignment problem with decomposable costs (MDADC), we develop extremely fast…

Computation · Statistics 2021-01-07 David Degras

Entity alignment has always had significant uses within a multitude of diverse scientific fields. In particular, the concept of matching entities across networks has grown in significance in the world of social science as communicative…

Social and Information Networks · Computer Science 2020-04-21 James Flamino , Christopher Abriola , Ben Zimmerman , Zhongheng Li , Joel Douglas

Most existing community-related studies focus on detection, which aim to find the community membership for each user from user friendship links. However, membership alone, without a complete profile of what a community is and how it…

Social and Information Networks · Computer Science 2017-01-18 Hongyun Cai , Vincent W. Zheng , Fanwei Zhu , Kevin Chen-Chuan Chang , Zi Huang

Missing data and noisy observations pose significant challenges for reliably predicting events from irregularly sampled multivariate time series (longitudinal) data. Imputation methods, which are typically used for completing the data prior…

Machine Learning · Statistics 2017-08-17 Hossein Soleimani , James Hensman , Suchi Saria

Finding joinable tables in data lakes is key procedure in many applications such as data integration, data augmentation, data analysis, and data market. Traditional approaches that find equi-joinable tables are unable to deal with…

Information Retrieval · Computer Science 2023-08-31 Yuyang Dong , Kunihiro Takeoka , Chuan Xiao , Masafumi Oyamada

In the differentially private partition selection problem (a.k.a. private set union, private key discovery), users hold subsets of items from an unbounded universe. The goal is to output as many items as possible from the union of the…

Data Structures and Algorithms · Computer Science 2025-08-12 Justin Y. Chen , Vincent Cohen-Addad , Alessandro Epasto , Morteza Zadimoghaddam

Selectivity estimation aims at estimating the number of database objects that satisfy a selection criterion. Answering this problem accurately and efficiently is essential to many applications, such as density estimation, outlier detection,…

Databases · Computer Science 2021-05-28 Yaoshu Wang , Chuan Xiao , Jianbin Qin , Rui Mao , Onizuka Makoto , Wei Wang , Rui Zhang , Yoshiharu Ishikawa

A key obstacle in automated analytics and meta-learning is the inability to recognize when different datasets contain measurements of the same variable. Because provided attribute labels are often uninformative in practice, this task may be…

Machine Learning · Computer Science 2019-09-12 Jonas Mueller , Alex Smola

Co-clustering simultaneously clusters rows and columns, revealing more fine-grained groups. However, existing co-clustering methods suffer from poor scalability and cannot handle large-scale data. This paper presents a novel and scalable…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-20 Zihan Wu , Zhaoke Huang , Hong Yan
‹ Prev 1 2 3 10 Next ›