Related papers: Error-Tolerant Big Data Processing

End-to-End Entity Resolution for Big Data: A Survey

One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a…

Databases · Computer Science 2020-08-20 Vassilis Christophides , Vasilis Efthymiou , Themis Palpanas , George Papadakis , Kostas Stefanidis

Progressive Entity Resolution: A Design Space Exploration

Entity Resolution (ER) is typically implemented as a batch task that processes all available data before identifying duplicate records. However, applications with time or computational constraints, e.g., those running in the cloud, require…

Databases · Computer Science 2025-03-12 Jakub Maciejewski , Konstantinos Nikoletos , George Papadakis , Yannis Velegrakis

Scalable and robust set similarity join

Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard…

Databases · Computer Science 2018-03-05 Tobias Christiani , Rasmus Pagh , Johan Sivertsen

Neural Networks for Entity Matching: A Survey

Entity matching is the problem of identifying which records refer to the same real-world entity. It has been actively researched for decades, and a variety of different approaches have been developed. Even today, it remains a challenging…

Databases · Computer Science 2021-06-02 Nils Barlaug , Jon Atle Gulla

Some Algorithms on Exact, Approximate and Error-Tolerant Graph Matching

The graph is one of the most widely used mathematical structures in engineering and science because of its representational power and inherent ability to demonstrate the relationship between objects. The objective of this work is to…

Data Structures and Algorithms · Computer Science 2021-01-01 Shri Prakash Dwivedi

Entity Matching using Large Language Models

Entity matching is the task of deciding whether two entity descriptions refer to the same real-world entity. Entity matching is a central step in most data integration pipelines. Many state-of-the-art entity matching methods rely on…

Computation and Language · Computer Science 2024-10-21 Ralph Peeters , Aaron Steiner , Christian Bizer

Prompt-Matcher: Leveraging Large Models to Reduce Uncertainty in Schema Matching Results

Schema matching is the process of identifying correspondences between the elements of two given schemata, essential for database management systems, data integration, and data warehousing. For datasets across different scenarios, the…

Databases · Computer Science 2025-03-07 Longyu Feng , Huahang Li , Chen Jason Zhang

Large Deviations for Sequential Tests of Statistical Sequence Matching

We revisit the problem of statistical sequence matching initiated by Unnikrishnan (TIT 2015) and derive theoretical performance guarantees for sequential tests that have bounded expected stopping times. Specifically, in this problem, one is…

Information Theory · Computer Science 2025-06-05 Lin Zhou , Qianyun Wang , Yun Wei , Jingjing Wang

Supervised machine learning techniques for data matching based on similarity metrics

Businesses, governmental bodies and NGO's have an ever-increasing amount of data at their disposal from which they try to extract valuable information. Often, this needs to be done not only accurately but also within a short time frame.…

Machine Learning · Computer Science 2021-09-16 Pim Verschuuren , Serena Palazzo , Tom Powell , Steve Sutton , Alfred Pilgrim , Michele Faucci Giannelli

Efficient Principal Subspace Projection of Streaming Data Through Fast Similarity Matching

Big data problems frequently require processing datasets in a streaming fashion, either because all data are available at once but collectively are larger than available memory or because the data intrinsically arrive one data point at a…

Computation · Statistics 2018-08-08 Andrea Giovannucci , Victor Minden , Cengiz Pehlevan , Dmitri B. Chklovskii

Robust and Scalable Entity Alignment in Big Data

Entity alignment has always had significant uses within a multitude of diverse scientific fields. In particular, the concept of matching entities across networks has grown in significance in the world of social science as communicative…

Social and Information Networks · Computer Science 2020-04-21 James Flamino , Christopher Abriola , Ben Zimmerman , Zhongheng Li , Joel Douglas

Sequence Alignment Algorithm for Statistical Similarity Assessment

This paper presents a new approach to statistical similarity assessment based on sequence alignment. The algorithm performs mutual matching of two random sequences by successively searching for common elements and by applying sequence…

Signal Processing · Electrical Eng. & Systems 2021-06-09 Jakub Nikonowicz , Łukasz Matuszewski , Paweł Kubczak

Data Partitioning for Parallel Entity Matching

Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-06-29 Toralf Kirsten , Lars Kolb , Michael Hartung , Anika Groß , Hanna Köpcke , Erhard Rahm

PASS-JOIN: A Partition-based Method for Similarity Joins

As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string…

Databases · Computer Science 2011-12-01 Guoliang Li , Dong Deng , Jiannan Wang , Jianhua Feng

Efficient Taxonomic Similarity Joins with Adaptive Overlap Constraint

A similarity join aims to find all similar pairs between two collections of records. Established approaches usually deal with synthetic differences like typos and abbreviations, but neglect the semantic relations between words. Such…

Information Retrieval · Computer Science 2018-10-30 Pengfei Xu , Jiaheng Lu

FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

Structured data is widely used in domains such as healthcare, finance, and scientific data management. Recent studies on structured data foundation models (SFMs) aim to support data analysis and mining tasks over such data, but still face…

Machine Learning · Computer Science 2026-05-21 Zhenghang Song , Tang Qian , Lu Chen , Yushuai Li , Zhengke Hu , Bingbing Fang , Yumeng Song , Junbo Zhao , Sheng Zhang , Tianyi Li

Effective and Efficient Variable-Length Data Series Analytics

In the last twenty years, data series similarity search has emerged as a fundamental operation at the core of several analysis tasks and applications related to data series collections. Many solutions to different mining problems work by…

Databases · Computer Science 2020-09-25 Michele Linardi

Efficient Error-tolerant Search on Knowledge Graphs

Edge-labeled graphs are widely used to describe relationships between entities in a database. Given a query subgraph that represents an example of what the user is searching for, we study the problem of efficiently searching for similar…

Databases · Computer Science 2020-05-12 Zhaoyang Shao , Davood Rafiei , Themis Palpanas

Fine-grained Pattern Matching Over Streaming Time Series

Pattern matching of streaming time series with lower latency under limited computing resource comes to a critical problem, especially as the growth of Industry 4.0 and Industry Internet of Things. However, against traditional single pattern…

Computer Vision and Pattern Recognition · Computer Science 2017-12-05 Rong Kang , Chen Wang , Peng Wang , Yuting Ding , Jianmin Wang

Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a…

Databases · Computer Science 2018-03-20 Yuhang Zhang , Kee Siong Ng , Michael Walker , Pauline Chou , Tania Churchill , Peter Christen