Related papers: Error-Tolerant Big Data Processing
One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a…
Entity Resolution (ER) is typically implemented as a batch task that processes all available data before identifying duplicate records. However, applications with time or computational constraints, e.g., those running in the cloud, require…
Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard…
Entity matching is the problem of identifying which records refer to the same real-world entity. It has been actively researched for decades, and a variety of different approaches have been developed. Even today, it remains a challenging…
The graph is one of the most widely used mathematical structures in engineering and science because of its representational power and inherent ability to demonstrate the relationship between objects. The objective of this work is to…
Entity matching is the task of deciding whether two entity descriptions refer to the same real-world entity. Entity matching is a central step in most data integration pipelines. Many state-of-the-art entity matching methods rely on…
Schema matching is the process of identifying correspondences between the elements of two given schemata, essential for database management systems, data integration, and data warehousing. For datasets across different scenarios, the…
We revisit the problem of statistical sequence matching initiated by Unnikrishnan (TIT 2015) and derive theoretical performance guarantees for sequential tests that have bounded expected stopping times. Specifically, in this problem, one is…
Businesses, governmental bodies and NGO's have an ever-increasing amount of data at their disposal from which they try to extract valuable information. Often, this needs to be done not only accurately but also within a short time frame.…
Big data problems frequently require processing datasets in a streaming fashion, either because all data are available at once but collectively are larger than available memory or because the data intrinsically arrive one data point at a…
Entity alignment has always had significant uses within a multitude of diverse scientific fields. In particular, the concept of matching entities across networks has grown in significance in the world of social science as communicative…
This paper presents a new approach to statistical similarity assessment based on sequence alignment. The algorithm performs mutual matching of two random sequences by successively searching for common elements and by applying sequence…
Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose…
As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string…
A similarity join aims to find all similar pairs between two collections of records. Established approaches usually deal with synthetic differences like typos and abbreviations, but neglect the semantic relations between words. Such…
Structured data is widely used in domains such as healthcare, finance, and scientific data management. Recent studies on structured data foundation models (SFMs) aim to support data analysis and mining tasks over such data, but still face…
In the last twenty years, data series similarity search has emerged as a fundamental operation at the core of several analysis tasks and applications related to data series collections. Many solutions to different mining problems work by…
Edge-labeled graphs are widely used to describe relationships between entities in a database. Given a query subgraph that represents an example of what the user is searching for, we study the problem of efficiently searching for similar…
Pattern matching of streaming time series with lower latency under limited computing resource comes to a critical problem, especially as the growth of Industry 4.0 and Industry Internet of Things. However, against traditional single pattern…
Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a…