Related papers: Data Partitioning for Parallel Entity Matching

Parallelizing Query Optimization on Shared-Nothing Architectures

Data processing systems offer an ever increasing degree of parallelism on the levels of cores, CPUs, and processing nodes. Query optimization must exploit high degrees of parallelism in order not to gradually become the bottleneck of query…

Databases · Computer Science 2015-11-06 Immanuel Trummer , Christoph Koch

Methods for Partitioning Data to Improve Parallel Execution Time for Sorting on Heterogeneous Clusters

The aim of the paper is to introduce general techniques in order to optimize the parallel execution time of sorting on a distributed architectures with processors of various speeds. Such an application requires a partitioning step. For…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-16 Christophe Cérin , Jean-Christophe Dubacq , Jean-Louis Roch , the SafeScale Collaboration

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Lauritz Thamsen , Dominik Scheinert , Jonathan Will , Jonathan Bader , Odej Kao

Multi-Resource Parallel Query Scheduling and Optimization

Scheduling query execution plans is a particularly complex problem in shared-nothing parallel systems, where each site consists of a collection of local time-shared (e.g., CPU(s) or disk(s)) and space-shared (e.g., memory) resources and…

Databases · Computer Science 2014-04-01 Minos Garofalakis , Yannis Ioannidis

MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching

Entity Matching (EM), which aims to identify all entity pairs referring to the same real-world entity from relational tables, is one of the most important tasks in real-world data management systems. Due to the labeling process of EM being…

Databases · Computer Science 2023-08-07 Xiaocan Zeng , Pengfei Wang , Yuren Mao , Lu Chen , Xiaoze Liu , Yunjun Gao

Progressive Entity Resolution: A Design Space Exploration

Entity Resolution (ER) is typically implemented as a batch task that processes all available data before identifying duplicate records. However, applications with time or computational constraints, e.g., those running in the cloud, require…

Databases · Computer Science 2025-03-12 Jakub Maciejewski , Konstantinos Nikoletos , George Papadakis , Yannis Velegrakis

Optimizing ETL Dataflow Using Shared Caching and Parallelization Methods

Extract-Transform-Load (ETL) handles large amount of data and manages workload through dataflows. ETL dataflows are widely regarded as complex and expensive operations in terms of time and system resources. In order to minimize the time and…

Databases · Computer Science 2014-09-08 Xiufeng Liu

WebParF: A Web partitioning framework for Parallel Crawlers

With the ever proliferating size and scale of the WWW [1] efficient ways of exploring content are of increasing importance. How can we efficiently retrieve information from it through crawling? And in this era of tera and multi-core…

Information Retrieval · Computer Science 2014-06-24 Sonali Gupta , Komal kumar Bhatia , Pikakshi Manchanda

The Battleship Approach to the Low Resource Entity Matching Problem

Entity matching, a core data integration problem, is the task of deciding whether two data tuples refer to the same real-world entity. Recent advances in deep learning methods, using pre-trained language models, were proposed for resolving…

Databases · Computer Science 2023-11-28 Bar Genossar , Avigdor Gal , Roee Shraga

Co-Scheduling Algorithms for High-Throughput Workload Execution

This paper investigates co-scheduling algorithms for processing a set of parallel applications. Instead of executing each application one by one, using a maximum degree of parallelism for each of them, we aim at scheduling several…

Data Structures and Algorithms · Computer Science 2013-05-01 Guillaume Aupy , Manu Shantharam , Anne Benoit , Yves Robert , Padma Raghavan

The Simulation Model Partitioning Problem: an Adaptive Solution Based on Self-Clustering (Extended Version)

This paper is about partitioning in parallel and distributed simulation. That means decomposing the simulation model into a numberof components and to properly allocate them on the execution units. An adaptive solution based on…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-07 Gabriele D'Angelo

Efficiently Scheduling Parallel DAG Tasks on Identical Multiprocessors

Parallel real-time embedded applications can be modelled as directed acyclic graphs (DAGs) whose nodes model subtasks and whose edges model precedence constraints among subtasks. Efficiently scheduling such parallel tasks can be challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-24 Shardul Lendve , Konstantinos Bletsas , Pedro F. Souto

Neural Networks for Entity Matching: A Survey

Entity matching is the problem of identifying which records refer to the same real-world entity. It has been actively researched for decades, and a variety of different approaches have been developed. Even today, it remains a challenging…

Databases · Computer Science 2021-06-02 Nils Barlaug , Jon Atle Gulla

Robust and Scalable Entity Alignment in Big Data

Entity alignment has always had significant uses within a multitude of diverse scientific fields. In particular, the concept of matching entities across networks has grown in significance in the world of social science as communicative…

Social and Information Networks · Computer Science 2020-04-21 James Flamino , Christopher Abriola , Ben Zimmerman , Zhongheng Li , Joel Douglas

Distributed Holistic Clustering on Linked Data

Link discovery is an active field of research to support data integration in the Web of Data. Due to the huge size and number of available data sources, efficient and effective link discovery is a very challenging task. Common pairwise link…

Databases · Computer Science 2017-08-31 Markus Nentwig , Anika Groß , Maximilian Möller , Erhard Rahm

Efficient Principal Subspace Projection of Streaming Data Through Fast Similarity Matching

Big data problems frequently require processing datasets in a streaming fashion, either because all data are available at once but collectively are larger than available memory or because the data intrinsically arrive one data point at a…

Computation · Statistics 2018-08-08 Andrea Giovannucci , Victor Minden , Cengiz Pehlevan , Dmitri B. Chklovskii

Data Placement and Replica Selection for Improving Co-location in Distributed Environments

Increasing need for large-scale data analytics in a number of application domains has led to a dramatic rise in the number of distributed data management systems, both parallel relational databases, and systems that support alternative…

Databases · Computer Science 2013-02-19 K. Ashwin Kumar , Amol Deshpande , Samir Khuller

High-Quality Hierarchical Process Mapping

Partitioning graphs into blocks of roughly equal size such that few edges run between blocks is a frequently needed operation when processing graphs on a parallel computer. When a topology of a distributed system is known an important task…

Data Structures and Algorithms · Computer Science 2020-01-23 Marcelo Fonseca Faraj , Alexander van der Grinten , Henning Meyerhenke , Jesper Larsson Träff , Christian Schulz

Analysis of Distributed Algorithms for Big-data

The parallel and distributed processing are becoming de facto industry standard, and a large part of the current research is targeted on how to make computing scalable and distributed, dynamically, without allocating the resources on…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-10 Rajendra Purohit , K R Chowdhary , S D Purohit

Multifile Partitioning for Record Linkage and Duplicate Detection

Merging datafiles containing information on overlapping sets of entities is a challenging task in the absence of unique identifiers, and is further complicated when some entities are duplicated in the datafiles. Most approaches to this…

Methodology · Statistics 2021-10-11 Serge Aleshin-Guendel , Mauricio Sadinle