Related papers: Weighted Random Sampling over Joins

Random Sampling over Spatial Range Joins

Spatial range joins have many applications, including geographic information systems, location-based social networking services, neuroscience, and visualization. However, joins incur not only expensive computational costs but also too large…

Databases · Computer Science 2025-08-22 Daichi Amagata

Subset Sampling over Joins

Subset sampling (also known as Poisson sampling), where the decision to include any specific element in the sample is made independently of all others, is a fundamental primitive in data analytics, enabling efficient approximation by…

Databases · Computer Science 2025-12-19 Aryan Esmailpour , Xiao Hu , Jinchao Huang , Stavros Sintos

Perfect and Maximum Randomness in Stratified Sampling over Joins

Supporting sampling in the presence of joins is an important problem in data analysis, but is inherently challenging due to the need to avoid correlation between output tuples. Current solutions provide either correlated or non-correlated…

Databases · Computer Science 2017-02-15 Niranjan Kamat , Arnab Nandi

Joins on Samples: A Theoretical Guide for Practitioners

Despite decades of research on approximate query processing (AQP), our understanding of sample-based joins has remained limited and, to some extent, even superficial. The common belief in the community is that joining random samples is…

Databases · Computer Science 2020-01-28 Dawei Huang , Dong Young Yoon , Seth Pettie , Barzan Mozafari

Sampling over Union of Joins

Data scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid the…

Databases · Computer Science 2023-03-10 Yurong Liu , Yunlong Xu , Fatemeh Nargesian

Weighted Reservoir Sampling With Replacement from Data Streams

In this work, we present a new random sampling method for data streams where the probability of an element's inclusion in the sample is proportional to a weight associated with that element. Our method is based on sampling with replacement,…

Data Structures and Algorithms · Computer Science 2026-03-18 Adriano Meligrana , Adriano Fazzone

Finding Associations and Computing Similarity via Biased Pair Sampling

This version is ***superseded*** by a full version that can be found at http://www.itu.dk/people/pagh/papers/mining-jour.pdf, which contains stronger theoretical results and fixes a mistake in the reporting of experiments. Abstract:…

Data Structures and Algorithms · Computer Science 2010-02-17 Andrea Campagna , Rasmus Pagh

Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments

Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time periods, requests for resources served at…

Databases · Computer Science 2010-11-11 Edith Cohen , Haim Kaplan , Subhabrata Sen

Reservoir Sampling over Joins

Sampling over joins is a fundamental task in large-scale data analytics. Instead of computing the full join results, which could be massive, a uniform sample of the join results would suffice for many purposes, such as answering analytical…

Databases · Computer Science 2024-04-11 Binyang Dai , Xiao Hu , Ke Yi

Weighted Random Sampling over Data Streams

In this work, we present a comprehensive treatment of weighted random sampling (WRS) over data streams. More precisely, we examine two natural interpretations of the item weights, describe an existing algorithm for each case ([2, 4]),…

Data Structures and Algorithms · Computer Science 2015-07-29 Pavlos S. Efraimidis

Sampling Multiple Nodes in Large Networks: Beyond Random Walks

Sampling random nodes is a fundamental algorithmic primitive in the analysis of massive networks, with many modern graph mining algorithms critically relying on it. We consider the task of generating a large collection of random nodes in…

Social and Information Networks · Computer Science 2021-10-27 Omri Ben-Eliezer , Talya Eden , Joel Oren , Dimitris Fotakis

Sampling to estimate arbitrary subset sums

Starting with a set of weighted items, we want to create a generic sample of a certain size that we can later use to estimate the total weight of arbitrary subsets. For this purpose, we propose priority sampling which tested on Internet…

Data Structures and Algorithms · Computer Science 2007-05-23 Nick Duffield , Carsten Lund , Mikkel Thorup

A Weighted Likelihood Approach Based on Statistical Data Depths

We propose a general approach to construct weighted likelihood estimating equations with the aim of obtain robust estimates. The weight, attached to each score contribution, is evaluated by comparing the statistical data depth at the model…

Methodology · Statistics 2018-02-16 Claudio Agostinelli

Weighted composite likelihood for linear mixed models in complex samples

Fitting mixed models to complex survey data is a challenging problem. Most methods in the literature, including the most widely used one, require a close relationship between the model structure and the survey design. In this paper we…

Methodology · Statistics 2023-11-23 Thomas Lumley , Xudong Huang

Poisson Sampling over Acyclic Joins

We introduce the problem of Poisson sampling over joins: compute a sample of the result of a join query by conceptually performing a Bernoulli trial for each join tuple, using a non-uniform and tuple-specific probability. We propose an…

Databases · Computer Science 2026-03-17 Liese Bekkers , Frank Neven , Lorrens Pantelis , Stijn Vansummeren

Joint Models with Multiple Markers and Multiple Time-to-event Outcomes Using Variational Approximations

Joint models are well suited to modelling linked data from laboratories and health registers. However, there are few examples of joint models that allow for (a) multiple markers, (b) multiple survival outcomes (including terminal events,…

Methodology · Statistics 2025-12-17 Benjamin Christoffersen , Keith Humphreys , Alessandro Gasparini , Birzhan Akynkozhayev , Hedvig Kjellström , Mark Clements

Preference-driven Similarity Join

Similarity join, which can find similar objects (e.g., products, names, addresses) across different sources, is powerful in dealing with variety in big data, especially web data. Threshold-driven similarity join, which has been extensively…

Databases · Computer Science 2017-07-13 Chuancong Gao , Jiannan Wang , Jian Pei , Rui Li , Yi Chang

Learning from networked examples

Many machine learning algorithms are based on the assumption that training examples are drawn independently. However, this assumption does not hold anymore when learning from a networked sample because two or more training examples may…

Artificial Intelligence · Computer Science 2017-06-06 Yuyi Wang , Jan Ramon , Zheng-Chu Guo

Computationally efficient methods for fitting mixed models to electronic health records data

Motivated by two case studies using primary care records from the Clinical Practice Research Datalink, we describe statistical methods that facilitate the analysis of tall data, with very large numbers of observations. Our focus is on…

Methodology · Statistics 2018-05-14 Kirsty Rhodes , Rebecca Turner , Rupert Payne , Ian White

Vertex-Context Sampling for Weighted Network Embedding

In recent years, network embedding methods have garnered increasing attention because of their effectiveness in various information retrieval tasks. The goal is to learn low-dimensional representations of vertexes in an information network…

Social and Information Networks · Computer Science 2017-11-02 Chih-Ming Chen , Yi-Hsuan Yang , Yian Chen , Ming-Feng Tsai