Related papers: Parallel Streaming Random Sampling
In this paper we study how to perform distinct sampling in the streaming model where data contain near-duplicates. The goal of distinct sampling is to return a distinct element uniformly at random from the universe of elements, given that…
The number of triangles in a graph is a fundamental metric, used in social network analysis, link classification and recommendation, and more. Driven by these applications and the trend that modern graph datasets are both large and dynamic,…
We initiate the study of graph algorithms in the streaming setting on massive distributed and parallel systems inspired by practical data processing systems. The objective is to design algorithms that can efficiently process evolving graphs…
The availability of large number of processing nodes in a parallel and distributed computing environment enables sophisticated real time processing over high speed data streams, as required by many emerging applications. Sliding window…
Often, machine learning applications have to cope with dynamic environments where data are collected in the form of continuous data streams with potentially infinite length and transient behavior. Compared to traditional (batch) data…
We discuss how string sorting algorithms can be parallelized on modern multi-core shared memory machines. As a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, we…
Stochastic equations play an important role in computational science, due to their ability to treat a wide variety of complex statistical problems. However, current algorithms are strongly limited by their sampling variance, which scales…
A new unequal probability sampling method is proposed. This method is sequential. The decision to select or not each unit is made based on the order in which the units appear. A variant of this method allows selecting a sample from a…
Given a stream of data, a typical approach in streaming algorithms is to design a sophisticated algorithm with small memory that computes a specific statistic over the streaming data. Usually, if one wants to compute a different statistic…
In this paper, we design the first streaming algorithms for the problem of multitasking scheduling on parallel machines with shared processing. In one pass, our streaming approximation schemes can provide an approximate value of the optimal…
In this paper we study the extraction of representative elements in the data stream model in the form of submodular maximization. Different from the previous work on streaming submodular maximization, we are interested only in the recent…
We study the problem of minimizing total completion time on parallel machines subject to varying processing capacity. In this paper, we develop an approximation scheme for the problem under the data stream model where the input data is…
We introduce and study the problem of computing the similarity self-join in a streaming context (SSSJ), where the input is an unbounded stream of items arriving continuously. The goal is to find all pairs of items in the stream whose…
This paper introduces a scheme for data stream processing which is robust to batch duration. Streaming frameworks process streams in batches retrieved at fixed time intervals. In a common setting a pattern recognition algorithm is applied…
In statistical learning for real-world large-scale data problems, one must often resort to "streaming" algorithms which operate sequentially on small batches of data. In this work, we present an analysis of the information-theoretic limits…
The manuscript introduces a method to select a random sample from a stream by deciding on each sampling unit immediately after observing it. The process could be applied to unequal as well as equal probability sampling. The implementation…
We consider the problem of sampling $n$ numbers from the range $\{1,\ldots,N\}$ without replacement on modern architectures. The main result is a simple divide-and-conquer scheme that makes sequential algorithms more cache efficient and…
The practicality of a video surveillance system is adversely limited by the amount of queries that can be placed on human resources and their vigilance in response. To transcend this limitation, a major effort under way is to include…
Streaming computation plays an important role in large-scale data analysis. The sliding window model is a model of streaming computation which also captures the recency of the data. In this model, data arrives one item at a time, but only…
A streaming model is one where data items arrive over long period of time, either one item at a time or in bursts. Typical tasks include computing various statistics over a sliding window of some fixed time-horizon. What makes the streaming…