Related papers: Efficient Time-Evolving Stream Processing at Scale

When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Processing

Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-01-28 Muhammad Anis Uddin Nasir , Gianmarco De Francisci Morales , Nicolas Kourtellis , Marco Serafini

Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random…

Data Structures and Algorithms · Computer Science 2022-07-19 Dimitris Bertsimas , Vassilis Digalakis

Parallel Stream Processing Against Workload Skewness and Variance

Key-based workload partitioning is a common strategy used in parallel stream processing engines, enabling effective key-value tuple distribution over worker threads in a logical operator. While randomized hashing on the keys is capable of…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-12-14 Junhua Fang , Rong Zhang , Tom Z. J. Fu , Zhenjie Zhang , Aoying Zhou , Junhua Zhu

Dynamic Adaptation in Data Storage: Real-Time Machine Learning for Enhanced Prefetching

The exponential growth of data storage demands has necessitated the evolution of hierarchical storage management strategies [1]. This study explores the application of streaming machine learning [3] to revolutionize data prefetching within…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-30 Chiyu Cheng , Chang Zhou , Yang Zhao , Jin Cao

Progressive Temporal Window Widening

This paper introduces a scheme for data stream processing which is robust to batch duration. Streaming frameworks process streams in batches retrieved at fixed time intervals. In a common setting a pattern recognition algorithm is applied…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-02-20 David Tolpin

Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds

The pervasive availability of streaming data is driving interest in distributed Fast Data platforms for streaming applications. Such latency-sensitive applications need to respond to dynamism in the input rates and task behavior using…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-13 Anshu Shukla , Yogesh Simmhan

System-aware dynamic partitioning for batch and streaming workloads

When processing data streams with highly skewed and nonstationary key distributions, we often observe overloaded partitions when the hash partitioning fails to balance data correctly. To avoid slow tasks that delay the completion of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-01 Zoltán Zvara , Péter G. N. Szabó , Balázs Barnabás Lóránt , András A. Benczúr

Managing Large-Scale Transient Data in IoT Systems

The pervasive availability of streaming data is driving interest in distributed Fast Data platforms for streaming applications. Such latency-sensitive applications need to respond to dynamism in the input rates and task behavior using…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-28 Nanjangud C. Narendra , Sambit Nayak , Anshu Shukla

Resource- and Message Size-Aware Scheduling of Stream Processing at the Edge with application to Realtime Microscopy

Whilst computational resources at the cloud edge can be leveraged to improve latency and reduce the costs of cloud services for a wide variety mobile, web, and IoT applications; such resources are naturally constrained. For distributed…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-20 Ben Blamey , Ida-Maria Sintorn , Andreas Hellander , Salman Toor

Cloudprofiler: TSC-based inter-node profiling and high-throughput data ingestion for cloud streaming workloads

To conduct real-time analytics computations, big data stream processing engines are required to process unbounded data streams at millions of events per second. However, current streaming engines exhibit low throughput and high tuple…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-11 Shinhyung Yang , Jiun Jeong , Bernhard Scholz , Bernd Burgstaller

A Framework for Simulating Real-world Stream Data of the Internet of Things

With the rapid growth in the number of devices of the Internet of Things (IoT), the volume and types of stream data are rapidly increasing in the real world. Unfortunately, the stream data has the characteristics of infinite and periodic…

Performance · Computer Science 2022-12-13 Weirong Xiu , Baozhu Li , Xusheng Du , Zheng Chu

LifeStream: A High-Performance Stream Processing Engine for Periodic Streams

Hospitals around the world collect massive amounts of physiological data from their patients every day. Recently, there has been an increase in research interest to subject this data to statistical analysis to gain more insights and provide…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-04 Anand Jayarajan , Kimberly Hau , Andrew Goodwin , Gennady Pekhimenko

Distributed Data Stream Processing and Edge Computing: A Survey on Resource Elasticity and Future Directions

Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructure, wearable assistance, and Internet of Things, continuous data streams must be processed under very short delays. Several…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-12-05 Marcos Dias de Assuncao , Alexandre da Silva Veith , Rajkumar Buyya

Strider: A Hybrid Adaptive Distributed RDF Stream Processing Engine

Real-time processing of data streams emanating from sensors is becoming a common task in Internet of Things scenarios. The key implementation goal consists in efficiently handling massive incoming data streams and supporting advanced data…

Databases · Computer Science 2017-05-17 Xiangnan Ren , Olivier Curé

Raising the ClaSS of Streaming Time Series Segmentation

Ubiquitous sensors today emit high frequency streams of numerical measurements that reflect properties of human, animal, industrial, commercial, and natural processes. Shifts in such processes, e.g. caused by external events or internal…

Machine Learning · Computer Science 2025-04-04 Arik Ermshaus , Patrick Schäfer , Ulf Leser

Evolving Large-Scale Data Stream Analytics based on Scalable PANFIS

Many distributed machine learning frameworks have recently been built to speed up the large-scale data learning process. However, most distributed machine learning used in these frameworks still uses an offline algorithm model which cannot…

Artificial Intelligence · Computer Science 2018-07-19 Mahardhika Pratama , Choiru Za'in , Eric Pardede

Elastic Scheduling of Intermittent Query Processing in a Cluster Environment

Many applications process a stream of tuples over a window duration, and require the results within a specified deadline after the end of the window. For such scenarios, processing tuples intermittently (in batches) instead of eagerly…

Databases · Computer Science 2026-05-19 Saranya Chandrasekaran , S. Sudarshan

A Utilization Model for Optimization of Checkpoint Intervals in Distributed Stream Processing Systems

State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-21 Sachini Jayasekara , Aaron Harwood , Shanika Karunasekera

A Scalable and Robust Framework for Data Stream Ingestion

An essential part of building a data-driven organization is the ability to handle and process continuous streams of data to discover actionable insights. The explosive growth of interconnected devices and the social Web has led to a large…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-23 Haruna Isah , Farhana Zulkernine

Improving the performance of bagging ensembles for data streams through mini-batching

Often, machine learning applications have to cope with dynamic environments where data are collected in the form of continuous data streams with potentially infinite length and transient behavior. Compared to traditional (batch) data…

Machine Learning · Computer Science 2021-12-21 Guilherme Cassales , Heitor Gomes , Albert Bifet , Bernhard Pfahringer , Hermes Senger