English
Related papers

Related papers: Khaos: Dynamically Optimizing Checkpointing for De…

200 papers

Fault tolerance is a property which needs deeper consideration when dealing with streaming jobs requiring high levels of availability and low-latency processing even in case of failures where Quality-of-Service constraints must be adhered…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-12 Morgan Geldenhuys , Lauritz Thamsen , Odej Kao

Distributed Stream Processing systems have become an essential part of big data processing platforms. They are characterized by the high-throughput processing of near to real-time event streams with the goal of delivering low-latency…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-22 Morgan K. Geldenhuys , Dominik Scheinert , Odej Kao , Lauritz Thamsen

Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-21 George Siachamis , Kyriakos Psarakis , Marios Fragkoulis , Arie van Deursen , Paris Carbone , Asterios Katsifodimos

State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-21 Sachini Jayasekara , Aaron Harwood , Shanika Karunasekera

Fault tolerance is critical for distributed stream processing systems, yet achieving error-free fault tolerance often incurs substantial performance overhead. We present AF-Stream, a distributed stream processing system that addresses the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-13 Zhinan Cheng , Qun Huang , Patrick P. C. Lee

Parallel computing is very important to accelerate the performance of software systems. Additionally, considering that a recurring challenge is to process high data volumes continuously, stream processing emerged as a paradigm and software…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-14 Adriano Vogel , Sören Henning , Esteban Perez-Wohlfeil , Otmar Ertl , Rick Rabiser

Distributed Stream Processing (DSP) systems are capable of processing large streams of unbounded data, offering high throughput and low latencies. To maintain a stable Quality of Service (QoS), these systems require a sufficient allocation…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-03 Benjamin J. J. Pfister , Dominik Scheinert , Morgan K. Geldenhuys , Odej Kao

Node and link churn in multi-party, cross-region clusters over wide-area networks (WANs) often disrupts distributed training. However, checkpoint-based recovery and cloud-centric autoscaling react slowly and assume centralized control,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-16 Wenjiao Feng , Rongxing Xiao , Zonghang Li , Hongfang Yu , Gang Sun , Long Luo , Mohsen Guizani , Qirong Ho , Steve Liu

Fault injectors are essential tools for evaluating the reliability and resilience of computing systems. They enable the simulation of hardware and software faults to analyze system behavior under error conditions and assess its ability to…

Hardware Architecture · Computer Science 2026-02-03 Elio Vinciguerra , Enrico Russo , Giuseppe Ascia , Maurizio Palesi

Nowadays, several software systems rely on stream processing architectures to deliver scalable performance and handle large volumes of data in near real-time. Stream processing frameworks facilitate scalable computing by distributing the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-30 Adriano Vogel , Sören Henning , Esteban Perez-Wohlfeil , Otmar Ertl , Rick Rabiser

Software as a service (SaaS) has recently enjoyed much attention as it makes the use of software more convenient and cost-effective. At the same time, the arising of users' expectation for high quality service such as real-time information…

Software Engineering · Computer Science 2016-04-13 Feng-Lin Li , Chi-Hung Chi , Yue Wang , Cong Liu

Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: fault detection inherits network uncertainties making a remote faulty process…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-25 Jovan Nikolic , Nursultan Jubatyrov , Evangelos Pournaras

Distributed Stream Processing Engines (DSPEs) target applications related to continuous computation, online machine learning and real-time query processing. DSPEs operate on high volume of data by applying lightweight operations on…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-06 Muhammad Anis Uddin Nasir

Failures in networks result in service disruptions which may cause deteriorated Quality of Service (QoS) for the end users. Since SDN is becoming the mainstream paradigm for networks, implementation of a robust fault tolerance scheme for…

Networking and Internet Architecture · Computer Science 2020-10-26 Baris Yamansavascilar , Ahmet Cihat Baktir , Atay Ozgovde , Cem Ersoy

A distributed system consisting of a huge number of computational entities is prone to faults, because faults in a few nodes cause the entire system to fail. Consequently, fault tolerance of distributed systems is a critical issue.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-30 Junya Nakamura , Yonghwan Kim , Yoshiaki Katayama , Toshimitsu Masuzawa

With the increasing importance of distributed scientific workflows, there is a critical need to ensure Quality of Service (QoS) constraints, such as minimizing time or limiting execution to resource subsets. However, the unpredictable…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-02 Md Hasanur Rashid , Jesun Firoz , Nathan R. Tallent , Luanzheng Guo , Meng Tang , Dong Dai

Real-time data processing applications with low latency requirements have led to the increasing popularity of stream processing systems. While such systems offer convenient APIs that can be used to achieve data parallelism automatically,…

Programming Languages · Computer Science 2022-01-04 Konstantinos Kallas , Filip Niksic , Caleb Stanford , Rajeev Alur

The ability to process large numbers of continuous data streams in a near-real-time fashion has become a crucial prerequisite for many scientific and industrial use cases in recent years. While the individual data streams are usually…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-08-06 Björn Lohrmann , Daniel Warneke , Odej Kao

Operating a distributed data stream processing workload efficiently at scale is hard. The operator of the workload must parallelize and lay out tasks of the workload with resources that match the requirement of target data rate. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-12-27 Manu Bansal , Eyal Cidon , Arjun Balasingam , Aditya Gudipati , Christos Kozyrakis , Sachin Katti

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2008-06-20 George Bosilca , Remi Delmas , Jack Dongarra , Julien Langou
‹ Prev 1 2 3 10 Next ›