Related papers: A Low Overhead Minimum Process Global Snapshop Col…

Minimum Process Coordinated Checkpointing Scheme for Ad Hoc Networks

The wireless mobile ad hoc network (MANET) architecture is one consisting of a set of mobile hosts capable of communicating with each other without the assistance of base stations. This has made possible creating a mobile distributed…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-11-10 Ruchi Tuli , Parveen Kumar

A cooperative partial snapshot algorithm for checkpoint-rollback recovery of large-scale and dynamic distributed systems and experimental evaluations

A distributed system consisting of a huge number of computational entities is prone to faults, because faults in a few nodes cause the entire system to fail. Consequently, fault tolerance of distributed systems is a critical issue.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-30 Junya Nakamura , Yonghwan Kim , Yoshiaki Katayama , Toshimitsu Masuzawa

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-20 Yuxin Wang , Xueze Kang , Shaohuai Shi , Xin He , Zhenheng Tang , Xinglin Pan , Yang Zheng , Xiaoyu Wu , Amelie Chi Zhou , Bingsheng He , Xiaowen Chu

Analysis of Recent Checkpointing Techniques for Mobile Computing Systems

Recovery from transient failures is one of the prime issues in the context of distributed systems. These systems demand to have transparent yet efficient techniques to achieve the same. Checkpoint is defined as a designated place in a…

Networking and Internet Architecture · Computer Science 2011-09-01 Ruchi Tuli , Parveen Kumar

Collective Vector Clocks: Low-Overhead Transparent Checkpointing for MPI

Taking snapshots of the state of a distributed computation is useful for off-line analysis of the computational state, for later restarting from the saved snapshot, for cloning a copy of the computation, and for migration to a new cluster.…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-09 Yao Xu , Gene Cooperman

Challenges, Designs, and Performances of a Distributed Algorithm for Minimum-Latency of Data-Aggregation in Multi-Channel WSNs

In wireless sensor networks (WSNs), the sensed data by sensors need to be gathered, so that one very important application is periodical data collection. There is much effort which aimed at the data collection scheduling algorithm…

Data Structures and Algorithms · Computer Science 2018-10-30 Ngoc-Tu Nguyen , Bing-Hong Liu , Shao-I Chu , Hao-Zhe Weng

Distributed Algorithms for Feature Extraction Off-loading in Multi-Camera Visual Sensor Networks

Real-time visual analysis tasks, like tracking and recognition, require swift execution of computationally intensive algorithms. Visual sensor networks can be enabled to perform such tasks by augmenting the sensor network with processing…

Computer Vision and Pattern Recognition · Computer Science 2017-05-24 Emil Eriksson , György Dán , Viktoria Fodor

Distance Based Asynchronous Recovery Approach in Mobile Computing Environment

A mobile computing system is a distributed system in which at least one of the processes is mobile. They are constrained by lack of stable storage, low network bandwidth, mobility, frequent disconnection and limited battery life.…

Databases · Computer Science 2012-06-08 Yogita Khatri

Scheduling Versus Contention for Massive Random Access in Massive MIMO Systems

Massive machine-type communications protocols have typically been designed under the assumption that coordination between users requires significant communication overhead and is thus impractical. Recent progress in efficient activity…

Information Theory · Computer Science 2022-07-12 Justin Kang , Wei Yu

On the Worst-case Communication Overhead for Distributed Data Shuffling

Distributed learning platforms for processing large scale data-sets are becoming increasingly prevalent. In typical distributed implementations, a centralized master node breaks the data-set into smaller batches for parallel processing…

Information Theory · Computer Science 2016-10-03 Mohamed Attia , Ravi Tandon

Asynchronous Checkpoint for Eventually Consistent Databases

We focus on the problem of checkpointing (or taking a snapshot) in fully replicated eventually consistent distributed databases. In particular, we consider the problem of taking Distributed Transaction-Consistent Snapshots (DTCS). A typical…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-19 Raaghav Ravishankar , Sandeep Kulkarni , Nitin H Vaidya

Distributed Algorithm for Dynamic Data-Gathering in Sensor Network

In WSN, each sensor is responsible for sensing environmental conditions and sending them to the one or more base stations. Battery-operated sensors are severely constrained by the amount of energy that can be spend for transmitting these…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-10-16 Subhasis Bhattacharjee

Shared Memory-Aware Latency-Sensitive Message Aggregation for Fine-Grained Communication

Message aggregation is often used with a goal to reduce communication cost in HPC applications. The difference in the order of overhead of sending a message and cost of per byte transferred motivates the need for message aggregation, for…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-07 Kavitha Chandrasekar , Laxmikant Kale

Congestion-aware Distributed Task Offloading in Wireless Multi-hop Networks Using Graph Neural Networks

Computational offloading has become an enabling component for edge intelligence in mobile and smart devices. Existing offloading schemes mainly focus on mobile devices and servers, while ignoring the potential network congestion caused by…

Networking and Internet Architecture · Computer Science 2024-01-23 Zhongyuan Zhao , Jake Perazzone , Gunjan Verma , Santiago Segarra

Enabling Low-Overhead Over-the-Air Synchronization Using Online Learning

Accurate network synchronization is a key enabler for services such as coherent transmission, cooperative decoding, and localization in distributed and cell-free networks. Unlike centralized networks, where synchronization is generally…

Signal Processing · Electrical Eng. & Systems 2023-03-03 Dieter Verbruggen , Hazem Sallouha , Sofie Pollin

Multiple Access for Small Packets Based on Precoding and Sparsity-Aware Detection

Modern mobile terminals often produce a large number of small data packets. For these packets, it is inefficient to follow the conventional medium access control protocols because of poor utilization of service resources. We propose a novel…

Information Theory · Computer Science 2014-09-05 Ronggui Xie , Huarui Yin , Xiaohui Chen , Zhengdao Wang

Distributed Iterative Detection Based on Reduced Message Passing for Networked MIMO Cellular Systems

This paper considers base station cooperation (BSC) strategies for the uplink of a multi-user multi-cell high frequency reuse scenario where distributed iterative detection (DID) schemes with soft/hard interference cancellation algorithms…

Information Theory · Computer Science 2014-01-03 Peng Li , Rodrigo C. de Lamare

Distributed Algorithms for Spectrum Allocation, Power Control, Routing, and Congestion Control in Wireless Networks

We develop distributed algorithms to allocate resources in multi-hop wireless networks with the aim of minimizing total cost. In order to observe the fundamental duplexing constraint that co-located transmitters and receivers cannot operate…

Networking and Internet Architecture · Computer Science 2016-11-15 Yufang Xi , Edmund M. Yeh

JASS: A Flexible Checkpointing System for NVM-based Systems

NVM-based systems are naturally fit candidates for incorporating periodic checkpointing (or snapshotting). This increases the reliability of the system, makes it more immune to power failures, and reduces wasted work in especially an HPC…

Hardware Architecture · Computer Science 2023-01-30 Akshin Singh , Smruti R. Sarangi

Some Problems in Compact Message Passing

This paper seeks to address the question of designing distributed algorithms for the setting of compact memory i.e. sublinear bits working memory for arbitrary connected networks. The nodes in our networks may have much lower internal…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-22 Armando Castañeda , Jonas Lefèvre , Amitabh Trehan