Related papers: A Low Overhead Minimum Process Global Snapshop Col…
The wireless mobile ad hoc network (MANET) architecture is one consisting of a set of mobile hosts capable of communicating with each other without the assistance of base stations. This has made possible creating a mobile distributed…
A distributed system consisting of a huge number of computational entities is prone to faults, because faults in a few nodes cause the entire system to fail. Consequently, fault tolerance of distributed systems is a critical issue.…
To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory…
Recovery from transient failures is one of the prime issues in the context of distributed systems. These systems demand to have transparent yet efficient techniques to achieve the same. Checkpoint is defined as a designated place in a…
Taking snapshots of the state of a distributed computation is useful for off-line analysis of the computational state, for later restarting from the saved snapshot, for cloning a copy of the computation, and for migration to a new cluster.…
In wireless sensor networks (WSNs), the sensed data by sensors need to be gathered, so that one very important application is periodical data collection. There is much effort which aimed at the data collection scheduling algorithm…
Real-time visual analysis tasks, like tracking and recognition, require swift execution of computationally intensive algorithms. Visual sensor networks can be enabled to perform such tasks by augmenting the sensor network with processing…
A mobile computing system is a distributed system in which at least one of the processes is mobile. They are constrained by lack of stable storage, low network bandwidth, mobility, frequent disconnection and limited battery life.…
Massive machine-type communications protocols have typically been designed under the assumption that coordination between users requires significant communication overhead and is thus impractical. Recent progress in efficient activity…
Distributed learning platforms for processing large scale data-sets are becoming increasingly prevalent. In typical distributed implementations, a centralized master node breaks the data-set into smaller batches for parallel processing…
We focus on the problem of checkpointing (or taking a snapshot) in fully replicated eventually consistent distributed databases. In particular, we consider the problem of taking Distributed Transaction-Consistent Snapshots (DTCS). A typical…
In WSN, each sensor is responsible for sensing environmental conditions and sending them to the one or more base stations. Battery-operated sensors are severely constrained by the amount of energy that can be spend for transmitting these…
Message aggregation is often used with a goal to reduce communication cost in HPC applications. The difference in the order of overhead of sending a message and cost of per byte transferred motivates the need for message aggregation, for…
Computational offloading has become an enabling component for edge intelligence in mobile and smart devices. Existing offloading schemes mainly focus on mobile devices and servers, while ignoring the potential network congestion caused by…
Accurate network synchronization is a key enabler for services such as coherent transmission, cooperative decoding, and localization in distributed and cell-free networks. Unlike centralized networks, where synchronization is generally…
Modern mobile terminals often produce a large number of small data packets. For these packets, it is inefficient to follow the conventional medium access control protocols because of poor utilization of service resources. We propose a novel…
This paper considers base station cooperation (BSC) strategies for the uplink of a multi-user multi-cell high frequency reuse scenario where distributed iterative detection (DID) schemes with soft/hard interference cancellation algorithms…
We develop distributed algorithms to allocate resources in multi-hop wireless networks with the aim of minimizing total cost. In order to observe the fundamental duplexing constraint that co-located transmitters and receivers cannot operate…
NVM-based systems are naturally fit candidates for incorporating periodic checkpointing (or snapshotting). This increases the reliability of the system, makes it more immune to power failures, and reduces wasted work in especially an HPC…
This paper seeks to address the question of designing distributed algorithms for the setting of compact memory i.e. sublinear bits working memory for arbitrary connected networks. The nodes in our networks may have much lower internal…