Related papers: Understanding Soft Errors in Uncore Components
To protect multicores from soft-error perturbations, resiliency schemes have been developed with high coverage but high power and performance overheads. Emerging safety-critical machine learning applications are increasingly being deployed…
We present a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time,…
We present CLEAR (Cross-Layer Exploration for Architecting Resilience), a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience…
Smaller feature size, higher clock frequency and lower power consumption are of core concerns of today's nano-technology, which has been resulted by continuous downscaling of CMOS technologies. The resultant 'device shrinking' reduces the…
Acoustic-sensor-based soft error resilience is particularly promising, since it can verify the absence of soft errors and eliminate silent data corruptions at a low hardware cost. However, the state-of-the-art work incurs a significant…
Rapid CMOS device size reduction resulted in billions of transistors on a chip have led to integration of many cores leading to many challenges such as increased power dissipation, thermal dissipation, occurrence of transient faults and…
Graphics processing units (GPUs) are gaining widespread use in computational chemistry and other scientific simulation contexts because of their huge performance advantages relative to conventional CPUs. However, the reliability of GPUs in…
The memory consistency model is a fundamental system property characterizing a multiprocessor. The relative merits of strict versus relaxed memory models have been widely debated in terms of their impact on performance, hardware complexity…
Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which…
Soft errors in large VLSI circuits pose dramatic influence on computing- and memory-intensive neural network (NN) processing. Understanding the influence of soft errors on NNs is critical to protect against soft errors for reliable NN…
Advancements in multi-core have created interest among many research groups in finding out ways to harness the true power of processor cores. Recent research suggests that on-board component such as cache memory plays a crucial role in…
AIoT processors fabricated with newer technology nodes suffer rising soft errors due to the shrinking transistor sizes and lower power supply. Soft errors on the AIoT processors particularly the deep learning accelerators (DLAs) with…
In stochastic circuits, major sources of error are correlation errors, soft errors and random fluctuation errors that affect the accuracy and reliability of the circuit. The soft error has the effect of changing the correlation status and…
The increasing number of threads inside the cores of a multicore processor, and competitive access to the shared cache memory, become the main reasons for an increased number of competitive cache misses and performance decline. Inevitably,…
High-performance and safety-critical system architects must accurately evaluate the application-level silent data corruption (SDC) rates of processors to soft errors. Such an evaluation requires error propagation all the way from particle…
In contemporary times, the increasing complexity of the system poses significant challenges to the reliability, trustworthiness, and security of the SACRES. Key issues include the susceptibility to phenomena such as instantaneous voltage…
Memory allocation, though constituting only a small portion of the executed code, can have a "butterfly effect" on overall program performance, leading to significant and far-reaching impacts. Despite accounting for just approximately 5% of…
In recent years, high availability and reliability of Data Storage Systems (DSS) have been significantly threatened by soft errors occurring in storage controllers. Due to their specific functionality and hardware-software stack, error…
The ever growing demands of embedded systems to satisfy high computing performance and cost efficiency lead to the trend of using commercial off-the-shelf hardware. However, due to their highly integrated design they are becoming…