Related papers: BEC: Bit-Level Static Analysis for Reliability aga…
Reliability has been a major concern in embedded systems. Higher transistor density and lower voltage supply increase the vulnerability of embedded systems to soft errors. A Single Event Upset (SEU), which is also called a soft error, can…
We present CLEAR (Cross-Layer Exploration for Architecting Resilience), a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience…
High-performance and safety-critical system architects must accurately evaluate the application-level silent data corruption (SDC) rates of processors to soft errors. Such an evaluation requires error propagation all the way from particle…
In contemporary times, the increasing complexity of the system poses significant challenges to the reliability, trustworthiness, and security of the SACRES. Key issues include the susceptibility to phenomena such as instantaneous voltage…
Efficient low complexity error correcting code(ECC) is considered as an effective technique for mitigation of multi-bit upset (MBU) in the configuration memory(CM)of static random access memory (SRAM) based Field Programmable Gate Array…
Fault injection attacks represent a class of threats that can compromise embedded systems across multiple layers of abstraction, such as system software, instruction set architecture (ISA), microarchitecture, and physical implementation.…
We present a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time,…
In most error correction coding (ECC) frameworks, the typical error metric is the bit error rate (BER) which measures the number of bit errors. For this metric, the positions of the bits are not relevant to the decoding, and in many noise…
Quantum error mitigation (QEM) is typically viewed as a suite of practical techniques for today's noisy intermediate-scale quantum devices, with limited relevance once fault-tolerant quantum computers become available. In this work, we…
Modern computer scaling trends in pursuit of larger component counts and power efficiency have, unfortunately, lead to less reliable hardware and consequently soft errors escaping into application data ("silent data corruptions").…
Silent Errors within hardware devices occur when an internal defect manifests in a part of the circuit which does not have check logic to detect the incorrect circuit operation. The results of such a defect can range from flipping a single…
High energy particles from cosmic rays or packaging materials can generate a glitch or a current transient (single event transient or SET) in a logic circuit. This SET can eventually get captured in a register resulting in a flip of the…
Stabilizer states are a central resource in quantum information processing, underpinning a wide range of applications. While they can be efficiently generated via Clifford circuits, the presence of coherent errors, such as small-angle…
Software Fault Localization refers to the activity of finding code elements (e.g., statements) that are related to a software failure. The state-of-the-art fault localization techniques, however, produce coarse-grained results that can…
Soft errors have a significant impact on the circuit reliability at nanoscale technologies. At the architectural level, soft errors are commonly modeled by a probabilistic bit-flip model. In developing such abstract fault models, an…
Alpha-particles and cosmic rays cause bit flips in chips. Protection circuits ease the problem, but cost chip area and power, and so designers try hard to optimize them. This leads to bugs: an undetected fault can bring miscalculations, the…
Static analysis plays a crucial role in software vulnerability detection, yet faces a persistent precision-scalability tradeoff. In large codebases like the Linux kernel, traditional static analysis tools often generate excessive false…
Future extreme-scale computer systems may expose silent data corruption (SDC) to applications, in order to save energy or increase performance. However, resilience research struggles to come up with useful abstract programming models for…
Silent data corruption (SDC) threatens the reliability of large-scale GPU clusters used for training large language models, yet its rarity and lack of explicit error signals make accurate high-level modeling challenging. To address this…
In large-scale datacenters, memory failure is a common cause of server crashes, with Uncorrectable Errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using…