Related papers: Heterogeneous-Reliability Memory: Exploiting Appli…
Many high end and next generation computing systems to incorporated alternative memory technologies to meet performance goals. Since these technologies present distinct advantages and tradeoffs compared to conventional DDR* SDRAM, such as…
The current mobile applications have rapidly growing memory footprints, posing a great challenge for memory system design. Insufficient DRAM main memory will incur frequent data swaps between memory and storage, a process that hurts…
Reliability has emerged as a key topic of interest for researchers around the world to detect and/or mitigate the side effects of decreasing transistor sizes, such as soft errors. Traditional solutions, like DMR and TMR, incur significant…
Quantum Random Access Memory (QRAM) holds the promise of enabling several large scale applications of quantum computers. However, designing fault tolerant QRAMs for large scale applications is still an open problem due to the poor error and…
With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature…
Remote Memory Access (RMA) is an emerging mechanism for programming high-performance computers and datacenters. However, little work exists on resilience schemes for RMA-based applications and systems. In this paper we analyze fault…
Raw bit errors are common in NAND flash memory and will increase in the future. These errors reduce flash reliability and limit the lifetime of a flash memory device. We aim to improve flash reliability with a multitude of low-cost…
In modern systems, DRAM-based main memory is significantly slower than the processor. Consequently, processors spend a long time waiting to access data from main memory, making the long main memory access latency one of the most critical…
AI clusters today are one of the major uses of High Bandwidth Memory (HBM). However, HBM is suboptimal for AI workloads for several reasons. Analysis shows HBM is overprovisioned on write performance, but underprovisioned on density and…
In recent years, high availability and reliability of Data Storage Systems (DSS) have been significantly threatened by soft errors occurring in storage controllers. Due to their specific functionality and hardware-software stack, error…
The continuing advancement of memory technology has not only fueled a surge in performance, but also substantially exacerbate reliability challenges. Traditional solutions have primarily focused on improving the efficiency of protection…
Modern DRAM modules are often equipped with hardware error correction capabilities, especially for DRAM deployed in large-scale data centers, as process technology scaling has increased the susceptibility of these devices to errors. To…
A large language model (LLM) is one of the most important emerging machine learning applications nowadays. However, due to its huge model size and runtime increase of the memory footprint, LLM inferences suffer from the lack of memory…
Due to the diversity and implicit redundancy in terms of processing units and compute kernels, off-the-shelf heterogeneous systems offer the opportunity to detect and tolerate faults during task execution in hardware as well as in software.…
Caching is crucial for enabling high-throughput networks for data intensive applications. Traditional caching technology relies on DRAM, as it can transfer data at a high rate. However, DRAM capacity is subject to contention by most system…
Cloud computing has become inevitable for every digital service which has exponentially increased its usage. However, a tremendous surge in cloud resource demand stave off service availability resulting into outages, performance…
The aggressive scaling of technology may have helped to meet the growing demand for higher memory capacity and density, but has also made DRAM cells more prone to errors. Such a reality triggered a lot of interest in modeling DRAM behavior…
In-memory key-value stores provide consistent low-latency access to all objects which is important for interactive large-scale applications like social media networks or online graph analytics and also opens up new application areas. But,…
High capacity and scalable memory systems play a vital role in enabling our desktops, smartphones, and pervasive technologies like Internet of Things (IoT). Unfortunately, memory systems are becoming increasingly prone to faults. This is…
This paper summarizes our work on experimentally characterizing, mitigating, and recovering data retention errors in multi-level cell (MLC) NAND flash memory, which was published in HPCA 2015, and examines the work's significance and future…