Related papers: Understanding and Optimizing Persistent Memory All…
We demonstrate that general-purpose memory allocation involving many threads on many cores can be done with high performance, multicore scalability, and low memory consumption. For this purpose, we have designed and implemented scalloc, a…
The ability to dynamically allocate memory is fundamental in modern programming languages. However, this feature is not adequately supported in current general-purpose PIM devices. To identify key design principles that PIM must consider,…
Persistent memory provides high-performance data persistence at main memory. Memory writes need to be performed in strict order to satisfy storage consistency requirements and enable correct recovery from system crashes. Unfortunately,…
We study memory allocation patterns in DNNs during inference, in the context of large-scale systems. We observe that such memory allocation patterns, in the context of multi-threading, are subject to high latencies, due to \texttt{mutex}…
The scaling law, which indicates that model performance improves with increasing dataset and model capacity, has fueled a growing trend in expanding recommendation models in both industry and academia. However, the advent of large-scale…
Data analytics applications transform raw input data into analytics-specific data structures before performing analytics. Unfortunately, such data ingestion step is often more expensive than analytics. In addition, various types of NVRAM…
The recent emergence of fast, dense, nonvolatile main memory suggests that certain long-lived data might remain in its natural pointer-rich format across program runs and hardware reboots. Operations on such data must be instrumented with…
Memory allocation, though constituting only a small portion of the executed code, can have a "butterfly effect" on overall program performance, leading to significant and far-reaching impacts. Despite accounting for just approximately 5% of…
The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and…
Heap-based exploits that leverage memory management errors continue to pose a significant threat to application security. The root cause of these vulnerabilities are the memory management errors within the applications, however various…
We present GreenMalloc, a multi objective search-based framework for automatically configuring memory allocators. Our approach uses NSGA II and rand_malloc as a lightweight proxy benchmarking tool. We efficiently explore allocator…
Memory allocators hide beneath nearly every application stack, yet their performance footprint extends far beyond their code size. Even small inefficiencies in the allocators ripple through caches and the rest of the memory hierarchy,…
Lock-free data structures are an important tool for the development of concurrent programs as they provide scalability, low latency and avoid deadlocks, livelocks and priority inversion. However, they require some sort of additional support…
Systems that require high-throughput and fault tolerance, such as key-value stores and databases, are looking to persistent memory to combine the performance of in-memory systems with the data-consistent fault-tolerance of nonvolatile…
Understanding the performance of data-parallel workloads when resource-constrained has significant practical importance but unfortunately has received only limited attention. This paper identifies, quantifies and demonstrates memory…
The Memory Reallocation problem asks to dynamically maintain an assignment of given objects of various sizes to non-overlapping contiguous chunks of memory, while supporting updates (insertions/deletions) in an online fashion. The total…
In this work, we present StarMalloc, a verified, security-oriented, concurrent memory allocator that can be used as a drop-in replacement in real-world projects. Using the Steel separation logic framework, we show how to specify and verify…
This paper presents a comprehensive analysis of performance trade offs between implementation choices for transaction runtime systems on persistent memory. We compare three implementations of transaction runtimes: undo logging, redo…
Parallel programmers face the often irreconcilable goals of programmability and performance. HPC systems use distributed memory for scalability, thereby sacrificing the programmability advantages of shared memory programming models.…
Programmers using native languages such as C, C++, or Rust can implement custom memory allocation strategies to improve execution time. In their paper titled "Reconsidering Custom Memory Allocation" almost 25 years ago, Berger et al. showed…