操作系统
LLM serving relies on prefix caching to improve inference performance. As growing contexts push key-value (KV) cache footprint far beyond GPU HBM and CPU DRAM capacity, KV cache is increasingly offloaded to NVMe SSDs. Unfortunately,…
Cities are rapidly deploying sensing infrastructure -- cameras, environmental sensors, and connected kiosks -- that continuously observe public spaces, yet they lack a system architecture governing how applications access, aggregate, and…
GPU-based simulation environments for embodied AI interleave physics simulation (CUDA) and photorealistic rendering (Vulkan) on a single device. We observe that two foundational scenarios -- simulation data generation and RL training -- can…
Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore (C/R) of this state is needed for fault tolerance, spot execution, RL rollout branching,…
Modern large multicore systems often run multiple workloads that share CPUs under schedulers such as Linux CFS. To keep CPUs busy, these schedulers load-balance runnable work, causing each workload to execute on many cores. This weakens…
Software stacks embedded on microcontroller-based hardware typically provide rudimentary APIs programmed in C/C++, basic connectivity and, sometimes, a firmware update mechanism. Such coarse mechanisms contrast with widely used APIs and…
Large language models (LLMs) are increasingly deployed as the execution core of autonomous agents rather than as standalone text generators. Agentic workloads induce a temporal shift from single-turn inference to multi-turn LLM-tool loops,…
The use of disaggregated or far memory systems such as CXL memory pools has renewed interest in Near-Data Processing (NDP): situating cores close to memory to reduce bandwidth requirements to and from the CPU. Hardware designs for such…
An OS kernel that runs LLM inference internally can read logit distributions before any text is generated and act on them as a governance primitive. This paper presents ProbeLogits, a kernel-level operation that performs a single forward…
Memory dominates datacenter system cost and power. Memory expansion via Compute Express Link (CXL) is an effective way to provide additional memory at lower cost and power, but its effective use requires software-level tiering for…
Modern storage systems often combine fast cache with slower backend devices to accelerate I/O. As performance gaps narrow, concurrently accessing both devices, rather than relying solely on cache hits, can improve throughput. However, in…
AI coding agents operate directly on users' filesystems, where they regularly corrupt data, delete files, and leak secrets. Current approaches force a tradeoff between safety and autonomy: unrestricted access risks harm, while frequent…
Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of…
Existing software-based memory tiering systems decide which pages to place on the slower or faster tier. However, they do not take into account two important factors that greatly influence application performance: the size of the migrated…
Memory tiering provides a cost-effective solution to increase memory capacity, utilization, and even bandwidth. Memory tiering relies on system software for memory profiling, detection of frequently accessed pages, and page migration. Such…
Serverless providers strive for high resource utilization by optimizing deployment density: how many applications can be deployed per host server. However, achieving high deployment density without compromising application performance or…
Deploying large language models (LLMs) on mobile devices is an emerging trend to enable data privacy and offline accessibility of LLM applications. Modern mobile neural processing units (NPUs) make such deployment increasingly feasible.…
LLM inference powers latency-critical production services nowadays. The bursty nature of inference traffic results in over-provisioning, which in turn leads to resource underutilization. While online-offline colocation promises to utilize…
Current LLM agent frameworks often implement isolation, scheduling, and communication at the application layer, even though these mechanisms are already provided by mature operating systems. Instead of introducing another application-layer…
The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory…