Computer Science
Large Language Models (LLMs) have revolutionized AI applications, but deploying them at scale presents significant challenges. We present RTP-LLM, a high-performance inference engine for industrial-scale LLM deployment, successfully…
A real-time multicore system requires delay bounds on access to shared resources. These resources include the kernel, which has potentially many non-preemptible critical sections guarded by one or more different synchronization primitives.…
Linux is the foundation of the digital age, accounting for the majority of the cloud and mobile OS markets. Any device that runs Linux uses the Linux page cache, a central pillar in OS and application performance, serving to reduce…
We consider the problem of computing sample points in each connected component of a semi-algebraic set defined by the non-vanishing or the positivity of an n-variate polynomial of degree d, with rational coefficients of bit size bounded by…
KV cache management is essential for efficient LLM inference. To maximize utilization, existing inference engines evict finished requests' KV cache if new requests are waiting. This policy breaks for agentic workloads, which interleave LLM…
Large Language Models (LLMs) have demonstrated impressive progress in complex reasoning tasks, largely driven by the Chain-of-Thought (CoT) paradigm, which decomposes difficult problems into intermediate steps. However, CoT reasoning…
LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g.,…
We study the problem of computing the isolated regular solutions of a system \((f_1,\ldots,f_n)\) of \(n\) polynomial equations in \(n\) variables \((X_1, \dots, X_n)\) over a field of characteristic zero \(k\). We focus on systems with a…
We present a new algorithm for fast matrix multiplication using tensor decompositions which have special features. Thanks to these features we obtain exponents lower than what the rank of the tensor decomposition suggests. In particular for…
This paper presents a generalised symbolic algorithm for solving systems of linear algebraic equations with multi-diagonal coefficient matrices. The algorithm is given in a pseudocode. A theorem which gives the condition for correctness of…
Secure containers isolate each container with its own kernel, mitigating shared-kernel attacks prevalent in traditional container systems. However, existing designs still face a fundamental isolation--performance trade-off. Nested-cloud…
Object-level management of tiered memory has been studied to address the inefficiencies in page-based systems. However, object-level management for CXL-tiered memory remains underexplored due to CXL's tight performance budget and load/store…
Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across multiple verifier queries, while the latter reduces each…
Real-Time Operating Systems (RTOSes) play a crucial role in safety-critical domains, where deterministic and predictable task execution is essential. Yet they are increasingly exposed to ionizing radiation, which can compromise system…
Linux is increasingly deployed in Low Earth Orbit on commercial off the shelf systems on chip that were not designed for space radiation. Ionizing particles can trigger single event functional interrupts that crash the kernel without…
Using correct design metrics and understanding the limitations of the underlying technology is critical to developing effective scheduling algorithms. Unfortunately, existing scheduling techniques used \emph{incorrect} metrics and had…
Modern LLM serving is increasingly serverless in shape: large model catalogs, long-tail invocations, and multi-tenant demand. Existing GPU serving systems face a tradeoff: dedicated-GPU allocation wastes scarce HBM under sparse traffic,…
A new symbolic algorithm to compute sums of squares multipliers (certificates) to witness the membership of non-negative univariate polynomials in a saturated univariate quadratic module is presented. Certificates are first computed in…
As Rust gains traction for developing safer systems software, a reality check for the microcontroller hardware segment becomes necessary. How ready is the Rust ecosystem for this segment? Can Rust compete with C in practice? This paper…
Cloud Virtual Disk (CVD) placement in Cloud Block Storage (CBS) is critical for resource efficiency and performance isolation. Existing schemes prioritize spatial load balancing by dispersing disks across pods based on configuration-derived…