Computer Science
Safety applications in vehicle-to-everything communications and Cooperative Intelligent Transport Systems rely on reliable and timely message exchange, which in turn depends on accurate modeling of wireless signal propagation. Simulation…
Adjacent GEMM problems that differ by a single 128-element step in N can show 30% different throughput on the same GPU. This pervasive performance ruggedness - invisible to roofline analysis and peak-FLOPs intuition, yet dominant for every…
Large Language Models (LLMs) have revolutionized AI applications, but deploying them at scale presents significant challenges. We present RTP-LLM, a high-performance inference engine for industrial-scale LLM deployment, successfully…
We describe libhmm, a C++20 library for Hidden Markov Model parameter estimation, sequence decoding, and model selection. libhmm addresses two gaps in existing software: the absence of a well-maintained, zero-dependency C++ HMM library…
Large language models have achieved remarkable capabilities through scaling, and this paper does not challenge that. It instead investigates a different question: once large models already exist, can they become more accessible to…
Half precision (FP16) promises to double FFT throughput on GPUs, but the prevailing view is that its 10-bit mantissa makes it unsuitable for radar-grade signal processing. We show this framing is wrong on Apple Silicon: the binding…
A real-time multicore system requires delay bounds on access to shared resources. These resources include the kernel, which has potentially many non-preemptible critical sections guarded by one or more different synchronization primitives.…
In a computer system, multiple indispensable components-such as the CPU, memory, and others-work together with other essential components to produce an overall effect, which can only be measured on an independently running system. Since the…
Linux is the foundation of the digital age, accounting for the majority of the cloud and mobile OS markets. Any device that runs Linux uses the Linux page cache, a central pillar in OS and application performance, serving to reduce…
KV cache management is essential for efficient LLM inference. To maximize utilization, existing inference engines evict finished requests' KV cache if new requests are waiting. This policy breaks for agentic workloads, which interleave LLM…
Recurring industrial analytics and machine-learning workflows are becoming a major computational burden in modern engineering practice. Large parametric database generation, scheduled model retraining, repeated evaluation pipelines, and…
Neural networks are increasingly deployed in scientific, safety critical, and mission critical pipelines, yet verification and analysis are often performed outside the programming environment that defines and runs the model. This creates a…
Efficient solutions of large-scale, ill-conditioned and indefinite algebraic equations are ubiquitously needed in numerous computational fields, including multiphysics simulations, machine learning, and data science. Because of their…
LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g.,…
Modern computing systems process jobs with resource requirements such as CPU and memory, which are described by multiresource jobs (MRJ) queueing models. In practice, job resource requirements are spread out over so many values, that it is…
A formulation of elliptic boundary value problems is used to develop the first discrete exterior calculus (DEC) library for massively parallel computations with 3D domains. This can be used for steady-state analysis of any physical process…
This paper presents an experimental performance study of implementations of three symbolic algorithms for solving band matrix systems of linear algebraic equations with heptadiagonal, pentadiagonal, and tridiagonal coefficient matrices. The…
Secure containers isolate each container with its own kernel, mitigating shared-kernel attacks prevalent in traditional container systems. However, existing designs still face a fundamental isolation--performance trade-off. Nested-cloud…
We present the Matlab toolbox MacaulayLab, which implements numerical linear algebra algorithms for solving multivariate polynomial systems and rectangular multiparameter eigenvalue problems. Its structure and functionality are the result…
Object-level management of tiered memory has been studied to address the inefficiencies in page-based systems. However, object-level management for CXL-tiered memory remains underexplored due to CXL's tight performance budget and load/store…