操作系统 — Scifaro

Vmem: A Lightweight Hot-Upgradable Memory Management for In-production Cloud Environment

Traditional memory management suffers from metadata overhead, architectural complexity, and stability degradation, problems intensified in cloud environments. Existing software/hardware optimizations are insufficient for cloud computing's…

操作系统 · 计算机科学 2025-11-14 Hao Zheng , Qiang Wang , Longxiang Wang , Xishi Qiu , Yibin Shen , Xiaoshe Dong , Naixuan Guan , Jia Wei , Fudong Qiu , Xingjun Zhang , Yun Xu , Mao Zhao , Yisheng Xie , Shenglong Zhao , Min He , Yu Li , Xiao Zheng , Ben Luo , Jiesheng Wu

Work-in-Progress: Function-as-Subtask API Replacing Publish/Subscribe for OS-Native DAG Scheduling

The Directed Acyclic Graph (DAG) task model for real-time scheduling finds its primary practical target in Robot Operating System 2 (ROS 2). However, ROS 2's publish/subscribe API leaves DAG precedence constraints unenforced: a callback may…

操作系统 · 计算机科学 2025-11-12 Takahiro Ishikawa-Aso , Atsushi Yano , Yutaro Kobayashi , Takumi Jin , Yuuki Takano , Shinpei Kato

Integrating Artificial Intelligence into Operating Systems: A Survey on Techniques, Applications, and Future Directions

Heterogeneous hardware and dynamic workloads worsen long-standing OS bottlenecks in scalability, adaptability, and manageability. At the same time, advances in machine learning (ML), large language models (LLMs), and agent-based methods…

操作系统 · 计算机科学 2025-11-12 Yifan Zhang , Xinkui Zhao , Ziying Li , Guanjie Cheng , Jianwei Yin , Lufei Zhang , Zuoning Chen

GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training

The accuracy of large language models (LLMs) improves with increasing model size, but increasing model complexity also poses significant challenges to training stability. Periodic checkpointing is a key mechanism for fault recovery and is…

操作系统 · 计算机科学 2025-11-11 Keyao Zhang , Yiquan Chen , Zhuo Hu , Wenhai Lin , Jiexiong Xu , Wenzhi Chen

Guidelines for Building Indexes on Partially Cache-Coherent CXL Shared Memory

The \emph{Partial Cache-Coherence (PCC)} model maintains hardware cache coherence only within subsets of cores, enabling large-scale memory sharing with emerging memory interconnect technologies like Compute Express Link (CXL). However,…

操作系统 · 计算机科学 2025-11-11 Fangnuo Wu , Mingkai Dong , Wenjun Cai , Jingsheng Yan , Haibo Chen

Fix: externalizing network I/O in serverless computing

We describe a system for serverless computing where users, programs, and the underlying platform share a common representation of a computation: a deterministic procedure, run in an environment of well-specified data or the outputs of other…

操作系统 · 计算机科学 2025-11-04 Yuhan Deng , Akshay Srivatsan , Sebastian Ingino , Francis Chua , Yasmine Mitchell , Matthew Vilaysack , Keith Winstein

Oneiros: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving

KV cache accelerates LLM inference by avoiding redundant computation, at the expense of memory. To support larger KV caches, prior work extends GPU memory with CPU memory via CPU-offloading. This involves swapping KV cache between GPU and…

操作系统 · 计算机科学 2025-10-31 Ruihao Li , Shagnik Pal , Vineeth Narayan Pullu , Prasoon Sinha , Jeeho Ryoo , Lizy K. John , Neeraja J. Yadwadkar

Tidying Up the Address Space

Memory tiering in datacenters does not achieve its full potential due to hotness fragmentation -- the intermingling of hot and cold objects within memory pages. This fragmentation prevents page-based reclamation systems from distinguishing…

操作系统 · 计算机科学 2025-10-23 Vinay Banakar , Suli Yang , Kan Wu , Andrea C. Arpaci-Dusseau , Remzi H. Arpaci-Dusseau , Kimberly Keeton

DFUSE: Strongly Consistent Write-Back Kernel Caching for Distributed Userspace File Systems

Cloud platforms host thousands of tenants that demand POSIX semantics, high throughput, and rapid evolution from their storage layer. Kernel-native distributed file systems supply raw speed, but their privileged code base couples every…

操作系统 · 计算机科学 2025-10-23 Haoyu Li , Jingkai Fu , Qing Li , Windsor Hsu , Asaf Cidon

AgentSight: System-Level Observability for AI Agents Using eBPF

Modern software infrastructure increasingly relies on LLM agents for development and maintenance, such as Claude Code and Gemini-cli. However, these AI agents differ fundamentally from traditional deterministic software, posing a…

操作系统 · 计算机科学 2025-10-21 Yusheng Zheng , Yanpeng Hu , Tong Yu , Andi Quinn

Proto: A Guided Journey through Modern OS Construction

Proto is a new instructional OS that runs on commodity, portable hardware. It showcases modern features, including per-app address spaces, threading, commodity filesystems, USB, DMA, multicore support, self-hosted debugging, and a window…

操作系统 · 计算机科学 2025-10-21 Wonkyo Choe , Rongxiang Wang , Afsara Benazir , Felix Xiaozhu Lin

Maratona Linux a tale of upgrading from Ubuntu 20.04 to 22.04

Maratona Linux is the development environment used since 2016 on the ``Maratona de Programa\c{c}\~ao'', ICPC's South American regional contest. It consists of Debian packages that modify a standard Ubuntu installation in order to make it…

操作系统 · 计算机科学 2025-10-20 Davi Antônio da Silva Santos , Bruno César Ribas

Man-Made Heuristics Are Dead. Long Live Code Generators!

Policy design for various systems controllers has conventionally been a manual process, with domain experts carefully tailoring heuristics for the specific instance in which the policy will be deployed. In this paper, we re-imagine policy…

操作系统 · 计算机科学 2025-10-13 Rohit Dwivedula , Divyanshu Saxena , Aditya Akella , Swarat Chaudhuri , Daehyeok Kim

Towards Deterministic Sub-0.5 us Response on Linux through Interrupt Isolation

Real-time responsiveness in Linux is often constrained by interrupt contention and timer handling overhead, making it challenging to achieve sub-microsecond latency. This work introduces an interrupt isolation approach that centralizes and…

操作系统 · 计算机科学 2025-10-13 Zhouyi Zhou , Zhili Liu , Shancong Zhang , Jiemin Li , Dengke Du , Mengke Sun , Zhiqiang Wang , Hongyan Liu , Guokai Xu

An Early Exploration of Deep-Learning-Driven Prefetching for Far Memory

Far-memory systems, where applications store less-active data in more energy-efficient memory media, are increasingly adopted by data centers. However, applications are bottlenecked by on-demand data fetching from far- to local-memory. We…

操作系统 · 计算机科学 2025-10-07 Yutong Huang , Zhiyuan Guo , Yiying Zhang

Ariel OS: An Embedded Rust Operating System for Networked Sensors & Multi-Core Microcontrollers

Large swaths of low-level system software building blocks originally implemented in C/C++ are currently being swapped for equivalent rewrites in Rust, a relatively more secure and dependable programming language. So far, however, no…

操作系统 · 计算机科学 2025-10-02 Elena Frank , Kaspar Schleiser , Romain Fouquet , Koen Zandberg , Christian Amsüss , Emmanuel Baccelli

Joyride: Rethinking Linux's network stack design for better performance, security, and reliability

Contemporary distributed computing workloads, including scientific computation, data mining, and machine learning, increasingly demand OS networking with minimal latency as well as high throughput, security, and reliability. However,…

操作系统 · 计算机科学 2025-09-30 Yanlin Du , Ruslan Nikolaev

Nova: Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization

This paper presents Nova, a real-time scheduling framework for serving agentic vision-language models (VLMs) on a single GPU with balanced per-request latency and overall request process throughput. Our design begins by enabling effective…

操作系统 · 计算机科学 2025-09-26 Yuhang Xu , Shengzhong Liu , Dong Zhang , Bingheng Yan , Fan Wu , Guihai Chen

Exploiting Page Faults for Covert Communication

We present a novel mechanism to construct a covert channel based on page faults. A page fault is an event that occurs when a process or a thread tries to access a page of memory that is not currently mapped to its address space. The kernel…

操作系统 · 计算机科学 2025-09-26 Sathvik Swaminathan

MVVM: Deploy Your AI Agents-Securely, Efficiently, Everywhere

The rise of AI agents powered by Large Language Models (LLMs) presents critical challenges: how to securely execute and migrate these agents across heterogeneous environments while protecting sensitive user data, maintaining availability…

操作系统 · 计算机科学 2025-09-25 Yiwei Yang , Aibo Hu , Yusheng Zheng , Brian Zhao , Xinqi Zhang , Dawei Xiang , Kexin Chu , Wei Zhang , Andi Quinn