StreamingClaw Technical Report

Jiawei Chen; Zhe Chen; Chaoqun Du; Maokui He; Wei He; Hengtao Li; Qizhen Li; Zide Liu; Hao Ma; Xuhao Pan; Chang Ren; Xudong Rao; Xintian Shen; Chenfeng Wang; Tao Wei; Chengjun Yu; Pengfei Yu; Shengyu Yao; Chunpeng Zhou; Kun Zhan; Lihao Zheng; Pan Zhou; Xuhan Zhu; Yufei Zheng

StreamingClaw Technical Report

Computer Vision and Pattern Recognition 2026-03-27 v2

Authors: Jiawei Chen , Zhe Chen , Chaoqun Du , Maokui He , Wei He , Hengtao Li , Qizhen Li , Zide Liu , Hao Ma , Xuhao Pan , Chang Ren , Xudong Rao , Xintian Shen , Chenfeng Wang , Tao Wei , Chengjun Yu , Pengfei Yu , Shengyu Yao , Chunpeng Zhou , Kun Zhan , Lihao Zheng , Pan Zhou , Xuhan Zhu , Yufei Zheng

View on arXiv ↗ PDF ↗

Abstract

Emerging applications such as embodied intelligence, AI hardware, autonomous driving, and intelligent cockpits rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents mostly suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming input. These shortcomings have become a key bottleneck for preventing agents from sustaining perception, making real-time decisions, and executing closed-loop actions in complex real-world environments, constraining their deployment and potential in dynamic, open physical worlds. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. Beyond maintaining full compatibility with the OpenClaw framework, it natively supports real-time, multimodal streaming interactions. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term memory storage, hierarchical memory evolution, efficient memory retrieval, and memory sharing across multiple agents. (4) It supports a closed loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to leverage the resources and support of the open-source community.

Keywords

data stream processing video streaming video understanding

Cite

@article{arxiv.2603.22120,
  title  = {StreamingClaw Technical Report},
  author = {Jiawei Chen and Zhe Chen and Chaoqun Du and Maokui He and Wei He and Hengtao Li and Qizhen Li and Zide Liu and Hao Ma and Xuhao Pan and Chang Ren and Xudong Rao and Xintian Shen and Chenfeng Wang and Tao Wei and Chengjun Yu and Pengfei Yu and Shengyu Yao and Chunpeng Zhou and Kun Zhan and Lihao Zheng and Pan Zhou and Xuhan Zhu and Yufei Zheng},
  journal= {arXiv preprint arXiv:2603.22120},
  year   = {2026}
}

Comments

Under Progress

StreamingClaw Technical Report

Abstract

Keywords

Cite

Comments

Related papers