Related papers: StreamingClaw Technical Report

StreamingEval: A Unified Evaluation Protocol towards Realistic Streaming Video Understanding

Real-time, continuous understanding of visual signals is essential for real-world interactive AI applications, and poses a fundamental system-level challenge. Existing research on streaming video understanding, however, typically focuses on…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Guowei Tang , Tianwen Qian , Huanran Zheng , Yifei Wang , Xiaoling Wang

Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge

Recent advances in Large Language Models (LLMs) have enabled the development of Video-LLMs, advancing multimodal learning by bridging video data with language tasks. However, current video understanding models struggle with processing long…

Computer Vision and Pattern Recognition · Computer Science 2025-01-24 Haomiao Xiong , Zongxin Yang , Jiazuo Yu , Yunzhi Zhuge , Lu Zhang , Jiawen Zhu , Huchuan Lu

StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its…

Computer Vision and Pattern Recognition · Computer Science 2025-12-05 Yifei Wang , Zhenkai Li , Tianwen Qian , Huanran Zheng , Zheng Wang , Yuqian Fu , Xiaoling Wang

Learning Streaming Video Representation via Multitask Training

Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Yibin Yan , Jilan Xu , Shangzhe Di , Yikun Liu , Yudi Shi , Qirui Chen , Zeqian Li , Yifei Huang , Weidi Xie

StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Ming Xie , Zizheng Huang , Xudong Tan , Chao Wang , Xiangyu Zeng , Wenxiao Wu , Tao Chen , Limin Wang , Yanwei Fu

ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents

Current embodied intelligent systems still face a substantial gap between high-level reasoning and low-level physical execution in open-world environments. Although Vision-Language-Action (VLA) models provide strong perception and intuitive…

Computer Vision and Pattern Recognition · Computer Science 2026-04-20 Dongjie Huo , Haoyun Liu , Guoqing Liu , Dekang Qi , Zhiming Sun , Maoguo Gao , Jianxin He , Yandan Yang , Xinyuan Chang , Feng Xiong , Xing Wei , Zhiheng Ma , Mu Xu

StreamingRAG: Real-time Contextual Retrieval and Generation Framework

Extracting real-time insights from multi-modal data streams from various domains such as healthcare, intelligent transportation, and satellite remote sensing remains a challenge. High computational demands and limited knowledge scope…

Computer Vision and Pattern Recognition · Computer Science 2025-01-27 Murugan Sankaradas , Ravi K. Rajendran , Srimat T. Chakradhar

StreamChat: Chatting with Streaming Video

This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Jihao Liu , Zhiding Yu , Shiyi Lan , Shihao Wang , Rongyao Fang , Jan Kautz , Hongsheng Li , Jose M. Alvare

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

Real-time streaming video understanding in domains such as autonomous driving and intelligent surveillance poses challenges beyond conventional offline video processing, requiring continuous perception, proactive decision making, and…

Computer Vision and Pattern Recognition · Computer Science 2026-04-30 Haolin Yang , Feilong Tang , Lingxiao Zhao , Xinlin Zhuang , Yifan Lu , Xiang An , Ming Hu , Xiaofeng Zhang , Abdalla Swikir , Junjun He , Zongyuan Ge , Muhammad Haris Khan , Imran Razzak

StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for…

Robotics · Computer Science 2026-03-31 Yiran Shi , Dongqi Guo , Tianchen Zhao , Feng Gao , Liangzhi Shi , Chao Yu , ZhiJian Mo , Qihua Xiao , XiaoShuai Peng , Qingmin Liao , Yu Wang

MediaClaw: Multimodal Intelligent-Agent Platform Technical Report

MediaClaw is a multimodal agent platform built on the OpenClaw ecosystem. Its core design follows a three-layer architecture of unified abstraction, pluginized extension, and workflow orchestration. The system is intended to address…

Artificial Intelligence · Computer Science 2026-05-15 Shaoan Zhao , Huanlin Gao , Qiang Hui , Ting Lu , Xueqiang Guo , Yantao Li , Xinpei Su , Fuyuan Shi , Chao Tan , Fang Zhao , Kai Wang , Shiguo Lian

StreamingFlow: Streaming Occupancy Forecasting with Asynchronous Multi-modal Data Streams via Neural Ordinary Differential Equation

Predicting the future occupancy states of the surrounding environment is a vital task for autonomous driving. However, current best-performing single-modality methods or multi-modality fusion perception methods are only able to predict…

Computer Vision and Pattern Recognition · Computer Science 2024-06-12 Yining Shi , Kun Jiang , Ke Wang , Jiusi Li , Yunlong Wang , Mengmeng Yang , Diange Yang

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with…

Computer Vision and Pattern Recognition · Computer Science 2025-10-13 Ruyi Xu , Guangxuan Xiao , Yukang Chen , Liuning He , Kelly Peng , Yao Lu , Song Han

StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two…

Computer Vision and Pattern Recognition · Computer Science 2025-10-30 Yuhang Hu , Zhenyu Yang , Shihan Wang , Shengsheng Qian , Bin Wen , Fan Yang , Tingting Gao , Changsheng Xu

Towards Streaming Perception

Embodied perception refers to the ability of an autonomous agent to perceive its environment so that it can (re)act. The responsiveness of the agent is largely governed by latency of its processing pipeline. While past work has studied the…

Computer Vision and Pattern Recognition · Computer Science 2020-08-26 Mengtian Li , Yu-Xiong Wang , Deva Ramanan

A Scalable Framework for Multilevel Streaming Data Analytics using Deep Learning

The rapid growth of data in velocity, volume, value, variety, and veracity has enabled exciting new opportunities and presented big challenges for businesses of all types. Recently, there has been considerable interest in developing systems…

Systems and Control · Electrical Eng. & Systems 2019-07-23 Shihao Ge , Haruna Isah , Farhana Zulkernine , Shahzad Khan

X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed…

Computer Vision and Pattern Recognition · Computer Science 2026-05-22 Xiaoming Ren , Ru Zhen , Chao Li , Yang Song , Qiuxia Hou , Yanhao Zhang , Peng Liu , Qi Qi , Quanlong Zheng , Qi Wu , Zhenyi Liao , Binqiang Pan , Haobo Ji , Haonan Lu

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Haibo Wang , Bo Feng , Zhengfeng Lai , Mingze Xu , Shiyu Li , Weifeng Ge , Afshin Dehghan , Meng Cao , Ping Huang

Thinking in Streaming Video

Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers…

Computer Vision and Pattern Recognition · Computer Science 2026-03-16 Zikang Liu , Longteng Guo , Handong Li , Ru Zhen , Xingjian He , Ruyi Ji , Xiaoming Ren , Yanhao Zhang , Haonan Lu , Jing Liu

A Streaming End-to-End Framework For Spoken Language Understanding

End-to-end spoken language understanding (SLU) has recently attracted increasing interest. Compared to the conventional tandem-based approach that combines speech recognition and language understanding as separate modules, the new approach…

Computation and Language · Computer Science 2021-07-20 Nihal Potdar , Anderson R. Avila , Chao Xing , Dong Wang , Yiran Cao , Xiao Chen