Computer Science

GMOS: Grounding Moving Object Segmentation in 3D Space and Time

Moving Object Segmentation (MOS) aims to discover, segment, and track objects that move independently of the camera. Current MOS methods, however, exhibit two fundamental limitations: they rely on pre-computed 2D auxiliary modalities such…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Junyu Xie , Tengda Han , Weidi Xie , Andrew Zisserman

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Hidir Yesiltepe , Jiazhen Hu , Tuna Han Salih Meral , Adil Kaan Akan , Kaan Oktay , Hoda Eldardiry , Pinar Yanardag

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment,…

Robotics · Computer Science 2026-05-29 Jusuk Lee , Seungjae Lee , Jonghun Shin , Hoseong Jung , Sungha Kim , Daesol Cho , H. Jin Kim , Jia-Bin Huang , Furong Huang

AdaState: Self-Evolving Anchors for Streaming Video Generation

Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Yusuf Dalva , Pinar Yanardag

NeuROK: Generative 4D Neural Object Kinematics

Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics -- realistic temporal deformations of static objects under…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Chen Geng , Guangzhao He , Yue Gao , Yunzhi Zhang , Shangzhe Wu , Jiajun Wu

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 You-Zhe Xie , Yu-Hsuan Li , Jie-Ying Lee , Kaipeng Zhang , Yu-Lun Liu , Zhixiang Wang

Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field

We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Shangjie Xue , Jesse Dill , Dhruv Ahuja , Frank Dellaert , Panagiotis Tsiotras , Danfei Xu

GPIC: A Giant Permissive Image Corpus for Visual Generation

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Keshigeyan Chandrasegaran , Kyle Sargent , Suchir Agarwal , Michael Jang , Michael Poli , Juan Carlos Niebles , Justin Johnson , Jiajun Wu , Li Fei-Fei

Benchmarking Single-Factor Physical Video-to-Audio Generation

Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Tingle Li , Siddharth Gururani , Kevin J. Shih , Gantavya Bhatt , Sang-gil Lee , Zhifeng Kong , Arushi Goel , Gopala Anumanchipalli , Ming-Yu Liu

REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image

Reconstructing physically stable 3D scenes from a single RGB image enables casual images to be converted into simulation-ready digital assets for applications such as immersive interaction and content creation. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Xiaoxuan Ma , Jiashun Wang , Nicolas Ugrinovic , Yehonathan Litman , Kris Kitani

Colored Noise Diffusion Sampling

Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Hadar Davidson , Noam Issachar , Sagie Benaim

Supercharging Thermal Gaussian Splatting with Depth Estimation

Efficient and robust 3D scene representation is crucial in autonomous driving, robotics, and related fields. While RGB images provide valuable content for 3D reconstruction, other modalities like thermal or depth can enable additional…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Manoj Biswanath , Chenxin Cai , Hannah Schieber , Daniel Roth , Benjamin Busam

RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide…

Robotics · Computer Science 2026-05-29 Chunru Lin , Hongxin Zhang , Fenghao Yu , Zhehuan Chen , Thomas L. Griffiths , Yejin Choi , David Held , Chuang Gan

Veda: Scalable Video Diffusion via Distilled Sparse Attention

Scaling Diffusion Transformers to generate high-resolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation quality…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Shihao Han , Hao Yang , Xinting Hu , Xiaofeng Mei , Yi Jiang , Xiaojuan Qi

MonoPhysics: Estimating Geometry, Appearance, and Physical Parameters from Monocular Videos

Existing inverse physics methods recover physical parameters from multi-view videos, where geometric constraints across views resolve scale and 3D structure. In monocular settings, however, such constraints are absent, leading to severe…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Daniel Rho , Jun Myeong Choi , Matthew Thornton , Biswadip Dey , Roni Sengupta

VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation

Autoregressive image and video generators are trained with teacher-forced histories but must sample from their own generated prefixes at inference time, making them vulnerable to exposure bias and prefix drift. Existing remedies either…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Xinyao Liao , Qiyuan He , Yicong Li , Jiayin Zhu , Xiaoye Qu , Wei Wei , Angela Yao

A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms

Simulation-based RL for contemporary robot control is increasingly organized around GPU-resident simulation: physics, rollout collection, and learning are placed on a single GPU-centric execution path. This paradigm has greatly improved…

Robotics · Computer Science 2026-05-29 Yufei Jia , Zhanxiang Cao , Mingrui Yu , Heng Zhang , Shenyu Chen , Dixuan Jiang , Meng Li , Xiaofan Li , Yiyang Liu , Junzhe Wu , Zheng Li , XiLin Fang , Tingyu Cui , Shengcheng Fu , Haoyang Li , Anqi Wang , Zifan Wang , Dongjie Zhu , Chenyu Cao , Zhenbiao Huang , Ziang Zheng , Jie Lu , Xin Ma , Zhengyang Wei , Xiang Zhao , Tianyue Zhan , Ye He , Yuxiang Chen , Yizhou Jiang , Yue Li , Haizhou Ge , Yuhang Dong , Fan Jia , Ziheng Zhang , Meng Zhang , Xiwa Deng , Zhixing Chen , Hanyang Shao , Chenxin Dong , Yixuan Li , Yizhi Chen , Bokui Chen , Kaifeng Zhang , Hanqing Cui , Yusen Qin , Ruqi Huang , Lei Han , Tiancai Wang , Xiang Li , Yue Gao , Guyue Zhou

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Chong Bao , Shichen Liu , Lijun Yu , David Futschik , Stylianos Moschoglou , Shefali Srivastava , Ziqian Bai , Feitong Tan , Guofeng Zhang , Zhaopeng Cui , Sean Fanello , Yinda Zhang

City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images

City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale and complexity of urban scenes. Existing city-scale 3D reconstruction methods based on NeRF,…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Sayan Paul , Sourav Ghosh , Siddharth Katageri , Soumyadip Maity , Sanjana Sinha , Brojeshwar Bhowmick

Zero-Scan Data Quality: Leveraging Table Format Metadata for Continuous Observability at Scale

Modern table formats such as Apache Iceberg compute and store metadata-commit timestamps, record counts, and column-level statistics such as null counts and value bounds at write time as part of file writing. These statistics serve query…

Databases · Computer Science 2026-05-29 Mohit Verma , Shantanu Rawat , Christian Bush , Sumedh Sakdeo , Lokesh Amarnath Ravindranathan , Dwarak Bakshi