Computer Science

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable,…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Min Zhao , Hongzhou Zhu , Bokai Yan , Zihan Zhou , Yimin Chen , Wenqiang Sun , Kaiwen Zheng , Guande He , Xiao Yang , Chongxuan Li , Fan Bao , Jun Zhu

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on…

Computation and Language · Computer Science 2026-05-29 Ziwen Xu , Haiwen Hong , Linsong Yu , Benglei Cui , Longtao Huang , Hui Xue , Ningyu Zhang

EASE Configuration Facilitates A Reproducible Science of LLM Social Simulations

LLMs are increasingly deployed to simulate social interactions, yet many of the existing simulators remain ad hoc and monolithic. This lack of architectural standardization prevents reproducible research and complicates downstream…

Multiagent Systems · Computer Science 2026-05-29 Sneheel Sarangi , Maximilian Puelma Touzel , Aurélien Bück-Kaeffer , Zachary Yang , Jean-François Godbout , Reihaneh Rabbany

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Ciara Rowles , Reshinth Adithyan , Nikhil Pinnaparaju , Vikram Voleti , Mark Boss

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Amrita Mazumdar , Seonwook Park , Rajarshi Roy , Nikhil Srihari , Shengze Wang , Yuhao Zhou , Julia Wang , Koki Nagano , Shalini De Mello

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the…

Computation and Language · Computer Science 2026-05-29 Zizhuo Lin , Quanling Liu , Jinsheng Quan , Chao Zhang , Yifan Zhu , Xing Shi , Jingtao Xu , Zhihui Li , Yawei Luo

Ambient-robust Inverse Rendering using Active RGB-NIR Imaging

Inverse rendering aims to reconstruct geometry and reflectance of objects from images. Despite recent progress, existing methods often produces inaccurate reconstructions that are sensitive to ambient illumination conditions. Here we…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Hoon-Gyu Chung , Jinnyeong Kim , Hyunwoo Kang , Seung-Hwan Baek

GenClaw: Code-Driven Agentic Image Generation

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Junyan Ye , Jun He , Zilong Huang , Dongzhi Jiang , Xuan Yang , Rui Chen , Weijia Li

OOD-GraphLLM: Graph Large Language Model for Out-of-Distribution Generalized Drug Synergy Prediction

Drug synergy prediction (DSP) aims to identify efficacious drug combinations under various cellular contexts with different targets. However, the continual emergence of novel compounds results in variations in molecular scaffolds and sizes,…

Machine Learning · Computer Science 2026-05-29 Xin Wang , Linxin Xiao , Yang Yao , Wenwu Zhu

Knowing What to Solve Before How: Preplan Empowered LLM Mathematical Reasoning

Current plan-based reasoning methods improve large language models (LLMs) by inserting a planning stage before execution, giving rise to the question $\rightarrow$ plan $\rightarrow$ cot paradigm. While effective, a closer examination…

Computation and Language · Computer Science 2026-05-29 Shaojie Wang , Liang Zhang

Reinforcement Learning with Robust Rubric Rewards

While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Ya-Qi Yu , Hao Wang , Fangyu Hong , Xiangyang Qu , Gaojie Wu , Qiaoyu Luo , Nuo Xu , Huixin Wang , Wuheng Xu , Yongxin Liao , Zihao Chen , Haonan Li , Ziming Li , Dezhi Peng , Minghui Liao , Jihao Wu , Haoyu Ren , Dandan Tu

CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild

Misinformation verification increasingly occurs in public, fast-moving, and multilingual online settings, where static benchmarks provide an incomplete measure of model reliability. We introduce CommunityFact, a refreshable benchmark for…

Computation and Language · Computer Science 2026-05-29 Sahajpreet Singh , Insyirah Mujtahid , Min-Yen Kan , Kokil Jaidka

SAM3D-Phys: Towards Multi-Object Interactive Simulation in Real World

This work addresses the problem of recovering complete, simulatable object geometry from reconstructed real-world scenes, enabling physics-based interaction with objects embedded in the scene. While modern multi-view reconstruction methods…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Xin Dong , Weijian Deng , Lihan Zhang , Tianru Dai , Wenfeng Deng , Yansong Tang

GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases

Semi-structured knowledge bases (SKBs) embed textual documents in a typed graph of entities and relations, and underpin applications such as product search, academic paper search, and precision-medicine inquiries. Existing hybrid retrieval…

Information Retrieval · Computer Science 2026-05-29 Yicheng Tao , Yiqun Wang , Xiangchen Song , Xin Luo , Kai Liu , Jie Liu

BullingerDB: A Dataset for Handwritten Text Recognition and Writer Retrieval

We present BullingerDB, a large-scale benchmark dataset for historical document analysis based on the correspondence of Heinrich Bullinger (1504-1575). The corpus comprises 20,898 pages and 499,222 text lines written by 796 writers over six…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Marco Peer , Anna-Scius Bertrand , Patricia Scheurer , Andreas Fischer

Do Language Models Track Entities Across State Changes?

Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of work investigates how transformer language models (LMs) solve entity binding $\textit{without}$…

Computation and Language · Computer Science 2026-05-29 Zilu Tang , Qiao Zhao , Gabriel Franco , Derry Wijaya , Aaron Mueller , Sebastian Schuster , Najoung Kim

How's it going? Reinforcement learning in language models recruits a functional welfare axis

How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to…

Machine Learning · Computer Science 2026-05-29 Andy Q Han , David J. Chalmers , Pavel Izmailov

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Chun-Hsiao Yeh , Shengyi Qian , Manchen Wang , Yi Ma , Joseph Tighe , Fanyi Xiao

IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation

With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Hao Wu , Xiangyang Luo , Hao Wang , Jiawei Zhang , Yi Zhang , Jinwei Wang

Anti Mode-Collapse in Mean-Field Transformer via Auxiliary Variables

We use a mean-field-based transformer model to theoretically investigate how auxiliary variables, such as positional encoding, prevent mode collapse of self-attention mechanisms. The use of mean-field transformers to analyze the properties…

Machine Learning · Computer Science 2026-05-29 Masaaki Imaizumi , Masanori Koyama , Noboru Isobe , Kohei Hayashi