Related papers: MIND: Multi-Scale Intent Diffusion for Text-Driven…

InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs

Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for…

Computer Vision and Pattern Recognition · Computer Science 2025-12-15 Bin Li , Ruichi Zhang , Han Liang , Jingyan Zhang , Juze Zhang , Xin Chen , Lan Xu , Jingyi Yu , Jingya Wang

SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-based Humanoid Control

Controlling physics-based humanoids from natural-language instructions is a critical step toward general-purpose embodied agents. However, existing methods remain constrained by a tension between semantic expressiveness and physical…

Graphics · Computer Science 2026-05-26 Jingyan Zhang , Han Liang , Ruichi Zhang , Bin Li , Juze Zhang , Xin Chen , Jingya Wang , Lan Xu , Jingyi Yu

HINT: Hierarchical Interaction Modeling for Autoregressive Multi-Human Motion Generation

Text-driven multi-human motion generation with complex interactions remains a challenging problem. Despite progress in performance, existing offline methods that generate fixed-length motions with a fixed number of agents, are inherently…

Computer Vision and Pattern Recognition · Computer Science 2026-01-29 Mengge Liu , Yan Di , Gu Wang , Yun Qu , Dekai Zhu , Yanyan Li , Xiangyang Ji

Text-driven Human Motion Generation with Motion Masked Diffusion Model

Text-driven human motion generation is a multimodal task that synthesizes human motion sequences conditioned on natural language. It requires the model to satisfy textual descriptions under varying conditional inputs, while generating…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Xingyu Chen

MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks

Multimodal fusion leverages information across modalities to learn better feature representations with the goal of improving performance in fusion-based tasks. However, multimodal datasets, especially in medical settings, are typically…

Machine Learning · Computer Science 2025-02-05 Alejandro Guerra-Manzanares , Farah E. Shamout

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

The human-like form of humanoid robots positions them uniquely to achieve the agility and versatility in motor skills that humans possess. Learning from human demonstrations offers a scalable approach to acquiring these capabilities.…

Robotics · Computer Science 2025-11-14 Qiayuan Liao , Takara E. Truong , Xiaoyu Huang , Yuman Gao , Guy Tevet , Koushil Sreenath , C. Karen Liu

LEAD: Latent Realignment for Human Motion Diffusion

Our goal is to generate realistic human motion from natural language. Modern methods often face a trade-off between model expressiveness and text-to-motion alignment. Some align text and motion latent spaces but sacrifice expressiveness;…

Computer Vision and Pattern Recognition · Computer Science 2024-10-21 Nefeli Andreou , Xi Wang , Victoria Fernández Abrevaya , Marie-Paule Cani , Yiorgos Chrysanthou , Vicky Kalogeiton

INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM

Traditional control and planning for robotic manipulation heavily rely on precise physical models and predefined action sequences. While effective in structured environments, such approaches often fail in real-world scenarios due to…

Robotics · Computer Science 2025-08-08 Jin Wang , Weijie Wang , Boyuan Deng , Heng Zhang , Rui Dai , Nikos Tsagarakis

Hierarchical Intention-Aware Expressive Motion Generation for Humanoid Robots

Effective human-robot interaction requires robots to identify human intentions and generate expressive, socially appropriate motions in real-time. Existing approaches often rely on fixed motion libraries or computationally expensive…

Robotics · Computer Science 2025-09-30 Lingfan Bao , Yan Pan , Tianhu Peng , Dimitrios Kanoulas , Chengxu Zhou

SENTINEL: A Fully End-to-End Language-Action Model for Humanoid Whole Body Control

Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment…

Robotics · Computer Science 2025-11-25 Yuxuan Wang , Haobin Jiang , Shiqing Yao , Ziluo Ding , Zongqing Lu

Executing your Commands via Motion Diffusion in Latent Space

We study a challenging task, conditional human motion generation, which produces plausible human motion sequences according to various conditional inputs, such as action classes or textual descriptors. Since human motions are highly diverse…

Computer Vision and Pattern Recognition · Computer Science 2023-05-22 Xin Chen , Biao Jiang , Wen Liu , Zilong Huang , Bin Fu , Tao Chen , Jingyi Yu , Gang Yu

Mimic Intent, Not Just Trajectories

While imitation learning (IL) has achieved impressive success in dexterous manipulation through generative modeling and pretraining, state-of-the-art approaches like Vision-Language-Action (VLA) models still struggle with adaptation to…

Robotics · Computer Science 2026-03-31 Renming Huang , Chendong Zeng , Wenjing Tang , Jintian Cai , Cewu Lu , Panpan Cai

LangWBC: Language-directed Humanoid Whole-Body Control via End-to-end Learning

General-purpose humanoid robots are expected to interact intuitively with humans, enabling seamless integration into daily life. Natural language provides the most accessible medium for this purpose. However, translating language into…

Robotics · Computer Science 2025-05-01 Yiyang Shao , Xiaoyu Huang , Bike Zhang , Qiayuan Liao , Yuman Gao , Yufeng Chi , Zhongyu Li , Sophia Shao , Koushil Sreenath

MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection

Recent advances in AI-generated content (AIGC) have significantly accelerated image editing techniques, driving increasing demand for diverse and fine-grained edits. Despite these advances, existing image editing methods still face…

Computer Vision and Pattern Recognition · Computer Science 2025-05-27 Shuyu Wang , Weiqi Li , Qian Wang , Shijie Zhao , Jian Zhang

SCENIC: Scene-aware Semantic Navigation with Instruction-guided Control

Synthesizing natural human motion that adapts to complex environments while allowing creative control remains a fundamental challenge in motion synthesis. Existing models often fall short, either by assuming flat terrain or lacking the…

Computer Vision and Pattern Recognition · Computer Science 2024-12-23 Xiaohan Zhang , Sebastian Starke , Vladimir Guzov , Zhensong Zhang , Eduardo Pérez Pellitero , Gerard Pons-Moll

MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

Scalable embodied intelligence is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on…

Robotics · Computer Science 2026-03-16 Ruicheng Zhang , Mingyang Zhang , Jun Zhou , Zhangrui Guo , Zunnan Xu , Xiaofan Liu , Zhizhou Zhong , Puxin Yan , Haocheng Luo , Xiu Li

MIND Your Reasoning: A Meta-Cognitive Intuitive-Reflective Network for Dual-Reasoning in Multimodal Stance Detection

Multimodal Stance Detection (MSD) is a crucial task for understanding public opinion on social media. Existing methods predominantly operate by learning to fuse modalities. They lack an explicit reasoning process to discern how inter-modal…

Computation and Language · Computer Science 2026-01-06 Bingbing Wang , Zhengda Jin , Bin Liang , Wenjie Li , Jing Li , Ruifeng Xu , Min Zhang

MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple…

Artificial Intelligence · Computer Science 2026-04-15 Shufang Lin , Muyang Chen , Xiabing Zhou , Rongrong Zhang , Dayou Zhang , Fangxin Wang

HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for Controllable Text-Driven Person Image Generation

Text-driven person image generation is an emerging and challenging task in cross-modality image generation. Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.…

Computer Vision and Pattern Recognition · Computer Science 2022-11-14 Kaiduo Zhang , Muyi Sun , Jianxin Sun , Binghao Zhao , Kunbo Zhang , Zhenan Sun , Tieniu Tan

Context-aware Human Intent Inference for Improving Human Machine Cooperation

The ability of human beings to precisely recog- nize others intents is a significant mental activity in reasoning about actions, such as, what other people are doing and what they will do next. Recent research has revealed that human…

Human-Computer Interaction · Computer Science 2018-03-13 Xiang Zhang