AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Authors: Yiheng Li, Zhuo Li, Ruibing Hou, Yingjie Chen, Hong Chang, Hao Liu, Shiguang Shan

Computer VisionArtificial Intelligence2026-05v1license

Abstract

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.

Cite

@article{arxiv.2605.29488,
  title  = {AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling},
  author = {Yiheng Li and Zhuo Li and Ruibing Hou and Yingjie Chen and Hong Chang and Hao Liu and Shiguang Shan},
  journal= {arXiv preprint arXiv:2605.29488},
  year   = {2026}
}

← Computer Vision · Home