SiMO: Single-Modality-Operable Multimodal Collaborative Perception

Jiageng Wen; Shengjie Zhao; Bing Li; Jiafeng Huang; Kenan Ye; Hao Deng

SiMO: Single-Modality-Operable Multimodal Collaborative Perception

Computer Vision and Pattern Recognition 2026-03-10 v1

Authors: Jiageng Wen , Shengjie Zhao , Bing Li , Jiafeng Huang , Kenan Ye , Hao Deng

Abstract

Collaborative perception integrates multi-agent perspectives to enhance the sensing range and overcome occlusion issues. While existing multimodal approaches leverage complementary sensors to improve performance, they are highly prone to failure--especially when a key sensor like LiDAR is unavailable. The root cause is that feature fusion leads to semantic mismatches between single-modality features and the downstream modules. This paper addresses this challenge for the first time in the field of collaborative perception, introducing Single-Modality-Operable Multimodal Collaborative Perception (SiMO). By adopting the proposed Length-Adaptive Multi-Modal Fusion (LAMMA), SiMO can adaptively handle remaining modal features during modal failures while maintaining consistency of the semantic space. Additionally, leveraging the innovative "Pretrain-Align-Fuse-RD" training strategy, SiMO addresses the issue of modality competition--generally overlooked by existing methods--ensuring the independence of each individual modality branch. Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality-specific features, enabling it to maintain optimal performance across all individual modalities. The implementation details can be found in https://github.com/dempsey-wen/SiMO.

Keywords

multimodal learning cooperative perception multimodal emotion recognition

Cite

@article{arxiv.2603.08240,
  title  = {SiMO: Single-Modality-Operable Multimodal Collaborative Perception},
  author = {Jiageng Wen and Shengjie Zhao and Bing Li and Jiafeng Huang and Kenan Ye and Hao Deng},
  journal= {arXiv preprint arXiv:2603.08240},
  year   = {2026}
}

Comments

Accepted to ICLR 2026. This arXiv version includes an additional appendix (Appendix 15) containing further philosophical discussion not included in the official ICLR peer-reviewed version

SiMO: Single-Modality-Operable Multimodal Collaborative Perception

Abstract

Keywords

Cite

Comments

Related papers