Optimizing Speech Multi-View Feature Fusion through Conditional Computation

Weiqiao Shan; Yuhao Zhang; Yuchen Han; Bei Li; Xiaofeng Zhao; Yuang Li; Min Zhang; Hao Yang; Tong Xiao; Jingbo Zhu

Optimizing Speech Multi-View Feature Fusion through Conditional Computation

Audio and Speech Processing 2025-01-15 v1 Artificial Intelligence Computation and Language Sound

Authors: Weiqiao Shan , Yuhao Zhang , Yuchen Han , Bei Li , Xiaofeng Zhao , Yuang Li , Min Zhang , Hao Yang , Tong Xiao , Jingbo Zhu

View on arXiv ↗ PDF ↗

Abstract

Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.

Keywords

self-supervised speech learning speech enhancement speech recognition and language modeling

Cite

@article{arxiv.2501.08057,
  title  = {Optimizing Speech Multi-View Feature Fusion through Conditional Computation},
  author = {Weiqiao Shan and Yuhao Zhang and Yuchen Han and Bei Li and Xiaofeng Zhao and Yuang Li and Min Zhang and Hao Yang and Tong Xiao and Jingbo Zhu},
  journal= {arXiv preprint arXiv:2501.08057},
  year   = {2025}
}

Comments

ICASSP 2025

Optimizing Speech Multi-View Feature Fusion through Conditional Computation

Abstract

Keywords

Cite

Comments

Related papers