English

Tempo estimation as fully self-supervised binary classification

Sound 2024-01-18 v1 Machine Learning Audio and Speech Processing

Abstract

This paper addresses the problem of global tempo estimation in musical audio. Given that annotating tempo is time-consuming and requires certain musical expertise, few publicly available data sources exist to train machine learning models for this task. Towards alleviating this issue, we propose a fully self-supervised approach that does not rely on any human labeled data. Our method builds on the fact that generic (music) audio embeddings already encode a variety of properties, including information about tempo, making them easily adaptable for downstream tasks. While recent work in self-supervised tempo estimation aimed to learn a tempo specific representation that was subsequently used to train a supervised classifier, we reformulate the task into the binary classification problem of predicting whether a target track has the same or a different tempo compared to a reference. While the former still requires labeled training data for the final classification model, our approach uses arbitrary unlabeled music data in combination with time-stretching for model training as well as a small set of synthetically created reference samples for predicting the final tempo. Evaluation of our approach in comparison with the state-of-the-art reveals highly competitive performance when the constraint of finding the precise tempo octave is relaxed.

Keywords

Cite

@article{arxiv.2401.08891,
  title  = {Tempo estimation as fully self-supervised binary classification},
  author = {Florian Henkel and Jaehun Kim and Matthew C. McCallum and Samuel E. Sandberg and Matthew E. P. Davies},
  journal= {arXiv preprint arXiv:2401.08891},
  year   = {2024}
}

Comments

Accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

R2 v1 2026-06-28T14:18:49.041Z