Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

Jongpil Lee; Taejun Kim; Jiyoung Park; Juhan Nam

Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

Sound 2017-12-05 v1 Machine Learning Multimedia Audio and Speech Processing

Authors: Jongpil Lee , Taejun Kim , Jiyoung Park , Juhan Nam

Abstract

Music, speech, and acoustic scene sound are often handled separately in the audio domain because of their different signal characteristics. However, as the image domain grows rapidly by versatile image classification models, it is necessary to study extensible classification models in the audio domain as well. In this study, we approach this problem using two types of sample-level deep convolutional neural networks that take raw waveforms as input and uses filters with small granularity. One is a basic model that consists of convolution and pooling layers. The other is an improved model that additionally has residual connections, squeeze-and-excitation modules and multi-level concatenation. We show that the sample-level models reach state-of-the-art performance levels for the three different categories of sound. Also, we visualize the filters along layers and compare the characteristics of learned filters.

Keywords

audio classification speech recognition audio-visual speech recognition

Cite

@article{arxiv.1712.00866,
  title  = {Raw Waveform-based Audio Classification Using Sample-level CNN Architectures},
  author = {Jongpil Lee and Taejun Kim and Jiyoung Park and Juhan Nam},
  journal= {arXiv preprint arXiv:1712.00866},
  year   = {2017}
}

Comments

NIPS, Machine Learning for Audio Signal Processing Workshop (ML4Audio), 2017

Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

Abstract

Keywords

Cite

Comments

Related papers