auDeep: Unsupervised Learning of Representations from Audio with Deep RNNs ( JMLR 2017 )

https://arxiv.org/pdf/1712.04382.pdf

Abstract
Recurrent Seq2Seq AE
System Overivew
Experiments
1. Three audio classification tasks
2. Baselines
3. Results

Abstract

auDeep

Python toolkit for deep unsupervised representation learning from acoustic data
architecture
- seq2seq … consider temporal dynamics
provide an extensive CLI in addition to a Python API

code: https: //github.com/auDeep/auDeep.

SOTA in audio classification

1. Recurrent Seq2Seq AE

Extends the seq2seq (RNN enc-dec model)

input sequence : fed to a multi-layered RNN
final hidden state fed to FC layer
output : fed to decoder RNN & reconstruct the input sequence

Loss : RMSE (reconstruction loss)

Details:

for faster model convergence, the expected decoder output from the previous step is fed back as the input into the decoder RNN
used representation : activations of the FC layer

Input data = spectrograms

time dependent sequences of frequency vectors.

Two of the key strengths

(1) fully UNsupervised training
(ii) the ability to account for the temporal dynamics of sequences

3. System Overview

extracted features can be exported to CSV or ARFF for further processing

( ex. classification with alternate algorithms )

4. Experiments

(1) Three audio classification tasks.

(1) Acoustic scene classification
- dataset : ( TUT Acoustic Scenes 2017 (TUT AS 2017) )
(2) Environmental sound classification (ESC)
- dataset : ( ESC-10 and ESC-50 )
(3) Music genre classification
- dataset : ( GTZAN )

Train multiple autoencoder configurations using auDeep

& Perform feature-level fusion of the learned representations.

$\rightarrow$ fused representations are evaluated using the built-in MLP with the same cross-validation setup as used for the baseline systems on the TUT AS 2017, ESC-10, and ESC-50 data sets.

(2) Baselines

(a) CNN (Piczak, 2015a)

(b) Sparse coding approach ( Henaff et al., 2011 )

better than auDeep ….but not fair compairison
- auDeep was trained using ESC-10 and ESC-50 data only
- SoundNet was pre-trained on an external corpus of 2+ million videos.

(3) Results

Twitter Facebook LinkedIn

AAA (All About AI)

Seunghan Lee