Wavenet : A generative model for raw audio (2016)

Contents

  1. Abstract
  2. WaveNet
    1. Dilated Causal Convolutions


0. Abstract

Wavenet : DNN for generating raw audio waveforms

  • fully probabilistic
  • autoregressive


1. WaveNet

  • introduce new generative model

  • joint probability of waveform \(\mathbf{x}=\left\{x_{1}, \ldots, x_{T}\right\}\) :

    factorized as …

    \(p(\mathbf{x})=\prod_{t=1}^{T} p\left(x_{t} \mid x_{1}, \ldots, x_{t-1}\right)\).

  • similar to PixelCNNs, the conditional prob dist is modelled by a stack of CONVOLUTIONAL layers

  • outputs a categorical distn over the next \(x_t\) with softmax layer

  • optimized to maximize log likelihood


(1) Dilated Causal Convolutions

figure2

key of WaveNet : “casual convolutions”

  • ensure “time ordering” ( = no cheating )

    ( \(\approx\) masked convolution for CNN )

  • for 1D data (ex. audio), easy to implement

    ( just shift the output of normal convolution by a few timestep )


Train & Inference

  • [ Train ] : can be made in parallel ( all KNOWN )

    \(\rightarrow\) faster than RNN

  • [ INFERENCE ] : made sequentially


Problem : require many layers

\(\rightarrow\) use dilated convolutions to increase receptive field! ( w.o increasing computational cost )


Dilated Convolution

  • use “skipping” for “efficieincy”

  • similar to pooling / strided convolutions

    ( BUT, difference : output has the SAME size as the input )

figure2

Tags:

Categories:

Updated: