Audio Self-supervised Learning: A Survey ( arxiv 2022 )

https://arxiv.org/pdf/2203.01205.pdf

https://www.researchgate.net/publication/358974871_Audio_Self-supervised_Learning_A_Survey/link/6225b0ed84ce8e5b4d0cdbf4/download


Contents

  1. Abstract
  2. Audio SSL
  3. Downstream audio Tasks & Benchmarks


Abstract

1. Self-supervised Learning: A General Overview

figure2


2. Audio SSL

figure2

LIM [36], COLA [37], CLAR [33], Fonseca et al. [38]

  • expand the SimCLR approach for learning auditory representations

  • LIM [36]

    • processes directly speech samples expecting to maximise “local mutual information” between the encoded representations of chunks of speech sampled from the same utterance
  • COLA [37], Fonseca et al. [38]

    • take segments randomly extracted from time-frequency features along the temporal direction

    • Fonseca et al. [38]

      • several stochastic data augmentations

        • ex) random size cropping and Gaussian noise addition
      • proposed “mix-back” for additional augmentation

        = which mixes the incoming patch with a background patch, by this ensuring that the incoming patch is dominant in their mixture.

  • CLAR [33]

    • paired views of the model’s input

      = generated by applying data augmentations on raw audio signals and time frequency audio features

    • combining a contrastive loss ( such as CE loss ) for supervised learning can provide significant improvements

  • Wang [88]

    • also suggests to train audio SSL models with different formats of an audio sample
    • training objective = maximise the agreement between the (1) raw waveform and its (2) spectral representation
  • BYOL-A [89]

    • adopt BYOL in the audio domain
    • learns representations from a single audio without using negative samples
  • Audio2Vec [49], Speech2Vec [48]

    • inspired by Word2Vec [47]

    • learn audio representations using CBoW and skip-gram formulations.

      • CBoW : (input, output) = (middle, past&future)
        • effective for acoustic scene classification in [90]
      • Skip-gram: (input, output) = (past&future, middle)
    • Audio2Vec vs. Speech2Vec

      • (1) Audio Segmentation

        • Speech2Vec : applies audio segmentation, by using an explicit forced alignment technique

          • to isolate audio slices corresponding to each word

            ( thus, may introduce supervision to some extent )

        • Audio2Vec : requires no explicit assistance ( removes the need of supervision )

      • (2) Architecture

        • Speech2Vec : built based on an RNN encoder-decoder
        • Audio2Vec : built of stacks of CNN blocks
      • (3) Input

        • Speech2Vec : the Mel-spectrogram
        • Audio2Vec : Mel-Frequency Cepstral Coefficients (MFCCs)
      • (4) Temporal Gap ( only Audio2Vec )

        • requests the model to estimate the absolute time distance between two (randomly sampled) slices taken from the same audio clip

          ( presents the idea of measuring the relative positions of audio components as a pretext task )

  • Carr et al. [67]

    • training strategy based on permutations
      • ex training a model that can reorder shuffled patches of an audio spectrogram
    • also leverage differentiable ranking to integrate permutation inversions into an end-to-end training
      • enables solving the permutation inversion for the whole set of permutations


figure2

figure2

Predictive model using an auto-encoder

( exploits a masked acoustic model (MAM) )

Reconstruct ENTIRELY : [92]–[94]

  • Mockingjay [92]
    • takes the Mel-spectrogram as input acoustic features
    • exploits transformers to code randomly masked frames into audio representations.
  • Audio ALBERT [94]

    • same network architecture as Mockingjay

    • but the parameters are shared across all its transformer encoder layers

      $\rightarrow$ achieving a faster inference and increasing training speed

  • TERA [93]
    • TERA = Transformer Encoder Representations from Alteration
    • extend the masking procedures
      • ex) replacing contiguous segments with randomness
      • ex) masking along the channel axis
      • ex) applying Gaussian noise


Reconstruct ONLY MASKED : [95]

  • DAPC [95]
    • only predict the missing components along the timeand frequency axes of an audio spectrogram
    • can be seen as extension of CBoW
    • input masked spectrogram : generated using SpecAugment [34]
      • \(\therefore\) the missing parts to be predicted are not only temporal frames, but also frequency bins.

Various pretrain tasks:

  • PASE [96]

    • PASE = problem agnostic speech encoder

    • combines a CNN encoder with “multiple neural decoders ( =workers )”

      ( aim at solving regression or binary discrimination tasks )

    • (1) Regression tasks

      • recovering the raw audio waveform, the log power sepctrogram, MFCCs, and prosody.
    • (2) Binary discrimination tasks ( contrastive learning )

      • by maximising local and global mutual information similar.

      $\rightarrow$ Each self-supervised task is expected to provide a different view of the speech signal!

    • architecture: SincNet [103]

      • to process the raw waveform as the encoder input
      • performs a convolution with a set of parametrised Sinc functions that implement rectangular band-pass filters.
  • PASE+ [97]

    • Pase+ = PASE + (1) + (2) + (3)
      • (1) additional data augmentation techniques
      • (2) more effective workers
    • CNN encoder is combined with a Quasi-Recurrent Neural Network (QRNN)
      • for capturing long-term dependencies in sequential data in a more efficient way


CPC [40]

  • effectively learn representations by predicting the future in a latent space using an AR model
    • promising results for audio, images, text processing, and reinforcement learning.
  • architecture for Audio …
    • (1) “strided CNN” to encode raw audio to its latent representation.
    • (2) “GRU-RNN” to aggregate the information from all the past timesteps to form a context vector.
  • Contrastive learning
    • contrast the true future to noise representations, given an aggregated context vector.
    • Time-domain data augmentation ( such as WavAugment [98] )
      • ex) pitch modification, additive noise, reverberation, band reject filtering, or time masking


CPC2 [98]

  • two modifications to CPC

    • (1) GRU-RNN of CPC $\rightarrow$ a two-layers LSTM-RNN

    • (2) linear prediction network $\rightarrow$ a single multi-head transformer layer


figure2

Wav2vec [84]

  • adjusts the CPC structure to a fully convolutional architecture
    • CNN (1) : used to produce a representation from audio
    • CNN (2) : captures global context information into a context vector for each time step
  • substantially improves a character-based ASR system
  • minimising contrastive loss for each step \(k=1, \ldots, K\) :
    • \(L_k=-\sum_{i=1}^{T-k}\left(\operatorname { l o g } \sigma \left(z_{i+k}^T h_k\left(c_i\right)+\lambda \mathbb{E}\left[\log \sigma\left(-\tilde{z}^T h_k\left(c_i\right)\right]\right)\right.\right.\).
      • \[\sigma(x)=1 /(1+\exp (-x))\]
      • \(\sigma\left(z_{i+k}^T h_k\left(c_i\right)\right.\) : the probability of \(z_{i+k}\) being the true future sample
      • \(h_k\left(c_i\right)=W_k c_i+b_k\).
  • final loss : \(L=\sum_{k=1}^K L_k\)


VQ-Wav2vec [85]

  • similar to a vector-quantised VAE (VQ-VAE)

    \(\rightarrow\) exploit a vector quantisation module after the wav2vec encoder

  • aims to find, for each representation, the closest embedding from a fixed size codebook \(e \in \mathbb{R}^{V \times d}\)

    • codebook : contains \(V\) representations of size \(d\).
  • “Discrete representations” are fed into the context network & optimised in the same way as for wav2vec.

  • use Gumbel-Softmax to solve the discontinuity caused by the argmax operation


Mode collapse in single codebook

( Using a single codebook for coding representations tends to mode collapse )

\(\rightarrow\) solution : multiple codebooks are used as in product quantisation!

  • product quantisation = choosing quantised representations from multiple codebooks and concatenating them.

  • Given \(G\) codebooks with \(V\) entries \(e \in \mathbb{R}^{V x d / G}\), one entry from each codebook is selected.

  • A linear transformation is applied after concatenating the selected codewords.

  • probabilities for choosing the \(v\)-th codebook entry for group \(g\) : \(p_{g, v}=\frac{e^{\left(l_{g, v}+n_v\right) / \tau}}{\sum_{k=1}^V e^{\left(l_{g, k}+n_k\right) / \tau}}\).

    • where \(l \in \mathbb{R}^{G \times V}\) represent the logits from projecting the encoded dense representation
    • \[n=-\log (-\log (u))\]
      • \(u\) are uniform samples from \(U(0,1)\)

    \(\rightarrow\) codeword \(i\) in group \(g\) is chosen by \(\operatorname{argmax}_i p_{g, i}\).


figure2

K-means clustering

( = can also be used for differentiable vector quantisation, along with Gumbel Softmax )

  • codeword is selected as long as it has the closest distance to the dense representations \(z\).
  • additional terms are added in the wav2vec objective function
    • \(L=\sum_k L_k+\left( \mid \mid \operatorname{sg}(z)-q \mid \mid ^2+\gamma \mid \mid z-\operatorname{sg}(q) \mid \mid ^2\right)\).
  • term \(\mid \mid \operatorname{sg}(z)-q \mid \mid ^2\) = freezes the encoder output \(z\) and forces the codewords \(Q\) to be closer to the encoder output.
  • term \(\mid \mid z-\operatorname{sg}(q) \mid \mid ^2\) = drives each encoder output to be close to one codeword, which is one centroid of the K-means clustering.


Wave2vec 2.0

  • Wav2 Vec and VQ-Wav2 Vec : motivated by CPC

    • processing audio input for only one forward direction
  • Wav2vec 2.0

    • exploits a bidirectional MPC model
  • representations (\(z\)) are partly “MASKED before sending to a transformer network

  • jointly trained to contrast the true representations from distractors, given the contextualised representations.

  • similar to VQ-Wav2Vec, Wav2vec 2.0 applies product quantisation to

  • however, the quantised vector \(q_t\) for each time step in Wave2Vec 2.0 is not fed into a context network, but only used in the objective function:

    • \(L=\mathbb{E}\left[-\log \frac{e^{c_t^T q_t / \tau}}{\sum_{\tilde{q} \sim Q_t} e^{c_t^T \tilde{q} / \tau}}\right]\).
    • where \(\tilde{q} \sim Q_t\) includes \(q_t\) and \(K\) distractors.
  • Regularised by a diversity loss \(L_d\)

    • to encourage the model to use \(V\) codebook entries equally often
    • \(L_d=\frac{1}{G V} \sum_{g=1}^G-H\left(\bar{p}_g\right)=\frac{1}{G V} \sum_{g=1}^G \sum_{v=1}^V \bar{p}_{g, v} \log \bar{p}_{g, v}\).
  • Wave2vec 2.0 has been explored from the perspective of domain shift in [111]
  • Findings (1): matching conditions between data of pre-training and testing are very important in order to achieve satisfying speech recognition results.
  • Findings (2): pre-training on multiple domains can improve the generalisation ability of the learnt representations.


Summary: Wav2vec audio SSL models

  • learn latent representations without considering specific tasks for pre-training.
  • After pre-training, they are fine-tuned for downstream tasks in an additional step.
  • Wav2vec-U [116]
    • Wav2vec-U = Wav2vec Unsupervised
    • learns a map from audio representations to phonemes directly without supervision.
    • GAN architecture
      • \(G\): generator uses Wav2vec 2.0 to extract speech representations and generate phoneme sequence based on it using a clustering method
      • \(D\) : generated phoneme tries to cheat a $D$ that is conditioned on a real phoneme sequence from unlabelled text.


Phonetic clustering in SeqRAAE [87]

  • The idea of grouping quantised audio representations into phoneme sequences

  • discrete representation is learnt in an AE architecture with vector quantisation.

  • consecutive repeated quantised representations are further grouped to form phonetic units

    ( = Each phoneme can therefore correspond to several repeated codewords, which is similar to the format of Connectionist Temporal Classification (CTC) [117] )


Hidden unit BERT (HuBERT) [99]

  • does not apply CL for training the same MPC model and avoids vector quantisation.

  • each of the learnt audio representation is paired with a pseudo-label provided by applying K-means to MFCCs of the input audio

  • benefits from cluster ensembles

    ( \(\because\) K-means clustering can be of different numbers of clustering centres = targets of different granularity )

Methods for speech enhancement (SE) task [118], [119]

  • share similar structure with the auto-encoding predictive model
  • goal: noisy audio input \(\rightarrow\) clean speech.
    • noisy = clean speech + mixed with a noise recording


Very recent works in audio SSL

  • solve typically challenging tasks, such as
    • speech enhancement [120]–[122]
    • source separation [123]
  • CAE (clean auto-encoder) and MAE (mixture auto-encoder) [120]
    • a pair of variational auto-encoders
    • CAE : encodes clean speech
      • by minimising the reconstruction error of its input spectrogram.
    • MAE : encodes a noisy utterance
      • forces the encoded representation into the same latent space of the CAE, by using a cycle-consistency loss terms
    • learns a mapping from the domain of mixtures to the domain of clean sounds without using paired training examples.
  • MixIT [124]
    • MixIT = Mixture Invariant Training
    • for solving unsupervised sound separation
    • Seperation Network
      • Input = a mixture of multiple single-channel acoustic mixtures (MOM)
        • each of the acoustic mixtures is comprised of several speech sources.
      • Goal = decomposes the MOM into separate audio sources
        • then selected to be re-mixed up to approximate each acoustic mixture of the MOM.
      • ( Similarly as for the Permutation Invariant Training (PIT) [125]) the remix matrix is optimised by choosing the best match between the separated sources and the acoustic mixtures )
  • Denoising pretraining [123]
    • alternative solution to solve the permutation switching problem of source separation
    • pretraining task = speech denoising
    • fine-tuned task = source separation


PSE ( SE system specialised in a particular person )

  • two SSL algorithms :

    • (1) pseudo speech enhancement (PseudoSE)
    • (2) Contrastive mixtures (CM)

    for extracting speaker-specific discriminative features.

  • (1) PseudoSE model
    • trained to recover a premixture signal from a pseudo-source
      • premixture signal = clean speech contaminated by noise
      • pseudo-source = a mixup of the premixture signal and additional noise
  • (2) CM method
    • generalises the training via contrastive learning
    • positive & negative
      • positive : shares the same premixture signal (but deformed with different additional noises),
      • negative : stems from two different premixture sources mixed with the same additional noise.
    • trained to recover premixture sources rather than clean speech


Data purification (DP) [126]

  • introduced in the pseudo speech enhancement training

  • separate model is trained to estimate the segmental SNR of the premixture signals,

    ( measuring the different importance of the audio frames )


3. Downstream audio Tasks & Benchmarks

Several different downstream audio tasks have been considered for empirically measuring the audio representation quality!

  • (1) Automatic Speech Recognition (ASR)
    • used for evaluating all Wav2vec based methods [81], [84], [85].
  • (2) Spekaer Identification [36], [45], [103]
  • (3) Speech emotion recognition [32], [157]-[159]
  • (4) Speech machine translation [160]
  • (5) pitch detection [161]
  • (6) acoustic scene classification [90]


Will describe some publicly available benchmarks that enable a fair comparisons between different audio SSL algorithms.


(1) Zero resource Speech challenge (ZeroSpeech) [162]

a) Challenge in 2015

  • task of unsupervised discovery of linguistic units from raw speech in an unknown language

  • tasks are split into two tracks

    • (1) unsupervised sub-word modelling
    • (2) spoken term discovery

    \(\rightarrow\) each focusing on a different level of linguistic structure.


(1) unsupervised sub-word modelling

  • aims at constructing a representation of speech sounds that is robust to within- and between-speaker variation and supports word identification

(2) spoken term discovery

  • aims at unsupervised discovery of ‘words’, taking raw speech as input.


b) Challenge in 2017

extend the study for the variants in language and speaker

  • considering the topic of cross-language generalisation and speaker adaptation


c) Challenge in 2017

to address the problem of a speech synthesiser without any text or phonetic labels.

goal: discover sub-word units in an unsupervised way given raw audio.


d) Challenge in 2021

several tasks for spoken language modelling, based on speech only as well as visually-grounded.


(2) Speech processing Universal PERformance Benchmark (SuperB)

Present a standard and comprehensive testbed for evaluation which can be generally applied to pre-trained models on various tasks.

10 downstream tasks are provided

  • Phoneme Recognition, Automatic Speech Recognition, Keyword Spotting, Query by Example Spoken Term Detection, Speaker Identification, Automatic Speaker Verification, Speaker Diarisation, Intent Classification, Slot Filling, and Emotion Recognition.


(3) LeBenchmark [164]

another reproducible and multifaceted benchmark for evaluating speech SSL models for the French language.

4 tasks

  • Speech Recognition (ASR), Spoken Language Understanding (SLU), Speech Translation (AST), and Emotion Recognition (AER).


(4) Libri-Light [165]

benchmark specifically designed for the task of ASR with limited or no supervision

  • based on spoken English audio collected from open-source audio books of the LibriVox project.


(5) HEAR

HEAR = Holistic Evaluation of Audio Representations

  • extends a benchmark suite for both speech and non-speech tasks
  • goal : create an audio representation that is as holistic as the human ear


3 main tasks in HEAR 2021

  • word classification, pitch detection, and sound event detection

Categories: , ,

Updated: