Unsupervised Contrastive Learning of Sound Event Representations (arXiv, 2020)
- Abstract
- Introduction
- Method
- Stochastic sampling of data views
- Mix-back
- Stochastic Data Augmentation
- Architectures
Unsupervised contrastive learning as a way to learn sound event representations
Propose to use the pretext task of contrasting differently augmented views of sound events
views are computed primarily via mixing of training examples with unrelated backgrounds,
followed by other data augmentations.
1. Introduction
Sound event recognition (SER)
Two largest labeled SER datasets
- (1) AudioSet
- provides a massive amount of content but the official release does not include waveforms, and the labelling in some classes is less precise
- (2) FSD50K
- consists of open-licensed audio curated with a more thorough labeling process,
- but the data amount is more limited.
Related works in SSL
(1) First works in SSL sound event representation learning [7]
- adopting a triplet loss-based training by creating anchor-positive pairs via simple audio transformations
- e.g., adding noise or mixing examples.
(2) Predicting the long-term temporal structure of continuous recordings captured with an acoustic sensor network [12]
(3) Two pretext tasks [11]
- a) estimating the time distance between pairs of audio segments
- b) reconstructing a spectrogram patch from past and future patches
pretext task of contrasting differently augmented views of sound events
- different views = via mixing of training examples with unrelated background examples, followed by other data augmentations.
- linear evaluation
- two downstream sound event classification tasks
2. Method
(1) Stochastic sampling of data views
- Input \(\mathcal{X}\) : log-mel spectrograms of audio clips
- Sample 2 views = time frequency (TF) patches
- \(x_i \in \mathcal{X}\) and \(x_j \in \mathcal{X}\) are selected randomly over the length of the clip spectrogam
(2) Mix-back
Mixing the (1) incoming patch \(x_i\) with a (2) background patch, \(b_i\)
- \(x_i^m=(1-\lambda) x_i+\lambda\left[E\left(x_i\right) / E\left(b_i\right)\right] b_i\)…… Eq (1)
- \(\lambda \sim \mathcal{U}(0, \alpha)\).
- \(\alpha \in[0,1]\) is the mixing hyper-parameter (typically small)
- \(E(\cdot)\) : energy of a given patch.
- Energy adjustment of Eq. 1 ensures that \(x_i\) is always dominant over \(b_i\), even if \(E\left(b_i\right)>>\) \(E\left(x_i\right)\),
- preventing aggressive transformations that may make the pretext task too difficult
( Details: Before Eq. 1, patches are transformed to linear scale (inversion of the log in the log-mel) to allow energywise compensation, after which mix-back is applied, and then the output, \(x_i^m\), is transformed back to log scale )
Background patches \(b\)
randomly drawn from the training set (excluding the input clip \(\mathcal{X}\) ),
- (1) shared information across positives is decreased by mixing \(x_i\) and \(x_j\) with different backgrounds
- (2) semantic information is preserved due to sound transparency (i.e., a mixture of two sound events inherits the classes of the constituents) and the fact that the positive patch is always predominant in the mixture.
Mix-back = data augmentation (?)
- yes, but we separate it from the others as it involves two input patches.
(3) Stochastic Data Augmentation
Adopt DAs directly computable over TF patches (rather than waveforms)
- simple for on-the-fly computation
Transform \(x_i^m\) into the input patch \(\tilde{x}_i\) for the encoder network.
Consider DAs both from computer vision and audio literature
- random resized cropping (RRC)
- random time/frequency shifts
- compression
- SpecAugment
- Gaussian noise addition
- Gaussian blurring.
(4) Architectures
a) Encoder
CNN based network \(f_\theta\)
- embedding \(h_i=f_\theta\left(\tilde{x}_i\right)\) from the augmented patch \(\tilde{x}_i\),
b) Projection head
Simple projection network \(g_{\varphi}\)
- consists of an MLP with one hidden layer, batchnormalization, and a ReLU
L2-normalized low-dimensional representation \(z_i\)
c) Contrastive Loss
NT-Xent loss
\(\ell_{i j}=-\log \frac{\exp \left(z_i \cdot z_j / \tau\right)}{\sum_{v=1}^{2 N} \mathbb{1}_{v \neq i} \exp \left(z_i \cdot z_v / \tau\right)}\).