BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation (IJCNN, 2021)

https://arxiv.org/pdf/2103.06695.pdf

Abstract
Introduction
BYOL-A
1. Pre-normalization
2. Mixup for foreground acoustic event
3. RRC for all content details
4. Post-Normalization
Experiments

Abstract

New general-purpose audio representation learning approach!

No negatives samples ( \(\approx\) BYOL )

without expecting relationships between different time segments of audio samples.

Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A)

creates contrasts in an augmented audio segment pair derived from a single audio segment
combination of normalization and augmentation techniques

1. Introduction

Contrastive Learning

use positive and negative samples
requires a large number of negative samples
- SimCLR [7] : uses a significant number of batch samples
- MoCo [6] : operates a large queue to accommodate a larger number of negative samples.
- BYOL [8] : does not use negative samples

Bootstrap Your Own Latent (BYOL)

no negative samples
directly minimizes the MSE of embeddings originating from the same input
collapsed representations? system architecture and training algorithm can avoid this problem!

SSL in Audio

COLA [14]: learns general-purpose representations and outperforms previous methods

Others: utilize the time-series aspect of audio signals

audio segments cropped closer = closer representations (?)

\(\rightarrow\) contradictory use cases can be found easily!
- ex) repetitive sounds like music could have similar contents in the remote time segments because music compositions, by their nature, repeat motifs.
ex) short acoustic events (e.g., a single knock, a gunshot) can occur in a short duration
- even adjacent segments (e.g., a knock followed by a footstep) can make differences in contents for acoustic events.
similar problems can also happen when we use contrastive learning [11] [14] or triplet loss [12] [13] because the comparison of multiple samples is the core of their loss calculation.

BYOL-A

Address these problems by having general-purpose audio representations , learned from a single audio segment!!

Focus on learning

(A) the foreground acoustic event sound as a dominant sound representation
(B) the sound texture details

for describing general-purpose representation.

(A) Foreground acoustic event

Can be better learned from samples with random background variations while the foreground is kept unchanged
mixing small amount of sounds can approximate making variations on the background.

\(\rightarrow\) adopt “mixup”

(B) Sound texture

Sounds from an acoustic scene or a sound texture can vary their pitch/speed/time, while the details can be consistent
details can be learned under the random variations of pitch/speed/time shifts

\(\rightarrow\) use approximation of audio “pitch shifting” and “time stretching” techniques

Create changes on a pair of segments originating from exactly the same segment, not from multiple segments

Contributions

(1) Propose learning general-purpose audio representations from a single audio segment
(2) BYOL for Audio (BYOL-A)
- learns representations from a single audio segment input with a dedicated audio augmentation module that focuses on foreground and content details
(3) Propose to learn ..
- a) foreground sound by combining pre-normalization and mixup
- b) content details through approximation of pitch shifting and time stretching.
(4) Extensive ablation studies

2. BYOL-A

General-purpose audio representations from a single audio segment

(1) Input : audio preprocessed as a log-scaled mel-spectrogram ( = time frequency feature )

(2) Data Augmentation: replace the augmentation module in BYOL with ours

so that the learning system can handle audio and create contrasts in augmented views
augmentation module consists of 4 blocks

Data Augmentation blocks

(1) Pre-Normalization block
- normalizes a single input audio segment ( for stability )
- normalized input is duplicated into 2copies
(2) Mixup block : creates two outputs that are mixes of ..
- a) normalized inputs & b) randomly chosen past normalized inputs.
- designed to create contrast for learning foreground acoustic event representations,
(3) Random Resize Crop (RRC) block
- resizes and crops the outputs randomly
- approximates pitch shifting and time stretching in time-frequency features
(4) Post-Normalization block
- adjusts statistical drifts caused by the former augmentations
- focus on foreground acoustic events and all content details.

(1) Pre-Normalization

normalized to \(\tilde{x}=\frac{x-\mu}{\sigma}\),
stabilizes computations in the system in two ways
- (1) by mitigating augmentation parameter sensitivity
  - which enables following blocks to assume that input range virtually follows \(N(0,1)\).
- (2) by normalizing statistical differences between training datasets.

(2) Mixup for foreground acoustic event

Input = normalized log-mel spectrogram audio

Mixup block

mixes past randomly selected input audio in a small ratio
added audio becomes a part of the background sound in the mixed audio.
- similar to mixback [11], which adds a random sample from a dataset as background sound
  
  ( but the purpose of the mix-back is to create a set of positive samples sharing less information in the contrastive learning setting )
Original mixup = to both X & Y

BYOL-A = only to X

As audio is log-scaled, we convert input to a linear scale before the mixup calculation and convert it back to a log-scale again

\(\rightarrow\) coin as log-mixup-exp

\(\tilde{x}_i=\log \left((1-\lambda) \exp \left(x_i\right)+\lambda \exp \left(x_k\right)\right)\).

\(x_k\) : mixing counterpart
\(\lambda\) : mixing ratio sampled from uniform distribution \(U(0.0, \alpha)\)

( instead of from a beta distribution in the original mixup )

(3) RRC for all content details

approximation of pitch shifting and time stretching of input audio log-mel spectrograms

(4) Post-Normalization

After augmentation … can cause statistical drift in their outputs

\(\rightarrow\) Thus, normalize as \(\sim N(0,1)\).

Difference with pre-normalization?

use BATCH statistics ( not INSTANCE-WISE )

3. Experiments

Twitter Facebook LinkedIn

AAA (All About AI)

Seunghan Lee