MAE-AST: Masked Autoencoding Audio Spectogram Transformer (Interspeech, 2022)

https://www.isca-speech.org/archive/pdfs/interspeech_2022/baade22_interspeech.pdf

Abstract
Introduction and Related Work
MAE-AST
1. Model Architecture
2. Mask Sampling
3. Joint Discriminative and Generative Pretraining
Experiments
1. SSAST vs. MAE-AST
2. Masking Strategies
3. Pretext Task Ablation

Abstract

MAE-AST

improvement over the recent SSAST
leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining
integrate MAE into the SSAST
- DEEP encoder operates on only unmasked input
- SHALLOW decoder operates on encoder outputs and mask tokens
provide a 3× speedup and 2× memory usage reduction over the vanilla SSAST

Self-Supervised Audio Spectrogram Transformer (SSAST)

first patch-based and fully self-attention based pretraining strategies
achieved remarkable results for both
- (1) audio event classification
- (2) speech classification
limitation : massive computational overhead … \(O(N^2)\)

\(\rightarrow\) MAE : completely discards masked input tokens during the encoding

Propose MAE-AST = (1) MAE + (2) SSAST

Findings :

(1) MAE-AST pretrains using significantly less time (3×) and memory (2×) than the SSAST
(2) MAE-AST outperforms the SSAST under a shared encoder depth on several downstream tasks,
(3) MAE-AST performs well with only a generative objective, while the SSAST sees a significant performance drop

2. MAE-AST (Masked Autoencoding Audio Spectogram Transformer)

(1) Model Architecture

a) Input Processing

( Identical to the SSAST )

input : 16khz audio waveform
convert : into 128-dimensional log Mel filterbank features
- with a frame length of 25 ms
- with a frame shift of 10 ms.
noramlize : \(N(0,0.5^2)\) …. same as AST and SSAST
split spectogram into patches
- (patch-based) 16 filter × 16 frame tokens
- (frame-based)128 filter × 2 frame tokens
masking

b) Positional Embeddings

use fixed sinusoidal positional embeddings for both patch-based and frame-based tokenization

input spectrogram data = only variable-length in time

\(\rightarrow\) use 1D positional embeddings for both tokenization

c) Encoder

( standard ViT architecture )

flatten the unmasked input patches or frames via a linear projection ( into a 768-dim )
add sinusoidal positional embeddings to all tokens before inputting into the decoder

d) Decoder

( only used during pretraining & composed of standard transformer encoder blocks )

input = encoder output & mask tokens
- mask tokens : represented by a shared, learned embedding

e) Settings

encoder : 6 layers
decoder : 2 layers
both the encoder and decoder use : 12 heads and a width of 768.
no CLS token

( mean pool the encoder output last hidden states for classification tasks )

( For fair comparison between MAE-like and traditional BERT-like architectures )

Adopt the input pipelining and loss from the SSAST

But 2 minor changes

(1) overlapping
- AST paper = overlap (O)
- SSAST = overlap (X) … no cheating
  - however, adds overlap back during fine-tuning
    
    ( to match the original AST paper via interpolating learned positional embeddings )
- MAE-AST = overlap (X) for both pretraining & finetuning
(2) positional embedding (PE)
- SSAST = interpolation or trunctation of PE to support variable-length inputs during finetuning
- MAE = no modifications to PE for finetuning

(2) Mask Sampling

2 considerations

(1) % of masking
(2) amount of chunking between masked tokens

To little (1) & (2) \(\rightarrow\) too easy task

higher masking ratio \(\rightarrow\) provide significant speedups (for MAE)

8 overall masking and input strategies via combinations of the following factors:

(1) Patch-Based vs. Frame-Based inputs
(2) Fully random vs. chunked masking
(3) Different masking ratios.

(3) Joint Discriminative and Generative Pretraining

SSAST : combining both discriminative (classification) and generative (reconstruction) losses

MAE-AST : adopt this strategy, with small differences:

1-layer vs 2-layer NN for task

SSAST : the encoder output is fed into 2 separate 2-layer NN
- 1 for reconstruction & 1 for classification
MAE-AST : single linear layer for each

Loss functions ( identical to SSAST )

(1) Reconstruction loss = MSE btw the unnormalized output normalized input
(2) Classification loss = InfoNCE
- obtain negative samples from other masked inputs within the same audio segment
Final Loss = two losses are summed and balanced by a factor \(\lambda\)
- same \(\lambda=10\) as SSAST.

MAE-AST; Masked Autoencoding Audio Spectogram Transformer

Seunghan Lee

MAE-AST: Masked Autoencoding Audio Spectogram Transformer (Interspeech, 2022)

Contents

Abstract

2. MAE-AST (Masked Autoencoding Audio Spectogram Transformer)