SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition (arxiv, 2019)
https://arxiv.org/pdf/2103.06695.pdf
Contents
- Abstract
- Introduction
- Augmentation Policy
Abstract
SpecAugment,
- simple DA for speech recognition
- applied directly to the feature inputs of a NN (i.e., filter bank coefficients)
- consists of ..
- warping the features
- masking blocks of frequency channels
- masking blocks of time steps
1. Introduction
Data augmentation for ASR
( ASR = Automatic Speech Recognition )
- Vocal Tract Length Normalization [11]
- Synthesize noisy audio [12]
- Speed perturbation for LVSCR tasks in [13]
- Use of an acoustic room simulator [14]
- Data augmentation for keyword spotting in [15, 16]
- Feature drop-outs for training multi-stream ASR systems [17]
SpecAugment
-
operates on the log mel spectrogram of the input audio
( rather than the raw audio itself )
-
simple & computationally cheap
-
consists of three kinds of deformations of the log mel spectrogram
- (1) Time warping
- a deformation of the TS in the time direction
- (2) Time masking
- (3) Frequency masking
- (1) Time warping
2. Augmentation Policy
- individual augmentations applied to a single input
- log mel spectrograms are normalized ( = zero mean )
- \(\therefore\) imputing masked values to zer o = setting it to mean value
- can apply multiple masks