SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition (arxiv, 2019)

https://arxiv.org/pdf/2103.06695.pdf


Contents

  1. Abstract
  2. Introduction
  3. Augmentation Policy


Abstract

SpecAugment,

  • simple DA for speech recognition
  • applied directly to the feature inputs of a NN (i.e., filter bank coefficients)
  • consists of ..
    • warping the features
    • masking blocks of frequency channels
    • masking blocks of time steps


1. Introduction

Data augmentation for ASR

( ASR = Automatic Speech Recognition )

  • Vocal Tract Length Normalization [11]
  • Synthesize noisy audio [12]
  • Speed perturbation for LVSCR tasks in [13]
  • Use of an acoustic room simulator [14]
  • Data augmentation for keyword spotting in [15, 16]
  • Feature drop-outs for training multi-stream ASR systems [17]


SpecAugment

  • operates on the log mel spectrogram of the input audio

    ( rather than the raw audio itself )

  • simple & computationally cheap

  • consists of three kinds of deformations of the log mel spectrogram

    • (1) Time warping
      • a deformation of the TS in the time direction
    • (2) Time masking
    • (3) Frequency masking


2. Augmentation Policy

figure2

  • individual augmentations applied to a single input
  • log mel spectrograms are normalized ( = zero mean )
    • \(\therefore\) imputing masked values to zer o = setting it to mean value


figure2

  • can apply multiple masks

Categories: , ,

Updated: