SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition (Interspeech, 2020)

https://www.isca-speech.org/archive/pdfs/interspeech_2020/song20b_interspeech.pdf


Contents

  1. Abstract
  2. SpecSwap


Abstract

SpecSwap

  • a simple DA for ASR
  • acts directly on the spectrogram of input utterances
  • swapping blocks of frequency channels & time steps

Architecture: Transformer-based networks


SpecSwap

( Inspired by SpecAugment )

  • also deforms data at spectrogram level
  • consists of two kinds of deformations of the log-mel spectrogram
    • (1) time swapping
    • (2) frequency swapping

swapping blocks comes from our previous work

  • [15] : permutation strategy
    • by reconstructing frames from a permuted speech feature sequence
    • Limitations
      • (a) Need to modify the attention structure and constructing special attention mask
      • (b) Need to be fine-tuned after pre-training


Question) Can we untie the (a) model structure and (b) permutation strategy?

Solution) by applying permutation directly on spectrogram

( = swapping blocks of features either in time-domain or frequency-domain )


figure2

Categories: , ,

Updated: