SpecSwap; A Simple Data Augmentation Method for End-to-End Speech Recognition

Interspeech 2020

less than 1 minute read

Seunghan Lee

Seunghan Lee

Deep Learning, Data Science, Statistics

SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition (Interspeech, 2020)

https://www.isca-speech.org/archive/pdfs/interspeech_2020/song20b_interspeech.pdf

Contents

Abstract
SpecSwap

Abstract

SpecSwap

a simple DA for ASR
acts directly on the spectrogram of input utterances
swapping blocks of frequency channels & time steps

Architecture: Transformer-based networks

SpecSwap

( Inspired by SpecAugment )

also deforms data at spectrogram level
consists of two kinds of deformations of the log-mel spectrogram
- (1) time swapping
- (2) frequency swapping

swapping blocks comes from our previous work

[15] : permutation strategy
- by reconstructing frames from a permuted speech feature sequence
- Limitations
  - (a) Need to modify the attention structure and constructing special attention mask
  - (b) Need to be fine-tuned after pre-training

Question) Can we untie the (a) model structure and (b) permutation strategy?

Solution) by applying permutation directly on spectrogram

( = swapping blocks of features either in time-domain or frequency-domain )

Twitter Facebook LinkedIn

You May Also Enjoy

4 minute read

5 minute read

8 minute read

22 minute read