Self-Supervised Learning with Random-Projection Quantizer for Speech Recognition (ICML, 2022)

https://arxiv.org/pdf/2202.01855.pdf

Abstract
Introduction
Related Work
SSL with Random Projection Quantizer
1. Random-projection Quantizer
2. Pre-training
3. Fine-tuning
4. Understanding the effectiveness of random-projection quantizer
Experiments
1. Data: LibriSpeech
2. Data: Multilingual Tasks
3. Quantization Quality
4. Effect of Pre-training Data Size

Abstract

SSL for speech recognition

Pretext task = predict the masked speech signals, in the form of discrete labels

Discrete labels : generated with a random-projection quantizer

Quantizer : projects speech inputs with a randomly initialized matrix

( & does a nearest-neighbor lookup in a randomly-initialized codebook )
Random-projection quantizer is not trained!!

( = No update for matrix & codebook )

\(\rightarrow\) makes the approach flexible and is compatible with universal speech recognition architecture.

Experiment on LibriSpeech

similar word-error-rates as previous work using SSL with non-streaming models
lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models

1. Introduction

Trend of SSL for speech recognition : BERT-inspired algorithms

Challenge in building BERT-style SSL for speech

= to bridge the gap between continuous speech signals and the discrete text tokens

Solution : learning speech representation or quantized representation

2 limitations of integration of representation learning & SSL

(1) Model architecture limitation
- often requires the model to act the role of providing speech representation while still being effective for the downstream tasks.
- An effective representation model, however, may not always be effective for the downstream tasks.
- For example, a good representation learning model may require accessing the future context of the utterance, while downstream tasks may require a low latency model which prohibits the access of the future context.
(2) Increased complexity.
- The objectives of representation learning and SSL are not always aligned
- Complexity of designing both algorithms and finding their balance can impede the research development.
- This complexity can also motivate the field toward designing more complicated algorithms instead of finding a simple and effective alternative.

BEST-RQ

BEST-RQ = BERT-based Speech pre-Training with Random-projection Quantizer

simple and effective SSL algorithm for speech recognition
Masked Modeling
- (1) mask speech signals
- (2) feed them to the encoder part of the speech recognition model
- (3) predict the masked region based on unmasked parts
learning targets = labels provided by random-projection quantizer
- random projection quantizer projects speech signals to a randomly initialized matrix, and finds a nearest vector in a randomly initialized codebook
- The index of that vector is the target label

Previous work on SSL for speech recognition = focus on learning speech representation

wav2vec (Schneider et al., 2019)

applies CL to learn the future representation based on the past context

vq-wav2vec (Baevski et al., 2020a)

uses wav2vec to learn the representations & quantizes them to discrete tokens
performs BERT-style pre-training to further improve the representation learning.

DiscreteBERT (Baevski et al., 2019)

extends vq-wav2vec by finetuning the BERT-pre-trained model on the downstream tasks.

wav2vec 2.0 (Baevski et al., 2020b)

uses CL with both past and future context to predict the representation of the masked parts.

HuBERT (Hsu et al., 2021)

uses k-means to learn the initial quantizer that maps speech signals to discrete labels
performs BERT-style pre-training

( = inputs are masked speech signals & targets are discrete labels )
further uses the pretrained model as the new quantizer to train a new iteration of the model

w2v-BERT (Chung et al., 2021)

uses a sub-network of the model to perform CL to learn speech representation
use the rest of the network to perform BERT-style pre-training

BERT-RQ

distinguishes from these work in

avoiding the requirement of representation learning
separating the quantizer from the speech recognition model

Quantizer

project input signals with a random matrix

( = which is similar to performing dimension reduction for the input signals )
results = prediction target for SSL

Similar to BEiT (Bao et al., 2021),

trains a VQ-VAE (van den Oord et al., 2018) as the quantizer
use the VQ-VAE to perform BERT-style SSL
(difference) BERT-RQ does not require training the quantizer

3. SSL with Random Projection Quantizer

BEST-RQ

applies a random-projection quantizer to map speech signals to discrete labels
quantizer randomly initializes a (1) matrix and a (2) codebook
- uses the matrix to project the input speech signals
- uses the codebook to find the nearest vector
( both are fixed during pre-training )
input data : normalized to \(N(0,1)\)

\(\rightarrow\) critical for preventing the random projection to collapse to a small subset of codes.
masked parts : replaced with a noise sampled from \(N(0,0.1^2)\)

(1) Random-projection Quantizer

Input vector \(x\) (\(d\)-dimensional vector computed from speech signals)

Discrete labels \(y\)

\(y=\underset{i}{\operatorname{argmin}}\mid \mid \operatorname{norm}_{l 2}\left(c_i\right)-\operatorname{norm}_{l 2}(A x)\mid \mid\).
- \(A\) : a randomly initialized \(h \times d\) matrix
- \(C=\left\{c_1, \ldots, c_n\right\}\) : a set of randomly initialized \(h\) dim vectors
Projection matrix \(A\) use Xavier initialization
Codebook \(C\) use standard normal distribution for initialization

(2) Pre-training

Softmax layer on top of the ASR encoder to learn to predict the quantized speech labels

Random-projection quantizer = independent of the ASR encoder

\(\rightarrow\) pre-training is flexible & can work with different architectures of the ASR encoder

Study the effectiveness of the algorithm on both nonstreaming and streaming models

use Conformer (Gulati et al., 2020) as the building block.

a) NON-streaming models

BERT-style pre-training is designed for the nonstreaming models

uses both past and future context to learn to predict the quantized labels of the masked speech signals.

b) Streaming models

Streaming architecture however is less well-studied in the previous SSL work compared to the non-streaming architecture
Proopose two pre-training algorithms that are compatible with the streaming architecture:
- (a) Streaming pre-train
- (b) Non-Streaming pre-train

(a) Streaming pre-train

BERT-RQ does not require learning quantization & focuses only on training the ASR encoder

\(\rightarrow\) largely benefits the streaming models.
Pre-training for streaming models follows the same setup as non-streaming models,

but the ASR encoder now learns to predict the quantized labels of the masked part based only on the past context

(b) Non-Streaming pre-train

neural network architecture like Transformer/Conformer allows switching from non-streaming to streaming behaviors by adding a mask for the future context within the same model, one can also perform pre-training with non-streaming setup for streaming models

(3) Fine-tuning

Focus on end-to-end models with RNN transducers (Graves, 2012)

decoder uses LSTMs for the prediction network.
an additional projection layer is added on top of the pre-trained encoder to help it adapt to the downstream ASR task.

Also updates the encoder during the supervised fine-tuning.

(4) Understanding the Effectivenss of the Random-projection Quantizer

random projection = dimension reduction for the speech signals

random codebook = approximated discrete representation of the speech data distribution.

2 Questions

Q1) how good is the resulting quantization quality with this quantizer?
Q2) how much does the quantization quality affect the effectiveness of the self-supervised learning?

Answer by comparing our quantizer with VQ-VAEs

VQ-VAEs also provide a discrete representation for the speech signals
- but do so by learning a representation in the latent space that best preserves the speech data.
Thus, Comparing with VQ-VAE is checking …
- (1) quantization quality of our quantizer
- (2) effect of representation learning for SSL

4. Experiments

(1) Data: LibriSpeech

a) vs. Non-streaming models

b) vs. Streaming models

(2) Data: MultiLingual Tasks

(3) Quantization Quality

Result: all quantizers lead to similar WERs

\(\rightarrow\) indicates that the quantizer quality does not translate to SSL quality

(4) Effect of Pre-training Data Size

Twitter Facebook LinkedIn

Self-Supervised Learning with Random-Projection Quantizer for Speech Recognition

Seunghan Lee

Self-Supervised Learning with Random-Projection Quantizer for Speech Recognition (ICML, 2022)

Contents

Abstract