CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification (arxiv 2022)Permalink
https://arxiv.org/pdf/2203.06760.pdf
ContentsPermalink
- Abstract
- Introduction
- Cross-Model Knowledge Distillation (CMKD)
- CNNs
- ASTs
- Difference btw CNN and AST
- Knowledge Distillation
- Experiment Settings
- Datasets
- Training Settings
- FSD50K experiments
- Audioset and ESC-50 experiments
AbstractPermalink
Audio classification
- CNNs : have been the de-facto standard building block for end-to-end audio classification models
- Self-attention mechanisms : have been shown to outperform CNNs
- ex) Audio Spectrogram Transformer (AST)
Cross-Model Knowledge Distillation (CMKD)Permalink
This paper, we find an intriguing interaction between the two very different models
→ CNN and AST models are good teachers for each other … via knowledge distillation (KD)
Experiments with this CNN/Transformer Cross-Model Knowledge Distillation (CMKD)
- achieve new SOTA performance on FSD50K, AudioSet, and ESC-50.
1. IntroductionPermalink
(1) Audio classificationPermalink
History
- (1) hand-crafted features & hidden Markov models (HMMs)
-
(2) CNNs : aim to learn a direct mapping from audio waveforms or spectrograms to corresponding labels
- (3) Self-attention : outperform CNN
CNN vs. Transformer
- CNN :
- built-in inductive biases
- ex) spatial locality and translation equivariance
- well suited to spectrogram based end-to-end audio classification.
- built-in inductive biases
- Transformer :
- do not have such built-in inductive biases
- learn in a more data-driven manner, making them more flexible.
- perform better but less computationally efficient than CNN models on long audio input due to their O(n2) complexity.
Intriguing interaction between the two very different models
→ CNN and AST models are good teachers for each other.
- via knowledge distillation (KD)
- performance of the student model improves & is better than the teacher model
Cross-Model Knowledge Distillation (CMKD)Permalink
-
knowledge distillation framework between a CNN and a Transformer model
-
(1) Distillation works bi-directionally
-
(a) CNN→Transformer & (b) Transformer→ CNN
-
( in general ) in KD, the teacher needs to be stronger than the student
→ but for CMKD, a weak teacher can still improve a student’s performance
-
-
(2) Student outperforms the teacher after knowledge distillation
- even when the teacher is originally stronger!
-
(3) KD between two models of the same class leads to a much smaller or no performance improvement
-
(4) Simple EfficientNet KD-CNN model with mean pooling outperforms the much larger AST model
- on FSD50K and ESC50 dataset.
ContributionPermalink
- first to explore bidirectional knowledge distillation between CNN and Transformer models
- conduct extensive experiments on standard audio classification datasets & find the optimal knowledge distillation setting
- Small and efficient CNN models match or outperform previous SOTA
2. Cross-Model Knowledge DistillationPermalink
( Architecture of CNN and AST models )
2-1. CNNs
2-2. AST
2-3. Main difference between these two classes of models
2-4. Knowledge distillation setting and notation
(1) CNNsPermalink
CNN model without attention module [20]
- best CNN model on the audio classification task
ProceduresPermalink
- [ Input ] input audio waveform of t seconds is converted into a sequence of 128 dim log Mel filterbank (fbank) features computed with a 25 ms Hanning window every 10 ms.
- result : 128×100t spectrogram ( = input to CNN )
- [ Output ] output of the penultimate layer = size (⌈128/c⌉,⌈100t/c⌉,d) in frequency, time, and embedding dimension,
- mainly use EfficientNet-B2 [21]
- where c is the feature downsampling factor of the CNN
- [ Mean pooling ] time and frequency mean pooling
- produce d dim spectrogram-level representation
- [ Final result ] via linear layer
- sigmoid (for multi-label classification)
- softmax (for single-label classification)
(2) Audio Spectrogram TransformersPermalink
Original AST model proposed in [11]
- has the best performance on the audio classification task.
ProceduresPermalink
-
[ Input ] same as CCN
- converted to a 128×100t spectrogram in the same way as the CNN model.
-
[ Patching ] split the spectrogram
- into a sequence of N patches of size 16×16
- with an overlap of 6 in both time and frequency dimension
- number of patches : N=12⌈(100t−16)/10⌉
- into a sequence of N patches of size 16×16
-
[ Flatten ] flatten each 16×16 patch to a 1D patch embedding of size d
- via a linear projection layer ( = patch embedding layer )
-
[ POS ] add a trainable positional embedding (also of size d ) to each patch embedding
-
[ CLS token ] append a [CLS ] token at the beginning of the sequence
The resulting sequence is then input to a standard Transformer encoder
- [ Output of the [CLS] token ]
- serves as the audio spectrogram representation
- [ Final result ] via a linear layer
- sigmoid (for multilabel classification)
- softmax (for single-label classification)
Pretrained weightPermalink
For both CNN and AST models , use ImageNet pretraining
- from public model checkpoints
CNN weight: Channel difference
-
image : 3-channel ↔ audio : 1-d spectrogram
-
solution : average the weights corresponding to each of the 3 input channels of the vision model checkpoints
( = equivalent to expanding a 1-channel spectrogram to 3-channels with the same content, but is computationally more efficient )
(3) Difference between CNN and ASTPermalink
see Introduction
(4) Knowledge distillationPermalink
Original knowledge distillation setup [28] with consistent teaching [29]
- this simple setup works better than the more complex attention distillation strategy [17] for the audio classification task.
ProcedurePermalink
- step 1) Train the teacher model
- step 2) During the student model training, feed the input audio spectrogram with the exact same augmentations to teacher and student models (consistent teaching)
Loss for the student model training:
- Loss=λLossg(ψ(Zs),y)+(1−λ)Lossd(ψ(Zs),ψ(Zt/τ)).
- Loss g = ground truth loss
- Loss d = distillation loss
Details
-
teacher model is frozen during the student model training.
- for Loss d, use the Kullback-Leibler divergence as Lossd.
- only apply τ on the teacher logits
- fix λ=0.5 and do not scale it with τ.
- Loss functions :
- CE loss & softmax : for single-label classification tasks such as ESC-50
- BCE loss & sigmoid : for multi-label classification tasks such as FSD50K and AudioSet.
3. Experiment SettingsPermalink
(1) DatasetsPermalink
3 widely-used audio classification datasets
- FSD50K [32], AudioSet [33], and ESC-50 [34]
FSD50K dataset [32]Permalink
- collection of sound event audio clips with 200 classes
- ( train, val, eval ) = ( 373134, 4170, 10231 )
- variable length from 0.3 to 30 s with an average of 7.6s.
- sample audio at 16kHz and trim all clips to 10s.
We use the FSD50K dataset for the majority of our experiments in this paper (Section 4) for three reasons.
- (1) allows more rigorous experiments, since it has an official training, validation, and evaluation split
- (2) publicly available dataset, thus easy to reproduce
- (3) moderate size (50K samples)
- AudioSet (2M samples) and ESC-50 (2K samples)
- allows us to conduct extensive experiments with our computational resources
AudioSet [33]Permalink
- collection of over 2 million 10-second audio clips excised from YouTube videos
- labeled with the sounds that the clip contains from a set of 527 labels.
- ( balanced training, full training, and evaluation ) = ( 22k, 2M, 20k )
- use AudioSet to study the generalization ability of the proposed method on larger datasets.
ESC-50 [34]Permalink
- consists of 2,000 5-second environmental audio recordings organized into 50 classes.
- use ESC-50 to study the transferability of models trained with the proposed knowledge distillation method
(2) Training SettingsPermalink
4. FSD50K experimentsPermalink
(1) Which model is a good teacher?Permalink
- Red : CNN
- Blue : AST
Findings 1) CNNs and ASTs are good teachers for each other
- While KD improves the student model performance in almost all settings, we find that models always prefer a different teacher
Findings 2) For both directions, the student model matches or outperforms its teacher
- ∗ denotes that student model outperforms teacher model
Findings 3) The strongest teacher is not the best teacher
- both CNN and AST perform better with a smaller (and weaker) teacher
Finding 4) Self-KD leads to smaller or no improvement
Finding 5) Iterative knowledge distillation does not further improve model performance
5. AudioSet and ESC-50 ExperimentsPermalink
(1) AudioSet ExperimentsPermalink
To study the impact of training data size…
→ conduct experiments on both the (1) balanced and (2) full training set
Goal : study the generalization of the proposed method
→ thus do not search the KD hyperparameters with AudioSet
( just re-use the optimal KD setting found with the FSD50K dataset )
Findings 1) CMKD works out-of-the-box
-
KD works well on AudioSet for both CNN→AST and AST→CNN, demonstrating the proposed cross-model knowledge generalizes for audio classification tasks
-
training KD models with more epochs can lead to further performance improvement
( consistent with our finding that KD models are less prone to overfitting )
Findings 2) KD leads to larger improvement on smaller dataset
-
KD is more effective when the model is trained with the smaller balanced training set
( ∵ AST and CNN models get closer with more training data and cross-model knowledge distillation thus plays a smaller role )
Findings 3) The advantage of KD narrows after weight averaging and ensemble
Comparison with SOTA