Mel-Frequency Cepstral Coefficients (MFCCs)
참고 : https://www.youtube.com/watch?v=fMqL5vckiU0&list=PL-wATfeyAMNrtbkCNsLcpoAyBBRJZVlnf
1. Introduction
Mel-Frequency Cepstral Coefficients
- Cepstral: Cepstrum \(\leftrightarrow\) Spectrum
How to compute Cepstrum?
\(C(x(t))=F^{-1}[\log (F[x(t)])]\).
2. Vocal tract
Vocal Tract acts as a filter of a speech
- vocal tract (성도) : 소리가 나가는 길
3. Decomposing Speech
\(\rightarrow\) Peaks of spectral envelope, or formants, carry the identity of sound!
We can see “speech” as a “convolution of (1) with (2)”
- (1) vocal tract frequency response
- (2) glottal pulse
\(X(t)=E(t) \cdot H(t)\).
\(\log (X(t))=\log (E(t) \cdot H(t))\).
\(\log (X(t))=\log (E(t))+\log (H(t))\).
4. Liftering
Removing the high quefruency values! ( or the “glottal pulse” )
5. Calculating MFCCs
Waveform \(\rightarrow\) DFT \(\rightarrow\) Log-amplitude Spectrum \(\rightarrow\) Mel-scaling \(\rightarrow\) Discrete cosine transform
But why use discrete cosine transform?
( = similar to inverse transform )
- simplfied version of FT
- get real-valued coefficient
- decorrelate energy in different mel bands
- reduce # of dim to represent spectrum
How many coefficients to use?
-
First 12~13 coefficients ( low frequencies )
- 1st : Most information
- corresponds to “formants”, “spectral envelope”
- Last : Least information
- 1st : Most information
-
Use \(\Delta\) and \(\Delta \Delta\) MFCCs
\(\rightarrow\) about 39 coffeicients per frame