Audo data for DL

참고 : https://www.youtube.com/watch?v=fMqL5vckiU0&list=PL-wATfeyAMNrtbkCNsLcpoAyBBRJZVlnf


figure2


1. Waveform

Key concpets

  • Period ( = seconds / cycle )
    • inverse: Frequency ( = cycle / second )
  • Amplitude

figure2

figure2


Mathematical expression

\(y(t)=A \sin (2 \pi f t+\varphi)\).

  • \(t\) : time index
  • \(A\) : amplitude
  • \(f\) : frequency
  • \(\varphi\) : phase


figure2


2. Frequency/pitch & Amplitude/loudness

LOW frequency / LOW amplitude

HIGH frequency / HIGH amplitude

figure2


HIGH frequency \(\rightarrow\) HIGH pitch

HIGH amplitude \(\rightarrow\) LOWD sound


3. Sampling

figure2

  • sampling period : \(T\)
    • time index: \(t_n = n \cdot T\)
  • samplig rate : \(1/T\)


figure2


4. Aliasing vs. Quantization

(1) Aliasing ( = X-axis )

  • original signal (RED) : high frequency
  • reconstructed signal (BLUE) : low frequency

\(\rightarrow\) removing certain frequencyes ABOVE ceratin threshold

figure2


(2) Quantization ( = Y-axis )

figure2


5. Analiog Digital Conversion (ADC)

[X] sample signal at uniform time intervals

[Y] quantize with (limited number of) bits

figure2


ex) CD :

  • sample rate = 44100 Hz ( frequency )
  • Bit = 16 bits / channel


6. 1 min = xx Byte?

Sampling rate = 44100Hz

  • 44100 points per second

Bit depth = 16 bit

  • amplitude is quantized into 16 bits ( \(2^{16}\) possibilities)


Total Memory of Sound in 1 minute ( in .wav file )

  • number of bits per second : \(16 \times 44,100\)
  • number of megabits per second : \((16 \times 44,100) / 1,048,576\)
  • number of megabytes per second : \((16 \times 44,100) / (1,048,576\times8)\)
  • number of megabytes per mintue : \((16 \times 44,100) / (1,048,576\times8)\) \(\times 60 = 5.49\text{MB}\)

\(\rightarrow\) to shrink memory, we use .mp3 file!


4. Fourier Transform

*from TIME domain to FREQUENCY domain

( but time information is lost )


decompse sound into sum of sine waves ( oscillating at different frequencies )

figure2


ex) decompose into 2 sine waves

\(s=A_1 \sin \left(2 \pi f_1 t+\varphi_1\right)+A_2 \sin \left(2 \pi f_2 t+\varphi_2\right)\).

  • \(A_1=0.5, f_1=4, \varphi_1=0\\\).
  • \(A_2=1.5, f_2=1.5, \varphi_2=0\).


figure2

  • decompose into mulitple waves


5. Short Time Fourier Transform (STFT)

problem: TIME INFORMATION is lost due to FT

solution : Short Time Fourier Transform (STFT)

  • (1) compute multiple FFT at different intervals

    • able to preserve TIME info
  • (2) FIXED frame size

    • ex) 2048 samples per interval
  • (3) output = SPECTOGRAM

    ( = time + frequency + magnitude )


figure2


6. Pre-processing pipeline for Audio

(1) DL

figure2


(2) (Traditional) ML

figure2

\(\rightarrow\) requires much feature engineering!


7. Mel frequency Cepstral Coefficients (MFCCs)

figure2


MFCCs

  • Frequency domain feature
  • Capture timbral/textural aspects of sound
  • Approximate human auditory system
  • 13 to 40 coefficient
  • Calculated at each frame
    • need to perform SFTF first!


Applications:

  • speech recognition
  • music genre classificaiton

figure2

Categories: ,

Updated: