Quantization
Contents
- Float32 vs. Float16 vs. BFloat16
 - Min & Max Range Comparison
 - Converting Data Type
 
1. Float32 vs. Float16 vs. BFloat16
(1) Float 32

- 32 = 1 (sign) + 8 (exponent) + 23 (mantissa)
 - \((-1)^{\text {sign }} 2^{(\text {exponent }-127)} \times 1 \text {.mantissa }\).
 
Example:

- \[81= 2^{6} + 2^4 + 2^0 = 64 + 16 +1\]
 - (max): \(2^7 + 2^6+ \cdots 2^0 = 2^8-1 =255\)
 
(2) Float16

- 
    
16 = 1 (sign) + 5 (exponent) + 10 (mantissa)
 - \[(-1)^{\text {sign }} 2^{(\text {exponent }-15)} \times 1 \text {.mantissa }\]
 
(3) BFloat16
(Brain Float 16)

- 16 = 1 (sign) + 8 (exponent) + 7 (mantissa)
 - \((-1)^{\text {sign }} 2^{(\text {exponent }-127)} \times 1 . \text { mantissa }\).
 
(4) Float8

- 8 = 1 (sign) + 4 (exponent) + 3 (mantissa)
 - \((-1)^{\operatorname{sign}} 2^{\left(\text {exponent-7) } \times 1-n a-s s c_0 .\right.}\).
 
2. Min & Max Range Comparison
Float32: \(\left[-3.4 \times 10^{38}, 3.4 \times 10^{38}\right]\)
Float16: \(\left[-6.55 \times 10^4, 6.55 \times 10^4\right]\)
BFloat16: \(\left[-3.39 \times 10^{38}, 3.39 \times 10^{38}\right]\)
Float8: \([-240,240]\)
3. Converting Data Type
(1) Float32 \(\rightarrow\) Float16

- exponent의 앞부분부터
 - decimal의 뒷부분부터
 
문제점? Float overflow!
- Float 16이 가질 수 있는 범위를 초과할 수도 있음!
 
해결책? BFloat16!
(2) Float32 \(\rightarrow\) BFloat16
