Quantization

Float32 vs Float16 vs BFloat16

less than 1 minute read

Seunghan Lee

Seunghan Lee

Deep Learning, Data Science, Statistics

Quantization

Contents

Float32 vs. Float16 vs. BFloat16
Min & Max Range Comparison
Converting Data Type

1. Float32 vs. Float16 vs. BFloat16

(1) Float 32

32 = 1 (sign) + 8 (exponent) + 23 (mantissa)
\((-1)^{\text {sign }} 2^{(\text {exponent }-127)} \times 1 \text {.mantissa }\).

Example:

\[81= 2^{6} + 2^4 + 2^0 = 64 + 16 +1\]
(max): \(2^7 + 2^6+ \cdots 2^0 = 2^8-1 =255\)

(2) Float16

16 = 1 (sign) + 5 (exponent) + 10 (mantissa)
\[(-1)^{\text {sign }} 2^{(\text {exponent }-15)} \times 1 \text {.mantissa }\]

(3) BFloat16

(Brain Float 16)

16 = 1 (sign) + 8 (exponent) + 7 (mantissa)
\((-1)^{\text {sign }} 2^{(\text {exponent }-127)} \times 1 . \text { mantissa }\).

(4) Float8

8 = 1 (sign) + 4 (exponent) + 3 (mantissa)
\((-1)^{\operatorname{sign}} 2^{\left(\text {exponent-7) } \times 1-n a-s s c_0 .\right.}\).

2. Min & Max Range Comparison

Float32: \(\left[-3.4 \times 10^{38}, 3.4 \times 10^{38}\right]\)

Float16: \(\left[-6.55 \times 10^4, 6.55 \times 10^4\right]\)

BFloat16: \(\left[-3.39 \times 10^{38}, 3.39 \times 10^{38}\right]\)

Float8: \([-240,240]\)

3. Converting Data Type

(1) Float32 \(\rightarrow\) Float16

exponent의 앞부분부터
decimal의 뒷부분부터

문제점? Float overflow!

Float 16이 가질 수 있는 범위를 초과할 수도 있음!

해결책? BFloat16!

(2) Float32 \(\rightarrow\) BFloat16

Twitter Facebook LinkedIn

You May Also Enjoy

8 minute read

2 minute read

5 minute read

14 minute read