Byte Latent Transformer: Patches Scale Better Than Tokens (FAIR 2024)

Pagnoni, Artidoro, et al. "Byte Latent Transformer: Patches Scale Better Than Tokens." arXiv preprint arXiv:2412.09871 (2024).

( https://arxiv.org/pdf/2412.09871 )

참고: https://www.youtube.com/watch?v=jjwkjYEbejk

1. Abstract

Byte-level LLM architecture
- Performance: Matches tokenization-based LLM performance at scale
  
  ( with significant improvements in inference efficiency and robustness )
Details
- Encodes bytes into dynamically sized patches
  
  ( = primary units of computation )
- Patches are segmented based on the entropy of the next byte

Using Next-byte entropies from a small byte LM

\(\rightarrow\) 새로운 patch의 시작점이 되는 경계: Entropy 기반으로 결정한다!

핵심: Byte & Patch 단위를 모두 사용!

요약:

\(\rightarrow\) Consists of Global & Local Model

Byte단위의 입/출력을 받음/뱉음

Encoder & Decoder

\(e_i=x_i+\sum_{n=3, s_8} E_n^{\text {hash }}\left(\operatorname{Hash}\left(g_{i, n}\right)\right)\).

해석: \(i\) 번재 byte의 최종 임베딩 = (1) + (2)

내재적인 Next Patch prediction

여기서 Transformer 모델에서 representation이 잘 생성되어야, Local decoder가 잘 예측할 수 있을 것임!

entropy가 높은 Token \(\rightarrow\) 새로운 patch의 경계