LLM Inference를 위한 라이브러리

Hugging Face Transformers
ollama
vLLM
Code 예시
1. Hugging Face Transformers
2. vLLM

1. Hugging Face Transformers

(1) 기본적인 특징

가장 널리 사용되는 라이브러리
- 다양한 LLM을 쉽게 로드하고 Inference할 수 있도록 지원
PyTorch와 TensorFlow를 모두 지원

(2) 상세 특징

Pipeline 함수를 통해 전처리/후처리과정을 모델과 연결하여 쉽게 사용 가능
lang chain과 함께 활용 가능
AutoModelForCausalLM 등을 활용해 편리하게 사용할 수 있음
accelerate, bitsandbytes와 같은 라이브러리와 함께 사용하면 메모리 최적화도 가능

(3) 한계점

추론 시 Bottleneck 등이 존재
(특히 quantized된 모델에 대해서는 느린 추론 속도)
- 이유? Float 연산 지원에 최적! (quantized된) INT에는 그닥…

(4) pipeline

전처리 & 후처리 과정이 매우 간편해짐!

Text generation
fill mask
feature extraction
ner
zero-shot classification
translation
sumarization

2. ollama

(1) 기본적인 특징

로컬 PC에서 LLM을 쉽게 실행할 수 있음
LLama, Mistral등 다양한 모델 지원

(2) 상세 특징

weight, 설정, 데이터셋 등을 통합하여 Modelfile로 관리함
무료 opensource 도구로 제공, REST API를 통해 접근 가능
CUDA 플랫폼 이외에도 OK

3. vLLM

(1) 기본적인 특징

고성능 Inference 엔진으로 (특히 Serving에 최적화)
throughput을 2배 이상으로 향상시킴!

(2) 상세 특징

PagedAttention을 활용한 효율적인 KV 캐싱 기법을 제공
Continuous batching을 활용하여 메모리 공간을 효율적으로 활용
vllm.engine.AsyncEngine 등을 사용하여 빠르고 확장성 있는 Inference를 수행할

(3) Paged attention

다음 post 글 참고하기!

(4) Continuous Batching

다양한 요청을 실시간으로 효율적으로 처리하는 기법

기존의 DL: 일반적으로 일정한 크기의 배치(batch)
continuous batching: 새로운 요청이 들어올 때마다 유동적으로 배치를 구성

기존 배칭 (Static Batching)	Continuous Batching
정해진 크기의 배치를 만들어 처리	새로운 요청이 들어오면 기존 배치에 즉시 추가
모든 요청이 준비될 때까지 대기	요청이 준비된 즉시 처리
GPU 자원을 비효율적으로 사용할 가능성	GPU를 최대한 활용하여 더 많은 요청을 동시에 처리

작동 방식

여러 사용자가 LLM에 질문을 보낼 때, 요청을 하나씩 기다렸다가 배치로 묶어서 처리하는 것이 아니라,
들어오는 즉시 기존 배치에 포함시키고,
GPU 자원을 끊김 없이 연속적으로 사용하도록 최적화함.

\(\rightarrow\) 즉, 연속적으로 배치(batch)를 업데이트하면서 최대한 많은 요청을 효율적으로 처리하는 방식

(5) 한계점

활용 가능한 inference 모델이 하나로 제한됨
- instance를 생성하고 나서, 다른 모델을 불러올 경우, 각 과정마다 destroy해줘야
transformer library와의 충돌하는 경우 O

(6) vLLM Details

기본적 사용법:

LLM을 instance로 생성
Generate 함수를 통해 문장을 생성

Generate 함수

SamplingParams를 통해 문장을 생성
SamplingParams의 arguments: 아래 참조

4. Code 예시

(1) Hugging Face Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

input_text = "What is the capital of France?"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

(2) vllm

from vllm import LLM, SamplingParams

model_name = "meta-llama/Llama-2-7b-chat-hf"
llm = LLM(model=model_name, dtype="bfloat16")

sampling_params = SamplingParams(max_tokens=50)
output = llm.generate("What is the capital of France?", sampling_params)

print(output[0].outputs[0].text)

Twitter Facebook LinkedIn

LLM Inference를 위한 라이브러리

Seunghan Lee

LLM Inference를 위한 라이브러리

Contents

1. Hugging Face Transformers

(1) 기본적인 특징

(2) 상세 특징

(3) 한계점

(4) pipeline

2. ollama

(1) 기본적인 특징

(2) 상세 특징

3. vLLM

(1) 기본적인 특징

(2) 상세 특징

(3) Paged attention

(4) Continuous Batching

작동 방식

(5) 한계점

(6) vLLM Details

4. Code 예시

(1) Hugging Face Transformers

(2) vllm

You May Also Enjoy