(참고 : Ready-To-Use Tech 유튜브 강의)
1. BERT의 4가지 형태의 fine-tuning
- 1) Sequence Pair Classification (문장 2개의 관계 파악)
- 2) Single Sentence Classification (문장 분류)
- 3) Question Answering (질의 응답)
- 4) Single Sentence Tagging (개체 분석)
2. Hugging Face
주로 Transformer 기반의 모델들
1) 좌측 메뉴 소개
- (a) Summary of Tasks : Fine-tuning 방법들 소개
- 위에 참고
- (b) Summary of the Models : Pre-training 모델들 구조 소개
- 1) Autoregressive models
- 2) Autoencoding models
- 3) Seq2Seq models
- 4) Multimodal models
- 5) Retrieval based models
- (c) Model sharing and uploading : 여러 사람들이 서로 모델 공유/공개
- (d) Models
- BERT,ALBERT,BART,GPT,T5….
- ex) BERT :
- BertTokenizer
- BertTokenizerFast
- …
- BertForSequenceClassification ( Pytorch용 )
- BertForNextSentencePrediction
- …
- TFBertForSequenceClassificaiton ( TF용 )
- ex) BERT :
- BERT,ALBERT,BART,GPT,T5….
2) Official한 Pre-trained models
Advanced Guides : Official한 모델
-
Pretrained Models
-
Architecture 소개
- bert-base-uncased
- bert-large-uncased
- …
- bert-base-multilingual-uncased
-
bert-base-multilingual-cased
- …
- gpt-large
- …
-
cased : 대소문자 구분 O
-
uncased : 대소문자 구분 X
-
base & large : 모델 size
-
여기서 pre-trained 모델을 받아와서, 나의 data/task에 맞게 fine-tuning하자!
3) Pre-train & Fine-tune 과정 규격화
example )
BART
from transformers import BartModel
from kobart import get_pytorch_kobart_model, get_kobart_tokenizer
# 1) tokenizer
kobart_tokenizer = get_kobart_tokenizer()
# 2) pre-trained model
model = BartModel.from_pretrained(get_pytorch_kobart_model())
Electra
from transformers import ElectraTokenizerFast, ElectraModel,TfElectraModel
# 1) tokenizer
electra_tokenizer = ElectraTokenizerFast.from_pretrained('kykim/electra-kor-base')
# 2) pre-trained model
## Pytorch
model_pt = ElectraModel.from_pretrained('kykim/electra-kor-base')
## TF
model_tf = TFElectraModel.from_pretrained('kykim/electra-kor-base')
4) Example : GPT3로 문장 생성
- 1) library 불러오기 ( pre-trained 모델 & tokenizer )
from transformers import BertTokenizerFast,TFGPT2LMHeadModel,GPT2LMHeadModel
- 2) tokenizer & model 불러오기
tokenizer = BertTokenizerFast.from_pretrained('kykim/gpt3-kor-small-based-on-gpt2')
model = GPT2HeadModel.from_pretrained('kykim/gpt3-kor-small-based-on-gpt2',pad_token_id=0)
- 3) input & output
- xxx_text : 텍스트
- xxx_ids : 해당 텍스트에 해당하는 token 들
input_text = '인생이'
input_ids = tokenizer.encode(input_text,return_tensors='pt')
input_ids = input_ids[:,1:] # (0번쨰) cls 제거
output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0],skip_tokens=True)
print(output_text)
'인생이 너무 행복하고 행복하다'