참고 : Do it! BERT와 GPT로 배우는 자연어처리
4. Document Classification
문서 분류
-
문서가 어떠한 극성 (polarity)를 가지는지 분류
-
순서
- 1) 입력 문장 “토큰화”
- 2) [CLS], [SEP] 초큰 앞/뒤에 붙이기
- 3) BERT에 넣어서
pooler_output
뽑기- 문장 수준의 벡터 ( = CLS 토큰 임베딩에 FFNN 한번 )
- 4) 추가적인 classifier module 하나 붙이기
- 크기 : (768,2)
-
모델 업데이트
- 1) 앞부분의 BERT 레이어
-
2) 뒷부분의 classifier module
- Fine tunining : 2)만 학습하던지, 1)+2) 학습하던지
TRAIN
순서
-———————————————————————
- 1) argument 설정 ( dict )
- 2) random seed 고정 & logger 설정
-———————————————————————
- 3) data 다운로드
- 4) tokenizer 준비
- 5) data 전처리
- 6) data loader 준비 ( train & val )
-———————————————————————
- 7) pre-trained 모델 불러오기
- 8) task 정의
- 9) trainer 정의
- 10) 모델 학습
-———————————————————————
1) argument 설정 ( dict )
import torch
from ratsnlp.nlpbook.classification import ClassificationTrainArguments as CLSargs
pretrained_dir = 'beomi/kcbert-base' # hugging face에 있어야
downstream_dir = '/gdrive/My Drive/nlpbook/checkpoint-doccls'
data_name = 'nsmc'
args = CLSargs(
pretrained_model_name = model_pretrained_dir,
downstream_corpus_name = data_name,
downstream_model_dir = downstream_dir,
batch_size = 32 if torch.cuda.is_available() else 4,
learning_rate = 5e-5,
max_seq_length = 128,
epochs = 3,
tpu_cores = 0 if torch.cuda.is_available() else 8,
seed=7
)
2) random seed 고정 & logger 설정
from ratsnlp import nlpbook
nlpbook.set_seed(args)
nlpbook.set_logger(args)
3) data 다운로드
from Korpora import Korpora
Korpora.fetch(
corpus_name = args.downstream_corpus_name, # nsmc
root_dir = args.downstream_corpus_root_dir,
force_download = True
)
4) tokenizer 준비
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(
args.pretrained_model_name, # 'beomi/kcbert-base'
do_lower_case = False
)
5) data 전처리 ( train & val )
from ratsnlp.nlpbook.classification import NsmcCorpus, ClassificationDataset
corpus = NsmcCorpus()
train_dataset = ClassificationDataset(
args = args,
corpus = corpus,
tokenizer = tokenizer,
mode = 'train'
)
val_dataset = ClassificationDataset(
args = args,
corpus = corpus,
tokenizer = tokenizer,
mode = 'test'
)
train_dataset[0]
에 담긴 정보들
- 1) input_ids : 정수 index
- 2) attention_mask : 실제 값 (1) / 패딩 (0)
- 3) token_type_ids : 세그멘트 id
- 4) label : y값
6) data loader 준비 ( train & val )
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
train_dataloader = DataLoader(
train_dataset,
batch_size = args.batch_size,
sampler = RandomSampler(train_dataset, replacement = False),
collate_fn = nlpbook.data_collator,
drop_last = False,
num_workers = args.cpu_workers
)
val_dataloader = DataLoader(
val_dataset,
batch_size = args.batch_size,
sampler = SequentialSampler(val_dataset),
collate_fn = nlpbook.data_collator,
drop_last = False,
num_workers = args.cpu_workers
)
- 7) pre-trained 모델 불러오기
from transformers import BertConfig, BertForSequenceClassification
config = BertConfig.from_pretrained(
args.pretrained_model_name,
num_labels = corpus.num_labels
)
model = BertForSequenceClassification(
args.pretrained_model_name,
config = config
)
- 8) task 정의
from ratsnlp.nlpbook.classification import ClassificationTask
task = ClassificationTask(model, args)
- 9) trainer 정의
trainer = nlpbook.get_trainer(args)
- 10) 모델 학습
trainer.fit(
task,
train_dataloader = train_dataloader,
val_dataloader = val_dataloader
)
INFERENCE
순서
-———————————————————————
- 1) argument 설정 ( dict )
2) random seed 고정 & logger 설정
-———————————————————————
3) data 다운로드- 2) tokenizer 준비
5) data 전처리6) data loader 준비 ( train & val )
-———————————————————————
- 3) checkpoint 불러오기
- 4) pre-trained 모델 불러오기
- 5) checkpoint 주입하기 ( + 평가모드 전환 )
8) task 정의9) trainer 정의10) 모델 학습- 6) inference 함수
- 7) 웹서비스
-———————————————————————
1) argument 설정 ( dict )
import torch
from ratsnlp.nlpbook.classification import ClassificationDeployArguments as CLSargs
pretrained_dir = 'beomi/kcbert-base' # hugging face에 있어야
downstream_dir = '/gdrive/My Drive/nlpbook/checkpoint-doccls'
data_name = 'nsmc'
args = CLSargs(
pretrained_model_name = model_pretrained_dir,
downstream_model_dir = downstream_dir,
max_seq_length = 128
)
2) tokenizer 준비
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(
args.pretrained_model_name
do_lower_case = False
)
3) checkpoint 불러오기
model_weight = torch.load(
args.downstream_model_checkpoint_fpath,
map_location = torch.device("cpu")
)
4) pre-trained 모델 불러오기
from transformers import BertConfig, BertForSequenceClassification
config = BertConfig.from_pretrained(
args.pretrained_model_name,
num_labels = model_weight['state_dict']['model.classsifier.bias'].shape.numel()
)
model = BertForSequenceClassification(config)
5) checkpoint 주입하기 ( + 평가모드 전환 )
model.load_state_dict(
{k.replace("model.", ""):v for k,v in model_weight['state_dict'].items()}
)
model.eval()
6) inference 함수
def inference_fn(sentence):
inputs_dict = tokenizer(
[sentence],
max_length = args.max_seq_length,
padding = 'max_length',
truncation=True
)
inputs_tensor = {k:torch.tensor(v) for k,v in inputs_dict.items()}
with torch.no_grad():
outputs = model(**inputs_tensor)
prob = outputs.logits.softmax(dim = 1)
prob_POS = prob[0][1].item()
prob_NEG = prob[0][0].item()
result = "POS" if torch.argmax(prob)==1 else "NEG"
return {
'sentence' : sentence,
'prediction' : result,
'positive_data' : f"긍정{round(prob_POS,4)}",
'negative_data' : f"부정{round(prob_NEG,4)}",
'positive_width' : f"{round(prob_POS,4)*100}%",
'negative_width' : f"{round(prob_POS,4)*100}%",
}
7) 웹서비스
from ratsnlp.nlpbook.classification import get_web_service_app
app = get_web_service_app(inference_fn)
app.run()