참고 : Do it! BERT와 GPT로 배우는 자연어처리

4. Document Classification

문서 분류

  • 문서가 어떠한 극성 (polarity)를 가지는지 분류

  • 순서

    • 1) 입력 문장 “토큰화”
    • 2) [CLS], [SEP] 초큰 앞/뒤에 붙이기
    • 3) BERT에 넣어서 pooler_output 뽑기
      • 문장 수준의 벡터 ( = CLS 토큰 임베딩에 FFNN 한번 )
    • 4) 추가적인 classifier module 하나 붙이기
      • 크기 : (768,2)
  • 모델 업데이트

    • 1) 앞부분의 BERT 레이어
    • 2) 뒷부분의 classifier module

    • Fine tunining : 2)만 학습하던지, 1)+2) 학습하던지


TRAIN

순서

-———————————————————————

  • 1) argument 설정 ( dict )
  • 2) random seed 고정 & logger 설정

-———————————————————————

  • 3) data 다운로드
  • 4) tokenizer 준비
  • 5) data 전처리
  • 6) data loader 준비 ( train & val )

-———————————————————————

  • 7) pre-trained 모델 불러오기
  • 8) task 정의
  • 9) trainer 정의
  • 10) 모델 학습

-———————————————————————


1) argument 설정 ( dict )

import torch
from ratsnlp.nlpbook.classification import ClassificationTrainArguments as CLSargs

pretrained_dir = 'beomi/kcbert-base' # hugging face에 있어야
downstream_dir = '/gdrive/My Drive/nlpbook/checkpoint-doccls'
data_name = 'nsmc'

args = CLSargs(
    pretrained_model_name = model_pretrained_dir,
    downstream_corpus_name = data_name,
    downstream_model_dir = downstream_dir,
    batch_size = 32 if torch.cuda.is_available() else 4,
    learning_rate = 5e-5,
    max_seq_length = 128,
    epochs = 3,
    tpu_cores = 0 if torch.cuda.is_available() else 8,
    seed=7
)


2) random seed 고정 & logger 설정

from ratsnlp import nlpbook

nlpbook.set_seed(args)
nlpbook.set_logger(args)


3) data 다운로드

from Korpora import Korpora

Korpora.fetch(
    corpus_name = args.downstream_corpus_name, # nsmc
    root_dir = args.downstream_corpus_root_dir,
    force_download = True
)


4) tokenizer 준비

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(
    args.pretrained_model_name, # 'beomi/kcbert-base'
    do_lower_case = False
)


5) data 전처리 ( train & val )

from ratsnlp.nlpbook.classification import NsmcCorpus, ClassificationDataset

corpus = NsmcCorpus()
train_dataset = ClassificationDataset(
    args = args,
    corpus = corpus,
    tokenizer = tokenizer,
    mode = 'train'
)

val_dataset = ClassificationDataset(
    args = args,
    corpus = corpus,
    tokenizer = tokenizer,
    mode = 'test'
)

train_dataset[0]에 담긴 정보들

  • 1) input_ids : 정수 index
  • 2) attention_mask : 실제 값 (1) / 패딩 (0)
  • 3) token_type_ids : 세그멘트 id
  • 4) label : y값


6) data loader 준비 ( train & val )

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
train_dataloader = DataLoader(
    train_dataset,
    batch_size = args.batch_size,
    sampler = RandomSampler(train_dataset, replacement = False),
	collate_fn = nlpbook.data_collator,
    drop_last = False,
    num_workers = args.cpu_workers
)

val_dataloader = DataLoader(
    val_dataset,
    batch_size = args.batch_size,
    sampler = SequentialSampler(val_dataset),
	collate_fn = nlpbook.data_collator,
    drop_last = False,
    num_workers = args.cpu_workers
)


  • 7) pre-trained 모델 불러오기
from transformers import BertConfig, BertForSequenceClassification

config = BertConfig.from_pretrained(
    args.pretrained_model_name,
    num_labels = corpus.num_labels
)

model = BertForSequenceClassification(
    args.pretrained_model_name,
    config = config
)


  • 8) task 정의
from ratsnlp.nlpbook.classification import ClassificationTask

task = ClassificationTask(model, args)


  • 9) trainer 정의
trainer = nlpbook.get_trainer(args)


  • 10) 모델 학습
trainer.fit(
    task,
    train_dataloader = train_dataloader,
	val_dataloader = val_dataloader
)


INFERENCE

순서

-———————————————————————

  • 1) argument 설정 ( dict )
  • 2) random seed 고정 & logger 설정

-———————————————————————

  • 3) data 다운로드
  • 2) tokenizer 준비
  • 5) data 전처리
  • 6) data loader 준비 ( train & val )

-———————————————————————

  • 3) checkpoint 불러오기
  • 4) pre-trained 모델 불러오기
  • 5) checkpoint 주입하기 ( + 평가모드 전환 )
  • 8) task 정의
  • 9) trainer 정의
  • 10) 모델 학습
  • 6) inference 함수
  • 7) 웹서비스

-———————————————————————


1) argument 설정 ( dict )

import torch
from ratsnlp.nlpbook.classification import ClassificationDeployArguments as CLSargs

pretrained_dir = 'beomi/kcbert-base' # hugging face에 있어야
downstream_dir = '/gdrive/My Drive/nlpbook/checkpoint-doccls'
data_name = 'nsmc'

args = CLSargs(
    pretrained_model_name = model_pretrained_dir,
    downstream_model_dir = downstream_dir,
    max_seq_length = 128
)


2) tokenizer 준비

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(
    args.pretrained_model_name
    do_lower_case = False
)


3) checkpoint 불러오기

model_weight = torch.load(
    args.downstream_model_checkpoint_fpath,
    map_location = torch.device("cpu")
)


4) pre-trained 모델 불러오기

from transformers import BertConfig, BertForSequenceClassification

config = BertConfig.from_pretrained(
    args.pretrained_model_name,
    num_labels = model_weight['state_dict']['model.classsifier.bias'].shape.numel()
)

model = BertForSequenceClassification(config)


5) checkpoint 주입하기 ( + 평가모드 전환 )

model.load_state_dict(
    {k.replace("model.", ""):v for k,v in model_weight['state_dict'].items()}
)

model.eval()


6) inference 함수

def inference_fn(sentence):
    inputs_dict = tokenizer(
        [sentence],
        max_length = args.max_seq_length,
        padding = 'max_length',
        truncation=True
    )
    
    inputs_tensor = {k:torch.tensor(v) for k,v in inputs_dict.items()}
    
    with torch.no_grad():
        outputs = model(**inputs_tensor)
        prob = outputs.logits.softmax(dim = 1)
        prob_POS = prob[0][1].item()
        prob_NEG = prob[0][0].item()
        result = "POS" if torch.argmax(prob)==1 else "NEG"
        
    
    return {
        'sentence' : sentence,
        'prediction' : result,
        'positive_data' : f"긍정{round(prob_POS,4)}",
        'negative_data' : f"부정{round(prob_NEG,4)}",
        'positive_width' : f"{round(prob_POS,4)*100}%",
        'negative_width' : f"{round(prob_POS,4)*100}%",
    }


7) 웹서비스

from ratsnlp.nlpbook.classification import get_web_service_app

app = get_web_service_app(inference_fn)
app.run()

Tags:

Categories:

Updated: