CNN for NLP 실습

( 참고 : “딥러닝을 이용한 자연어 처리 입문” (https://wikidocs.net/book/2155) )


NLP에서 CNN을 사용한 예시로, 아래와 같이 총 3가지의 실습을 할 것이다.

  • 1) 1D CNN으로 IMDB review 분류
  • 2) 1D CNN으로 spam mail 분류
  • 3) Multi-kernel 1D CNN으로 Naver movie review 분류


1. 1D CNN으로 IMDB review 분류

  • 1) 필요한 library들을 불러온다
from tensorflow.keras import datasets
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dropout, Conv1D, GlobalMaxPooling1D, Dense
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.models import load_model


  • 2) 데이터를 불러온다 ( 10000개의 단어만을 사용 )
    • 최대 길이 200으로 padding
vocab_size = 10000
(X_train, y_train), (X_test, y_test) = datasets.imdb.load_data(num_words = vocab_size)

max_len = 200
X_train = pad_sequences(X_train, maxlen = max_len)
X_test = pad_sequences(X_test, maxlen = max_len)


  • 3) embedding할 차원 설정 (=256)

    ( + batch size는 256으로 )

embedding_dim = 256
batch_size = 256


  • 4) 모델 학습
    • 10000차원 $\rightarrow$ 256차원으로 Embedding
    • Dropout layer 사용
    • activation function으로는 ReLU사용
model = Sequential()
model.add(Embedding(vocab_size, 256))
model.add(Dropout(0.3))
model.add(Conv1D(256, 3, padding='valid', activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))


  • 5) EarlyStopping & 최고의 weight를 저장
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 3)
mc = ModelCheckpoint('best_model.h5', monitor = 'val_acc', mode = 'max', verbose = 1, save_best_only = True)


  • 6) Model Fitting
    • optimizer = adam optimizer
model.compile(optimizer='adam', loss = 'binary_crossentropy', metrics = ['acc'])
history = model.fit(X_train, y_train, epochs = 20, validation_data = (X_test, y_test), callbacks=[es, mc])
loaded_model = load_model('best_model.h5')


2. 1D CNN으로 spam mail 분류

2-1. 전처리

  • 1) 필요한 library들을 불러온다
import urllib.request
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


  • 2) 데이터를 load하고, target을 각각 0,1로 바꾼다

    ( + 중복되는 값들 제거 )

urllib.request.urlretrieve("https://raw.githubusercontent.com/mohitgupta-omg/Kaggle-SMS-Spam-Collection-Dataset-/master/spam.csv", filename="spam.csv")
data = pd.read_csv('spam.csv', encoding='latin-1')
data = data[['v1','v2']]
data['v1'] = data['v1'].replace(['ham','spam'],[0,1])
data.drop_duplicates(subset=['v2'], inplace=True) 


  • 3) 약간 imbalanced 데이터임을 확인할 수 있다

    ( 정상 메일이 4516개, 스팸 메일이 653개 )

print(data.groupby('v1').size().reset_index(name='count'))
   v1  count
0   0   4516
1   1    653


  • 4) 등장 빈도 상위 1000개를 사용하여 Tokenize한다 ( + 정수 Encoding )
    • sequences : 토큰화 된 이후, 정수 encoding된 결과
X_data = data['v2']
y_data = data['v1']

vocab_size = 1000
tokenizer = Tokenizer(num_words = vocab_size)
tokenizer.fit_on_texts(X_data)
sequences = tokenizer.texts_to_sequences(X_data) 


  • 5) train : test의 비율 = 8 : 2
n_of_train = int(len(sequences) * 0.8)
n_of_test = int(len(sequences) - n_of_train)
X_data = sequences


  • 6) Padding하기 ( 가장 긴 문자열인 172개를 기준으로! )
max_len = 172
data = pad_sequences(X_data, maxlen = max_len)


  • 7) Train & Test Split
    • 전체 shape : 5,169 × 172
    • Train shape : 4,135 x 172
    • Test shape : 1,034 x 172
X_test = data[n_of_train:] 
y_test = np.array(y_data[n_of_train:])
X_train = data[:n_of_train] 
y_train = np.array(y_data[:n_of_train])


2-2. Modeling

( 대부분의 과정은 1.1D CNN으로 IMDB review 분류 와 유사하다. )

from tensorflow.keras.layers import Dense, Conv1D, GlobalMaxPooling1D, Embedding, Dropout, MaxPooling1D
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint


model = Sequential()
model.add(Embedding(vocab_size, 32))
model.add(Dropout(0.2))
model.add(Conv1D(32, 5, strides=1, padding='valid', activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])


es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 3)
mc = ModelCheckpoint('best_model.h5', monitor = 'val_acc', mode = 'max', verbose = 1, save_best_only = True)
history = model.fit(X_train, y_train, epochs = 10, batch_size=64, validation_split=0.2, callbacks=[es, mc])


3. Multi-kernel 1D CNN으로 Naver movie review 분류

3-1. 전처리

전처리 과정 생략


3-2. Modeling

  • 1) 필요한 library들을 불러온다
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, Dropout, Conv1D, GlobalMaxPooling1D, Dense, Input, Flatten, Concatenate
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.models import load_model


  • 2) hyperparameter들을 설정한다.
embedding_dim = 128
dropout_prob = (0.5, 0.8)
num_filters = 128


  • 3) Model 생성
model_input = Input(shape = (max_len,))
z = Embedding(vocab_size, embedding_dim, input_length = max_len, name="embedding")(model_input)
z = Dropout(dropout_prob[0])(z)


세 개의 filter를 사용할 것이다

  • 각각 kernel size가 3,4,5이다
  • stride = 1
  • activation function : relu
conv_blocks = []

for sz in [3, 4, 5]:
    conv = Conv1D(filters = num_filters,
                         kernel_size = sz,
                         padding = "valid",
                         activation = "relu",
                         strides = 1)(z)
    conv = GlobalMaxPooling1D()(conv)
    conv = Flatten()(conv)
    conv_blocks.append(conv)


각각의 filter를 통과하여 생성된 결과를 서로 concatenate하고, 마지막으로 Dense layer에 태운다.

z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]
z = Dropout(dropout_prob[1])(z)
z = Dense(128, activation="relu")(z)
model_output = Dense(1, activation="sigmoid")(z)

model = Model(model_input, model_output)
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc"])


  • 4) 모델을 학습시킨다.
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=4)
mc = ModelCheckpoint('CNN_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)

model.fit(X_train, y_train, batch_size = 64, epochs=10, validation_data = (X_test, y_test), verbose=2, callbacks=[es, mc])


3-3. Review 예측하기

**Input : 특정 텍스트 **

  • ex) “이거 너무 재밋다….담에또언제와서 봐야하려나~~”

Output : 리뷰가 긍정/부정일 확률

  • ex) 긍정일 확률 94.8%
def sentiment_predict(x):
  x = okt.morphs(x, stem=True) # 1) Tokenize
  x = [word for word in x if not word in stopwords] # 2) Stopwords 제거
  encoded = tokenizer.texts_to_sequences([x]) # 3) 정수 Encoding
  pad_new = pad_sequences(encoded, maxlen = max_len) # 4) Padding
  score = float(model.predict(pad_new)) # 5) Predict
  if(score > 0.5):
    print("{:.2f}% 확률로 긍정 리뷰입니다.\n".format(score * 100))
  else:
    print("{:.2f}% 확률로 부정 리뷰입니다.\n".format((1 - score) * 100))