[ Recommender System ]

25.Image data로 RS 성능 올리기

( 참고 : Fastcampus 추천시스템 강의 )


1. 데이터 소개

  • 사용할 데이터 : Amazon data
    • AMAZON_FASHION_5.json
    • All_Beauty_5.json
    • Luxury_Beauty_5.json

2. Pre-trained CNN

import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from torch.autograd import Variable
from PIL import Image

사용할 모델 : Resnet18

model = models.resnet18(pretrained=True)
layer = model._modules.get('avgpool')
model.eval()

Image 전처리 과정

scaler = transforms.Scale((224, 224))
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
to_tensor = transforms.ToTensor()

Image \(\rightarrow\) feature vector로

  • image명을 입력하면, 해당 image가 embedding된 512차원의 feature vector가 반환된다
def get_vector(image_name):
    img = Image.open(image_name)
    t_img = Variable(normalize(to_tensor(scaler(img))).unsqueeze(0))

    my_embedding = torch.zeros(512)

    def copy_data(m, i, o):
      my_embedding.copy_(o.data.reshape(o.data.size(1)))

    h = layer.register_forward_hook(copy_data)
    model(t_img)
    h.remove()
    return my_embedding.cpu().detach().numpy()

최종적으로 생성된 combined_df : 5개의 column

  • 카테고리 / 평점 / 고객 ID / 제품 ID / filename / image_vec
combined_df = pd.DataFrame(data=data_list,columns=['category', 'overall', 'reviewerID', 'asin', 'filename'])
combined_df['image_vec'] = combined_df['filename'].apply(lambda x: get_vector(x))

image vector에서 빈칸의 경우 0으로 채워줌

def check_vector(vector):
  return np.array([0.0 if str(x) == '' else float(x) for x in vector])[:512]
  
df = combined_df.copy()
df['image_vec'] = df['image_vec'].apply(lambda x: check_vector(x))

3. K-means Clustering

train_df, test_df = train_test_split(df, test_size=0.2, random_state=1234)
X_train = np.array([list(x) for x in train_df['image_vec'].values])
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)
kmeans.labels_

4. Evaluation

test_df['prediction'] = test_df['image_vec'].apply(lambda x: kmeans.predict([x])[0])
test_df.head()

figure2

애매한 결과..

test_df.groupby('category').count()

figure2

test_df.groupby('prediction').count()

figure2

아마 너무 비슷한 제품들 끼리 구분을 해서 그런 듯 하다.

( ex. 전자제품 vs 음식 vs 뷰티제품 이면 훨씬 잘 구분했을수도? )

5. KNN

from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
y_train = train_df['overall'].values
neigh.fit(X_train, y_train)

예측 결과 :

test_df['prediction'] = test_df['image_vec'].apply(lambda x: neigh.predict([x])[0])
test_df.head()

figure2

Accuarcy : 약 63.5% (174/274)

test_df[test_df.overall == test_df.prediction].count()

figure2

Categories:

Updated: