[ Recommender System ]
25.Image data로 RS 성능 올리기
( 참고 : Fastcampus 추천시스템 강의 )
1. 데이터 소개
- 사용할 데이터 : Amazon data
- AMAZON_FASHION_5.json
- All_Beauty_5.json
- Luxury_Beauty_5.json
2. Pre-trained CNN
import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from torch.autograd import Variable
from PIL import Image
사용할 모델 : Resnet18
model = models.resnet18(pretrained=True)
layer = model._modules.get('avgpool')
model.eval()
Image 전처리 과정
scaler = transforms.Scale((224, 224))
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
to_tensor = transforms.ToTensor()
Image \(\rightarrow\) feature vector로
- image명을 입력하면, 해당 image가 embedding된 512차원의 feature vector가 반환된다
def get_vector(image_name):
img = Image.open(image_name)
t_img = Variable(normalize(to_tensor(scaler(img))).unsqueeze(0))
my_embedding = torch.zeros(512)
def copy_data(m, i, o):
my_embedding.copy_(o.data.reshape(o.data.size(1)))
h = layer.register_forward_hook(copy_data)
model(t_img)
h.remove()
return my_embedding.cpu().detach().numpy()
최종적으로 생성된 combined_df : 5개의 column
- 카테고리 / 평점 / 고객 ID / 제품 ID / filename / image_vec
combined_df = pd.DataFrame(data=data_list,columns=['category', 'overall', 'reviewerID', 'asin', 'filename'])
combined_df['image_vec'] = combined_df['filename'].apply(lambda x: get_vector(x))
image vector에서 빈칸의 경우 0으로 채워줌
def check_vector(vector):
return np.array([0.0 if str(x) == '' else float(x) for x in vector])[:512]
df = combined_df.copy()
df['image_vec'] = df['image_vec'].apply(lambda x: check_vector(x))
3. K-means Clustering
train_df, test_df = train_test_split(df, test_size=0.2, random_state=1234)
X_train = np.array([list(x) for x in train_df['image_vec'].values])
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)
kmeans.labels_
4. Evaluation
test_df['prediction'] = test_df['image_vec'].apply(lambda x: kmeans.predict([x])[0])
test_df.head()
애매한 결과..
test_df.groupby('category').count()
test_df.groupby('prediction').count()
아마 너무 비슷한 제품들 끼리 구분을 해서 그런 듯 하다.
( ex. 전자제품 vs 음식 vs 뷰티제품 이면 훨씬 잘 구분했을수도? )
5. KNN
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
y_train = train_df['overall'].values
neigh.fit(X_train, y_train)
예측 결과 :
test_df['prediction'] = test_df['image_vec'].apply(lambda x: neigh.predict([x])[0])
test_df.head()
Accuarcy : 약 63.5% (174/274)
test_df[test_df.overall == test_df.prediction].count()