OVR ( with Embedded Football Dataset )
- with Football dataset
Football dataset은 11개의 그룹으로 나누어진 150개의 node로 이루어진 network이다. 이 node들 간의 인접 정보를 활용하여 node2vec을 구현하여, 2차원 평면상에 이들의 원래 연결관계가 잘 유지되도록 표현하는 것이 최종 목표이다.
-
embedded with LINE (first-order proximity / negative sampling )
( LINE을 이용하여 embedding한 football dataset을, OVR(One-Versus-Rest) classifier를 이용하여 classification을 할 것이다 )
1. Import Libraries & Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
- Football Dataset의 11개의 그룹(community) 중, 가장 적은 그룹의 node 개수는 5개이다. 이를 위해 SMOTE (oversampling의 기법 중 하나)를 사용하여 classify를 해볼 것이다. ( 해보지 않은 것과의 성능 비교 )
from imblearn.over_sampling import SMOTE
- 이전에 임베딩한 데이터를 csv로 저장하여 다시 불러온다.
ev = pd.read_csv('[Football]Embedded_with_FirstOrder.csv')
ev = ev.drop(ev.columns[0],axis=1)
ev.shape
(115, 11)
- 처음 5개의 node의 embedding vector를 확인하면 다음과 같다
ev.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.865126 | 0.732464 | 0.654681 | -0.280288 | -0.416516 | 0.779290 | 1.989182 | 0.944528 | 0.758910 | 0.924716 | 7 |
1 | -0.315168 | -1.665299 | -0.984810 | 1.077798 | 0.511267 | 0.939566 | 1.635527 | -0.366913 | -0.451699 | 1.780345 | 0 |
2 | -0.569846 | -0.199044 | 1.784970 | 0.186517 | 2.154936 | -0.550533 | -0.937430 | 0.107572 | 1.074133 | 0.326420 | 2 |
3 | 0.832763 | 0.221549 | -0.575225 | -0.686977 | -1.096524 | 0.453152 | -0.012188 | 0.983878 | 0.942373 | 0.570720 | 3 |
4 | 0.117742 | 0.502898 | 0.749028 | 0.396632 | -0.188808 | 0.286299 | 1.366271 | -0.073079 | -0.786144 | -0.255506 | 7 |
- 5번 community에는 node의 개수가 5개밖에 있지 않다. 이것이 SMOTE를 사용하려는 이유이다.
ev['Label'].value_counts()
6 13
9 12
3 12
2 11
11 10
8 10
4 10
0 9
7 8
1 8
10 7
5 5
Name: Label, dtype: int64
SMOTE
sm = SMOTE(random_state=42,k_neighbors=2)
k = sm.fit_sample(ev.iloc[:,0:10],ev.iloc[:,10])
ev2 = pd.DataFrame(k[0])
ev2['Label'] = k[1]
ev2 = ev2.sample(frac=1).reset_index(drop=True)
- 모든 community의 node의 개수를 13개 (가장 많은 community의 node 개수)로 만들어주었다.
ev2['Label'].value_counts()
11 13
10 13
9 13
8 13
7 13
6 13
5 13
4 13
3 13
2 13
1 13
0 13
Name: Label, dtype: int64
train & test split
- 70% : 30&로 train & test를 나누었다.
(1) SMOTE (X)
test_index1 = ev.groupby('Label').apply(lambda x: x.sample(frac=0.3)).index.levels[1]
train_index1 = set(np.arange(0,ev.shape[0])) - set(test_index1)
train1 = ev.loc[train_index1]
test1 = ev.loc[test_index1]
train_X1 = np.array(train1.iloc[:,0:10])
train_y1 = np.array(train1.iloc[:,10]).flatten()
test_X1 = np.array(test1.iloc[:,0:10])
test_y1 = np.array(test1.iloc[:,10]).flatten()
train_X1.shape, test_X1.shape, train_y1.shape, test_y1.shape
((80, 10), (35, 10), (80,), (35,))
(2) SMOTE (O)
test_index2 = ev2.groupby('Label').apply(lambda x: x.sample(frac=0.3)).index.levels[1]
train_index2 = set(np.arange(0,ev2.shape[0])) - set(test_index2)
train2 = ev2.loc[train_index2]
test2 = ev2.loc[test_index2]
train_X2 = np.array(train2.iloc[:,0:10])
train_y2 = np.array(train2.iloc[:,10]).flatten()
test_X2 = np.array(test2.iloc[:,0:10])
test_y2 = np.array(test2.iloc[:,10]).flatten()
train_X2.shape, test_X2.shape, train_y2.shape, test_y2.shape
((108, 10), (48, 10), (108,), (48,))
2. Define Functions
OVR을 구현하기 위해 다음과 같은 함수들을 만들었다.
- 1) matrix multiplication
- 2) sigmoid
- 3) standard scaler
- 4) loss function
def mul(W,b,x):
return np.dot(x,W)+b
def sigmoid(x):
k = 1 / (1 + np.exp(-x+0.0001))
return k[:,0]
def standard_scaler(x):
mean = np.mean(x)
std = np.std(x)
return (x-mean)/std
def loss_func(y_hat,y):
total_loss = np.mean(y*np.log(y_hat+0.0001) + (1-y)*np.log(1-y_hat+0.0001))
return -total_loss
3. Train Model
(1) Logistic Regression
- OVR을 이루는 여러 개의 Logistic Regression
def predict(test_X,W,b):
result = sigmoid(np.dot(test_X, W) + b)
return result
def logreg(x,y,epoch,lr):
W = np.random.rand(x.shape[1],1)
b = np.random.rand(1)
for ep in range(epoch+1):
Z = mul(W,b,x)
y_hat = sigmoid(Z)
loss = loss_func(y_hat,y)
dw = np.matmul(x.T,y_hat-y)/x.shape[0]
db = np.sum(y_hat-y)
W = W-lr*dw.reshape(-1,1)
b = b-lr*db
if ep>0 and ep % 10000 == 0:
print('epoch :',ep,' loss :',loss)
print('------------------------------------------ final loss :',loss,'---')
return W,b
OVR (One-Versus-Rest)
- SMOTE로 sampling한 것보다 하지 않은 거시 더 좋은 성능이 나오는 것으로 확인되어서 SMOTE를 하지 않은 것을 최종 모델로 선택하였다.
def OVR(train_x,train_y,test_x,test_y,epoch,lr):
pred_result = []
real_result = []
for index in ev['Label'].unique():
train_y2 = (train_y == index).astype(int)
test_y2 = (test_y == index).astype(int)
''' oversampling with SMOTE in OVR
sm = SMOTE(random_state=42,k_neighbors=3)
smote_x,smote_y = sm.fit_sample(train_x,train_y2)
ind = np.arange(smote_x.shape[0])
np.random.shuffle(ind)
smote_x,smote_y = smote_x[ind],smote_y[ind]
W,b = logreg(smote_x,smote_y,epoch,lr)
print('------------------------------------------ Classifier ',index,'done---')
'''
W,b = logreg(train_x,train_y2,epoch,lr)
y_pred = predict(test_x,W,b)
pred_result.append(y_pred)
real_result.append(test_y2)
pred_OH = (pred_result == np.amax(pred_result,axis=0)).astype('int')
act_OH = np.concatenate(real_result).ravel().reshape(ev.iloc[:,-1].nunique(),-1)
return pred_OH,act_OH
- Confusion Matrix (혼동 행렬)
def confusion_matrix(actual,prediction):
n = actual.shape[0]
conf_mat = np.zeros((n,n))
for i in range(n):
for j in range(n):
conf_mat[i][j] += len(np.intersect1d(np.nonzero(actual[i]),np.nonzero(prediction[j])))
return conf_mat
4. Result
1. SMOTE (X)
prediction1,actual1 = OVR(train_X1,train_y1,test_X1,test_y1,20000,0.0025)
epoch : 10000 loss : 0.24007866286417195
epoch : 20000 loss : 0.23680118954105106
------------------------------------------ final loss : 0.23680118954105106 ---
epoch : 10000 loss : 0.1752521261784632
epoch : 20000 loss : 0.16416244141006825
------------------------------------------ final loss : 0.16416244141006825 ---
epoch : 10000 loss : 0.281384114717651
epoch : 20000 loss : 0.27806784014081265
------------------------------------------ final loss : 0.27806784014081265 ---
epoch : 10000 loss : 0.22297134652014933
epoch : 20000 loss : 0.21767287570497387
------------------------------------------ final loss : 0.21767287570497387 ---
epoch : 10000 loss : 0.22175818987175216
epoch : 20000 loss : 0.211198716837698
------------------------------------------ final loss : 0.211198716837698 ---
epoch : 10000 loss : 0.12114989191594872
epoch : 20000 loss : 0.11273809885920652
------------------------------------------ final loss : 0.11273809885920652 ---
epoch : 10000 loss : 0.2649164780890776
epoch : 20000 loss : 0.2519730950150898
------------------------------------------ final loss : 0.2519730950150898 ---
epoch : 10000 loss : 0.22101384088099474
epoch : 20000 loss : 0.2078649825932543
------------------------------------------ final loss : 0.2078649825932543 ---
epoch : 10000 loss : 0.1767867072945053
epoch : 20000 loss : 0.1425810221363643
------------------------------------------ final loss : 0.1425810221363643 ---
epoch : 10000 loss : 0.22525643967274248
epoch : 20000 loss : 0.2156514249701329
------------------------------------------ final loss : 0.2156514249701329 ---
epoch : 10000 loss : 0.11773905299477097
epoch : 20000 loss : 0.09568321094024893
------------------------------------------ final loss : 0.09568321094024893 ---
epoch : 10000 loss : 0.23128075596452985
epoch : 20000 loss : 0.22795853056835433
------------------------------------------ final loss : 0.22795853056835433 ---
confusion_without_smote = confusion_matrix(actual1, prediction1)
confusion_without_smote
array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.],
[0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0.],
[0., 0., 1., 0., 0., 0., 2., 0., 0., 0., 0., 0.],
[0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 2., 1., 0., 0., 0.],
[0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1.],
[0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 2., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.]])
2. SMOTE (O)
prediction2,actual2 = OVR(train_X2,train_y2,test_X2,test_y2,100000,0.0005)
epoch : 10000 loss : 0.2983699795044497
epoch : 20000 loss : 0.2611569455527149
epoch : 30000 loss : 0.2473366437738021
epoch : 40000 loss : 0.24164112733430088
epoch : 50000 loss : 0.23881470383005338
epoch : 60000 loss : 0.23719708232203154
epoch : 70000 loss : 0.23618992626970173
epoch : 80000 loss : 0.23553188933471786
epoch : 90000 loss : 0.23508881092170372
epoch : 100000 loss : 0.23478409834385966
------------------------------------------ final loss : 0.23478409834385966 ---
epoch : 10000 loss : 0.2400181946008694
epoch : 20000 loss : 0.18745882470144323
epoch : 30000 loss : 0.15880505308162365
epoch : 40000 loss : 0.14176660822950332
epoch : 50000 loss : 0.13120364338910293
epoch : 60000 loss : 0.12428912713895102
epoch : 70000 loss : 0.1194388832204144
epoch : 80000 loss : 0.1158095823084796
epoch : 90000 loss : 0.11295044593793865
epoch : 100000 loss : 0.11060850003748399
------------------------------------------ final loss : 0.11060850003748399 ---
epoch : 10000 loss : 0.2999803097390051
epoch : 20000 loss : 0.2639149114555777
epoch : 30000 loss : 0.2446725952739122
epoch : 40000 loss : 0.23379872998522944
epoch : 50000 loss : 0.22746777609393065
epoch : 60000 loss : 0.22369106581188636
epoch : 70000 loss : 0.2213713025379446
epoch : 80000 loss : 0.21989517760674207
epoch : 90000 loss : 0.21891950759605758
epoch : 100000 loss : 0.21825060262288792
------------------------------------------ final loss : 0.21825060262288792 ---
epoch : 10000 loss : 0.29455031245577284
epoch : 20000 loss : 0.26564345125146954
epoch : 30000 loss : 0.25095026384575914
epoch : 40000 loss : 0.24249639463859832
epoch : 50000 loss : 0.23716841034005542
epoch : 60000 loss : 0.23360593701021243
epoch : 70000 loss : 0.23113276614192327
epoch : 80000 loss : 0.22937279992261464
epoch : 90000 loss : 0.2280983187354759
epoch : 100000 loss : 0.2271630848001625
------------------------------------------ final loss : 0.2271630848001625 ---
epoch : 10000 loss : 0.3548198834773136
epoch : 20000 loss : 0.28671006912526875
epoch : 30000 loss : 0.249277348920404
epoch : 40000 loss : 0.228683775225805
epoch : 50000 loss : 0.21714141128012227
epoch : 60000 loss : 0.2104198099948162
epoch : 70000 loss : 0.20630635570095104
epoch : 80000 loss : 0.203656329801001
epoch : 90000 loss : 0.20186713234708797
epoch : 100000 loss : 0.20061007578940213
------------------------------------------ final loss : 0.20061007578940213 ---
epoch : 10000 loss : 0.13292111805959692
epoch : 20000 loss : 0.11483439477752716
epoch : 30000 loss : 0.10402357803414848
epoch : 40000 loss : 0.0969316477847122
epoch : 50000 loss : 0.09192692327305411
epoch : 60000 loss : 0.08818877266003355
epoch : 70000 loss : 0.08527045381252621
epoch : 80000 loss : 0.08291184267647564
epoch : 90000 loss : 0.08095258903322274
epoch : 100000 loss : 0.07928894046006787
------------------------------------------ final loss : 0.07928894046006787 ---
epoch : 10000 loss : 0.3481952743350057
epoch : 20000 loss : 0.2879064519008239
epoch : 30000 loss : 0.25728219983436634
epoch : 40000 loss : 0.24083588231068212
epoch : 50000 loss : 0.23140603463888046
epoch : 60000 loss : 0.22563980878392964
epoch : 70000 loss : 0.22190968000637687
epoch : 80000 loss : 0.21938406070985333
epoch : 90000 loss : 0.21761203854263705
epoch : 100000 loss : 0.21633418543777017
------------------------------------------ final loss : 0.21633418543777017 ---
epoch : 10000 loss : 0.21266317034523596
epoch : 20000 loss : 0.17930406028222048
epoch : 30000 loss : 0.16222107154326448
epoch : 40000 loss : 0.15248490782314048
epoch : 50000 loss : 0.14636718911043783
epoch : 60000 loss : 0.14219674031088794
epoch : 70000 loss : 0.13916115360760709
epoch : 80000 loss : 0.13683515684757225
epoch : 90000 loss : 0.1349811000740059
epoch : 100000 loss : 0.13345820228775343
------------------------------------------ final loss : 0.13345820228775343 ---
epoch : 10000 loss : 0.2834878544223111
epoch : 20000 loss : 0.24590509223846155
epoch : 30000 loss : 0.23321726562913037
epoch : 40000 loss : 0.22755168297689163
epoch : 50000 loss : 0.22456861010391957
epoch : 60000 loss : 0.22283217057783475
epoch : 70000 loss : 0.22175305410255083
epoch : 80000 loss : 0.22105140386565358
epoch : 90000 loss : 0.22058002097684232
epoch : 100000 loss : 0.2202555050137814
------------------------------------------ final loss : 0.2202555050137814 ---
epoch : 10000 loss : 0.38969972975605066
epoch : 20000 loss : 0.302637098419762
epoch : 30000 loss : 0.25944308523026194
epoch : 40000 loss : 0.2352142918299519
epoch : 50000 loss : 0.22007405152167986
epoch : 60000 loss : 0.20981538612630823
epoch : 70000 loss : 0.20241610880226202
epoch : 80000 loss : 0.19681442532338628
epoch : 90000 loss : 0.1924126924031136
epoch : 100000 loss : 0.18885366241819537
------------------------------------------ final loss : 0.18885366241819537 ---
epoch : 10000 loss : 0.2756251024544689
epoch : 20000 loss : 0.22718486773731691
epoch : 30000 loss : 0.20643472565578336
epoch : 40000 loss : 0.1960051361906093
epoch : 50000 loss : 0.1900684757832861
epoch : 60000 loss : 0.18636604290448683
epoch : 70000 loss : 0.1838893551023386
epoch : 80000 loss : 0.18213763841611183
epoch : 90000 loss : 0.18084148412847992
epoch : 100000 loss : 0.17984649603065841
------------------------------------------ final loss : 0.17984649603065841 ---
epoch : 10000 loss : 0.24759561724848939
epoch : 20000 loss : 0.21656257829847408
epoch : 30000 loss : 0.20394407013269233
epoch : 40000 loss : 0.19822450449413484
epoch : 50000 loss : 0.19510911766714892
epoch : 60000 loss : 0.19313480791405166
epoch : 70000 loss : 0.1917541088358366
epoch : 80000 loss : 0.19072929630772226
epoch : 90000 loss : 0.1899402062585381
epoch : 100000 loss : 0.18931780789395422
------------------------------------------ final loss : 0.18931780789395422 ---
confusion_with_smote = confusion_matrix(actual2, prediction2)
confusion_with_smote
array([[0., 0., 2., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
[0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0.],
[2., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0.],
[1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0., 2., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 2.],
[0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0.],
[0., 0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0.],
[0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 3., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 3.]])
5. Evaluation
def f1_scores(con,score):
# score = 0 : micro / score =1 : macro / score = 2 : weighted macro
# (1) Micro F1
if score==0:
return np.diag(con).sum()/con.sum()
rec,pre,f1 = [],[],[]
for i in range(con.shape[0]):
recall = con[i][i] / con[i].sum()
precision = con[i][i] / con[:,i].sum()
f1_score = 2*recall*precision / (recall+precision)
rec.append(recall)
pre.append(precision)
f1.append(f1_score)
# (2) Macro F1
if score==1:
return np.average(f1)
# (3) Weighted Macro F1
elif score==2:
w = [con[x].sum() for x in range(con.shape[0])]
return np.average(f1,weights=w)
Conclusion
- dataset의 개수가 너무 부족했었는지, 그닥 성능이 좋게 나온 것 같지 않다.