OVR ( with Embedded Football Dataset )

  • with Football dataset

Football dataset은 11개의 그룹으로 나누어진 150개의 node로 이루어진 network이다. 이 node들 간의 인접 정보를 활용하여 node2vec을 구현하여, 2차원 평면상에 이들의 원래 연결관계가 잘 유지되도록 표현하는 것이 최종 목표이다.

  • embedded with LINE (first-order proximity / negative sampling )

    ( LINE을 이용하여 embedding한 football dataset을, OVR(One-Versus-Rest) classifier를 이용하여 classification을 할 것이다 )

1. Import Libraries & Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
  • Football Dataset의 11개의 그룹(community) 중, 가장 적은 그룹의 node 개수는 5개이다. 이를 위해 SMOTE (oversampling의 기법 중 하나)를 사용하여 classify를 해볼 것이다. ( 해보지 않은 것과의 성능 비교 )
from imblearn.over_sampling import SMOTE
  • 이전에 임베딩한 데이터를 csv로 저장하여 다시 불러온다.
ev = pd.read_csv('[Football]Embedded_with_FirstOrder.csv')
ev = ev.drop(ev.columns[0],axis=1)
ev.shape
(115, 11)
  • 처음 5개의 node의 embedding vector를 확인하면 다음과 같다
ev.head()
0 1 2 3 4 5 6 7 8 9 Label
0 0.865126 0.732464 0.654681 -0.280288 -0.416516 0.779290 1.989182 0.944528 0.758910 0.924716 7
1 -0.315168 -1.665299 -0.984810 1.077798 0.511267 0.939566 1.635527 -0.366913 -0.451699 1.780345 0
2 -0.569846 -0.199044 1.784970 0.186517 2.154936 -0.550533 -0.937430 0.107572 1.074133 0.326420 2
3 0.832763 0.221549 -0.575225 -0.686977 -1.096524 0.453152 -0.012188 0.983878 0.942373 0.570720 3
4 0.117742 0.502898 0.749028 0.396632 -0.188808 0.286299 1.366271 -0.073079 -0.786144 -0.255506 7
  • 5번 community에는 node의 개수가 5개밖에 있지 않다. 이것이 SMOTE를 사용하려는 이유이다.
ev['Label'].value_counts()
6     13
9     12
3     12
2     11
11    10
8     10
4     10
0      9
7      8
1      8
10     7
5      5
Name: Label, dtype: int64

SMOTE

sm = SMOTE(random_state=42,k_neighbors=2)
k = sm.fit_sample(ev.iloc[:,0:10],ev.iloc[:,10])        
ev2 = pd.DataFrame(k[0])
ev2['Label'] = k[1]
ev2 = ev2.sample(frac=1).reset_index(drop=True)
  • 모든 community의 node의 개수를 13개 (가장 많은 community의 node 개수)로 만들어주었다.
ev2['Label'].value_counts()
11    13
10    13
9     13
8     13
7     13
6     13
5     13
4     13
3     13
2     13
1     13
0     13
Name: Label, dtype: int64

train & test split

  • 70% : 30&로 train & test를 나누었다.

(1) SMOTE (X)

test_index1 = ev.groupby('Label').apply(lambda x: x.sample(frac=0.3)).index.levels[1]
train_index1 = set(np.arange(0,ev.shape[0])) - set(test_index1)
train1 = ev.loc[train_index1]
test1 = ev.loc[test_index1]
train_X1 = np.array(train1.iloc[:,0:10])
train_y1 = np.array(train1.iloc[:,10]).flatten()
test_X1 = np.array(test1.iloc[:,0:10])
test_y1 = np.array(test1.iloc[:,10]).flatten()
train_X1.shape, test_X1.shape, train_y1.shape, test_y1.shape
((80, 10), (35, 10), (80,), (35,))

(2) SMOTE (O)

test_index2 = ev2.groupby('Label').apply(lambda x: x.sample(frac=0.3)).index.levels[1]
train_index2 = set(np.arange(0,ev2.shape[0])) - set(test_index2)
train2 = ev2.loc[train_index2]
test2 = ev2.loc[test_index2]
train_X2 = np.array(train2.iloc[:,0:10])
train_y2 = np.array(train2.iloc[:,10]).flatten()
test_X2 = np.array(test2.iloc[:,0:10])
test_y2 = np.array(test2.iloc[:,10]).flatten()
train_X2.shape, test_X2.shape, train_y2.shape, test_y2.shape
((108, 10), (48, 10), (108,), (48,))

2. Define Functions

OVR을 구현하기 위해 다음과 같은 함수들을 만들었다.

  • 1) matrix multiplication
  • 2) sigmoid
  • 3) standard scaler
  • 4) loss function
def mul(W,b,x):
    return np.dot(x,W)+b

def sigmoid(x):    
    k = 1 / (1 + np.exp(-x+0.0001))
    return k[:,0]
def standard_scaler(x):
    mean = np.mean(x)
    std = np.std(x)
    return (x-mean)/std
def loss_func(y_hat,y):
    total_loss = np.mean(y*np.log(y_hat+0.0001) + (1-y)*np.log(1-y_hat+0.0001))
    return -total_loss

3. Train Model

(1) Logistic Regression

  • OVR을 이루는 여러 개의 Logistic Regression
def predict(test_X,W,b):
    result = sigmoid(np.dot(test_X, W) + b)
    return result
def logreg(x,y,epoch,lr):
    W = np.random.rand(x.shape[1],1)
    b = np.random.rand(1)
    
    for ep in range(epoch+1):
        Z = mul(W,b,x)
        y_hat = sigmoid(Z)
        loss = loss_func(y_hat,y)
        dw = np.matmul(x.T,y_hat-y)/x.shape[0]
        db = np.sum(y_hat-y)
        
        W = W-lr*dw.reshape(-1,1)
        b = b-lr*db
        
        if ep>0 and ep % 10000 == 0:
            print('epoch :',ep,' loss :',loss)
    print('------------------------------------------ final loss :',loss,'---')   
    return W,b

OVR (One-Versus-Rest)

  • SMOTE로 sampling한 것보다 하지 않은 거시 더 좋은 성능이 나오는 것으로 확인되어서 SMOTE를 하지 않은 것을 최종 모델로 선택하였다.
def OVR(train_x,train_y,test_x,test_y,epoch,lr):
    pred_result = []
    real_result = []
    for index in ev['Label'].unique():
        train_y2 = (train_y == index).astype(int)        
        test_y2 = (test_y == index).astype(int)
        
        
        ''' oversampling with SMOTE in OVR
        
        sm = SMOTE(random_state=42,k_neighbors=3)
        smote_x,smote_y = sm.fit_sample(train_x,train_y2)
        
        ind = np.arange(smote_x.shape[0])
        np.random.shuffle(ind)
        
        smote_x,smote_y = smote_x[ind],smote_y[ind]
        
        W,b = logreg(smote_x,smote_y,epoch,lr)
        print('------------------------------------------ Classifier ',index,'done---')
        
        '''
        W,b = logreg(train_x,train_y2,epoch,lr)
        y_pred = predict(test_x,W,b)
        pred_result.append(y_pred)
        real_result.append(test_y2)
    pred_OH = (pred_result == np.amax(pred_result,axis=0)).astype('int')
    act_OH = np.concatenate(real_result).ravel().reshape(ev.iloc[:,-1].nunique(),-1)    
    return pred_OH,act_OH
  • Confusion Matrix (혼동 행렬)
def confusion_matrix(actual,prediction):
    n = actual.shape[0]
    conf_mat = np.zeros((n,n))
    for i in range(n):
        for j in range(n):
            conf_mat[i][j] += len(np.intersect1d(np.nonzero(actual[i]),np.nonzero(prediction[j])))        
    return conf_mat

4. Result

1. SMOTE (X)

prediction1,actual1 = OVR(train_X1,train_y1,test_X1,test_y1,20000,0.0025)
epoch : 10000  loss : 0.24007866286417195
epoch : 20000  loss : 0.23680118954105106
------------------------------------------ final loss : 0.23680118954105106 ---
epoch : 10000  loss : 0.1752521261784632
epoch : 20000  loss : 0.16416244141006825
------------------------------------------ final loss : 0.16416244141006825 ---
epoch : 10000  loss : 0.281384114717651
epoch : 20000  loss : 0.27806784014081265
------------------------------------------ final loss : 0.27806784014081265 ---
epoch : 10000  loss : 0.22297134652014933
epoch : 20000  loss : 0.21767287570497387
------------------------------------------ final loss : 0.21767287570497387 ---
epoch : 10000  loss : 0.22175818987175216
epoch : 20000  loss : 0.211198716837698
------------------------------------------ final loss : 0.211198716837698 ---
epoch : 10000  loss : 0.12114989191594872
epoch : 20000  loss : 0.11273809885920652
------------------------------------------ final loss : 0.11273809885920652 ---
epoch : 10000  loss : 0.2649164780890776
epoch : 20000  loss : 0.2519730950150898
------------------------------------------ final loss : 0.2519730950150898 ---
epoch : 10000  loss : 0.22101384088099474
epoch : 20000  loss : 0.2078649825932543
------------------------------------------ final loss : 0.2078649825932543 ---
epoch : 10000  loss : 0.1767867072945053
epoch : 20000  loss : 0.1425810221363643
------------------------------------------ final loss : 0.1425810221363643 ---
epoch : 10000  loss : 0.22525643967274248
epoch : 20000  loss : 0.2156514249701329
------------------------------------------ final loss : 0.2156514249701329 ---
epoch : 10000  loss : 0.11773905299477097
epoch : 20000  loss : 0.09568321094024893
------------------------------------------ final loss : 0.09568321094024893 ---
epoch : 10000  loss : 0.23128075596452985
epoch : 20000  loss : 0.22795853056835433
------------------------------------------ final loss : 0.22795853056835433 ---
confusion_without_smote = confusion_matrix(actual1, prediction1)
confusion_without_smote
array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 0., 0., 2., 0., 0., 0., 0., 0.],
       [0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 2., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 2., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.]])

2. SMOTE (O)

prediction2,actual2 = OVR(train_X2,train_y2,test_X2,test_y2,100000,0.0005)
epoch : 10000  loss : 0.2983699795044497
epoch : 20000  loss : 0.2611569455527149
epoch : 30000  loss : 0.2473366437738021
epoch : 40000  loss : 0.24164112733430088
epoch : 50000  loss : 0.23881470383005338
epoch : 60000  loss : 0.23719708232203154
epoch : 70000  loss : 0.23618992626970173
epoch : 80000  loss : 0.23553188933471786
epoch : 90000  loss : 0.23508881092170372
epoch : 100000  loss : 0.23478409834385966
------------------------------------------ final loss : 0.23478409834385966 ---
epoch : 10000  loss : 0.2400181946008694
epoch : 20000  loss : 0.18745882470144323
epoch : 30000  loss : 0.15880505308162365
epoch : 40000  loss : 0.14176660822950332
epoch : 50000  loss : 0.13120364338910293
epoch : 60000  loss : 0.12428912713895102
epoch : 70000  loss : 0.1194388832204144
epoch : 80000  loss : 0.1158095823084796
epoch : 90000  loss : 0.11295044593793865
epoch : 100000  loss : 0.11060850003748399
------------------------------------------ final loss : 0.11060850003748399 ---
epoch : 10000  loss : 0.2999803097390051
epoch : 20000  loss : 0.2639149114555777
epoch : 30000  loss : 0.2446725952739122
epoch : 40000  loss : 0.23379872998522944
epoch : 50000  loss : 0.22746777609393065
epoch : 60000  loss : 0.22369106581188636
epoch : 70000  loss : 0.2213713025379446
epoch : 80000  loss : 0.21989517760674207
epoch : 90000  loss : 0.21891950759605758
epoch : 100000  loss : 0.21825060262288792
------------------------------------------ final loss : 0.21825060262288792 ---
epoch : 10000  loss : 0.29455031245577284
epoch : 20000  loss : 0.26564345125146954
epoch : 30000  loss : 0.25095026384575914
epoch : 40000  loss : 0.24249639463859832
epoch : 50000  loss : 0.23716841034005542
epoch : 60000  loss : 0.23360593701021243
epoch : 70000  loss : 0.23113276614192327
epoch : 80000  loss : 0.22937279992261464
epoch : 90000  loss : 0.2280983187354759
epoch : 100000  loss : 0.2271630848001625
------------------------------------------ final loss : 0.2271630848001625 ---
epoch : 10000  loss : 0.3548198834773136
epoch : 20000  loss : 0.28671006912526875
epoch : 30000  loss : 0.249277348920404
epoch : 40000  loss : 0.228683775225805
epoch : 50000  loss : 0.21714141128012227
epoch : 60000  loss : 0.2104198099948162
epoch : 70000  loss : 0.20630635570095104
epoch : 80000  loss : 0.203656329801001
epoch : 90000  loss : 0.20186713234708797
epoch : 100000  loss : 0.20061007578940213
------------------------------------------ final loss : 0.20061007578940213 ---
epoch : 10000  loss : 0.13292111805959692
epoch : 20000  loss : 0.11483439477752716
epoch : 30000  loss : 0.10402357803414848
epoch : 40000  loss : 0.0969316477847122
epoch : 50000  loss : 0.09192692327305411
epoch : 60000  loss : 0.08818877266003355
epoch : 70000  loss : 0.08527045381252621
epoch : 80000  loss : 0.08291184267647564
epoch : 90000  loss : 0.08095258903322274
epoch : 100000  loss : 0.07928894046006787
------------------------------------------ final loss : 0.07928894046006787 ---
epoch : 10000  loss : 0.3481952743350057
epoch : 20000  loss : 0.2879064519008239
epoch : 30000  loss : 0.25728219983436634
epoch : 40000  loss : 0.24083588231068212
epoch : 50000  loss : 0.23140603463888046
epoch : 60000  loss : 0.22563980878392964
epoch : 70000  loss : 0.22190968000637687
epoch : 80000  loss : 0.21938406070985333
epoch : 90000  loss : 0.21761203854263705
epoch : 100000  loss : 0.21633418543777017
------------------------------------------ final loss : 0.21633418543777017 ---
epoch : 10000  loss : 0.21266317034523596
epoch : 20000  loss : 0.17930406028222048
epoch : 30000  loss : 0.16222107154326448
epoch : 40000  loss : 0.15248490782314048
epoch : 50000  loss : 0.14636718911043783
epoch : 60000  loss : 0.14219674031088794
epoch : 70000  loss : 0.13916115360760709
epoch : 80000  loss : 0.13683515684757225
epoch : 90000  loss : 0.1349811000740059
epoch : 100000  loss : 0.13345820228775343
------------------------------------------ final loss : 0.13345820228775343 ---
epoch : 10000  loss : 0.2834878544223111
epoch : 20000  loss : 0.24590509223846155
epoch : 30000  loss : 0.23321726562913037
epoch : 40000  loss : 0.22755168297689163
epoch : 50000  loss : 0.22456861010391957
epoch : 60000  loss : 0.22283217057783475
epoch : 70000  loss : 0.22175305410255083
epoch : 80000  loss : 0.22105140386565358
epoch : 90000  loss : 0.22058002097684232
epoch : 100000  loss : 0.2202555050137814
------------------------------------------ final loss : 0.2202555050137814 ---
epoch : 10000  loss : 0.38969972975605066
epoch : 20000  loss : 0.302637098419762
epoch : 30000  loss : 0.25944308523026194
epoch : 40000  loss : 0.2352142918299519
epoch : 50000  loss : 0.22007405152167986
epoch : 60000  loss : 0.20981538612630823
epoch : 70000  loss : 0.20241610880226202
epoch : 80000  loss : 0.19681442532338628
epoch : 90000  loss : 0.1924126924031136
epoch : 100000  loss : 0.18885366241819537
------------------------------------------ final loss : 0.18885366241819537 ---
epoch : 10000  loss : 0.2756251024544689
epoch : 20000  loss : 0.22718486773731691
epoch : 30000  loss : 0.20643472565578336
epoch : 40000  loss : 0.1960051361906093
epoch : 50000  loss : 0.1900684757832861
epoch : 60000  loss : 0.18636604290448683
epoch : 70000  loss : 0.1838893551023386
epoch : 80000  loss : 0.18213763841611183
epoch : 90000  loss : 0.18084148412847992
epoch : 100000  loss : 0.17984649603065841
------------------------------------------ final loss : 0.17984649603065841 ---
epoch : 10000  loss : 0.24759561724848939
epoch : 20000  loss : 0.21656257829847408
epoch : 30000  loss : 0.20394407013269233
epoch : 40000  loss : 0.19822450449413484
epoch : 50000  loss : 0.19510911766714892
epoch : 60000  loss : 0.19313480791405166
epoch : 70000  loss : 0.1917541088358366
epoch : 80000  loss : 0.19072929630772226
epoch : 90000  loss : 0.1899402062585381
epoch : 100000  loss : 0.18931780789395422
------------------------------------------ final loss : 0.18931780789395422 ---
confusion_with_smote = confusion_matrix(actual2, prediction2)
confusion_with_smote
array([[0., 0., 2., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0.],
       [2., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0.],
       [1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 2., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 2.],
       [0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0.],
       [0., 0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 3., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 3.]])

5. Evaluation

def f1_scores(con,score): 
    # score = 0 : micro / score =1 : macro / score = 2 : weighted macro
    
    # (1) Micro F1
    if score==0: 
        return np.diag(con).sum()/con.sum()
    rec,pre,f1 = [],[],[]
    
    for i in range(con.shape[0]):
        recall = con[i][i] / con[i].sum()
        precision = con[i][i] / con[:,i].sum()
        f1_score = 2*recall*precision / (recall+precision)
        rec.append(recall)
        pre.append(precision)
        f1.append(f1_score)
    
    # (2) Macro F1
    if score==1:
        return np.average(f1)
    
    # (3) Weighted Macro F1
    elif score==2:
        w = [con[x].sum() for x in range(con.shape[0])]
        return np.average(f1,weights=w)

Conclusion

  • dataset의 개수가 너무 부족했었는지, 그닥 성능이 좋게 나온 것 같지 않다.