Logistic Regression (Embedded Karate Dataset Classification )

  • train ratio : 0.1, 0.3, 0.5, 0.7
  • metric : accuracy, precision, recall, F1-score
  • dataset : Karate

Deep Walk로 임베딩한 2차원의 Karate Dataset을, Logistic Regression을 만들어서 classification을 해봤다.

1. Import Libraries & Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
ev = pd.read_csv('embedded_vector.csv')
data = ev[['X','Y','Color']]
data.columns = ['x1','x2','class']
data = data.sample(frac=1) # to shuffle

2차원으로 embedded된 dataset을 보면 다음과 같다.

plt.scatter(data['x1'], data['x2'], c=data['class'])
plt.show()

png

data['class'].value_counts()
1    17
0    17
Name: class, dtype: int64



2. Define Functions

  • 1) train_test_split
  • 2) matrix multiplication
  • 3) sigmoid
  • 4) standard scaler
  • 5) loss function
def train_test_split(data,test_ratio):
    data.iloc[:,[0,1]] = standard_scaler(data.iloc[:,[0,1]])
    test_index = np.random.choice(len(data),int(len(data)*test_ratio),replace=False)
    train = data[~data.index.isin(test_index)]
    test = data[data.index.isin(test_index)]
    
    train_X = np.array(train)[:,[0,1]]
    train_y = np.array(train)[:,[2]].flatten()
    test_X = np.array(test)[:,[0,1]]
    test_y = np.array(test)[:,[2]].flatten()
    return train_X,train_y, test_X,test_y
def mul(W,b,x):
    return np.dot(x,W)+b

def sigmoid(x):    
    k = 1 / (1 + np.exp(-x))
    return k[:,0]
def standard_scaler(x):
    mean = np.mean(x)
    std = np.std(x)
    return (x-mean)/std
def loss_func(y_hat,y):
    total_loss = np.mean(y*np.log(y_hat) + (1-y)*np.log(1-y_hat))
    return -total_loss



3. Train Model

Logistic Regression

def logreg(x,y,epoch,lr):
    W = np.random.rand(x.shape[1],1)
    b = np.random.rand(1)
    
    for ep in range(epoch+1):
        Z = mul(W,b,x)
        y_hat = sigmoid(Z)
        loss = loss_func(y_hat,y)
        dw = np.matmul(x.T,y_hat-y)/x.shape[0]
        db = np.sum(y_hat-y)
        
        W = W-lr*dw.reshape(-1,1)
        b = b-lr*db
        
        if ep % 10000 == 0:
            print('epoch :',ep,' loss :',loss)
            
    return W,b



4. Prediction

TOO SMALL dataset! unstable model

  • 데이터 수가 매우 작아서(34개), 아주 좋은 성능의 분류기를 만들지는 못했다.
train_X_10, train_y_10, test_X_10, test_y_10 = train_test_split(data,0.9)
train_X_30, train_y_30, test_X_30, test_y_30 = train_test_split(data,0.7)
train_X_50, train_y_50, test_X_50, test_y_50 = train_test_split(data,0.5)
train_X_70, train_y_70, test_X_70, test_y_70 = train_test_split(data,0.3)

4 cases

  • case 1) train 10%

  • case 2) train 30%

  • case 3) train 50%

  • case 4) train 70%

1) weight & bias

W_10,b_10 = logreg(train_X_10,train_y_10,40000,0.001)
epoch : 0  loss : 0.6593308365741114
epoch : 10000  loss : 0.16243315108234851
epoch : 20000  loss : 0.08608784835902628
epoch : 30000  loss : 0.057835378260770044
epoch : 40000  loss : 0.043373910298135046
W_30,b_30 = logreg(train_X_30,train_y_30,40000,0.001)
epoch : 0  loss : 0.595785050592402
epoch : 10000  loss : 0.5277547221980337
epoch : 20000  loss : 0.5264002682692209
epoch : 30000  loss : 0.5260746776716059
epoch : 40000  loss : 0.5259887552029092
W_50,b_50 = logreg(train_X_50,train_y_50,40000,0.001)
epoch : 0  loss : 0.48855425918653145
epoch : 10000  loss : 0.29181192527506544
epoch : 20000  loss : 0.2689832066669257
epoch : 30000  loss : 0.25930977959826224
epoch : 40000  loss : 0.2540498366830823
W_70,b_70 = logreg(train_X_70,train_y_70,40000,0.001)
epoch : 0  loss : 0.6410069002890594
epoch : 10000  loss : 0.5650531338346519
epoch : 20000  loss : 0.5616843341978129
epoch : 30000  loss : 0.5614450045147054
epoch : 40000  loss : 0.5614244991790818


2) Prediction Result

def predict(test_X,W,b):
    preds = []
    for i in sigmoid(np.dot(test_X, W) + b):
        if i>0.5:
            preds.append(1)
        else:
            preds.append(0)
    return np.array(preds)
y_pred_10 = predict(test_X_10, W_10,b_10)
y_pred_30 = predict(test_X_30, W_30,b_30)
y_pred_50 = predict(test_X_50, W_50,b_50)
y_pred_70 = predict(test_X_70, W_70,b_70)
  • ex) train 50 : test 50으로 나눠서 만든 Logistic Regression model으로 예측한 test data 50%
y_pred_50
array([1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0])

3) Metric

  • Accuracy, Precision, Recall, F1-score를 구하면 다음과 같다.
def Metrics(pred,actual):
    TP,TN,FP,FN = 0,0,0,0
    for i in range(len(pred)):
        if pred[i]*actual[i]==1:
            TP +=1
        elif pred[i]>actual[i]:
            FP +=1
        elif pred[i]<actual[i]:
            FN +=1
        else:
            TN +=1
    
    accuracy = (TP+TN) / (TP+TN+FP+FN)
    precision = TP / (TP+FP)
    recall = TP / (TP+FN)
    F1_score = 2*(precision*recall)/(precision+recall)
    return accuracy,precision,recall,F1_score
  • training dataset 10%
print('Training Dataset 10%')
acc, pre, rec, f1 = Metrics(y_pred_10,test_y_10)
print('accuarcy :', np.round(acc,3))
print('precision :', np.round(pre,3))
print('recall :', np.round(rec,3))
print('f1-score :', np.round(f1,3))
Training Dataset 10%
accuarcy : 0.667
precision : 0.692
recall : 0.6
f1-score : 0.643
  • training dataset 30%
print('Training Dataset 30%')
acc, pre, rec, f1 = Metrics(y_pred_30,test_y_30)
print('accuarcy :', np.round(acc,3))
print('precision :', np.round(pre,3))
print('recall :', np.round(rec,3))
print('f1-score :', np.round(f1,3))
Training Dataset 30%
accuarcy : 0.478
precision : 0.421
recall : 0.889
f1-score : 0.571
  • training dataset 50%
print('Training Dataset 50%')
acc, pre, rec, f1 = Metrics(y_pred_50,test_y_50)
print('accuarcy :', np.round(acc,3))
print('precision :', np.round(pre,3))
print('recall :', np.round(rec,3))
print('f1-score :', np.round(f1,3))
Training Dataset 50%
accuarcy : 0.647
precision : 0.7
recall : 0.7
f1-score : 0.7
  • training dataset 70%
print('Training Dataset 70%')
acc, pre, rec, f1 = Metrics(y_pred_70,test_y_70)
print('accuarcy :', np.round(acc,3))
print('precision :', np.round(pre,3))
print('recall :', np.round(rec,3))
print('f1-score :', np.round(f1,3))
Training Dataset 70%
accuarcy : 0.8
precision : 0.75
recall : 0.75
f1-score : 0.75