[ 32. Actor-Critic 실습2 ]

( 아래 내용은 SKplanet Tacademy의 강화학습 강좌를 참고하여 학습한 내용입니다 )

1. Introduction

이전 포스트에서 우리는 위 그림을 확인했었다. Actor Critic은, Value-Based에서 사용하는 Value Function도 사용하고, Policy-Based에서 사용하는 Policy를 둘 다 사용한다. 최근에 많이 사용되는 RL 방법들도 모두 이 Actor Critic에 기반을 둔 것이다.

1 ) Actor

어떤 행동을 해야할지 결정해주는 역할
Policy가 이 역할을 한다

2 ) Critic

Actor가 내린 결정을 평가
Value Function이 이 역할을 한다

2. $\bigtriangledown_{\theta} U(\theta)$ of Actor Critic

우리는 이전에 $U(\theta)$의 기울기를 구했었다. 이것에 Value-Function을 사용하기 위해, 우리는 다음과 같이 $Q_{\pi_{\theta}}(s,a)$를 사용하여 다음과 같이 나타낼 수 있다.

$\begin{align*} \bigtriangledown_{\theta}U(\theta) &= E_{\tau}[\sum_{t=0}^{\tau}\bigtriangledown_{\theta}log\pi_{\theta}(a_t \mid s_t) G_t]\\ &= E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s,a)]\\ \end{align*}$,

그리고 위 식의 $Q_{\pi_{\theta}}(s,a)$를 우리는 잘 알지 못하기 때문에, 다음과 같이 근사할 것이다. (모델은 Neural Net으로 사용할 것이다)

$Q_{\pi_{\theta}}(s,a) \approx Q_{w}(s,a)$.

따라서, $\bigtriangledown_{\theta} U(\theta)$는 다음과 같다.

$\bigtriangledown_{\theta} U(\theta) \approx E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) Q_{w}(s,a)]$.

3. A2C ( Advantage Actor Critic)

How to reduce variance?

결론부터 이야기하자면, 우리는 다음과 같은 식을 통해 위 식의 분산을 줄일 수 있다.

$\begin{align*} \bigtriangledown_{\theta}U(\theta) &= E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s,a)]\\ &= E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) (Q_{\pi_{\theta}}(s,a)-V_{\pi_{\theta}}(s))]\\ &= E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) A_{\pi_{\theta}}(s,a)] \end{align*}$,

여기서 $A_{\pi_{\theta}}(s,a)$ 는 $Q_{\pi_{\theta}}(s,a)-V_{\pi_{\theta}}(s)$ 를 의미하는 것으로, Advantage를 의미한다.

“특정 정책 하에서, 어떠한 state에서 어떠한 action을 했을 때 얻게되는 value” ( $Q_{\pi_{\theta}}(s,a)$ ) 에서,

“그 state에서의 value” ( $V_{\pi_{\theta}}(s)$ )를 빼는 것으로, 그 행동을 했을 때의 “이득”(Advantage)라고 볼 수 있다.

근데, 어떻게 마음대로 $Q_{\pi_{\theta}}(s,a)$ 대신 $(Q_{\pi_{\theta}}(s,a)-V_{\pi_{\theta}}(s))$를 사용할 수 있을까?

Proof

state s에 관한 임의의 함수 $B(s)$에 대해서, 다음은 항상 성립한다!

$\begin{align*} E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(s,a) B(s)] &= \sum_{s \in S}d^{\pi_{\theta}}(s) \sum_{a}\pi_{\theta}(s,a) \bigtriangledown_{\theta} log \pi_{\theta}(s,a)B(s)\\ &= \sum_{s \in S}d^{\pi_{\theta}}(s) \sum_{a}\bigtriangledown_{\theta} \pi_{\theta}(s,a)B(s)\\ &= \sum_{s \in S}d^{\pi_{\theta}}(s) B(s)\sum_{a}\bigtriangledown_{\theta} \pi_{\theta}(s,a)\\ &= 0 \end{align*}$.

(위 식에서 $d^{\pi_{\theta}}(s)$ 는, 특정 정책 $\pi_{\theta}$ 하에서, 해당 state $s$에서 머물게 되는 (상대적) 점유율/시간이라고 보면 된다. 총 합은 1이된다 )

따라서 우리는 state에 관한 임의의 함수인 $V_{\pi_{\theta}}(s)$를 뺴줘도 되고, 따라서 $\bigtriangledown_{\theta}U(\theta)$를 다음과 같이 나타내도 되는 것이다!

$\bigtriangledown_{\theta}U(\theta) = E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) A_{\pi_{\theta}}(s,a)]$.

따라서 위의 gradient를 이용하여 학습하는 방법을 우리는 “Advantage Actor-Critic”, 혹은 A2C라고 부른다.

4. TD Actor-Critic

A2C를 훈련시킨다고 해보자. $\bigtriangledown_{\theta}U(\theta) = E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) A_{\pi_{\theta}}(s,a)]$를 보면 알겠지만, 우리는 총 3종류의 parameter를 train해야 한다.

1 ) Q function
2 ) Value Function
3 ) Policy

하지만 만약 우리가 TD-error를 사용한다면, 우리는 Q-function을 굳이 훈련시키지 않고 Value-Function ( $V(s)$ )만 훈련시키면 된다. 어떻게 그것이 가능한지 수식을 통해 확인해보자.

( 복습 )

TD-error : $\delta_{\pi_{\theta}} = ( r + \gamma\;V_{\pi_{\theta}}(s')) - V_{\pi_{\theta}}(s)$

TD-error $\delta_{\pi_{\theta}}$ 는 $A_{\pi_{\theta}}(s,a)$에 대한 unbiased estimate이기 때문에, 우리는 이를 이용하여 Policy Gradient를 계산할 수 있다.

$\begin{align*} E_{\pi_{\theta}}[\delta_{\pi_{\theta}} \mid s,a] &= E_{\pi_{\theta}}[r + \gamma\;V_{\pi_{\theta}}(s') - V_{\pi_{\theta}}(s)] \\ &= Q_{\pi_{\theta}}(s,a) -V_{\pi_{\theta}}(s))\\ &= A_{\pi_{\theta}}(s,a) \end{align*}$.

따라서, 다음과 같은 approximate TD-error를 사용할 경우, 우리는 value function $V(s)$만 학습하면 된다!

$\delta_{V} = r + \gamma V_{V}(s') - V_V(s)$.

정리하면, TD Actor-Critic에서 사용하는 $\bigtriangledown_{\theta}U(\theta)$ 는 다음과 같다.

$\bigtriangledown_{\theta}U(\theta) = E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) \delta]$.

5. Summary

지난 두 포스트에서 우리는 Policy Gradient에 대해서 배웠다. ( Policy Gradient의 다양한 형태인 REINFORCE, A2C, TD Actor-Critic 등 )

각각의 $\bigtriangledown_{\theta}U(\theta)$를 정리하면 다음과 같다.

1 ) REINFORCE : $E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) v_t]$
2 ) Q Actor-Critic : $E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) Q^{w}(s,a)]$
3 ) Advantage Actor-Critic : $E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) A^{w}(s,a)]$
4 ) TD Actor-Critic : $E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) \delta]$
5 ) REINFORCE : $E_{\pi_{\theta}}[\bigtriangledown_{\theta}log\pi_{\theta}(a \mid s) v_t]$

6. Implementation with Pytorch

( 아래 코드는 SKplanet Tacademy의 강화학습 강좌를 참고로 하여 작성된 코드입니다. )

1) Import Libraries

import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical

2) Set Hyperparameters

lr = 0.0002
gamma = 0.98

3) Actor Critic

def main():
    env = gy.make('CartPole-v1')
    model = ActorCritic() # 이 부분이 Policy-Based와 다른 점이다
    n_rollout = 5 # 5번의 step마다 update 진행
    print_int = 20
    score = 0
    
    for episode in range(2000): 
        done = False
        state = env.reset()
        while not done:
            for t in range(n_rollout):
                prob = model.pi(torch.from_numpy(s).float()) # into probability
                actions = Categorical(prob).sample() # random sample ( prob에 따라 )
                state_, returns, done, info = env.step(actions.item())
                model.put_data((state,actions,returns,state_,done))
                state = state_ # 다음 state로 넘어감
                score += returns # return을 누적해서 더함

                if done:
                    break
                
            model.train()    
    
        if episode%print_int==0 & episode!=0:
            print('episode : {}, score : {}'.format(episode, score/print_int))
            score= 0
    env.close()
    
    

class ActorCritic(nn.Module):
    def __init__(self):
        super(ActorCritic, self).__init__()
        self.data = []
        self.fc_common = nn.Linear(4,128) 
        self.fc_pi = nn.Linear(128,2)
        self.fc_v = nn.Linear(128,1) # hidden layer의 unit 개수 : 128개
        self.opt = optim.Adam(self.parameters(), lr=lr) # Adam Optimizer 사용
    
    ## REINFORCE와는 다르게, 훈련시켜야하는 network가 2개(pi & v)다
    def pi(sef,x,dim):
        x = F.relu(self.fc_common(x))
        x = self.fc_pi(x)
        pi = F.softmax(x,dim=dim) # 각 행동에 대한 probability 반환
        return pi
    
    def v(self,x):
        x = F.relu(self.fc_common(x))
        v = self.fc_v(x)        
        return v
    
    def put_data(self,item):
        self.data.append(item)
    
    def batch(self):
        S,A,R,S_,Done = [],[],[],[],[]
        
        for item in self.data:
            s,a,r,s_,done = item
            S.append(s)
            A.append([a])
            R.append([r/100.0])
            S_.append(s_)
            if done:
                d = 0
            else :
                d = 1
            D.append([d])
        
        s_batch = torch.tensor(S, dtype=torch.float)
        a_batch = torch.tensor(A, dtype=torch.float),
        r_batch = torch.tensor(R, dtype=torch.float),
        s2_batch = torch.tensor(S_, dtype=torch.float),
        d_batch = torch.tensor(D, dtype=torch.float),
        self.data= []
        
        return s_batch,a_batch,r_batch,s2_batch,d_batch
    
    def train(self):
        s,a,r,s_,done = self.batch()
        TD_error = (r+gamma*self.v(s_)*done) - self.v(s)
        
        pi = self.pi(s,dim=1)
        pi_a = pi.gather(1,a)
        loss = - torch.log(pi_a)*TD_error.detach() + F.smooth_l1_loss(self.v(s), TD_error.detach()) # detach : gradient 계산 안되는 상수 취급위해!
        
        self.opt.zero_grad()
        loss.mean().backward()
        self.opt.step()
       

Twitter Facebook LinkedIn

32.Actor-Critic 실습2

Seunghan Lee