( 참고 : Fastcampus 강의 )

[ 25. Value Function Approximation ]

Introduction
Function Approximation
Value-function approximation
1. State-Value Function
2. Action-Value Function
Q-learning with function approximation

1. Introduction

Planning(Dynamic) 혹은 Learning(MC, TD)을 통해 value-function을 추정해왔는데, 이때 우리는 아래 그림과 같은 tabular method를 활용하여 value-function을 구해왔었다

( * tabular method : grid 형식처럼, 각각의 state가 하나의 grid가 되고, action은 상/하/좌/우 중 하나로 움직여서 인접한 grid로 state가 넘어가는 단순한 형태 )

( https://cs.stanford.edu/people/karpathy/img/mdpdp.jpeg )

하지만 state & action의 경우가 훨씬 많아지고 복잡해지면, 이를 적용하기 쉽지 않다.

또한, 이를 일반화(generalization)할 수 있는 식을 찾기 위해 등장한 것이 “value-function approximation”이다.

2. Function Approximation

모든 값을 하나하나 저장하지 않고, 함수를 modeling해서 해당 함수(의 parameter)만을 저장하자!

value function : \(\hat{v}(s,w) \approx v_{\pi}(s)\)
action-value function : \(\hat{q}(s,a,w) \approx q_{\pi}(s,a)\)

위 식에서 parameter(혹은 weight) \(w\)를 잘 찾는 것(update)이 핵심이다. ( via NN )

https://encrypted-tbn0.gstatic.com/images

이를 통해, 아래의 3가지 function을 근사한다.

1 ) value function ( input : \(s\) )
2 ) action-value function ( input : \(s\) & \(a\) )
3 ) action-value function ( input : \(s\) )

장점

실제 존재하지 않는 data도 function을 통해 계산 가능
실제 data의 noise를 배제하고 학습 가능
고차원의 data도 효율적으로 저장 가능

Gradient Descent

주로 사용하는 모델은 NN로써, NN은 Gradient Descent(경사하강법)을 이용하여 적절한 parameter값을 찾아나간다. (기초적인 내용 생략) 이를 RL의 value-function에 적용하자면, update하는 대상은 true-value function이 되고, minimize해야 하는 objective function은 다음 식과 같이 true-value function과 approximated-value function의 MSE로 잡을 수 있다.

\(J(w) = E_{\pi}[\{v_{\pi}(s) - v(s,w)\}^2]\).

3. Value-function Approximation

(1) State-Value-Function

(a) Loss Function

\(J(w) =\mathrm{E}_{\pi}\left[\left\{v_{\pi}(s)-\hat{v}(s, \boldsymbol{w})\right\}^{2}\right]\).

(b) gradient

\(\Delta w=-\frac{1}{2} \alpha \nabla_{w} J(w) = \alpha \mathrm{E}_{\pi}\left[\left(v_{\pi}(s)-\hat{v}(s, w)\right) \nabla_{w} \hat{v}(s, w)\right]\).

MC : \(\Delta w=\alpha \mathrm{E}_{\pi}\left[\left(G_{t}-\hat{v}(s, w)\right) \nabla_{w}, \hat{v}(s, w)\right]\)
TD(0) : \(\Delta w=\alpha \mathrm{E}_{\pi}\left[\left(R_{t+1}+\gamma \hat{v}\left(S_{t+1}, w\right)-\hat{v}(s, \boldsymbol{w})\right) \nabla_{w} \hat{v}(s, w)\right]\)

(2) Action-value function

(a) Loss Function

\(J(w)=\mathrm{E}_{\pi}\left[\left\{q_{\pi}(S, A)-\hat{q}(S, A, \boldsymbol{w})\right\}^{2}\right]\).

(b) gradient

\(\Delta w=-\frac{1}{2} \alpha \nabla_{\text {wr }} J(w)=\alpha \mathrm{E}_{\pi}\left[\left(q_{\pi}(S, A)-\hat{q}(S, A, \boldsymbol{w})\right) \nabla_{w} \hat{q}(S, A, \boldsymbol{w})\right]\).

MC : \(\Delta w=\alpha \mathrm{E}_{\pi}\left[\left(G_{\mathrm{t}}-\hat{q}\left(S_{t}, A_{t}, \boldsymbol{w}\right)\right) \nabla_{w} \hat{q}\left(S_{t}, A_{t}, \boldsymbol{w}\right)\right]\)
TD(0) : \(\Delta w=\alpha \mathrm{E}_{\pi}\left[\left(R_{\mathrm{t}+1}+\gamma \tilde{q}\left(S_{t+1}, A_{t+1}, w\right)-\bar{q}\left(S_{t}, A_{\mathrm{t}}, \boldsymbol{w}\right)\right) \nabla_{w} \hat{q}\left(S_{t}, A_{t}, \boldsymbol{w}\right)\right]\)
TD(\(\lambda\)) :
- Forward : \(\Delta w= \alpha \mathrm{E}_{\pi}\left[\left(q_{\mathrm{t}}^{\lambda}-q\left(S_{\mathrm{t}}, A_{\mathrm{t}}, \boldsymbol{w}\right)\right) \nabla_{w} q\left(S_{\mathrm{t}}, A_{\mathrm{t}}, \boldsymbol{w}\right)\right]\).
- Backward : \(\Delta w =\alpha \mathrm{E}_{\pi}\left[\delta_{t} E_{\mathrm{t}}\right]\)
  - \(\delta_{t} =R_{\mathrm{t}+1}+\gamma \hat{q}\left(S_{t+1}, A_{\mathrm{t}+1}, w\right)-\hat{q}\left(S_{t}, A_{t} \boldsymbol{w}\right)\).
  - \(E_{\mathrm{t}} =\gamma \lambda E_{\mathrm{t-1}}+\nabla_{w} \hat{\mathcal{q}}\left(S_{t}, A_{t}, \boldsymbol{w}\right)\).

25.Value Function Approximation

Seunghan Lee

[ 25. Value Function Approximation ]

Contents

1. Introduction

2. Function Approximation

장점

Gradient Descent

3. Value-function Approximation

(1) State-Value-Function

(2) Action-value function

SGD

4. Q-learning with function approximation

(1) Function Approximation (X)

(2) Function Approximation (O)

You May Also Enjoy