( 참고 : Fastcampus 강의 )

[ 15. Time Difference Learning (1) ]

Introduction
Time Difference Learning
N-step TD
Experiment

1. Introduction

Time Difference Learning도 Monte Carlo Learning과 함께 model-free하다

( 즉, MDP transition과 reward에 대한 정보가 없는 상황에서도 풀 수 있는 방법이다 ).

TD Learning은 MC Learning과 비교했을 때 무엇이 다른지 한번 확인해보자!

MC vs TD(0)

차이점 : value function이 update되는 빈도

MC : 매 episode마다
- \(V(S_t) \leftarrow V(S_t) + \alpha(G_t-V(S_t))\).
- 하지만 episode가 길면 길수록 update가 늦게 이루어지게 된다
  
  \(\rightarrow\) 이를 보다 online update하기 위한 것이 TD
- UNBIASED, high variance
TD(0) : epsiode내의 매 step마다
- idea : \(G_t \approx R_{t+1} + \gamma \; V(S_{t+1})\)
- \(V(S_t) \leftarrow V(S_t) + \alpha(R_{t+1}+\gamma\;V(S_{t+1})-V(S_t))\).
- biased, LOW VARIANCE
  
  ( 고려하는 time step의 수가 적다 보니, 당연히 variance는 줄어들게 된다 )

2. Time Difference Learning

용어 정리

TD target : \(R_{t+1} + \gamma \; V(S_{t+1})\) ……………….. biased
TD true target : \(R_{t+1} + \gamma \; v_{\pi}(S_{t+1})\) ……….. unbiased
TD error : \(\delta_t = ( R_{t+1} + \gamma \;V(S_{t+1}) ) - V(S_t)\)

TD의 전체 process를 요약하면 다음과 같다.

( 매 step마다 업데이트가 이루어짐을 확인할 수 있다. )

3. N-step TD

TD(0) vs TD(N)

TD(0) : 1-step 마다 update
TD(N) : N-step 마다 update

\(\begin{aligned} n=1 & \text { (TD) } & G_{t}^{(1)}=R_{t+1}+\gamma V\left(S_{t+1}\right) \\ n=2 & & G_{t}^{(2)}=R_{t+1}+\gamma R_{t+2}+\gamma^{2} V\left(S_{t+2}\right) \\ & \vdots & & \\ n=\infty &(M C) & G_{t}^{(\infty)}=R_{t+1}+\gamma R_{t+2}+\ldots+\gamma^{T-1} R_{T} \end{aligned}\).

Summary

\(G_{t}^{(n)}=R_{t+1}+\gamma R_{t+2}+\ldots+\gamma^{n-1} R_{t+n}+\gamma^{n} V\left(S_{t+n}\right)\).
Updating Equation :

\(V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(G_{t}^{(n)}-V\left(S_{t}\right)\right)\).

4. Experiment

다양한 \(n\)값과 \(\alpha\)값(=learning rate)을 Random Walk를 통해 시험해 보았고, time step의 수(n) 마다 최적의 alpha값이 다름을 확인할 수 있었다.

절대적으로 좋은 time step을 찾기는 어렵다.

그래서 여러 time step를 고려하여 Time Difference Learing을 한 것이 바로 \(TD(\lambda)\)이다.

5. Summary

Time Difference Learning : model-free하다
MC vs TD
- MC : 매 episode마다 update …. ( UNBIASED, high variance )
- TD : 매 step마다 update…. ( biased, LOW VARIANCE )
TD(0) : \(G_t \approx R_{t+1} + \gamma \; V(S_{t+1})\)
N-step TD
- TD(0) : 1-step 마다 update
- TD(N) : N-step 마다 update

Twitter Facebook LinkedIn

15.Time Difference Learning (1)

Seunghan Lee