( 참고 : Fastcampus 강의 )

[ 26. Optimization ]

Contents

  1. Momentum
    1. 일반 SGD
    2. Momentum SGD
  2. Adagrad
  3. RMSprop
  4. Adam


1. Momentum

(1) 일반 SGD

$\begin{aligned} \theta_{t+1} \leftarrow \theta_{t}+\eta \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}} \end{aligned}$


(2) Momentum SGD

$\theta_{t+1} \leftarrow \theta_{t}+v_{t}$.

where

$\begin{aligned} v_{t}&=\gamma v_{t-1}+\eta \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}
&\quad=\gamma\left(\gamma v_{t-2}+\eta \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t-1}}\right)+\eta \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}
&\quad=\eta\left(\frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}+\gamma \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t-1}}+\gamma^{2} \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t-2}}+\ldots\right) \end{aligned}$.


2. Adagrad

$\begin{aligned}\theta_{t+1} \leftarrow \theta_{t}+\frac{\gamma}{\sqrt{G_{t}+\epsilon}} \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}} \end{aligned}$.

  • $G_{t}=G_{t-1}+\left(\frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}\right)^{2}$.

하지만, iteration이 진행될 수록, $G_t$는 점점 커지므로, learning rate가 매우 작아짐.

$\rightarrow$ 더 이상 학습이 진행되지 않음.

이를 극복하기 위해 나온 것이 RMSprop


3. RMSprop

$\begin{aligned} \theta_{t+1} \leftarrow \theta_{t}+\frac{\gamma}{\sqrt{G_{t}+\epsilon}} \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}} \end{aligned}$.

  • $G_{t}=\gamma G_{t-1}+(1-\gamma)\left(\frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}\right)^{2}$.

$G_t$ 계산 시, moving average ($\gamma$ & (1-$\gamma$) )를 통해, $G_t$가 계속 커지는 것을 방지한다


4. Adam

Adam = RMSprop + Momentum

$\begin{aligned} \theta_{t+1} \leftarrow \theta_{t}+\frac{\gamma}{\sqrt{G_{t}+\epsilon}} v_{t} \end{aligned}$.

  • $v_{t}=\eta_{1} v_{t-1}+\left(1-\eta_{1}\right) \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}$……………………. momentum
  • $G_{t}=\eta_{2} G_{t-1}+\left(1-\eta_{2}\right)\left(\frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}\right)^{2}$……………. RMSprop