( 참고 : Fastcampus 강의 )
[ 26. Optimization ]
Contents
- Momentum
- 일반 SGD
- Momentum SGD
- Adagrad
- RMSprop
- Adam
1. Momentum
(1) 일반 SGD
$\begin{aligned} \theta_{t+1} \leftarrow \theta_{t}+\eta \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}} \end{aligned}$
(2) Momentum SGD
$\theta_{t+1} \leftarrow \theta_{t}+v_{t}$.
where
$\begin{aligned}
v_{t}&=\gamma v_{t-1}+\eta \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}
&\quad=\gamma\left(\gamma v_{t-2}+\eta \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t-1}}\right)+\eta \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}
&\quad=\eta\left(\frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}+\gamma \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t-1}}+\gamma^{2} \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t-2}}+\ldots\right)
\end{aligned}$.
2. Adagrad
$\begin{aligned}\theta_{t+1} \leftarrow \theta_{t}+\frac{\gamma}{\sqrt{G_{t}+\epsilon}} \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}} \end{aligned}$.
- $G_{t}=G_{t-1}+\left(\frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}\right)^{2}$.
하지만, iteration이 진행될 수록, $G_t$는 점점 커지므로, learning rate가 매우 작아짐.
$\rightarrow$ 더 이상 학습이 진행되지 않음.
이를 극복하기 위해 나온 것이 RMSprop
3. RMSprop
$\begin{aligned} \theta_{t+1} \leftarrow \theta_{t}+\frac{\gamma}{\sqrt{G_{t}+\epsilon}} \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}} \end{aligned}$.
- $G_{t}=\gamma G_{t-1}+(1-\gamma)\left(\frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}\right)^{2}$.
$G_t$ 계산 시, moving average ($\gamma$ & (1-$\gamma$) )를 통해, $G_t$가 계속 커지는 것을 방지한다
4. Adam
Adam = RMSprop + Momentum
$\begin{aligned} \theta_{t+1} \leftarrow \theta_{t}+\frac{\gamma}{\sqrt{G_{t}+\epsilon}} v_{t} \end{aligned}$.
- $v_{t}=\eta_{1} v_{t-1}+\left(1-\eta_{1}\right) \frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}$……………………. momentum
- $G_{t}=\eta_{2} G_{t-1}+\left(1-\eta_{2}\right)\left(\frac{d \mathcal{L}(y, \hat{y})}{d \theta_{t}}\right)^{2}$……………. RMSprop