Meta-learning with Latent Embedding Optimization (2019)

Abstract
Introduction
Methodology
1. Problem Definition
2. Model
3. Zero-shot Learning
4. Network Architecture

0. Abstract

Gradient-based Meta-Learning techniques의 어려움 :

HIGH-dimensional parameter space in LOW-data regimes에서 작동할 때!

이 논문에서는 위의 문제를 다음의 방식으로 극복함

learn a data-dependent “latent generative representation” of model parameters
perform gradient-based meta-learning in this “low-dimensional latent space”

\(\rightarrow\) LEO (Latent Embedding Optimization)을 제안한다

1. Introduction

LEO (Latent Embedding Optimization)의 2가지 장점

1) Initial parameters for a new task are “conditioned on the training data”,

which enables “task-specific starting point”
2) by optimizing in the LOWER-dimensional input, can be done more “EFFICIENTLY”

2. Model

2-1. Problem Definition

Settings & Notation

가정 : N-way K-shot
task instance \(T_i\)는 task distribution \(p(T)\)에서 sample됨
t개의 task : \(T_1\) ,…\(T_t\) ( 이들은 train/valid/test로 구분)
- \(T_1\) ~ \(T_m\) : train task ( training meta-set= \(S^{tr}\) )
- \(T_{m+1} \sim T_n\) : validation task ( validation meta-set= \(S^{val}\) )
- \(T_{n+1} \sim T_{t}\) : test task ( test meta-set= \(S^{test}\) )
위의 각 세 세트의 데이터들은, 다음으로 구분

ex) \(S^{tr} = (D^{tr}, D^{val}, D^{test})\)
- \(\mathcal{D}^{t r}=\left\{\left(\mathbf{x}_{n}^{k}, y_{n}^{k}\right) \mid k=1 \ldots K ; n=1 \ldots N\right\}\).
헷갈리지 말기!
- \(D^{val}\) : used to “optimize loss function”
- \(S^{val}\) : used for “model selection”

2-2. MAML ( MODEL-AGNOSTIC META-LEARNING )

Goal :

aims to find a “single set of params \(\theta\)“,

which uses a “few” optimization steps,

that can be successfully “adapted to any novel task”

Notation

task-specific model parameters : \(\theta_{i}^{\prime}\)
\(\theta_{i}^{\prime}=\mathcal{G}\left(\theta, \mathcal{D}^{t r}\right)\) ( 여기서 \(\mathcal{G}\) 는 gradient descent & \(D^{tr}\)는 task \(i\)에 해당하는 dataset )
task별 updating equation : ( = adaptation procedure )

\(\theta_{i}^{\prime}=\theta-\alpha \nabla_{\theta} \mathcal{L}_{\mathcal{T}_{i}}^{t r}\left(f_{\theta}\right)\).
- \(\alpha\) : can be meta-learned ( ex. Meta-SGD, 2017 )
meta parameter의 updating equation :

\(\theta \leftarrow \theta-\eta \nabla_{\theta} \sum_{\mathcal{T}_{i} \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_{i}}^{v a l}\left(f_{\theta_{i}^{\prime}}\right)\).

2-3. LEO (Latent Embedding Optimization) for Meta-Learning

high-dimension에서 벗어나서 low-dimension에서 GD가 이루어지면 GOOD!

\(\rightarrow\) achieve this by learning a “STOCHASTIC LATENT SPACE” with an information bottleneck, conditioned on the input data

MAML vs LEO

MAML : learn unique set of model parameters \(\theta\)
LEO : learn a “generative distribution” of model parameters

(1) Model Overview

Parameter

Encoder 부분 : encoder (\(\phi_e\)) + relation network (\(\phi_r\))
Decoder 부분 : decoder (\(\phi_d\))
learning rate : \(\alpha\)
[종합] \(\phi = \phi_e, \phi_r, \phi_d, \alpha\)

Algorithm

Randomly initialize \(\phi_{e}, \phi_{r}, \phi_{d}\) Let \(\phi=\left\{\phi_{e}, \phi_{r}, \phi_{d}, \alpha\right\}\)

given task instance \(T_i\) ( = \((D_{tr}, D_{val}\) ) )

step 1) \(D^{tr}\)에 있는 data들 \(\left\{\mathbf{x}_{n}^{k}\right\}\) 는 stochastic encoder에 들어가서 latent code \(\mathbf{z}\)가 나온다
step 2) 이 latent code \(\mathbf{z}\)는 다시 decoder로 들어가서 \(\theta_i\)가 나온다 ( using “parameter generator, \(g_{\phi_{d}}\)” )
step 3) 이렇게 해서 나온 \(\theta_i\)로 loss 계산 후, \(\mathbf{z}\)의 latent space에서 gradient descent 진행
- \(\mathbf{z}^{\prime} \leftarrow \mathbf{z}^{\prime}-\alpha \nabla_{\mathbf{z}^{\prime}} \mathcal{L}_{\mathcal{T}_{i}}^{t r}\left(f_{\theta_{i}^{\prime}}\right)\).
step 4) 이렇게 update된 \(\mathbf{z}^{'}\)를 다시 decoder로 들어가서 \(\theta_i^{'}\)가 나온다 ( using “parameter generator, \(g_{\phi_{d}}\)” )
step 2) ~ step 4)를 여러 번 반복한다
여러번 반복 뒤, 나오게 된 \(\theta^{'}\)를 사용하여 validation loss \(\mathcal{L}_{\mathcal{T}_{i}}^{v a l}\left(f_{\theta_{i}^{\prime}}\right)\) 계산

이 loss를 사용하여 \(\phi\)를 update한다.

\(\phi \leftarrow \phi-\eta \nabla_{\phi} \sum_{\mathcal{T}_{i}} \mathcal{L}_{\mathcal{T}_{i}}^{v a l}\left(f_{\theta_{i}^{\prime}}\right)\).

(2) Initialization : Generating Parameters Conditioned on a few examples

stage 1) instantiate model parameters, that will be adapted to each task instance

MAML : single set of model params
LEO : data-dependent latent encoding ( 추후 decoded to generate actual initial params )

( = consider context, when producing parameter initial params )

Encoding

2개의 과정을 거쳐서 encoding 된다

encoder network : \(g_{\phi_{e}}: \mathcal{R}^{n_{x}} \rightarrow \mathcal{R}^{n_{h}}\)
relation network : \(g_{\phi_{r}}\)

\(K\)-shot training samples, corresponding to each class \(n\)

\(n: \mathcal{D}_{n}^{t r}=\left\{\left(\mathbf{x}_{n}^{k}, y_{n}^{k}\right) \mid k=\right.1 \ldots K\}\).

Encoding 과정

( class별로 생성한다! class-dependent codes : \(\mathbf{z}=\left[\mathbf{z}_{1}, \mathbf{z}_{2}, \ldots, \mathbf{z}_{N}\right]\) )

\(\begin{aligned} \boldsymbol{\mu}_{n}^{e}, \boldsymbol{\sigma}_{n}^{e}=& \frac{1}{N K^{2}} \sum_{k_{n}=1}^{K} \sum_{m=1}^{N} \sum_{k_{m}=1}^{K} g_{\phi_{r}}\left(g_{\phi_{e}}\left(\mathbf{x}_{n}^{k_{n}}\right), g_{\phi_{e}}\left(\mathbf{x}_{m}^{k_{m}}\right)\right) \\ & \mathbf{z}_{n} \sim q\left(\mathbf{z}_{n} \mid \mathcal{D}_{n}^{t r}\right)=\mathcal{N}\left(\boldsymbol{\mu}_{n}^{e}, \operatorname{diag}\left(\boldsymbol{\sigma}_{n}^{e 2}\right)\right) \end{aligned}\).

Decoding

decoder network : \(g_{\phi_{d}}: \mathcal{Z} \rightarrow \Theta\)
use class-specified latent codes to instantiate just the top layer weights of classifier
decoding된 결과 : \(\theta_{i}^{\prime}=\left\{\mathbf{w}_{n} \mid n=1 \ldots N\right\}\)
without requiring the generator to produce very high-dim params

Decoding 과정

( parameterize Gaussian distn with diagonal covariance )

sample class-dependent parameters \(\mathbf{w}_{n}\)
\(\begin{aligned} \boldsymbol{\mu}_{n}^{d}, \boldsymbol{\sigma}_{n}^{d} &=g_{\phi_{d}}\left(\mathbf{z}_{n}\right) \\ \mathbf{w}_{n} \sim p\left(\mathbf{w} \mid \mathbf{z}_{n}\right) &=\mathcal{N}\left(\boldsymbol{\mu}_{n}^{d}, \operatorname{diag}\left(\boldsymbol{\sigma}_{n}^{d^{2}}\right)\right) \end{aligned}\).

(3) Adaptation by LEO ( = INNER loop )

특정 task \(T_i\) 별로 수행 ( \(\mathbf{z}\)를 update)
Cross-Entropy loss 사용
\(\mathcal{L}_{\mathcal{T}_{i}}^{t r}\left(f_{\theta_{i}}\right)=\sum_{(\mathbf{x}, y) \in \mathcal{D}^{t r}}\left[-\mathbf{w}_{y} \cdot \mathbf{x}+\log \left(\sum_{j=1}^{N} e^{\mathbf{w}_{j} \cdot \mathbf{x}}\right)\right]\).
위 loss를 사용하여 다음의 GD를 수행

\(\mathbf{z}_{n}^{\prime}=\mathbf{z}_{n}-\alpha \nabla_{\mathbf{z}_{n}} \mathcal{L}_{\mathcal{T}_{i}}^{t r}\).

(4) Meta-Training strategy ( = OUTER loop )

모든 task \(T_i\)들에 대해서 전부 수행한 뒤 \(\phi\)를 update
\(\min _{\phi_{e}, \phi_{r}, \phi_{d}} \sum_{\mathcal{T}_{i} \sim p(\mathcal{T})}\left[\mathcal{L}_{\mathcal{T}_{i}}^{v a l}\left(f_{\theta_{i}^{\prime}}\right)+\beta D_{K L}\left(q\left(\mathbf{z}_{n} \mid \mathcal{D}_{n}^{t r}\right) \mid \mid p\left(\mathbf{z}_{n}\right)\right)+\gamma \mid \mid \operatorname{stopgrad}\left(\mathbf{z}_{n}^{\prime}\right)-\mathbf{z}_{n} \mid \mid _{2}^{2}\right]+R\).
- (2번째 term)
  
  weighted KL-div to regularize the latent space & encourage generative model to learn a disentangled embedding
- (3번째 term)
  
  encourage the encoder & relation net to output a parameter initialization that is close to the adapted code

Twitter Facebook LinkedIn

[meta] (paper 15) Meta-Learning with Latent Embedding Optimization

Seunghan Lee