( 참고 : 패스트 캠퍼스 , 한번에 끝내는 컴퓨터비전 초격차 패키지 )

Representation Learning (1)

1. Metric Learning

(1) Euclidean Distance

\(D_{E}\left(x_{i}, x_{j}\right)=\sqrt{\left(x_{i}-x_{j}\right)^{\top}\left(x_{i}-x_{j}\right)}\).

(2) Mahalanobis Distance

\(D_{M}\left(x_{i}, x_{j}\right)=\sqrt{\left(x_{i}-x_{j}\right)^{\top} M\left(x_{i}-x_{j}\right)}\).
- considering the data manifold!
Euclidean Distance = special case of Mahalanobis Distance, where \(M=I\)

Mahalanobis Distance in Multivariate Gaussian :

\(M = \Sigma^{-1}\).

where \(\mathcal{N}(\mathbf{x})=\frac{1}{\sqrt{(2 \pi)^{k} \mid \Sigma \mid }} \exp \left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\top} \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)\).

Learning Mahalanobis Distance

estimate \(M\) from data! ( \(M\) : p.s.d )

(3) A first approach to distance metric learning

Notation : 2 sets of data pairs

\(S^{+}\): The set of similar pairs
\(S^{-}\): The set of dissimilar pairs

Objective Function

\(M^{*}=\underset{M}{\operatorname{argmin}} \sum_{\left(x_{i}, x_{j}\right) \in S^{+}}\left(x_{i}-x_{j}\right)^{\top} M\left(x_{i}-x_{j}\right)\)………… (1)

where \(\text { s.t. } \sum_{\left(x_{i}, x_{j}\right) \in S^{+}}\left(x_{i}-x_{j}\right)^{\top} M\left(x_{i}-x_{j}\right) \geq 1\) ………. (2)
Interpretation
- (1) should be small, if similar data
- (2) should be larger than 1, if dissimilar data

(4) Large Margin Nearest Neighbor (LMNN)

Interpretation

2. Deep Metric Learning

“Learning representation from data”

(1) Deep Metric Learning

\(D(f(x_i), f(x_j))\) ,

\(\rightarrow\) Learn function \(f\) that maps data to data representation with DNNs!

(2) Siamese Network

Siamese = pair of NN, sharing parameters

Contrastive Loss

make similar pairs close
make dissimilar pairs far away

(3) Triplet Network

1) Notation

anchor : \(x_i^a\)
positive : \(x_i^{+}\)
negative : \(x_i^{-}\)

2) Key idea

given anchor…

the distance between positive pair < distance between negative pair in the feature space
\(D\left(f\left(\boldsymbol{x}_{i}^{a}\right), f\left(\boldsymbol{x}_{i}^{+}\right)\right)+\delta<D\left(f\left(\boldsymbol{x}_{i}^{a}\right), f\left(\boldsymbol{x}_{i}^{-}\right)\right)\).
- \(\delta\) : margin

3) Triple Rank loss ( for triplet \(\left(\boldsymbol{x}_{i}^{a}, \boldsymbol{x}_{i}^{+}, \boldsymbol{x}_{i}^{-}\right)\) )

\(L\left(W ; \boldsymbol{x}_{i}^{a}, \boldsymbol{x}_{i}^{+}, \boldsymbol{x}_{i}^{-}\right)=\max \left[0, \mid \mid f\left(\boldsymbol{x}_{i}^{a}\right), f\left(\boldsymbol{x}_{i}^{+}\right) \mid \mid _{2}^{2}- \mid \mid f\left(\boldsymbol{x}_{i}^{a}\right), f\left(\boldsymbol{x}_{i}^{-}\right) \mid \mid _{2}^{2}+\delta\right]\).

4) Details

After each data is passed through the NN….”L2 regularization”

Reason :

without normalization, margin becomes trivial!
ex) \(D\left(f\left(\boldsymbol{x}_{i}^{a}\right), f\left(\boldsymbol{x}_{i}^{+}\right)\right)+\delta\),
- \(\begin{gathered} \text {e.g., } 1,000,000+0.2<1,000,001 \end{gathered}\).

“Weight sharing” between both networks

5) Sample Selection

infeasible to cover all pairs/triplets!

(pairs) : \(O(N^2)\)
(triplets) : \(O(N^3)\)

Difficulties

(1) TOO EASY pairs/triplets do not contribute to training
- \(L\left(W ; \boldsymbol{x}_{i}^{a}, \boldsymbol{x}_{i}^{+}, \boldsymbol{x}_{i}^{-}\right)=\max \left[0, \mid \mid f\left(\boldsymbol{x}_{i}^{a}\right), f\left(\boldsymbol{x}_{i}^{+}\right) \mid \mid _{2}^{2}- \mid \mid f\left(\boldsymbol{x}_{i}^{a}\right), f\left(\boldsymbol{x}_{i}^{-}\right) \mid \mid _{2}^{2}+\delta\right]\),
  - if \(\mid \mid f\left(x_{i}^{a}\right), f\left(x_{i}^{+}\right) \mid \mid _{2}^{2}- \mid \mid f\left(x_{i}^{a}\right), f\left(x_{i}^{-}\right) \mid \mid _{2}^{2}+\delta<0\)
    
    \(\rightarrow\) loss is 0
    
    \(\rightarrow\) gradient is 0
    
    \(\rightarrow\) no training!
(2) TOO HARD pairs/triplets could make training unstable
- since \(\frac{\partial L}{\partial f\left(x_{i}^{-}\right)} \propto \frac{f\left(x_{i}^{a}\right)-f\left(x_{i}^{-}\right)}{ \mid \mid f\left(x_{i}^{a}\right)-f\left(x_{i}^{-}\right) \mid \mid ^{\prime}}\),
- If TOO HARD negatives … and for hard negatives \(\mid \mid f\left(x_{i}^{a}\right), f\left(x_{i}^{-}\right) \mid \mid\) becomes too small
  
  \(\rightarrow\) direction of gradients are NOT STABLE

Uniform sampling

assuming an \(L_2\) normalized embedding space

\(\rightarrow\) distance between uniformly sampled points are “BIASED”

\(\rightarrow\) Need to select samples well!!

(4) Distance Weighted margin-based Loss

Solution of “Sample Selection problem” : Distance Weighted Sampling (DWS)

correct the bias & control the variance
sample negative examplesuniformly according to the DISTANCE from ANCHOR

\(P\left(n^{\prime}=n \mid a\right) \propto \min \left(\lambda, q^{-1}\left(D\left(f\left(x^{a}\right), f\left(x^{n^{\prime}}\right)\right)\right)\right)\),

where \(q(D) \propto D^{d-2}\left[1-\frac{1}{4} d^{2}\right]^{\frac{n-3}{2}}\)

Empirical Analysis of DWS

Margin-based loss

\(\ell^{\operatorname{margin}}(i, j):=\left(\alpha+y_{i j}\left(D_{i j}-\beta\right)\right)_{+}\).

\(\beta\) : determines the boundary between positive and negative pairs
\(\alpha\) : controls the margin of separation
\(y_{i j} \in\{-1,1\}\).

(5) Quadruplet Network

When is it used?

multi-object tracking : different relationships among positive samples,

accoring to their time indices
person re-identification : enlarging inter-class distances with additional constraints

1) Quadruplet Rank loss

linear combination of 2 Triplet Rank Losses

\(L_{Q}\left(W ; \boldsymbol{x}_{i}^{a}, \boldsymbol{x}_{i}^{1}, \boldsymbol{x}_{i}^{2}, \boldsymbol{x}_{i}^{3}\right)=\alpha \cdot L_{T}^{1}\left(W ; \boldsymbol{x}_{i}^{a}, \boldsymbol{x}_{i}^{1}, \boldsymbol{x}_{i}^{2}\right)+\beta \cdot L_{T}^{2}\left(W ; \boldsymbol{x}_{i}^{a}, \boldsymbol{x}_{i}^{1}, \boldsymbol{x}_{i}^{3}\right)\).

\(\begin{aligned} &L_{Q}\left(W ; \boldsymbol{x}_{i}^{a}, \boldsymbol{x}_{i}^{t_{1}}, \boldsymbol{x}_{i}^{t_{2}}, \boldsymbol{x}_{i}^{-}\right) \\ &=\max \left[0, \mid \mid f\left(x_{i}^{a}\right), f\left(x_{i}^{t_{1}}\right) \mid \mid _{2}^{2}- \mid \mid f\left(x_{i}^{a}\right), f\left(x_{i}^{t_{2}}\right) \mid \mid _{2}^{2}+\delta_{1}\right] \\ &+\max \left[0, \mid \mid f\left(x_{i}^{a}\right), f\left(\boldsymbol{x}_{i}^{t_{2}}\right) \mid \mid _{2}^{2}- \mid \mid f\left(x_{i}^{a}\right), f\left(\boldsymbol{x}_{i}^{-}\right) \mid \mid _{2}^{2}+\delta_{2}\right] \end{aligned}\).

Notation

\(\boldsymbol{x}_{i}^{a}\) : anchor
\(\boldsymbol{x}_{i}^{t_{1}}, \boldsymbol{x}_{i}^{t_{2}}\) : positive ( time=1,2 respectively )
\(\boldsymbol{x}_{i}^{-}\) : negative

2) Examples

[ multi-object tracking ] Intepretation

1st term : positive samples closer in time should be closer to the anchor
2nd term : negative samples should be farther from the anchor than positive samples

[ person re-identification ] Intepretation

1st term : positive samples closer in time should be closer to the anchor
2nd term : inter-class distance > intra-class distance

(6) Nearest Neighbor Search

finding the nearest sample among training examples ( in a latent space )

(7) Applications

1) Image Retrieval

Content-based image retreieval

nearest neighbor search in latent space
works for unseen classes

2) Face Verification

Task : decide if 2 face images are same/different

apply a threshold to the distance ( in the latent space )

3) Person Re-identification

Task : identify people across different cameras

works for unseen classes (=people)

4) Online Visual Tracking

Task : track an object, which is manually annotated only in the 1st frame

done by particle filtering + metric learning

Particle Filtering?

Basic concept :
- drawing random samples arount target location of previous time step
  
  \(\rightarrow\) choose the most similar one!
- similarity = inverse of distance
should be invariant to target object

3. Unsupervised Representation Learning

Twitter Facebook LinkedIn

(CV summary) 10. Representation Learning (1)

Seunghan Lee