( 참고 : 패스트 캠퍼스 , 한번에 끝내는 컴퓨터비전 초격차 패키지 )
Representation Learning (2)
3. Unsupervised Representation Learning
(1) Motivation
- Representation Learning : key of success in visual recognition
- problem : requires labeled data ( = supervision )
\(\rightarrow\) solution : UNsupervised Representation Learning
Example )
- context prediction (2015)
- inpaiting (2016)
(2) Face Recognition : faceNet (2015)
- Motivation : different pose, illumination \(\rightarrow\) but if same person, HIGH similarity !!
Loss Function
\(\begin{gathered} \sum_{i}^{N}\left[ \mid \mid f\left(x_{i}^{a}\right)-f\left(x_{i}^{p}\right) \mid \mid _{2}^{2}- \mid \mid f\left(x_{i}^{a}\right)-f\left(x_{i}^{n}\right) \mid \mid _{2}^{2}+\alpha\right]_{+} \end{gathered}\).
( Triplet Loss ) Code - TF
- https://github.com/davidsandberg/facenet/blob/master/src/facenet.py
def triplet_loss(anchor, positive, negative, alpha):
with tf.variable_scope('triplet_loss'):
pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, positive)), 1)
neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, negative)), 1)
basic_loss = tf.add(tf.subtract(pos_dist,neg_dist), alpha) # alpha = MARGIN
loss = tf.reduce_mean(tf.maximum(basic_loss, 0.0), 0)
return loss
( Triplet Loss ) Code - Pytorch (scratch)
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.modules.distance import PairwiseDistance
from torchvision.models import resnet18
class TripletLoss(nn.Module):
def __init__(self, alpha):
self.margin = alpha
self.pdist = PairwiseDistance(p = 2)
def forward(self, anchor, pos, neg):
pos_dist = self.pdist.forward(anchor, pos)
neg_dist = self.pdist.forward(anchor, neg)
hingle_loss = torch.clamp(self.margin + pos_dist - neg_dist, min = 0.0)
loss = torch.mean(hinge_loss)
return loss
class FacenNet(nn.Module):
def __init__(self, hidden_dim = 128, pretrained = False):
self.backbone = resnet18(pretrained = pretrained)
# get input dimension
input_dim = self.backbone.fc.in_features
self.model.fc = nn.Linear(input_dim, hidden_dim)
def forward(self, x):
x = self.backbone(x) # get Embedding
x = F.normalize(x, p = 2, dim = 1) # L2 normalization
return x
( Triplet Loss ) Code - Pytorch (built-in)
- https://pytorch.org/docs/stable/generated/torch.nn.TripletMarginLoss.html
loss_fn = nn.TripletMarginLoss(margin = 0.2)
loss = loss_fn(anchor_embed, pos_embed, neg_embed)
(3) Image Retrieval
Image Retrieval의 종류
-
Content-based Image Retrieval
-
ex) finding closest content/object image from DB
-
Dataset
- Cars196 : http://ai.stanford.edu/~jkrause/cars/car_dataset.html
- CUB-200 : https://paperswithcode.com/dataset/cub-200-2011
- Stanford Online Products : https://cvgl.stanford.edu/projects/lifted_struct/
- In-Shop : https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion/InShopRetrieval.html
-
-
Instance-based Image Retrieval
- ex) Landmark
[ Content-based Image Retrieval ]
1. Beyond Binary Supervision (2019)
( Deep Metric Learning Beyond Binary Supervision (Kim et al., CVPR 2019) )
- continuous label ( discrete (X) )
- loss function : \(\ell_{\operatorname{lr}}(a, i, j)=\left\{\log \frac{D\left(f_{a}, f_{i}\right)}{D\left(f_{a}, f_{j}\right)}-\log \frac{D\left(y_{a}, y_{i}\right)}{D\left(y_{a}, y_{j}\right)}\right\}^{2}\)
- \(a\) : anchor
- \(i\) : similar
- \(j\) : dissimilar
Code
- https://github.com/tjddus9597/Beyond-Binary-Supervision-CVPR19/tree/master/code
( Naive version )
class Naive_TripletLoss(Function):
def __init__(self, mrg=0.2):
super(Naive_TripletLoss, self).__init__()
self.mrg = mrg
def Squared_L2dist(self, x1, x2, norm=2):
eps = 1e-4 / x1.size(0)
diff = torch.abs(x1 - x2)
out = torch.pow(diff, norm).sum(0)
return out + eps
def forward(self, input):
a = input[0] # anchor
p = input[1] # positive
n = input[2] # negative
N = a.size(0) # #acnhor
Li = torch.FloatTensor(N)
for i in range(N):
Li[i] = (self.Squared_L2dist(a[i],p[i])-self.Squared_L2dist(a[i],n[i])+self.mrg).clamp(min=1e-12)
loss = Li.sum().div(N)
return loss
( Proposed version )
class Dense_TripletLoss(Function):
"""Log ratio loss function. """
def __init__(self, mrg=0.03):
super(Dense_TripletLoss, self).__init__()
self.mrg = mrg
self.pdist = Squared_L2dist(2)
def forward(self, input, gt_dist):
# "CONSIDERS DISTANCE"
m = input.size()[0]-1 # paired
a = input[0] # anchor
p = input[1:] # paired
# auxiliary variables
idxs = torch.arange(1, m+1).cuda()
indc = idxs.repeat(m,1).t() < idxs.repeat(m,1)
dist = self.pdist.forward(a,p)
# uniform weight coefficients
wgt = indc.clone().float()
wgt = wgt.div(wgt.sum())
loss = dist.repeat(m,1).t() - dist.repeat(m,1) + self.mrg
loss = loss.clamp(min=1e-12)
loss = loss.mul(wgt).sum()
return loss
2. **Proxy Anchor Loss **(2020)
- https://arxiv.org/pdf/2003.13911.pdf
Exisiting metric learning losses
- (1) pair-based
- (2) proxy-based
Faster convergence!
Comparison with previous works
Loss Function (Proxy-NCA)
\(\begin{aligned} \ell(X) &=\sum_{x \in X}-\log \frac{e^{s\left(x, p^{+}\right)}}{\sum_{p^{-} \in P^{-}} e^{s\left(x, p^{-}\right)}} \\ &=\sum_{x \in X}\left\{-s\left(x, p^{+}\right)+\underset{p^{-} \in P^{-}}{\left.\operatorname{LSE} s\left(x, p^{-}\right)\right\}}\right. \end{aligned}\).
- Notation
- \(X\) : batch of embedding vectors
- \(x\) : embedding vector of input
- \(p^{+}\) : positive proxy
- \(P^{-}\) : set of negative proxies
- \(p^{-}\) : negative proxy
- \(s(\cdot, \cdot)\) : cosine similarity
- \(X\) : batch of embedding vectors
Loss Function (Proxy-Anchor)
\(\begin{aligned} \ell(X)=& \frac{1}{ \mid P^{+} \mid } \sum_{p \in P^{+}}[\operatorname{Softplus}(\operatorname{LSE}-\alpha(s(x, p)-\delta))] \\ &+\frac{1}{ \mid P \mid } \sum_{p \in X_{p}^{+}}\left[\operatorname{Softplus}\left(\operatorname{LSE}_{x \in X_{p}^{-}} \alpha(s(x, p)+\delta)\right)\right] \end{aligned}\).
- Notation
- \(\delta>0\) : margin
- \(\alpha>0\) : scaling factor
- \(P\) : set of ALL proxies
- \(P^{+}\) : set of POSITIVE proxies
- for each proxy \(p\), a batch of embedding vectors \(X\) is divided into…
- (1) \(X_{p}^{+}\) : positive embedding vectors of \(p\)
- (2) \(X-X_{p}^{+}\)
Comparison between..
- (1) Proxy-NCA
- (2) Proxy-Anchor