Universal Domain Adaptation ( 2019, 154 )

Contents

  1. Abstract
  2. Introduction
  3. Related Work
    1. Closed set DA
    2. Partial DA
    3. Open set DA (OSDA)
  4. Universal Domain Adaptation (UDA)
    1. Problem Setting
    2. Technical Challenges
    3. UAN (Universal Adaptation Network)
    4. Transferability Criterion


0. Abstract

UDA (Universal Domain Adaptation)

  • requires “no prior knowledge” on lael sets

    • common label sets
    • private label

    \(\rightarrow\) allow category gap

  • requires a model to either

    • (1) classify the target sample ( if associated with common label set )
    • (2) mark it as “unknown” ( o.w )

\(\rightarrow\) propose UAN (Universal Adaptation Network)


1. Introduction

Existing DA algorithms : tackle the domain gap, by..

  • 1) learning domain invariant feature representation
  • 2) generating features/samples for target domains
  • 3) transforming samples between domains, through generative models


2 Challenges

  • 1) domain gap
  • 2) category gap


UDA (Universal Domain Adaptation)

  • given a “labeled” source domain,

  • classify target data correctly, if it belongs to class in source label set,

    or else mark it as “unknown”


2 Major challenges

  • 1) cannot decide which part of source domain should be matched to which part of target domain
  • 2) should be able to target some as “unknown”


UAN (Universal Adaptation Network)

  • quantify the transferability of each sample

  • criterion integrates both…

    • 1) domain similarity
    • 2) prediction uncertainty


2. Related Work

According to the “constraint on the label set” relationship between domain…

  • 1) closed set DA
  • 2) partial DA
  • 3) open set DA

figure2


(1) Closed set DA

  • only deal with “domain gap”

  • fall into 2 categories

    • 1) feature adaptation
    • 2) generative model


a) feature adaptation

diminish the feature distribution discrepancy between 2 domains

examples )

  • minimize MMD of deep features across domains

    ( MMD = Maximum Mean Discrepancy )

  • residual transfer structure & entropy minimization on target data

  • distribution alignment by optimizing CMD

    ( CMD = Central Moment Discrepancy )

  • construct bipartite graph to force feature distribution alignment within clusters

  • DA by minimizing EMD

    ( Earth Mover’s Distance )


b) generative model

learn a domain classifier to discriminate features from source & target domains

\(\rightarrow\) force the feature extractor to confuse the domain classifier ( adversarial learning paradigm )


(2) Partial DA

examples)

  • multiple domain discriminators with class-level & instance-level weighting mechanism
  • auxiliary domain discriminator to quantify the **probability of source sample being similar to the target domain **
  • only one adversarial network & jointly applying class-level weighting on the source classifier


(3) Open set DA (OSDA)

class private to both domains are unified as “unknown” class


3. Universal Domain Adaptation (UDA)

(1) Problem Setting

Notation

  • source domain : \(\mathcal{D}_{s}=\left\{\left(\mathbf{x}_{i}^{s}, \mathbf{y}_{i}^{s}\right)\right\}\)
    • distribution of source data : \(p\)
      • \(p_{\mathcal{C}_{s}}\) : distribution of source data with labels in \(C_s\)
      • \(p_{\mathcal{C}}\) : ~ in \(C\)
    • \(C_s\) : label set of source domain
  • target domain : \(\mathcal{D}_{t}=\left\{\left(\mathrm{x}_{i}^{t}\right)\right\}\)
    • distribution of target data : \(q\)
      • \(q_{\mathcal{C}_{t}}\) : distribution of target data with labels in \(C_t\)
      • \(q_{\mathcal{C}}\) : ~ in \(C\)
    • \(C_t\) : label set of target domain
  • Common label set : \(\mathcal{C}=\mathcal{C}_{s} \cap \mathcal{C}_{t}\)
  • Private label set :
    • \(\overline{\mathcal{C}}_{s}=\mathcal{C}_{s} \backslash \mathcal{C}\).
    • \(\overline{\mathcal{C}}_{t}=\mathcal{C}_{t} \backslash \mathcal{C}\).


Commonness between 2 domains

\(\xi=\frac{ \mid \mathcal{C}_{s} \cap \mathcal{C}_{t} \mid }{ \mid \mathcal{C}_{s} \cup \mathcal{C}_{t} \mid }\). ……… closed set DA : \(\xi=1\)


Task for UDA

= design a model that does not know \(\xi\), but works well across a wide spectrum of \(\xi\)


Goal : \(\min \mathbb{E}_{(\mathbf{x}, \mathbf{y}) \sim q_{\mathcal{C}}}[f(\mathbf{x}) \neq \mathbf{y}]\)


(2) Technical Challenges

a) “Category Gap” = difference of label sets

( “Domain Gap” still exists! That is, \(p \neq q\) & \(p_C \neq q_C\) )


b) detecting “unknown” classes

  • low classification confidence


(3) UAN (Universal Adaptation Network)

Consists of 4 parts

  • 1) feature extractor :\(F\)
  • 2) adversarial domain discriminator : \(D\)
  • 3) non-adversarial domain discriminator : \(D^{'}\)
  • 4) label classifier : \(G\)


Process

  • step 1) input \(\mathbf{x}\) is fed into \(F\) ….. \(\mathbf{z}=F(\mathbf{x})\)
  • step 2)
    • step 2-1) \(\mathbf{z}\) is fed into \(G\), to obtain “probability of class” over \(C_s\) ….. \(\mathbf{\hat{y}} = G(\mathbf{z})\)
    • step 2-2) \(\mathbf{z}\) is fed into \(D^{'}\), to obtain “domain similarity” ….. \(\hat{d^{'}} = D(\mathbf{z^{'}})\)
    • step 2-3) \(\mathbf{z}\) is fed into \(D\), to match feature distn of source & target ….. \(\hat{d} = D(\mathbf{z})\)


Error (Loss Function) : \(E_G, E_{D^{'}}, E_D\)

  • \(E_{G} =\mathbb{E}_{(\mathbf{x}, \mathbf{y}) \sim p} L(\mathbf{y}, G(F(\mathbf{x})))\).
  • \(E_{D^{\prime}}=-\mathbb{E}_{\mathbf{x} \sim p} \log D^{\prime}(F(\mathbf{x})) -\mathbb{E}_{\mathbf{x} \sim q} \log \left(1-D^{\prime}(F(\mathbf{x}))\right)\).
  • \(E_{D}=-\mathbb{E}_{\mathbf{x} \sim p} w^{s}(\mathbf{x}) \log D(F(\mathbf{x})) -\mathbb{E}_{\mathbf{x} \sim q} w^{t}(\mathbf{x}) \log (1-D(F(\mathbf{x})))\).

where

  • \(L\) : cross-entropy loss
  • \(w^s(\mathbf{x})\) : probability of SOURCE sample \(\mathbf{x}\), belonging to “common label set \(C\)”
  • \(w^t(\mathbf{x})\) : probability of TARGET sample \(\mathbf{x}\), belonging to “common label set \(C\)”


Training UAN

Min-max game

  • \(\max _{D} \min _{F, G} E_{G}-\lambda E_{D}\),

    \(\min _{D^{\prime}} E_{D^{\prime}}\),

  • where \(\lambda\) : hyperparameter to tradeoff “transferability & discriminability”


Testing UAN

  • given input target sample \(\mathbf{x}\),
  • 2 predictions
    • 1) categorical prediction : \(\mathbf{\hat{y}}(\mathbf{x})\)
      • over source label set \(C_s\)
    • 2) domain prediction : \(d^{'} (\mathbf{x})\)
  • final prediction : 1) + 2)
    • \(y(\mathbf{x})= \begin{cases}\text { unknown } & w^{t}<w_{0} \\ \operatorname{argmax}(\hat{\mathbf{y}}) & w^{t} \geq w_{0}\end{cases}\).


(4) Transferability Criterion

compute weighting

  • 1) \(w^{s}=w^{s}(\mathbf{x})\)
  • 2) \(w^{t}=w^{t}(\mathbf{x})\)

by sample-level transferability criterion


Well-established sample-level transferability criterion should satisfy…

  • 1) \(\mathbb{E}_{\mathbf{x} \sim p_{\mathcal{C}}} w^{s}(\mathbf{x})>\mathbb{E}_{\mathbf{x} \sim p_{\overline{\mathcal{C}}_{s}}} w^{s}(\mathbf{x})\)
  • 2) \(\mathbb{E}_{\mathbf{x} \sim q_{\mathcal{C}}} w^{t}(\mathbf{x})>\mathbb{E}_{\mathbf{x} \sim q_{\overline{\mathcal{C}}_{t}}} w^{t}(\mathbf{x})\)


a) Domain Similarity

objective of \(D^{'}\) :

  • predict samples from SOURCE as 1
  • predict samples from TARGET as 0


output \(d^{'}\)

  • high = more similar to SOURCE
  • low = more similar to TARGET


b) Prediction Uncertainty

Entropy

  • small = more confident
  • high = more uncertain


Hypothesize

  • \(\mathbb{E}_{\mathbf{x} \sim q_{\mathbb{C}_{t}}} H(\hat{\mathbf{y}})> \mathbb{E}_{\mathbf{x} \sim q_{\mathcal{C}}} H(\hat{\mathbf{y}})>\mathbb{E}_{\mathbf{x} \sim p_{\mathcal{C}}} H(\hat{\mathbf{y}})>\mathbb{E}_{\mathbf{x} \sim p_{\mathcal{C}_{s}}} H(\hat{\mathbf{y}})\).


( SOURCE = labeled ) & ( TARGET = unlabeled )

  • \(\mathbb{E}_{\mathbf{x} \sim q_{\overline{\mathcal{C}}_{t}}} H(\hat{\mathbf{y}}), \mathbb{E}_{\mathbf{x} \sim q_{\mathcal{C}}} H(\hat{\mathbf{y}})>\mathbb{E}_{\mathbf{x} \sim p_{\mathcal{C}}} H(\hat{\mathbf{y}}), \mathbb{E}_{\mathbf{x} \sim p_{\overline{\mathcal{C}}_{s}}} H(\hat{\mathbf{y}})\).


Define sample-level transferability criterion as…

  • 1) \(w^{s}(\mathrm{x})=\frac{H(\hat{\mathbf{y}})}{\log \mid \mathcal{C}_{s} \mid }-\hat{d}^{\prime}(\mathrm{x})\).
  • 2) \(w^{t}(\mathrm{x})=\hat{d}^{\prime}(\mathrm{x})-\frac{H(\hat{\mathbf{y}})}{\log \mid \mathcal{C}_{s} \mid }\).