( 참고 : 패스트 캠퍼스 , 한번에 끝내는 컴퓨터비전 초격차 패키지 )

Transformer for OD and HOI

1. DETR (Detection Transformer)

( Carion, Nicolas, et al. “End-to-end object detection with transformers.” European conference on computer vision. Springer, Cham, 2020. )

  • https://arxiv.org/pdf/2005.12872


figure2

  • for more details… https://seunghan96.github.io/cv/vision_22_DETR/


DETR = 3 main components

  • (1) CNN backbone
    • to extract compact feature representation
  • (2) encoder-decoder transformer
  • (3) simple FFN
    • makes final prediction


(1) CNN backbone

  • Input image : $x_{\mathrm{img}} \in \mathbb{R}^{3 \times H_{0} \times W_{0}}$
  • Lower-resolution activation map : $f \in \mathbb{R}^{C \times H \times W}$
    • $C=2048$
    • $H, W=\frac{H_{0}}{32}, \frac{W_{0}}{32}$.


(2-1) Transformer Encoder

  • 1x1 convolution : reduce channel dimension of $f$

    • from $C$ to $d$

    • new feature map : $z_{0} \in \mathbb{R}^{d \times H \times W}$

  • squeeze : $d \times H \times W \rightarrow d \times H $

    • (since encoder expects a sequence as input)
    • total of $d$ sequences, with $HW$ dimension
  • Encoder

    • multi-head self-attention module
    • feed forward network (FFN)
  • Fixed positional encodings

    • $\because$ transformer architecture = permutation-invariant


(2-2) Transformer Decoder

  • transforming $N$ embeddings of size $d$
  • difference with original transformer :
    • (original) autoregressive model that predicts the output sequence one element at a time
    • (proposed) decodes the $N$ objects in parallel


(3) simple FFN

  • $N$ object queries are transformed into an output embedding by the decoder

  • then, independently decoded into

    • (1) box coordinates
    • **(2) class labels

    by FFN


figure2


Hungarian Algorithm

figure2


2. HOI Detection Task?

HOI = Human-Object Interaction

figure2


3. InteractNet

( Gkioxari, Georgia, et al. “Detecting and recognizing human-object interactions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. )

  • https://arxiv.org/pdf/1704.07333


figure2


Goal : detect triplets of the form < human, verb, objective >

  • (1) localize the …
    • box containing a human ( $b_h$ )
    • box for the associated object of interaction ( $b_o$ )
  • (2) identify the action ( selected from among $A$ actions )


(1) Object Detection branch

( identical to that of Faster RCNN )

  • step 1) generate object proposals with RPN
  • step 2) for each proposal box $b$…
    • extract features with RoiAlign
    • perform (1) object classification & (2) bounding-box regression


(2) Human Centric branch

Role 1) Action Classification

assign an action classification score $s^a_h$ to each human box $b_h$ and action $a$

( just like Object Detection branch … )

  • extract features with RoiAlign from $b_h$


Human can simultaneously perform multiple actions…

  • output layer : binary sigmoid classifiers for multilabel action classification


Role 2) Target Localization

predict the target object location based on a person’s appearance

  • predict a density over possible locations, and use this output together with the location of actual detected objects to precisely localize the target.

figure2


(3) Interaction branch

human-centric model

  • scores actions based on the HUMAN appearance

  • problem : does not take into account the appearance of the TARGET object


3. iCAN

( Gao, Chen, Yuliang Zou, and Jia-Bin Huang. “ican: Instance-centric attention network for human-object interaction detection.” arXiv preprint arXiv:1808.10437 (2018). )

  • https://arxiv.org/pdf/1808.10437.pdf


figure2


(1) Notation

  • goal : predict HOI scores $S_{h, o}^{a}$, for each action $a \in{1, \cdots, A}$
  • human-object bounding box pair : $\left(b_{h}, b_{o}\right)$
  • $S_{h, o}^{a}=s_{h} \cdot s_{o} \cdot\left(s_{h}^{a}+s_{o}^{a}\right) \cdot s_{s p}^{a}$.
  • score $S_{h, o}^{a}$ depends on..
    • (1) confidence for the individual object detections ( $s_h$ & $s_o$ )
    • (2) interaction prediction based on the appearance of the person $s_{h}^{a}$ and the object $s_{h}^{a}$
    • (3) score prediction based on the spatial relationship between the person and the object $s_{s p}^{a}$.
  • For some action classes w.o objectes ( ex. walk & smile )
    • final scores : $s_{h} \cdot s_{h}^{a}$


(2) iCAN (Instance-Centric Attention Network) module

figure2


(3) Human / Ojbect stream

extract both..

  • (1) instance-level appearance feature
    • for a person : $x_{\text{inst}}^h$
    • for an object : $x_{\text{inst}}^o$
  • (2) contextual features
    • for a person : $x_{\text{context}}^h$
    • for an object : $x_{\text{context}}^o$

based on attentional map


with 2 feature vectors ( (1) instance-level appearance feature & (2) contextual features )..

  • step 1) concatenate them
  • step 2) pass it to 2 FC layers
  • step 3) get actions cores $s_h^a$ & $s_o^a$


(4) Pairwise Stream

To encode spatial relationship between person & object

$\rightarrow$ adpot the 2-channel binary image representation, to characterize the interaction patterns


4. UnionDet

  • previous works : Sequential HOI detectors

  • proposed method : Parallel HOI detectors


5. HOTR

python train.py --gaf_mid_channel 32 --num_graphs 3 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12 
python train.py --gaf_mid_channel 64 --num_graphs 3 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 128 --num_graphs 3 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 256 --num_graphs 3 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12

python train.py --gaf_mid_channel 32 --num_graphs 6 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 64 --num_graphs 6 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 128 --num_graphs 6 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 256 --num_graphs 6 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12

python train.py --gaf_mid_channel 32 --num_graphs 9 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 64 --num_graphs 9 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 128 --num_graphs 9 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 256 --num_graphs 9 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12

python train.py --gaf_mid_channel 32 --num_graphs 12 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 64 --num_graphs 12 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 128 --num_graphs 12 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 256 --num_graphs 12 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12

Categories:

Updated: