( 참고 : 패스트 캠퍼스 , 한번에 끝내는 컴퓨터비전 초격차 패키지 )
Transformer for OD and HOI
1. DETR (Detection Transformer)
( Carion, Nicolas, et al. “End-to-end object detection with transformers.” European conference on computer vision. Springer, Cham, 2020. )
- https://arxiv.org/pdf/2005.12872
- for more details… https://seunghan96.github.io/cv/vision_22_DETR/
DETR = 3 main components
- (1) CNN backbone
- to extract compact feature representation
- (2) encoder-decoder transformer
- (3) simple FFN
- makes final prediction
(1) CNN backbone
- Input image : $x_{\mathrm{img}} \in \mathbb{R}^{3 \times H_{0} \times W_{0}}$
- Lower-resolution activation map : $f \in \mathbb{R}^{C \times H \times W}$
- $C=2048$
- $H, W=\frac{H_{0}}{32}, \frac{W_{0}}{32}$.
(2-1) Transformer Encoder
-
1x1 convolution : reduce channel dimension of $f$
-
from $C$ to $d$
-
new feature map : $z_{0} \in \mathbb{R}^{d \times H \times W}$
-
-
squeeze : $d \times H \times W \rightarrow d \times H $
- (since encoder expects a sequence as input)
- total of $d$ sequences, with $HW$ dimension
-
Encoder
- multi-head self-attention module
- feed forward network (FFN)
-
Fixed positional encodings
- $\because$ transformer architecture = permutation-invariant
(2-2) Transformer Decoder
- transforming $N$ embeddings of size $d$
- difference with original transformer :
- (original) autoregressive model that predicts the output sequence one element at a time
- (proposed) decodes the $N$ objects in parallel
(3) simple FFN
-
$N$ object queries are transformed into an output embedding by the decoder
-
then, independently decoded into
- (1) box coordinates
- **(2) class labels
by FFN
Hungarian Algorithm
2. HOI Detection Task?
HOI = Human-Object Interaction
3. InteractNet
( Gkioxari, Georgia, et al. “Detecting and recognizing human-object interactions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. )
- https://arxiv.org/pdf/1704.07333
Goal : detect triplets of the form < human, verb, objective >
- (1) localize the …
- box containing a human ( $b_h$ )
- box for the associated object of interaction ( $b_o$ )
- (2) identify the action ( selected from among $A$ actions )
(1) Object Detection branch
( identical to that of Faster RCNN )
- step 1) generate object proposals with RPN
- step 2) for each proposal box $b$…
- extract features with RoiAlign
- perform (1) object classification & (2) bounding-box regression
(2) Human Centric branch
Role 1) Action Classification
assign an action classification score $s^a_h$ to each human box $b_h$ and action $a$
( just like Object Detection branch … )
- extract features with RoiAlign from $b_h$
Human can simultaneously perform multiple actions…
- output layer : binary sigmoid classifiers for multilabel action classification
Role 2) Target Localization
predict the target object location based on a person’s appearance
- predict a density over possible locations, and use this output together with the location of actual detected objects to precisely localize the target.
(3) Interaction branch
human-centric model
-
scores actions based on the HUMAN appearance
-
problem : does not take into account the appearance of the TARGET object
3. iCAN
( Gao, Chen, Yuliang Zou, and Jia-Bin Huang. “ican: Instance-centric attention network for human-object interaction detection.” arXiv preprint arXiv:1808.10437 (2018). )
- https://arxiv.org/pdf/1808.10437.pdf
(1) Notation
- goal : predict HOI scores $S_{h, o}^{a}$, for each action $a \in{1, \cdots, A}$
- human-object bounding box pair : $\left(b_{h}, b_{o}\right)$
- $S_{h, o}^{a}=s_{h} \cdot s_{o} \cdot\left(s_{h}^{a}+s_{o}^{a}\right) \cdot s_{s p}^{a}$.
- score $S_{h, o}^{a}$ depends on..
- (1) confidence for the individual object detections ( $s_h$ & $s_o$ )
- (2) interaction prediction based on the appearance of the person $s_{h}^{a}$ and the object $s_{h}^{a}$
- (3) score prediction based on the spatial relationship between the person and the object $s_{s p}^{a}$.
- For some action classes w.o objectes ( ex. walk & smile )
- final scores : $s_{h} \cdot s_{h}^{a}$
(2) iCAN (Instance-Centric Attention Network) module
(3) Human / Ojbect stream
extract both..
- (1) instance-level appearance feature
- for a person : $x_{\text{inst}}^h$
- for an object : $x_{\text{inst}}^o$
- (2) contextual features
- for a person : $x_{\text{context}}^h$
- for an object : $x_{\text{context}}^o$
based on attentional map
with 2 feature vectors ( (1) instance-level appearance feature & (2) contextual features )..
- step 1) concatenate them
- step 2) pass it to 2 FC layers
- step 3) get actions cores $s_h^a$ & $s_o^a$
(4) Pairwise Stream
To encode spatial relationship between person & object
$\rightarrow$ adpot the 2-channel binary image representation, to characterize the interaction patterns
4. UnionDet
-
previous works : Sequential HOI detectors
-
proposed method : Parallel HOI detectors
5. HOTR
python train.py --gaf_mid_channel 32 --num_graphs 3 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 64 --num_graphs 3 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 128 --num_graphs 3 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 256 --num_graphs 3 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 32 --num_graphs 6 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 64 --num_graphs 6 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 128 --num_graphs 6 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 256 --num_graphs 6 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 32 --num_graphs 9 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 64 --num_graphs 9 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 128 --num_graphs 9 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 256 --num_graphs 9 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 32 --num_graphs 12 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 64 --num_graphs 12 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 128 --num_graphs 12 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12
python train.py --gaf_mid_channel 256 --num_graphs 12 --data 'data/METR-LA' --adjdata 'data/sensor_graph/adj_mx_metr.pkl' --num_nodes 207 --gaf 'data/METR-LA/GAF_metr.txt' --epochs 120 --input_dim 2 --input_length 12 --output_length 12