( 참고 : 패스트 캠퍼스 , 한번에 끝내는 컴퓨터비전 초격차 패키지 )

Object Detection - RCNN

1. Visual Recognition Tasks

Image Classification
Object Detection : “bounding box”
Semantic Segmentation : “pixel-level”

2. Naive Approach

Object Detection (OD) = (1) + (2)

(1) Box Localization ( via object proposals )
- ex 1) Selective Search for Object Detection ( IJCV 2013 )
- ex 2) Edge Boxes : Locating Object Proposals from Edges ( ECCV 2014 )
(2) Box Classification ( via CNN )

3. RCNN (2014)

( Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014 )

RCNN = Region-based CNN

(1) Summary

2 stage detector
- stage # 1 : region proposal
- stage # 2 : classification
a) Region proposal :
- 2,000 proposed regions per image
- crop 2,000 regions into same size
  
  ( for a fixed-length vector, for regression & classification )
- put 2,000 regions to CNN ( $\rightarrow$ very slow )
b) Regression & Classification :
- with 2,000 fixed-length vectors, do regression & classification
- Classification
  - binary SVM x (num_classes+1) … for background
- Regression
  - IoU >= 0.3 : positive
  - IoU < 0.3 : negative

(2) Regression

Regression = Learning a transformation of bounding box

(3) Loss Function

$\begin{aligned} t_{x} &=\left(g_{x}-p_{x}\right) / p_{w} \\ t_{y} &=\left(g_{y}-p_{y}\right) / p_{h} \\ t_{w} &=\log \left(g_{w} / p_{w}\right) \\ t_{h} &=\log \left(g_{h} / p_{h}\right) \end{aligned}$.

$\mathcal{L}_{\mathrm{reg}}=\sum_{i \in\{x, y, w, h\}}\left(t_{i}-d_{i}(\mathbf{p})\right)^{2}+\lambda \mid \mid \mathbf{w} \mid \mid ^{2}$.

4. Fast RCNN

( Girshick, Ross. “Fast R-CNN: Fast Region-based Convolutional Networks for object detection.” Internation Conference on Computer Vision,(ICCV). 2015. )

Solve the problem of slow speed of RCNN

RCNN vs Fast RCNN

common ) Region proposal using selective search
difference # 1 ) SPEED
- RCNN :
  - step 1) 2,000 region proposals
  - step 2) crop 2,000 region proposals
  - step 3) CNN 2,000 times ( with proposed regions )
- Fast-RCNN
  - step 1-1) 2,000 region proposals
    
    step 1-2) CNN 1 time ( with original image )
  - step 2) crop regions from feature map ( RoI projection )
difference # 2 ) RoI Pooling
- Why RoI Pooling??
  - no need for fixed length vector for CNN Input!
    
    ( just need fixed length for FC input )
  - Thus, RoI Pooling after RoI Projection!
difference # 3 ) Classification
- instead of SVM, use FC layer

(1) Roi Pooling

(2) Loss Function

$\begin{aligned} \mathcal{L}\left(p, u, t^{u}, v\right) &=\mathcal{L}_{\mathrm{cls}}(p, u)+\mathbb{1}[u \geq 1] \mathcal{L}_{\mathrm{box}}\left(t^{u}, v\right) \\ \mathcal{L}_{\mathrm{cls}}(p, u) &=-\log p_{u} \\ \mathcal{L}_{\mathrm{box}}\left(t^{u}, v\right) &=\sum_{i \in\{x, y, w, h\}} L_{1}^{\text {smooth }}\left(t_{i}^{u}-v_{i}\right) \end{aligned}$.

where $L_{1}^{\mathrm{smooth}}(x)= \begin{cases}0.5 x^{2} & \text { if } \mid x \mid <1 \\ \mid x \mid -0.5 & \text { otherwise }\end{cases}$.

5. Faster RCNN

Solve the problem of selective search of RCNN/Fast RCNN

selective search : on CPU -> bottleneck!

Faster RCNN : end-to-end

region proposal inside CNN, using GPU
Process
- step 1) CNN with original image
- step 2) input feature map to RPN (Region Proposal Network)
  
  ( use these output region proposals, instead of selective search )
- below are same

Two Additional Loss ( of RPN )

(Loss 1) Classification Loss
(Loss 2) Box regression Loss

(1) RPN (Region Proposal Network)

Input : 7x7xC Feature Map
Kernel Size : 3 x 3 filter
Number of anchor boxes : K (=9)
Output : 7x7x6K Feature Map
- 6k = 2k + 4k
  - 2K : whether object is foreground/background
  - 4K : (X,Y,W,H)
$p^{*}=f(x)=\left{\begin{aligned} -1, & \text { if IoU }<0.3
1, & \text { if IoU }>0.7
0, & \text { otherwise } \end{aligned}\right.$.

(2) Loss Function

$\begin{aligned} \mathcal{L} &=\mathcal{L}_{\mathrm{cls}}+\mathcal{L}_{\mathrm{box}} \\ \mathcal{L}\left(\left\{p_{i}\right\},\left\{t_{i}\right\}\right) &=\frac{1}{N_{\mathrm{cls}}} \sum_{i} \mathcal{L}_{\mathrm{cls}}\left(p_{i}, p_{i}^{*}\right)+\frac{\lambda}{N_{\mathrm{box}}} \sum_{i} p_{i}^{*} \cdot L_{1}^{\text {smooth }}\left(t_{i}-t_{i}^{*}\right) \end{aligned}$.

where $\mathcal{L}_{\mathrm{cls}}\left(p_{i}, p_{i}^{*}\right)=-p_{i}^{*} \log p_{i}-\left(1-p_{i}^{*}\right) \log \left(1-p_{i}\right)$.

Twitter Facebook LinkedIn

(CV summary) 18. Object Detection - RCNN/Fast RCNN/Faster-RCNN

Seunghan Lee

Object Detection - RCNN

1. Visual Recognition Tasks

2. Naive Approach

3. RCNN (2014)

(1) Summary

(2) Regression

(3) Loss Function

4. Fast RCNN

(1) Roi Pooling

(2) Loss Function

5. Faster RCNN

(1) RPN (Region Proposal Network)

(2) Loss Function

You May Also Enjoy