Key Ideas

  • An end-to-end object detection approach with respect to images instead by single or double stage methods using anchors and proposals
  • Uses bipartite matching loss function using Hungarian Algorithm to enforce permutation-invariance and unique matches
  • Parallel decoding with Transformers instead of auto-regressive models like RNN
  • Learn positional encoding using object queries in Transformers - these are responsible to detect bounding boxes in different areas of an image

Background Reading

  • Bipartite Matching Loss
  • Hungarian Algorithm
  • Transformer Architecture
  • Positional Encoding
  • IoU loss


Hungarian Loss:

LHungarian(y,y^)=i=1N[logp^σ^(i)(ci)+I{ciϕ}Lbox(bi,b^i)]L_{Hungarian}(y, \hat{y}) = \sum_{i=1}^{N}{[-\log{\hat{p}_{\hat{\sigma}(i)}(c_i)} + \mathbb{I}_{\{c_i \ne\phi\}} L_{box}(b_i, \hat{b}_i) ]}

$y$ - ground truth set of objects

$\hat{y}$ - set of predictions from 1 to $N$

$y_i = (c_i, b_i)$

$c_i$ - ground truth class

$\hat{p}_i$ - predicted class probability

$b_i \in [0,1]^4$ - vector defining $[center_x, center_y, height, width]$

σ^=argminσNiNIciϕp^σ^(i)(ci)+IciϕLbox(bi,b^i)\hat{\sigma} = \arg \min_{\sigma \in \mathbb{N}} \sum_{i}^{N}{-\mathbb{I}_{c_i\ne\phi} \hat{p}_{\hat{\sigma}(i)}(c_i) + \mathbb{I}_{c_i\ne\phi}{L_{box}(b_i, \hat{b}_i)}}

$\mathbb{I}$ - identity function equals to 1 when $c_i \ne \phi$ else 0

Lbox=λiouLiou(bi,b^i)+λL1bib^σ^(iL_{box} = \lambda_{iou}L_{iou}(b_i, \hat{b}_i) + \lambda_{L1}||b_i - \hat{b}_{\hat{\sigma}(i}||