Paper Notes: End-to-End Object Detection with Transformers

Key Ideas

An end-to-end object detection approach with respect to images instead by single or double stage methods using anchors and proposals
Uses bipartite matching loss function using Hungarian Algorithm to enforce permutation-invariance and unique matches
Parallel decoding with Transformers instead of auto-regressive models like RNN
Learn positional encoding using object queries in Transformers - these are responsible to detect bounding boxes in different areas of an image

[TODO: Add details about these]

Hungarian Loss:

L_{Hungarian}(y, \hat{y}) = \sum_{i=1}^{N}{[-\log{\hat{p}_{\hat{\sigma}(i)}(c_i)} + \mathbb{I}_{\{c_i \ne\phi\}} L_{box}(b_i, \hat{b}_i) ]}

$y$ - ground truth set of objects

$\hat{y}$ - set of predictions from 1 to $N$

$y_i = (c_i, b_i)$

$c_i$ - ground truth class

$\hat{p}_i$ - predicted class probability

$b_i \in [0,1]^4$ - vector defining $[center_x, center_y, height, width]$

\hat{\sigma} = \arg \min_{\sigma \in \mathbb{N}} \sum_{i}^{N}{-\mathbb{I}_{c_i\ne\phi} \hat{p}_{\hat{\sigma}(i)}(c_i) + \mathbb{I}_{c_i\ne\phi}{L_{box}(b_i, \hat{b}_i)}}

$\mathbb{I}$ - identity function equals to 1 when $c_i \ne \phi$ else 0

L_{box} = \lambda_{iou}L_{iou}(b_i, \hat{b}_i) + \lambda_{L1}||b_i - \hat{b}_{\hat{\sigma}(i}||