Abstract
We present DINO (DETR with Improved deNoising anchOr boxes), a state-of-the-art end-to-end object detector. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction. DINO achieves 48.3AP in 12 epochs and 51.0AP in 36 epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of +4.9AP and +2.4AP, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO val2017 (63.2AP) and test-dev (63.3AP). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. Our code will be available at https: //github.com/IDEACVR/DINO.
我们提出了DINO(带改进去噪锚盒的DETR),一种最先进的端到端目标检测器。DINO通过使用对比方法进行去噪训练,使用混合查询选择方法进行锚点初始化,使用前向两次方案进行框预测,在性能和效率上都比以前的类der模型有所提高。利用ResNet-50骨干网和多尺度特征,DINO在COCO上12个历元达到48.3AP, 36个历元达到51.0AP,与之前最好的类der模型DN-DETR相比,分别提高了+4.9AP和+2.4AP。DINO在模型大小和数据大小上都有很好的伸缩性。没有额外的功能,DINO在Objects365数据集上使用SwinL主干进行预训练后,在COCO val2017 (63.2AP)和test-dev (63.3AP)上都获得了最佳结果。与排行榜上的其他模型相比,DINO显著减小了模型大小和预训练数据大小,同时取得了更好的结果。
Summary
提出的创新点
- 进行了密集的消融研究,以验证不同设计选择在DINO中的有效性
- 效仿DAB-DETR,解码器中的查询为动态锚框;效仿DN-DETR,在解码器中,给真值框和标签加入噪声使得训练过程中的二分匹配更加稳定;效仿Deformable attention,为了更高的计算效率
- 使用对比去噪声网络训练(contrastive denoising training),混合查询选择(mixed query selection),并对DINO模型的不同部分进行两次前向传播。
- 为了利用来自后几层的精细化锚框信息来优化相邻早期层的参数,提出了新的前向两次方案(look forward twice)。即利用后几层的梯度来矫正更新过的参数
解决的问题
- 早期的一阶段(one-stage)方法如YOLOv2、v3,两阶段(two-stage)方法如faster-rcnn,均需要生成锚框这一步骤,也依赖于手工设计组件如NMS来移除重复的框,增加了模型复杂性和设计难度
- DETR类似模型会有因为二分匹配不稳定性所带来的收敛缓慢问题
具体架构
Preliminary
- 每个位置编码positional query是一个4D锚框 ( x , y , w , h ) (x,y,w,h) (x,y,w,h)
- 为了解决收敛缓慢问题,给标签和锚框添加噪声(DN,denoising DETR)。噪声为 ( Δ x , Δ y , Δ w , Δ h ) (\Delta x,\Delta y,\Delta w,\Delta h) (Δx,Δy,Δw,Δh),其中有约束 ∣ Δ x ∣ < λ w 2 |\Delta x|<\frac{\lambda w}{2} ∣Δx∣<2λw, ∣ Δ y ∣ < λ h 2 |\Delta y|<\frac{\lambda h}{2} ∣Δy∣<2λh, ∣ Δ w ∣ < λ w |\Delta w|<\lambda w ∣Δw∣<λw, ∣ Δ h ∣ < λ h |\Delta h|<\lambda h ∣Δh∣<λh
- DINO将位置查询制定为动态锚框,并且提供计算额外的DN损失
- DINO同时使用了Deformable DTER的技巧:
- 可变形注意力(deformable attention)
- 迭代边界框细化(iterative bounding box refinement;‘look forward once’) → \to → 使用在每一层的参数更新里
- 查询选择(query selection) → \to → 更好的初始化位置查询
Model Overview
- backbone: ResNet/Swim Transformer → \to → multi-scale features
- features+corresponding positional embeddings → \to → Transformer encoder
- mixed query selection strategy
→
\to
→ initialize anchors as positional query for decoder
- 不会初始化内容查询,保持他们的可学习性
- initialized anchors&learnable content queries ⟶ deformable attention \stackrel{\text{deformable attention}}{\longrightarrow} ⟶deformable attention combine features of the encoder outputs, update the queries layer-by-layer
- (refined content features ⟶ p r e d i c t \stackrel{predict}{\longrightarrow} ⟶predict classification results) +refined anchor boxes → \to → final outputs
- CDN (contrastive denoising) training → \to → considering hard negative samples
- look forward twice, which pass gradients between adjacent layers → \to → use refined box information from later layers to optimize parameters in their adjacent early layers
CDN(contrastive denoising)
-
DN-DETR: effective in stabilizing training and accelerating convergence. However, lack of predicting “no object” for anchors with no object nearby. DN-DETR模型主要关注于从噪声中恢复目标框,而没有特别训练去区分接近目标的锚框和远离目标的锚框
-
Objective: Rejecting useless anchors
-
two hyper-parameter λ 1 , λ 2 \lambda_1, \lambda_2 λ1,λ2, where λ 1 < λ 2 \lambda_1<\lambda_2 λ1<λ2, controlling the noise scale
- adopt small λ 2 \lambda_2 λ2 since hard negative samples closer to GT boxes are helpful to ↑ \uparrow ↑
- 对每个样本采用多个噪声对,即multiple CDN groups。for each CDN group, if one image has n n n GT boxes, it would have 2 × n 2\times n 2×n queries with each GT box generating a pos and neg queries.
-
loss: l 1 l_1 l1 + GIOU losses for box regression and focal loss for classification
- L l 1 = ∑ i = 1 4 ∣ b i − b ^ i ∣ L_{l1} = \sum_{i=1}^{4} |b_i - \hat{b}_i| Ll1=∑i=14∣bi−b^i∣
- GIOU loss: L G I O U = 1 − ∣ B ∩ B ^ ∣ ∣ B ∪ B ^ ∣ + ∣ C − ( B ∪ B ^ ) ∣ ∣ C ∣ L_{GIOU} = 1 - \frac{|B \cap \hat{B}|}{|B \cup \hat{B}|} + \frac{|C - (B \cup \hat{B})|}{|C|} LGIOU=1−∣B∪B^∣∣B∩B^∣+∣C∣∣C−(B∪B^)∣
- focal loss(解决类别不平衡问题): L f o c a l = − α ( 1 − p t ) γ log ( p t ) L_{focal} = -\alpha (1 - p_t)^{\gamma} \log(p_t) Lfocal=−α(1−pt)γlog(pt)
- 将负样本归类为背景的损失也是Focal loss。
-
ATD(k) (average top-k distance): evaluate how far anchors are from their target GT boxes in matching part
- if N N N GT bounding boxes b 0 , ⋯ , b N − 1 b_0,\cdots, b_{N-1} b0,⋯,bN−1 in a validation set, b i = ( x i , y i , w i , h i ) b_i=(x_i,y_i,w_i,h_i) bi=(xi,yi,wi,hi).
- For each box, corresponding anchor is denoted as a i = ( x i , y i , w i , h i ) a_i=(x_i,y_i,w_i,h_i) ai=(xi,yi,wi,hi). This a i a_i ai is the initial box of the decoder whose refined box after the last decoder layer is assigned to b i b_i bi during match. So it is the non-refined version.
- A T D ( k ) = 1 k ∑ { t o p K ( { ∣ ∣ b 0 − a 0 ∣ ∣ 1 , ∣ ∣ b 1 − a 1 ∣ ∣ 1 , ⋯ , ∣ ∣ b N − 1 − a N − 1 ∣ ∣ 1 } , k ) } ATD(k)=\frac{1}{k}\sum\{topK(\{||b_0-a_0||_1,||b_1-a_1||_1,\cdots,||b_{N-1}-a_{N-1}||_1\},k)\} ATD(k)=k1∑{topK({∣∣b0−a0∣∣1,∣∣b1−a1∣∣1,⋯,∣∣bN−1−aN−1∣∣1},k)}
MQS(mixed query selection)
- DETR\DN-DETR both use static embeddings without taking any encoder features from images as decoder queries. They set content queries as all 0 vectors.
- Deformable DETR learns both positional and content queries. It selects topK encoder features from the last encoder layer prior to enhancing decoder queries. Both positional and content queries are linear transformations of the selected features. Meanwhile, these selected features → \to → an auxiliary detection head to get predicted boxes
- DINO: only initialize anchor boxes using the position information associated with top-K features, leaving the content queries static as before.
- 为什么位置查询不需要?因为我们选中的特征是没有任何精细之后的最初内容特征,对于解码器来讲就较为模糊和误导性。例如,在没有精细化前,一个特征可能同时包含多个对象的一部分,或者只包含一个对象的一部分,这会使得内容查询难以准确表示一个完整的对象。
- 有助于利用更好的位置信息,从编码器中汇集更全面的内容特征
LFT(look forward twice)
- conjecture: parameters of layer-i are influenced by losses of both layer-i and layer-i+1.
- For each predicted offset Δ b i \Delta b_i Δbi, it will be used to update the box b i ′ b_i' bi′ and b i + 1 ( p r e d ) b^{(pred)}_{i+1} bi+1(pred)
- Given an input box
b
i
−
1
b_{i-1}
bi−1, we get the final prediction box
b
i
(
p
r
e
d
)
b_i^{(pred)}
bi(pred) by:
Δ
b
i
=
L
a
y
e
r
i
(
b
i
−
1
)
,
b
i
′
=
U
p
d
a
t
e
(
b
i
−
1
,
Δ
b
i
)
\Delta b_i=Layer_i (b_{i-1}),b'_i=Update(b_{i-1},\Delta b_i)
Δbi=Layeri(bi−1),bi′=Update(bi−1,Δbi)
b
i
=
D
e
t
a
c
h
(
b
i
′
)
,
b
i
(
p
r
e
d
)
=
U
p
d
a
t
e
(
b
i
−
1
′
,
Δ
b
i
)
b_i=Detach(b'_i),b_i^{(pred)}=Update(b'_{i-1},\Delta b_i)
bi=Detach(bi′),bi(pred)=Update(bi−1′,Δbi)
- where b i ′ b'_i bi′ is the undetached version of b i b_i bi
- U p d a t e ( ⋅ , ⋅ ) Update(\cdot,\cdot) Update(⋅,⋅): refines the box b i − 1 b_{i-1} bi−1 by the predicted box offset Δ b i \Delta b_i Δbi
- 采用与 Deformable DETR 中相同的 box update 方法:Deformable DETR 在模型中使用了 box 的标准化形式,因此 box 的每个值都是0到1之间的浮点数。给定两个 box,在逆sigmoid之后对它们求和,然后通过sigmoid变换求和。