Abstract
In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of openset object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a 52.5 AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. After finetuning with COCO data, Grounding DINO reaches 63.0 AP. It sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP. Code will be available at https://github.com/IDEA-Research/GroundingDINO.
在本文中,我们通过将基于变换器的检测器 DINO 与接地预训练相结合,提出了一种名为接地 DINO 的开放集对象检测器,它可以通过类别名称或引用表达等人工输入来检测任意对象。开放集对象检测的关键解决方案是在封闭集检测器中引入语言,以实现开放集概念泛化。为了有效地融合语言和视觉模式,我们从概念上将封闭集检测器分为三个阶段,并提出了一个紧密的融合解决方案,其中包括特征增强器、语言引导的查询选择和用于跨模式融合的跨模式解码器。以往的工作主要是对新类别的开放集对象检测进行评估,而我们建议同时对带有属性的对象的引用表达理解进行评估。接地 DINO 在所有三种设置中都表现出色,包括 COCO、LVIS、ODinW 和 RefCOCO/+/g 的基准测试。接地型 DINO 在 COCO 检测零镜头传输基准(即没有 COCO 的任何训练数据)上的得分达到了 52.5 分。在使用 COCO 数据进行微调后,Grounding DINO 达到了 63.0 AP。它以平均 26.1 AP 的成绩刷新了 ODinW 零点传输基准的记录。代码见 https://github.com/IDEA-Research/GroundingDINO。
Summary
提出的创新点
- 使用闭集检测器DINO,通过在多个阶段执行视觉语言模态融合扩展到开集检测。其中包括