1. BaseInfo
Title | Contrastive Grouping with Transformer for Referring Image Segmentation |
Adress | https://arxiv.org/pdf/2309.01017 |
Journal/Time | CVPR 2023 |
Author | 上海科技大学 |
Code | https://github.com/SooLab/CGFormer |
Read | 202408013 |
Table | #VisonLanguage #RIS |
2. Creative Q&A
Q1: 单阶段像素级别
A1 :
- CGFormer 物体级别的信息,分组策略。
- 可学习的 Q ,交替查询。
- 对比学习。
3. Concrete
3.1. Model
3.1.1. Input
图片+文本
image size is 480 × 480.
3.1.2. Backbone
Swin Transformer + BERT
visual encoder is pre-trained on ImageNet22K
text encoder is initialized with the weights from HuggingFace
视觉特征的维度:[128, 256, 512, 1024]
语言特征的维度:768
represent referent and other disturbing objects/stuffs : 512
3.1.3. Neck
3.1.4. Decoder
CGFormer
token_dim
2 个 核心的 CGAttention
3.1.5. Loss
3.1.6. Optimizer
AdamW
3.2. Training
Name | Value |
---|---|
batch size | 64 |
Learning rate | 1e-4 |
epoch | 50 |
3.2.1. Resource
NVIDIA Tesla A40 GPUs.
3.2.2 Dataset
Name | Number | Size | Note |
---|---|---|---|
RefCOCO | 19,994 | - | short、3.5 words |
RefCOCO+ | 19,992 | - | 8.4 words |
G-Ref | 26,711 | ||
ReferIt | 19,894 |
3.3. Eval
IoU (oIoU), mean IoU (mIoU), and precision at the 0.5, 0.7, and 0.9 thresholds of IoU
3.4. Ablation
- 分组加对比损失。提升 4 个点。
- 多尺度解码。提升不到 1 个点。多尺度解码分组连接,提升 1.47%
4. Reference
5. Additional
主要是将特征图分组的思路比较好。解码头部分的创新。
标签:Transformer,Referring,Image,CGFormer,分组,3.1,3.2,IoU From: https://blog.csdn.net/weixin_45863274/article/details/141173147