首页 > 其他分享 >『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!

『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!

时间:2022-11-07 15:38:45浏览次数:117  
标签:Real RoI Towards Faster 网络 区域 CNN anchor


Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!

文章目录

  • ​​一. Faster R-CNN的思想​​
  • ​​1.1. R-CNN,Fast R-CNN,Faster R-CNN对比​​
  • ​​1.2. Faster R-CNN的网络结构​​
  • ​​二. 区域生成网络(RPN)详解​​
  • ​​2.1. 特征提取​​
  • ​​2.2. 候选区域(anchor)​​
  • ​​2.3. 边框回归​​
  • ​​2.4. 候选框修正​​
  • ​​三. RoI Pooling层​​
  • ​​3.1. 为何需要RoI Pooling​​
  • ​​3.2. RoI Pooling原理​​
  • ​​四. 分类和框回归​​
  • ​​五. Faster R-CNN训练​​
  • ​​参考博客​​

一. Faster R-CNN的思想

1.1. R-CNN,Fast R-CNN,Faster R-CNN对比

从R-CNN到Fast RCNN,再到本文的Faster R-CNN,目标检测的四个基本步骤 (1.候选区域生成,2.特征提取,3.分类,4.位置精修)终于被统一到一个深度网络框架之内。所有计算没有重复,完全在GPU中完成,大大提高了运行速度。

  • 三者关系如下图:


『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection

  • 三者对比如下表:


『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_02

  • faster RCNN可以简单地看做 “区域生成网络(RPN)+fast RCNN“
  1. 如何设计区域生成网络
  2. 如何训练区域生成网络
  3. 如何让区域生成网络和fast RCNN网络共享特征提取网络

1.2. Faster R-CNN的网络结构

Faster R-CNN统一的网络结构如下图所示,可以简单看作RPN网络+Fast R-CNN网络。



『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_03

注意:上图Fast R-CNN中含特有卷积层,我认为不是所有卷积层都参与共享。具体步骤如下:

  • 1、首先向CNN网络【ZF或VGG-16】输入任意大小图片『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_04
  • 2、经过CNN网络前向传播至最后共享的卷积层,一方面得到供RPN网络输入的特征图,另一方面继续前向传播至特有卷积层,产生更高维特征图;
  • 3、供RPN网络输入的特征图经过RPN网络得到区域建议和区域得分,并对区域得分采用非极大值抑制【阈值为0.7】,输出其Top-N【文中为300】得分的区域建议给RoI池化层;
  • 4、第2步得到的高维特征图和第3步输出的区域建议同时输入RoI池化层,提取对应区域建议的特征;
  • 5、第4步得到的区域建议特征通过全连接层后,输出该区域的分类得分以及回归后的bounding-box。

二. 区域生成网络(RPN)详解

  • 基本设想是:在提取好的特征图上,对所有可能的候选框进行判别。由于后续还有位置精修步骤,所以候选框实际比较稀疏。


『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_05

2.1. 特征提取

  • RPN还是需要使用一个CNN网络对原始图片提取特征。为了方便读者理解,不妨设这个前置的CNN提取的特征为 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_06,即高为51,宽为39,通道数为256.对这个卷积特征再进行一次卷积计算,保持宽、高、通道数不变,再次得到一个『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_06的特征。
  • 为了方便叙述,先来定义一个 “位置” 的概念:对于一个『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_06的卷积特征,称它一共有『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_09个"位置"。让新的卷积特征的每一个"位置"都"负责”原图中对应位置的9种尺寸框的检测,检测的目标是判断框中是否存在一个物体,因此共用『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_10个“框”。 在Faster R-CNN原论文中,将这些框都统一称为 “anchor”

2.2. 候选区域(anchor)

  • 特征可以看做一个尺度 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_09 的256通道图像,对于该图像的每一个位置,考虑9个可能的候选窗口:三种面积分别是『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_12,每种面积又分成3种长宽比,分别是『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_13,所以每个位置有『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_14个anchor 。这些候选窗口称为anchors,接着就是通过这些anchors引入了检测中常用到的多尺度方法(检测各种大小的目标),下图示出『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_09个anchor中心,以及9种anchor示例。 注意:这里每3个同比例的画在了不同位置(为了容易发现),实际每个位置都有9个,第2个图所示。

『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_16

『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_17

对于这『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_18个位置和『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_19个anchor,下图展示了接下来每个位置的计算步骤:

  • 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_20为单个位置对应的 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_21 的个数,此时『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_22,通过增加一个『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_23滑动窗口操作以及两个卷积层完成区域建议功能;
  • 使用一个小网络在最后卷积得到的特征图上进行滑动扫描,这个滑动网络每次与特征图上『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_24(论文中『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_25)的窗口全连接(图像的有效感受野很大,ZF是171像素,VGG是228像素),然后映射到一个低维向量(256d for ZF / 512d for VGG),最后将这个低维向量送入到两个全连接层,即bbox回归层(reg)和box分类层(cls)。sliding window的处理方式保证reg-layer和cls-layer关联了conv5-3的全部特征空间。
  • 对应每个滑窗位置输出 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_20 个区域得分,表示该位置的 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_21 为物体的概率,这部分总输出长度为『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_28(一个anchor对应两个输出:是物体的概率+不是物体的概率)和 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_20 个回归后的区域建议(框回归),一个『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_21对应4个框回归参数,因此框回归部分的总输出的长度为『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_31,并对得分区域进行非极大值抑制后输出得分Top-N(文中为300)区域,告诉检测网络应该注意哪些区域,本质上实现了Selective Search、EdgeBoxes等方法的功能。


『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_32

  • reg layer:预测proposal的anchor对应的proposal的『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_33
  • cls layer:判断该proposal是前景(object)还是背景(non-object);

2.3. 边框回归

如下图所示绿色框为飞机的Ground Truth(GT),红色为提取的foreground anchors,即便红色的框被分类器识别为飞机,但是由于红色的框定位不准,这张图相当于没有正确的检测出飞机。所以我们希望采用一种方法对红色的框进行微调,使得foreground anchors和GT更加接近。



『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_34

对于窗口一般使用四维向量 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_35 表示,分别表示窗口的中心点坐标和宽高
我们使用 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_36 表示原始的foreground anchor,使用 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_37 表示目标的ground truth,我们的目标是寻找一种关系,使得输入原始的Anchor A经过映射到一个和真实框『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_37更接近的回归窗口 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_39,即:

  • 给定:『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_21 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_41『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_42
  • 寻找一种变换 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_43,使得:『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_44其中『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_45


『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_46

那么经过何种变换『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_47才能从图10中的『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_48 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_36变为 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_50

  • 先做平移
    『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_51
  • 再做缩放
    『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_52
  • 上面4个公式中,我们需要学习4个参数,分别是 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_53 ,当输入的anchor A与GT相差较小时,可以认为这种变换是一种线性变换, 那么就可以用线性回归来建模对窗口进行微调 (注意,只有当anchors A和GT比较接近时,才能使用线性回归模型,否则就是复杂的非线性问题了)。
  • 接下来的问题就是如何通过线性回归获得 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_53 了。线性回归就是给定输入的特征向量 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_55, 学习一组参数 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_56, 使得经过线性回归后的值跟真实值 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_57 非常接近,即 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_58。对于该问题,输入 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_55 是cnn feature map,定义为 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_60;同时还有训练传入A与GT之间的变换量,即 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_61 。输出是 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_53 四个变换。那么目标函数可以表示为:
    『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_63
    其中 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_64是对应anchor的feature map组成的特征向量, 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_65 是需要学习的参数, 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_66 是得到的预测值( 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_67 表示 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_68也就是每一个变换对应一个上述目标函数)。为了让预测值 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_66 与真实值 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_70 差距最小,设计损失函数:
    『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_71
  • 函数优化目标为:
    『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_72
  • 需要说明,只有在GT与需要回归框位置比较接近时,才可近似认为上述线性变换成立。说完原理,对应于Faster RCNN原文,foreground anchor与ground truth之间的平移量『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_73与尺度因子 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_74 如下:
    『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_75

对于训练bouding box regression网络回归分支,输入是cnn feature 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_76,监督信号是Anchor与GT的差距 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_77,即训练目标是:输入 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_76 的情况下使网络输出与监督信号尽可能接近。那么当bouding box regression工作时,再输入『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_76 时,回归网络分支的输出就是每个Anchor的平移量和变换尺度 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_77,显然即可用来修正Anchor位置了。

2.4. 候选框修正

在得到每一个候选区域 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_48 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_36 的修正参数『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_83之后,我们就可以计算出精确的『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_48,然后按照物体的区域得分从大到小对得到的 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_48 排序,然后提出一些宽或者高很小的 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_48 (获取其它过滤条件),再经过非极大值抑制抑制,取前Top-N的 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_特征提取_87,然后作为 『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_88

三. RoI Pooling层

RoI Pooling层负责收集所有的候选框,并计算每一个候选框的特征图,然后送入后续网络,从Faster RCNN的结构图我们可以看到RoI Pooling层有两个输入:

  • 原始的特征图;
  • RPN网络输出的候选框;

3.1. 为何需要RoI Pooling

先来看一个问题:对于传统的CNN(如AlexNet,VGG),当网络训练好后输入的图像尺寸必须是固定值(全连接层的限制,具体的解释可以参考我这篇文章:​​论文阅读笔记:(SSPNet)​​),同时网络输出也是固定大小的vector or matrix。如果输入图像大小不定,这个问题就变得比较麻烦。有2种解决办法:

  • 从图像中crop(切割)一部分传入网络,但是crop后破坏了图像的完整结构
  • 将图像warp(缩放)成需要的大小后传入网络,但是warp破坏了图像原始形状信息


『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_89

3.2. RoI Pooling原理

我们把每一个候选框的特征图水平和垂直分为pooled_w(文章中为7)和pooled_h(7)份,对每一份进行最大池化处理,这样处理后,即使大小不一样的候选区,输出大小都一样,实现了固定长度的输出:



『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_卷积_90

  • 然后我们把Top-N个固定输出(『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Object detection_91)连接起来,组成特征向量,大小为『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_92,这里可以把Top-N看做样本数,49看做每一个样本的特征维数,送入全连接层。

四. 分类和框回归

通过RoI Pooling层我们已经得到所有候选区组成的特征向量,然后送入全连接层和softmax计算每个候选框具体属于哪个类别,输出类别的得分;同时再次利用框回归获得每个候选区相对实际位置的偏移量预测值,用于对候选框进行修正,得到更精确的目标检测框。



『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_93

  • 这里我们来看看全连接层,由于全连接层的参数w和b大小都是固定大小的,假设大小为『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Faster R-CNN_94,那么输入向量的维度就要为『论文笔记』Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks!_Deeplearning_92,所以这就说明了RoI Pooling的重要性。

标签:Real,RoI,Towards,Faster,网络,区域,CNN,anchor
From: https://blog.51cto.com/u_15866474/5829900

相关文章