标签：SqueezeExcite mathbf nn 512 self Excite Squeeze channel 通道

Squeeze and Excite

https://github.com/titu1994/keras-squeeze-excite-network

Convolutional Neural Networks (CNN) are workhorses of deep learning. A popular architecture in CNN is Residual Net (ResNet) that emphasizes on learning a residual mapping rather than directly fit input to output. Subsequent to ResNet, Squeeze and Excitation network (SENet) introduced a squeeze and excitation block (SE block) on every residual mapping ResNet to improve its performance. The SE block quantifies the importance of each feature map and weights them accordingly.

MobileNet v3 中引用的Squeeze-and-Excite是怎么回事

Squeeze-and-Excite 对应的论文是Squeeze-and-Excitation Networks
Sequeeze-and-Excitation是什么
Sequeeze-and-Excitation(SE) Block是一个子模块，可以嵌到其他的模型中，作者采用SENet Block和ResNeXt结合在ILSVRC 2017的分类项目中得了第一。

层次结构

Sequeeze-and-Excitation的层次结构如下
1、AdaptiveAvgPool2d
2、Linear
3、ReLU
4、Linear
5、Sigmoid
先拆成两部分Squeeze部分和Excitation部分
Squeeze 部分就是 AdaptiveAvgPool2d
Excitation 部分就是2到5
先是 squeeze 很形象的词挤压柠檬汁，挤压使用的函数是 AdaptiveAvgPool2d(1)
就像以管理小白兔的方式挤压柠檬汁，挤压柠檬汁之后就是Excitation，汁少（特征少）的那就大棒伺候，汁多（特征多）的给胡萝卜，特征少的抑制它，特征多的就多多关注它。
先看Pool，术语叫池化，意思是合并多个数要变一个数

pool是怎么多个数变一个数呢？可以是算平均Avg，也可以是取最大
下面是每4个数变一个数
最大的方式

同样的表示

平均的方式

因为输出大小是可以变化的所以叫在最前面加个Adaptive
这里我们想要做的全局平均池化 global average pooling，所以AdaptiveAvgPool2d需要加参数1,就是AdaptiveAvgPool2d(1)
AdaptiveAvgPool2d(1)干了什么事呢？看下图一目了然

有avg就有max

3通道（channel）的一张图片，经过AdaptiveAvgPool2d(1)，输出都是1×1×channel
全局平均池化（global average pooling）概念即使不知道，经过了上面的图片展示，意思就明白了

global average pooling 来源论文《Network In Network》
我们提出了另一种称为全局平均池的策略来取代CNN中传统的全连接层（we propose another strategy called global average pooling to replace the traditional fully connected layers in CNN.）
Network in Network工作使用global average pooling来取代了最后的全连接层。
PyTorch代码实现SE块

以下简称SE块

class SELayer(nn.Module):
    def __init__(self, channel, reduction=16):
        super(SELayer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channel, channel // reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(channel // reduction, channel, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y).view(b, c, 1, 1)
        return x * y.expand_as(x)

dummy_input = torch.randn(1,30, 300, 300)
print(SELayer(30)(dummy_input))

SELayer(
(avg_pool): AdaptiveAvgPool2d(output_size=1)
(fc): Sequential(
(0): Linear(in_features=30, out_features=1, bias=False)
(1): ReLU(inplace)
(2): Linear(in_features=1, out_features=30, bias=False)
(3): Sigmoid()
)
)

// 的解释
// 表示 floor division
返回商的整数部分（Returns the integral part of the quotient）

>>> 5.0 / 2
2.5
>>> 5.0 // 2
2.0

分别输出forward中的y的shape是
torch.Size([1, 30])
torch.Size([1, 30, 1, 1])

AdaptiveAvgPool2d的其他使用方式
torch.nn.AdaptiveAvgPool2d(output_size)

对于任何输入尺寸，输出的大小为H x W.输出feature的数量等于输入plane的数量。

# target output size of 5x7
m = nn.AdaptiveAvgPool2d((5,7))
input = torch.randn(1, 64, 8, 9)
output = m(input)
print(output.shape)
# target output size of 7x7 (square)
m = nn.AdaptiveAvgPool2d(7)
input = torch.randn(1, 64, 10, 9)
output = m(input)
print(output.shape)
# target output size of 10x7
m = nn.AdaptiveMaxPool2d((None, 7))
input = torch.randn(1, 64, 10, 9)
output = m(input)
print(output.shape)



输出结果

torch.Size([1, 64, 5, 7])
torch.Size([1, 64, 7, 7])
torch.Size([1, 64, 10, 7])



expand_as的解释
Expand this tensor to the same size as other. self.expand_as(other) is equivalent to self.expand(other.size()).

将此张量扩展到与其他张量相同的大小
self.expand_as(other) = self.expand(other.size()).
把h-sigmoid加入到SE块的实现

class h_sigmoid(nn.Module):
    def __init__(self, inplace=True):
        super(h_sigmoid, self).__init__()
        self.inplace = inplace

    def forward(self, x):
        return F.relu6(x + 3., inplace=self.inplace) / 6.

class SELayer2(nn.Module):
    def __init__(self, channel, reduction=16):
        super(SELayer2, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channel, channel // reduction),
            nn.ReLU(inplace=True),
            nn.Linear(channel // reduction, channel),
            h_sigmoid()
        )

    def forward(self, x):
        batch, channels, height, width = x.size()
        out = self.avg_pool(x).view(batch, channels)
        out = self.fc(out)
        out = out.view(batch, channels, 1, 1)

        return out * x

print(SELayer2(30))

输出

SELayer2(
(avg_pool): AdaptiveAvgPool2d(output_size=1)
(fc): Sequential(
    (0): Linear(in_features=30, out_features=1, bias=True)
    (1): ReLU(inplace)
    (2): Linear(in_features=1, out_features=30, bias=True)
    (3): h_sigmoid()
)
)



如果按照论文的表达方式就是这样的

Squeeze-and-Excitation

整体图是这样的

下面这个图片来源于PPT，可以在下面的链接下载

先拆成两部分Squeeze部分和Excitation部分
Squeeze之前的部分

F t r : x → U , x ∈ R w ′ × H ′ × C ′ , U ∈ R W × H × C \mathbf{F}_{t r} : \mathbf{x} \rightarrow \mathbf{U}, \mathbf{x} \in \mathbb{R}^{w^{\prime} \times H^{\prime} \times C^{\prime}}, \mathbf{U} \in \mathbb{R}^{W \times H \times C} Ftr:x→U,x∈Rw′×H′×C′,U∈RW×H×C

tr这里是transformation
transformation F tr mapping an input X to feature maps U
相当于我们自己的一些卷积代码
式子虽多，就是卷积操作，卷积操作如下
u c = v c ∗ X = ∑ s = 1 C ′ v c s ∗ x s \mathbf{u}_{c}=\mathbf{v}_{c} * \mathbf{X}=\sum_{s=1}^{C^{\prime}} \mathbf{v}_{c}^{s} * \mathbf{x}^{s} uc=vc∗X=s=1∑C′vcs∗xs

v c = [ v c 1 , v c 2 , … , v c C ′ ] \mathbf{v}_{c}=\left[\mathbf{v}_{c}^{1}, \mathbf{v}_{c}^{2}, \ldots, \mathbf{v}_{c}^{C^{\prime}}\right] vc=[vc1,vc2,…,vcC′]

X = [ x 1 , x 2 , … , x C ′ ] \mathbf{X}=\left[\mathbf{x}^{1}, \mathbf{x}^{2}, \ldots, \mathbf{x}^{C^{\prime}}\right] X=[x1,x2,…,xC′]

u c ∈ R H × W \mathbf{u}_{c} \in \mathbb{R}^{H \times W} uc∈RH×W

**看论文里Sequeeze的介绍 **

式子

z c = F s q ( u c ) = 1 H × W ∑ i = 1 H ∑ j = 1 W u c ( i , j ) z_{c}=\mathbf{F}_{s q}\left(\mathbf{u}_{c}\right)=\frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} u_{c}(i, j) zc=Fsq(uc)=H×W1i=1∑Hj=1∑Wuc(i,j)

Sequeeze是压缩，它压缩的是什么呢
X经过卷积输出是U （X-conv-U）
U经过什么得到结果A，大小是 1 × 1 × C (这里A是为了理解，假设得到一个中间结果)

U -》Global Average Pooling-》A
U -》Fsq(.) -》A
U -》 Squeeze -》A
上面的U怎么得到A的过程就是Squeeze
Squeeze，图上的Fsq(.)，论文里的Global Average Pooling 都是指代一个意思

Squeeze的式子是这样的
z c = F s q ( u c ) = 1 H × W ∑ i = 1 H ∑ j = 1 W u c ( i , j ) z_{c}=\mathbf{F}_{s q}\left(\mathbf{u}_{c}\right)=\frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} u_{c}(i, j) zc=Fsq(uc)=H×W1i=1∑Hj=1∑Wuc(i,j)

这里的代码是 nn.AdaptiveAvgPool2d

压缩的是 spatial dimension ( H × W )
H × W 就有了名字叫空间维度（spatial dimension）

Squeeze 采用了求平均的方式，将空间维度（spatial dimension上）所有点的都计算为一个值。

Excitation 是什么
用两个全连接来实现
一个全连接把C个通道压缩成了 C/r 个通道，这个全连接之后是ReLU，
第二个全连接再恢复回C个通道，这个全连接之后是Sigmoid，
r是指压缩的比例，In the aggregate, when setting the reduction ratio r to 16。
r是个超参数,默认是16，16平衡的好

接着Squeeze得到的A
A-》FC-》ReLU （通道数由C变成了 C r \frac{C}{r} rC）
-》FC-》Sigmoid (通道数由 C r \frac{C}{r} rC 变成了C)

s = F e x ( z , W ) = σ ( g ( z , W ) ) = σ ( W 2 δ ( W 1 z ) ) \mathbf{s}=\mathbf{F}_{e x}(\mathbf{z}, \mathbf{W})=\sigma(g(\mathbf{z}, \mathbf{W}))=\sigma\left(\mathbf{W}_{2} \delta\left(\mathbf{W}_{1} \mathbf{z}\right)\right) s=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z))

σ : 表示 Sigmoid; δ : 表示 ReLU; \sigma : \text { 表示 Sigmoid; } \delta : \text { 表示 ReLU; } σ: 表示 Sigmoid; δ: 表示 ReLU;

输出结果是

x ~ c = F scale ( u c , s c ) = s c u c \widetilde{\mathbf{x}}_{c}=\mathbf{F}_{\text { scale }}\left(\mathbf{u}_{c}, s_{c}\right)=s_{c} \mathbf{u}_{c} x

c=F scale (uc,sc)=scuc
用1×1的卷积替换全连接层的代码实现

代码已经实现了，这里还有个“但是”，“但是”之后就是代码的另一种写法
看大神Yann LeCun说的话

Yann LeCun

In Convolutional Nets, there is no such thing as “fully-connected layers”. There are only convolution layers with 1x1 convolution kernels and a full connection table. It’s a too-rarely-understood fact that ConvNets don’t need to have a fixed-size input. You can train them on inputs that happen to produce a single output vector (with no spatial extent), and then apply them to larger images. Instead of a single output vector, you then get a spatial map of output vectors. Each vector sees input windows at different locations on the input. In that scenario, the “fully connected layers” really act as 1x1 convolutions.

我们是可以用1×1的卷积替换全连接层的
那么代码就变成了下面的样子

class SeModule(nn.Module):
    def __init__(self, in_size, reduction=4):
        super(SeModule, self).__init__()
        self.se = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(in_size, in_size // reduction, kernel_size=1, stride=1, padding=0, bias=False),
            nn.BatchNorm2d(in_size // reduction),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_size // reduction, in_size, kernel_size=1, stride=1, padding=0, bias=False),
            nn.BatchNorm2d(in_size),
            hsigmoid()
        )

    def forward(self, x):
        return x * self.se(x)

在MobileNet v3中reduction=4，SENet原论文的设计是reduction=16
we replace them all to fixed to be 1/4 of the number of channels in expansion layer.
因为是个SENet是个子模块就可以嵌入到其他网络结构中例如

参考
Network In Network

Global Average Pooling Layers for Object Localization

Deep Learning Cage Match: Max Pooling vs Convolutions

1D Global average pooling

2D Global average pooling

CNN | Introduction to Pooling Layer

1x1 Convolutions Demystified

How to understand the mlpconv layer in the NIN network

Squeeze-and-Excitation Networks

SENet 的PPT

Searching for MobileNetV3

原作者的代码实现

Squeeze-and-Excitation_Networks论文
————————————————
链接：

https://blog.csdn.net/flyfish1986/article/details/96488227

https://blog.csdn.net/weixin_34910922/article/details/107443739

===================================================================

SENet最后一届 ImageNet 2017 竞赛 Image Classification 任务的冠军，也是目前细粒度分类任务的必选基础网络。CNN是通过用局部感受野，基于逐通道基础上，去融合空间信息来提取信息化的特征，对于图像这种数据来说很成功。

为了增强CNN模型的表征能力，许多现有的工作主要用在增强空间编码上，比如ResNet，DenseNet。SENet则主要关注通道上可做点，通过显示的对卷积层特征之间的通道相关性进行建模来提升模型的表征能力；并以此提出了特征重校准机制：通过使用全局信息去选择性的增强可信息化的特征并同时压缩那些无用的特征。

SE 模块可以嵌入到现在几乎所有的网络结构中。通过在原始网络结构的 building block 单元中嵌入 SE 模块，我们可以获得不同种类的 SENet。如 SE-BN-Inception、SE-ResNet、SE-ReNeXt、SE-Inception-ResNet-v2 等等。
1、关于卷积的进一步讨论

近些年来，卷积神经网络在很多领域上都取得了巨大的突破。而卷积核作为卷积神经网络的核心，通常被看做是在局部感受野上，将空间上（spatial）的信息和特征维度上（channel-wise）的信息进行聚合的信息聚合体。卷积神经网络由一系列卷积层、非线性层和下采样层构成，这样它们能够从全局感受野上去捕获图像的特征来进行图像的描述。

然而去学到一个性能非常强劲的网络是相当困难的，其难点来自于很多方面。最近很多工作被提出来从空间维度层面来提升网络的性能，如 Inception 结构中嵌入了多尺度信息，聚合多种不同感受野上的特征来获得性能增益；在 Inside-Outside 网络中考虑了空间中的上下文信息；还有将 Attention 机制引入到空间维度上，等等。这些工作都获得了相当不错的成果。
2、Se模块构建

SE网络就是通过不断的堆叠这个SE模块而成的网络。

假设张量X∈RW′×H′×C′,卷积操作为Ftr,从而得到新的张量U∈RW×H×C。到这里都是传统的卷积过程而已，然后基于U,接下来开始挤压和激励：

挤压(squeeze)：将U固定通道维度不变，对每个feature map进行处理，从而得到一个基于通道的描述符1×1×C,即用一个标量来描述一个map；

作者提出的所谓挤压就是针对每个通道的feature map，进行一次GAP（全局平均池化）：

即将这个feature map表示的矩阵所有值相加，求其平均值。

激励(Excitation)：将挤压得到的通道描述符1×1×C作为每个通道的权重，基于U重新生成一个X˜。

先对挤压后得到的1×1×C的向量基础上先进行一次FC层转换，然后用ReLU激活函数层，然后在FC层转换，接着采用sigmoid激活函数层，该层就是为了模仿LSTM中门的概念，通过这个来控制信息的流通量：

其中，δ是ReLU函数，W1∈RCr×C,W2∈RC×Cr，为了限制模型的复杂程度并且增加泛化性，就通过两层FC层围绕一个非线性映射来形成一个"瓶颈"，其中r作者选了16，最后在得到了所谓的门之后，只要简单的将每个通道的门去乘以原来对应的每个feature map，就能控制每个feature map的信息流通量了：

从上述描述就可以看出，这其实算是一个构建网络块的方法，可以应用到inception和resnet等网络上，从而具有普适性：

整理下来，一个SE模块分为三个部分：

给定一个输入 x，其特征通道数为 c_1，通过一系列卷积希望变换后得到一个特征通道数为 c_2 的特征。即输入通道是c1，输出通道是c2。通过三个操作来重标定前面得到的特征：

1）Squeeze 操作，我们顺着空间维度来进行特征压缩，将每个二维的特征通道变成一个实数，这个实数某种程度上具有全局的感受野，并且输出的维度和输入的特征通道数相匹配。它表征着在特征通道上响应的全局分布，而且使得靠近输入的层也可以获得全局的感受野，这一点在很多任务中都是非常有用的。

2） Excitation 操作，它是一个类似于循环神经网络中门的机制。通过参数 w 来为每个特征通道生成权重，其中参数 w 被学习用来显式地建模特征通道间的相关性。

3）一个 Reweight 的操作，我们将 Excitation 的输出的权重看做是进过特征选择后的每个特征通道的重要性，然后通过乘法逐通道加权到先前的特征上，完成在通道维度上的对原始特征的重标定。

3、keras实现

输入为待处理feature maps，特征通道为c1，输出特征通道为c2，se相当于实现一个Dense(filter)功能。属于Dense的加强版。ratio为通道缩放的比例。squeeze_excite_block函数实现了左边c2到右边c2的映射。

    def squeeze_excite_block(input, ratio=16):
        # 1、构造se_shape
        channel_axis = 1 if K.image_data_format() == "channels_first" else -1
        filters = input._keras_shape[channel_axis] # 取输入的通道数c1
        se_shape = (1, 1, filters)

        # 2、Squeeze 操作，全局池化,reshape，变为一个序列
        se = GlobalAveragePooling2D()(input)
        se = Reshape(se_shape)(se)、

        # 3、Excitation 操作，先压缩通道数，再返回原维度
        se = Dense(int(filters / float(ratio)), activation='relu', kernel_initializer='he_normal', use_bias=False)(se)
        se = Dense(filters, activation='sigmoid', kernel_initializer='he_normal', use_bias=False)(se) # sigmoid激活

        # 4、Reweight 的操作，将权重乘到输入上
        if K.image_data_format() == 'channels_first':
            se = Permute((3, 1, 2))(se)
        x = multiply([input, se])
        return x

具体调用方式：

上一篇：DenseNet，传送门：分类网络目录索引。
————————————————

原文链接：https://blog.csdn.net/weixin_34910922/article/details/107443739

https://github.com/titu1994/keras-squeeze-excite-network

===========================================================================

SEBlock 2017

考虑通道间的注意力之间的关系，在通道上加入注意力机制
论文：https://arxiv.org/abs/1709.01507
代码：https://github.com/hujie-frank/SENet

对于输入特征图C2，其后加上SE注意力模块
1.1.1 步骤
主要分三步：
    squeeze，对空间维度进行压缩，代码上即利用全局平均池化将每个通道平均成一个值，该值具有全局感受野。（通俗理解：一个通道理解为一个大饼，多个通道就是多个大饼垒在一起。全局平均池化即将一个大饼平均成一个点，整体看，类似一个垒起来的色子）
    维度变化：（2,512,8,8）-> (2,512,1,1) ==> (2,512)   512通道，8*8 -- 1*1
    excitation, 利用权重学习上面各通道间的相关性，代码实现有全连接和卷积核为1的卷积操作两种方式。
    维度变化：（2,512）-> (2,512//reducation)->(2,512) ==>(2,512,1,1)
    说明：该过程先降维再升维，降维倍数由 reducation 参数决定，降低网络的计算量，其中的激活函数增加了网络的非线性。
    scale: 通过上面excitation的操作输出了每个通道的重要性，在通过乘法加权操作乘以输入数据C2，从而提升重要特征，抑制不重要特征。
    维度变化：（2,512,8,8）*(2,512,1,1) -> (2,512,8,8)

小结：即输入维度为（2,512,8,8），输出维度为：（2,512,8,8）
说明：上述步骤中的“->”表示维度变化方向，“==>”表示通过view方法改变了维度。

1.1.2 更加清晰的理解图

说明：

    全连接和1 × 1的卷积效果类似，上图显示为全连接，亦可为1*1的卷积，下同，不赘述。
    激活函数位置见代码，下同。

1.1.3 代码
1. pytorch

class SELayer(nn.Module):
    def __init__(self, channel, reduction=4):
        """ SE注意力机制,输入x。输入输出特征图不变
            1.squeeze: 全局池化 (batch,channel,height,width) -> (batch,channel,1,1) ==> (batch,channel)
            2.excitaton: 全连接or卷积核为1的卷积(batch,channel)->(batch,channel//reduction)-> (batch,channel) ==> (batch,channel,1,1) 输出y
            3.scale: 完成对通道维度上原始特征的标定 y = x*y 输出维度和输入维度相同

        :param channel: 输入特征图的通道数
        :param reduction: 特征图通道的降低倍数
        """
        super(SELayer, self).__init__()
        # 自适应全局平均池化,即，每个通道进行平均池化，使输出特征图长宽为1
        self.avg_pool = nn.AdaptiveAvgPool2d(1)

        # 全连接的excitation
        self.fc = nn.Sequential(
            nn.Linear(channel, channel // reduction),
            nn.ReLU(inplace=True),
            nn.Linear(channel // reduction, channel),
            nn.Sigmoid()
        )
        # 卷积网络的excitation
        # 特征图变化：
        # (2,512,1,1) -> (2,512,1,1) -> (2,512,1,1)
        self.fc2 = nn.Sequential(
            nn.Conv2d(channel, channel // reduction, 1, bias=False),
            nn.ReLU(inplace=True),
            nn.Conv2d(channel // reduction, channel, 1, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        # (batch,channel,height,width) (2,512,8,8)
        b, c, _, _ = x.size()
        # 全局平均池化 (2,512,8,8) -> (2,512,1,1) -> (2,512)
        y = self.avg_pool(x).view(b, c)
        # (2,512) -> (2,512//reducation) -> (2,512) -> (2,512,1,1)
        y = self.fc(y).view(b, c, 1, 1)
        # (2,512,8,8)* (2,512,1,1) -> (2,512,8,8)
        pro = x * y
        return x * y



2. tensorflow/keras

# SEBlock
feature_map_shape = input_x.shape # input feature map shape
x = tf.reduce_mean(x, [1, 2]) # reduce along axis 1 and 2 ,height,width,
x = Dense(feature_map_shape[-1] / 16, activation=tf.nn.relu)(x) # (batch,channel) -> (batch,channel/16)
x = Dense(feature_map_shape[-1], activation=tf.nn.relu)(x) # (batch,channel/16) -> (batch,channel)
x = tf.multiply(input_x, x) # multiply along channel

说明：

    当使用全连接时，forward中平均池 (squeeze)，

# 全局平均池化 (2,512,8,8) -> (2,512,1,1) -> (2,512)
y = self.avg_pool(x).view(b, c)

    当使用1*1卷积，forward中平均池化（squeeze）,

# 全局平均池化 (2,512,8,8) -> (2,512,1,1)
y = self.avg_pool(x)

原文链接：https://blog.csdn.net/weixin_39190382/article/details/117711239

标签：SqueezeExcite,mathbf,nn,512,self,Excite,Squeeze,channel,通道
From： https://www.cnblogs.com/emanlee/p/17114707.html

Squeeze-and-Excite SqueezeExcite

Squeeze and Excite

相关文章

赞助商

阅读排行