论文+源码阅读: Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

2019-10-19

这篇论文是2019年语义分割领域的最新成果, 采用双流CNN和ASPP进行语义分割处理, 在细小的物体上有着很大的提升.

paper: https://arxiv.org/abs/1907.05740
code: https://github.com/nv-tlabs/GSCNN
>>

关于论文的内容, 我就只解释模型和损失部分, 其他的直接在原文里看就好了, OK那么直接上结构图.
在这里插入图片描述
这是模型的整体结构图, 非常简洁的描绘了模型的工作原理, 当然要想做进一步分析, 这张图是远远不够的, 因此还需要下面这张图:

文章亮点:

采用双流(two stream)结构, 这个stream是作者命名的, 大概意思就是, 我们的这个模型分为两个部分, 一部分是传统的语义分割(regular stream), 另一个部分是专门处理具有边缘(或者边界)信息的shape stream.
Shape stream中使用gated convolutional layer, 简称GCL, 帮助shape stream只处理和边界相关的信息而滤除其他的信息.

大家从图里也发现了, 两个stream分开工作, 甚至专门设计了一个辅助损失edge bce loss来单独监督shape stream.
GCL在图中就是带有星号的模块, 一共有三层, 在论文中, GCL首先将不同层的regular stream和shape stream的特征图进行concatenate, 然后通过1x1的标准卷积, 最后通过sigmoid函数得到attenion map, 用公式来表达:

$α_t = \sigma(C_{1×1}(s_t||r_t))\\$

其中||就是拼接操作.
得到attention map之后,GCL用于将shape stream的输出特征图st和attention map($α_t$)的元素点积, 卷积之后通过一次残差网络且通道加权值为$w_t$, 这个公式的符号不会打…没关系, 原文里很好找!

模型和代码

OK, 来通过代码彻底分析一下这个模型, 各个部分关键的层我会直接写在注释里:

class GSCNN(nn.Module):
    def __init__(self, num_classes, trunk=None, criterion=None):
        
        super(GSCNN, self).__init__()
        self.criterion = criterion
        self.num_classes = num_classes

		# 特征提取网络,使用预训练权重和空洞卷积
        wide_resnet = wider_resnet38_a2(classes=1000, dilation=True) 
        wide_resnet = torch.nn.DataParallel(wide_resnet)
        
        # ------------------
        #  这部分都是regular stream, 图里展示4个feature map, 在这里是mod1, mod3, mod4, mod7
        wide_resnet = wide_resnet.module
        self.mod1 = wide_resnet.mod1
        self.mod2 = wide_resnet.mod2
        self.mod3 = wide_resnet.mod3
        self.mod4 = wide_resnet.mod4
        self.mod5 = wide_resnet.mod5
        self.mod6 = wide_resnet.mod6
        self.mod7 = wide_resnet.mod7
        self.pool2 = wide_resnet.pool2
        self.pool3 = wide_resnet.pool3
        self.interpolate = F.interpolate
        del wide_resnet
        # ------------------

		# -------------------
		# 对应图里的四个1x1卷积
        self.dsn1 = nn.Conv2d(64, 1, 1)
        self.dsn3 = nn.Conv2d(256, 1, 1)
        self.dsn4 = nn.Conv2d(512, 1, 1)
        self.dsn7 = nn.Conv2d(4096, 1, 1)
        # ------------------

		# ------------------
		# 这个应该不用多说了吧..图里直接写的res1, res2, res3就是这块
        self.res1 = Resnet.BasicBlock(64, 64, stride=1, downsample=None)
        self.d1 = nn.Conv2d(64, 32, 1)
        self.res2 = Resnet.BasicBlock(32, 32, stride=1, downsample=None)
        self.d2 = nn.Conv2d(32, 16, 1)
        self.res3 = Resnet.BasicBlock(16, 16, stride=1, downsample=None)
        self.d3 = nn.Conv2d(16, 8, 1)
        self.fuse = nn.Conv2d(8, 1, kernel_size=1, padding=0, bias=False)
        # ------------------

		# 这一层对应图中image gradients后接的1x1卷积
        self.cw = nn.Conv2d(2, 1, kernel_size=1, padding=0, bias=False)

		# ------------------
		# 3个门卷积
        self.gate1 = gsc.GatedSpatialConv2d(32, 32)
        self.gate2 = gsc.GatedSpatialConv2d(16, 16)
        self.gate3 = gsc.GatedSpatialConv2d(8, 8)
        # -----------------
         
        self.aspp = _AtrousSpatialPyramidPoolingModule(4096, 256,
                                                       output_stride=8)

        self.bot_fine = nn.Conv2d(128, 48, kernel_size=1, bias=False)
        self.bot_aspp = nn.Conv2d(1280 + 256, 256, kernel_size=1, bias=False)

        self.final_seg = nn.Sequential(
            nn.Conv2d(256 + 48, 256, kernel_size=3, padding=1, bias=False),
            Norm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1, bias=False),
            Norm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, num_classes, kernel_size=1, bias=False))

        self.sigmoid = nn.Sigmoid()
        initialize_weights(self.final_seg)

    def forward(self, inp, gts=None):

        x_size = inp.size() 

		# ----------------------------
		# 提特征
        # res 1
        m1 = self.mod1(inp)

        # res 2
        m2 = self.mod2(self.pool2(m1))

        # res 3
        m3 = self.mod3(self.pool3(m2))

        # res 4-7
        m4 = self.mod4(m3)
        m5 = self.mod5(m4)
        m6 = self.mod6(m5)
        m7 = self.mod7(m6) 
		# ----------------------------
		
		# ----------------------------
		# 从regular stream得到的feature map在1x1卷积后双线性插值以拼接
		# 一共传递4个feature map, 不过s1没用到, 和图里似乎不一样?
        s1 = F.interpolate(self.dsn1(m1), x_size[2:],
                            mode='bilinear', align_corners=True)
        s3 = F.interpolate(self.dsn3(m3), x_size[2:],
                            mode='bilinear', align_corners=True)
        s4 = F.interpolate(self.dsn4(m4), x_size[2:],
                            mode='bilinear', align_corners=True)
        s7 = F.interpolate(self.dsn7(m7), x_size[2:],
                            mode='bilinear', align_corners=True)
        # ----------------------------
        
        # 对应图里res1前, 不经过1x1卷积直接插值
        m1f = F.interpolate(m1, x_size[2:], mode='bilinear', align_corners=True)

		# ----------------------------
		# 取image gradients
        im_arr = inp.cpu().numpy().transpose((0,2,3,1)).astype(np.uint8)
        canny = np.zeros((x_size[0], 1, x_size[2], x_size[3]))
        for i in range(x_size[0]):
            canny[i] = cv2.Canny(im_arr[i],10,100)
        canny = torch.from_numpy(canny).cuda().float()
 		# ----------------------------

        cs = self.res1(m1f)
        cs = F.interpolate(cs, x_size[2:],
                           mode='bilinear', align_corners=True)
        cs = self.d1(cs)
        cs = self.gate1(cs, s3)
        cs = self.res2(cs)
        cs = F.interpolate(cs, x_size[2:],
                           mode='bilinear', align_corners=True)
        cs = self.d2(cs)
        cs = self.gate2(cs, s4)
        cs = self.res3(cs)
        cs = F.interpolate(cs, x_size[2:],
                           mode='bilinear', align_corners=True)
        cs = self.d3(cs)
        cs = self.gate3(cs, s7)
        cs = self.fuse(cs)
        cs = F.interpolate(cs, x_size[2:],
                           mode='bilinear', align_corners=True)
        edge_out = self.sigmoid(cs)
        cat = torch.cat((edge_out, canny), dim=1)
        acts = self.cw(cat)
        acts = self.sigmoid(acts)

        # aspp
        x = self.aspp(m7, acts)
        dec0_up = self.bot_aspp(x)

        dec0_fine = self.bot_fine(m2)
        dec0_up = self.interpolate(dec0_up, m2.size()[2:], mode='bilinear',align_corners=True)
        dec0 = [dec0_fine, dec0_up]
        dec0 = torch.cat(dec0, 1)

        dec1 = self.final_seg(dec0)  
        seg_out = self.interpolate(dec1, x_size[2:], mode='bilinear')            
       
        if self.training:
            return self.criterion((seg_out, edge_out), gts)              
        else:
            return seg_out, edge_out

没有注释的部分都很直接明了, 就不多写了, 来看看GCL:

class GatedSpatialConv2d(_ConvNd):
    def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
                 padding=0, dilation=1, groups=1, bias=False):
                 
        kernel_size = _pair(kernel_size)
        stride = _pair(stride)
        padding = _pair(padding)
        dilation = _pair(dilation)
        super(GatedSpatialConv2d, self).__init__(
            in_channels, out_channels, kernel_size, stride, padding, dilation,
            False, _pair(0), groups, bias, 'zeros')

		# GCL的组成, 作者自己写的norm层, 接1x1卷积, 激活, 1x1卷积输出一维feat map, 再sigmoid.
		# 得到attention map.
		# 用边界信息去给区域进行加权.
        self._gate_conv = nn.Sequential(
            mynn.Norm2d(in_channels+1),
            nn.Conv2d(in_channels+1, in_channels+1, 1),
            nn.ReLU(), 
            nn.Conv2d(in_channels+1, 1, 1),
            mynn.Norm2d(1),
            nn.Sigmoid()
        )

    def forward(self, input_features, gating_features):
        """
        :param input_features:  [NxCxHxW]  featuers comming from the shape branch (canny branch).
        :param gating_features: [Nx1xHxW] features comming from the texture branch (resnet). Only one channel feature map.
        :return:
        """
        alphas = self._gate_conv(torch.cat([input_features, gating_features], dim=1))

        input_features = (input_features * (alphas + 1)) 
        return F.conv2d(input_features, self.weight, self.bias, self.stride,
                        self.padding, self.dilation, self.groups)

关于图像的edgemap和image gradient, 随便找了张VOC的样本和mask可视化了一下, 大概是这么个效果:
在这里插入图片描述

从左上到右下分别是原图, edgemap和image gradient.

损失

多任务学习的损失函数:
该误差函数包括BCE(binary cross entropy)和CE(cross entropy).$λ_1, λ_2$是二者的权重. $\hat s$是预测的真值边框, $\hat y$是预测的真值语义标签.
在这里插入图片描述
dual task regularizer:
最后的输出结构是得到two-stream的误差函数, 为了不过拟合, 在误差函数中依旧加入正则化项.

$\zeta = \frac {1}{2}||\nabla(G * \argmax_k p(y^k|r, s))||$