Page - 73 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Image of the Page - 73 -
Text of the Page - 73 -
Figure 2: Detailed illustration of our end-to-end
panoptic segmentation network with task interrela-
tions. We internally merge predictions from our se-
mantic and instance segmentation branches in a dif-
ferentiable way. In particular, we concatenate stuff
class predictions from our semantic segmentation
branch with things class predictions in the form of
canvas collections from our instance segmentation
branch. Our instance canvas collections can also be
transformed into an initial segmentation image (ISI)
which serves as additional feature input for our se-
mantic segmentationbranch.
sistsof3×3convolutions,batchnormalization[10],
ReLU [7], and2×bilinear upsampling. Because the
individual stages have different spatial dimensions,
we process each stage by a different number of up-
sampling modules to generateH/4×W/4× 128
feature maps, whereH andW are the input image
dimensions. The resulting outputs of all stages are
concatenatedandprocessedusingafinal1×1convo-
lution to reduce thechanneldimension to thedesired
numberofclasses.
For the instance segmentation branch, we imple-
mented a Mask R-CNN [8]. We use a region pro-
posal network to detect regions of interest, perform
non-maximum suppression, execute ROI alignment, and predict 28× 28 binary masks as well as class
probabilities for eachdetected instance.
Inorder tocombinethesemanticandinstanceseg-
mentation outputs, we use an internal differentiable
fusion instead of external heuristics. For this pur-
pose, we first select the most likely class label for
eachdetected instanceusingadifferentiable
soft argmax= N∑
i b e
zi·β∑N
k e
zk·β e · i (1)
operation [2], where N is the number of things
classes,β is a large constant, and z is the predicted
class logit. Usingβ in the exponent in combination
with the round function allows us to squash all non-
maxium values to zero. In this way, we approximate
the non-differentiable argmax function, allowing us
tobackpropagategradients.
We then resize the predicted28×28mask logits
for each detected instance according to its predicted
2D bounding box size and place them in empty can-
vas layers at the predicted 2D location, as shown in
Figure2(topright). Additionally,wemerge thecan-
vas layers for regions of interest with the same class
id and high mask IOU. The resulting canvas collec-
tion from the instance segmentation branch is then
concatenatedwith thestuff class logitsof theseman-
ticsegmentationbranchtogenerateourpanopticout-
put, as illustrated in Figure 2 (bottom). The pixel-
wise panoptic segmentation output is attained by ap-
plyingasoftmaxlayerontopof thestackedsemantic
and instance segmentation information. The shape
of the final output isH×W × (#stuff classes+
#detected instances). For stuff classes, the output
is a class ID. For things classes, the output is an in-
stance ID. The corresponding class ID for each in-
stancecanbegatheredfromoursemanticor instance
segmentationoutput.
During training, it is important to reorder the de-
tected instances to match the order of the ground
truth instances. For this purpose, we use a ground
truth instance IDlookup table. Allparametersofour
networkareoptimized jointly.
3.2. Inter-taskRelations
Ourdifferentiable fusionofsemanticand instance
segmentation predictions allows us to join the out-
puts of our two branches internally for end-to-end
training. However, it also allows us to provide in-
stance predictions as additional feature input to our
semantic segmentationbranch,asshowninFigure3.
73
Joint Austrian Computer Vision and Robotics Workshop 2020
- Title
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Editor
- Graz University of Technology
- Location
- Graz
- Date
- 2020
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Categories
- Informatik
- Technik