Seite - 72 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Bild der Seite - 72 -
Text der Seite - 72 -
tasks, we provide instance segmentation predictions
as additional feature input to our semantic segmen-
tation branch. In particular, we gather predicted in-
stancemasks intoan initial segmentation image (ISI)
which represents a coarse semantic segmentation for
things classes. In this way, we exploit a segmenta-
tion prior which increases the overall panoptic per-
formanceofoursystembyleveragingsimilaritiesbe-
tween the twopreviouslydisjoint subtasks.
We evaluate our method on the challenging
Cityscapes dataset [4] for semantic understanding
of urban street scenes using the recently introduced
panopticquality[11]metric. Weprovideanunbiased
evaluation and compare four different approaches
withan increasing levelofentanglementbetweense-
mantic and instance segmentation. Our experiments
show that both end-to-end training and inter-task re-
lations improvepanopticperformance inpractice.
2.RelatedWork
Fusing semantic and instance information has a
rich history in computer vision [25, 26]. However,
only recently [12] formalized the task of panoptic
segmentationand introducedapanopticquality (PQ)
metric to assess the performance of complete 2D
scene segmentation in an interpretable and unified
manner. This formalization and the availability of
large datasets with corresponding annotations [19]
motivated researchonpanoptic segmentation.
Early approaches to panoptic segmentation use
two highly specialized networks for semantic seg-
mentation [22, 24, 3] and instance segmentation [21,
8, 17, 27] and combine their predictions heuristi-
cally [1]. Instead, recent methods address the two
segmentation tasks with a single network by train-
ing a multi-task system that performs semantic and
instancesegmentationon topofashared feature rep-
resentation [11]. This reduces the number of param-
eters, thecomputational complexity, and the timere-
quired for training. To improve the panoptic qual-
ity,newerapproachesproposeadifferentiable fusion
of semantic and instance segmentation instead of a
heuristiccombination. Inthisway, theylearntocom-
bine the individual predictions and optimize directly
for the final objective in an end-to-end manner. For
example, UPSNet [28] introduces a parameter-free
merging technique to generate panoptic predictions
usinga singlenetwork.
Another strategy to improveaccuracy is toexploit
mutual information and similarities between seman- tic and instance segmentation network branches. In
this context, AUNet [15] incorporates region pro-
posal information as an attention mechanism in the
semantic segmentation branch. In this way, the se-
mantic segmentation focuses more on stuff classes
and less on things classes, which are eventually re-
placed by predicted instance masks. TASCNet [14]
enforcesL2-consistencybetweenpredictedsemantic
and instance segmentation masks to exploit mutual
information. SOGNet[29]addresses theoverlapping
issue of instances using a scene graph representation
which computes a relational embedding for each ob-
jectbasedongeometryandappearance.
Similar to our approach, IMP [6] which has been
developed at the same time uses predicted instance
segmentation masks as additional input for the se-
mantic segmentation branch. Compared to our ap-
proach, a different normalization technique is used
and the instance masks are combined using the max
operator insteadofaveraging.
3. Holistic End-to-End Panoptic Segmenta-
tionNetworkwithInterrelations
An overview of our end-to-end trainable panop-
tic segmentation network with inter-task relations is
shown in Figure 1. We first present our end-to-end
trainable architecture which combines semantic and
instance segmentation predictions in a differentiable
way in Sec. 3.1. Then, we introduce our interrela-
tions module which provides instance segmentation
predictions as additional feature input to our seman-
tic segmentationbranch inSec.3.2.
3.1.End-to-EndPanopticArchitecture
Our network architecture builds upon Panoptic
Feature Pyramid Networks [11]. Like many recent
panoptic segmentation methods, this approach ex-
tends the generalized Mask R-CNN framework [8]
with a semantic segmentation branch. This results
in a multi-task network that predicts a dense seman-
tic segmentation in addition to sparse instance seg-
mentation masks. For our implementation, we use
asharedResNet-101[9] featureextractionbackbone
with a Feature Pyramid Network [18] architecture to
obtaincombined low-andhigh-level features. These
featuresserveasshared input tooursemanticand in-
stance segmentationbranches, as shown inFigure2.
For thesemanticsegmentationbranch,weprocess
eachstageof thefeaturepyramid{P2, .. . ,P5}bya
series of upsampling modules. These modules con-
72
Joint Austrian Computer Vision and Robotics Workshop 2020
- Titel
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Herausgeber
- Graz University of Technology
- Ort
- Graz
- Datum
- 2020
- Sprache
- englisch
- Lizenz
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Abmessungen
- 21.0 x 29.7 cm
- Seiten
- 188
- Kategorien
- Informatik
- Technik