Page - 72 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 72 -

Text of the Page - 72 -

tasks, we provide instance segmentation predictions as additional feature input to our semantic segmen- tation branch. In particular, we gather predicted in- stancemasks intoan initial segmentation image (ISI) which represents a coarse semantic segmentation for things classes. In this way, we exploit a segmenta- tion prior which increases the overall panoptic per- formanceofoursystembyleveragingsimilaritiesbe- tween the twopreviouslydisjoint subtasks. We evaluate our method on the challenging Cityscapes dataset [4] for semantic understanding of urban street scenes using the recently introduced panopticquality[11]metric. Weprovideanunbiased evaluation and compare four different approaches withan increasing levelofentanglementbetweense- mantic and instance segmentation. Our experiments show that both end-to-end training and inter-task re- lations improvepanopticperformance inpractice. 2.RelatedWork Fusing semantic and instance information has a rich history in computer vision [25, 26]. However, only recently [12] formalized the task of panoptic segmentationand introducedapanopticquality (PQ) metric to assess the performance of complete 2D scene segmentation in an interpretable and unified manner. This formalization and the availability of large datasets with corresponding annotations [19] motivated researchonpanoptic segmentation. Early approaches to panoptic segmentation use two highly specialized networks for semantic seg- mentation [22, 24, 3] and instance segmentation [21, 8, 17, 27] and combine their predictions heuristi- cally [1]. Instead, recent methods address the two segmentation tasks with a single network by train- ing a multi-task system that performs semantic and instancesegmentationon topofashared feature rep- resentation [11]. This reduces the number of param- eters, thecomputational complexity, and the timere- quired for training. To improve the panoptic qual- ity,newerapproachesproposeadifferentiable fusion of semantic and instance segmentation instead of a heuristiccombination. Inthisway, theylearntocom- bine the individual predictions and optimize directly for the final objective in an end-to-end manner. For example, UPSNet [28] introduces a parameter-free merging technique to generate panoptic predictions usinga singlenetwork. Another strategy to improveaccuracy is toexploit mutual information and similarities between seman- tic and instance segmentation network branches. In this context, AUNet [15] incorporates region pro- posal information as an attention mechanism in the semantic segmentation branch. In this way, the se- mantic segmentation focuses more on stuff classes and less on things classes, which are eventually re- placed by predicted instance masks. TASCNet [14] enforcesL2-consistencybetweenpredictedsemantic and instance segmentation masks to exploit mutual information. SOGNet[29]addresses theoverlapping issue of instances using a scene graph representation which computes a relational embedding for each ob- jectbasedongeometryandappearance. Similar to our approach, IMP [6] which has been developed at the same time uses predicted instance segmentation masks as additional input for the se- mantic segmentation branch. Compared to our ap- proach, a different normalization technique is used and the instance masks are combined using the max operator insteadofaveraging. 3. Holistic End-to-End Panoptic Segmenta- tionNetworkwithInterrelations An overview of our end-to-end trainable panop- tic segmentation network with inter-task relations is shown in Figure 1. We first present our end-to-end trainable architecture which combines semantic and instance segmentation predictions in a differentiable way in Sec. 3.1. Then, we introduce our interrela- tions module which provides instance segmentation predictions as additional feature input to our seman- tic segmentationbranch inSec.3.2. 3.1.End-to-EndPanopticArchitecture Our network architecture builds upon Panoptic Feature Pyramid Networks [11]. Like many recent panoptic segmentation methods, this approach ex- tends the generalized Mask R-CNN framework [8] with a semantic segmentation branch. This results in a multi-task network that predicts a dense seman- tic segmentation in addition to sparse instance seg- mentation masks. For our implementation, we use asharedResNet-101[9] featureextractionbackbone with a Feature Pyramid Network [18] architecture to obtaincombined low-andhigh-level features. These featuresserveasshared input tooursemanticand in- stance segmentationbranches, as shown inFigure2. For thesemanticsegmentationbranch,weprocess eachstageof thefeaturepyramid{P2, .. . ,P5}bya series of upsampling modules. These modules con- 72

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik