Page - 73 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 73 -

Text of the Page - 73 -

Figure 2: Detailed illustration of our end-to-end panoptic segmentation network with task interrela- tions. We internally merge predictions from our se- mantic and instance segmentation branches in a dif- ferentiable way. In particular, we concatenate stuff class predictions from our semantic segmentation branch with things class predictions in the form of canvas collections from our instance segmentation branch. Our instance canvas collections can also be transformed into an initial segmentation image (ISI) which serves as additional feature input for our se- mantic segmentationbranch. sistsof3×3convolutions,batchnormalization[10], ReLU [7], and2×bilinear upsampling. Because the individual stages have different spatial dimensions, we process each stage by a different number of up- sampling modules to generateH/4×W/4× 128 feature maps, whereH andW are the input image dimensions. The resulting outputs of all stages are concatenatedandprocessedusingafinal1×1convo- lution to reduce thechanneldimension to thedesired numberofclasses. For the instance segmentation branch, we imple- mented a Mask R-CNN [8]. We use a region pro- posal network to detect regions of interest, perform non-maximum suppression, execute ROI alignment, and predict 28× 28 binary masks as well as class probabilities for eachdetected instance. Inorder tocombinethesemanticandinstanceseg- mentation outputs, we use an internal differentiable fusion instead of external heuristics. For this pur- pose, we first select the most likely class label for eachdetected instanceusingadifferentiable soft argmax= N∑ i b e zi·β∑N k e zk·β e · i (1) operation [2], where N is the number of things classes,β is a large constant, and z is the predicted class logit. Usingβ in the exponent in combination with the round function allows us to squash all non- maxium values to zero. In this way, we approximate the non-differentiable argmax function, allowing us tobackpropagategradients. We then resize the predicted28×28mask logits for each detected instance according to its predicted 2D bounding box size and place them in empty can- vas layers at the predicted 2D location, as shown in Figure2(topright). Additionally,wemerge thecan- vas layers for regions of interest with the same class id and high mask IOU. The resulting canvas collec- tion from the instance segmentation branch is then concatenatedwith thestuff class logitsof theseman- ticsegmentationbranchtogenerateourpanopticout- put, as illustrated in Figure 2 (bottom). The pixel- wise panoptic segmentation output is attained by ap- plyingasoftmaxlayerontopof thestackedsemantic and instance segmentation information. The shape of the final output isH×W × (#stuff classes+ #detected instances). For stuff classes, the output is a class ID. For things classes, the output is an in- stance ID. The corresponding class ID for each in- stancecanbegatheredfromoursemanticor instance segmentationoutput. During training, it is important to reorder the de- tected instances to match the order of the ground truth instances. For this purpose, we use a ground truth instance IDlookup table. Allparametersofour networkareoptimized jointly. 3.2. Inter-taskRelations Ourdifferentiable fusionofsemanticand instance segmentation predictions allows us to join the out- puts of our two branches internally for end-to-end training. However, it also allows us to provide in- stance predictions as additional feature input to our semantic segmentationbranch,asshowninFigure3. 73

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik