Page - 84 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 84 -

Text of the Page - 84 -

Frame1 Frame2 Frame3 Frame4 Frame5 Figure 4: Qualitative Results. A comparison between input data, DeepLab Xception ground-truth, single frame training andLSTMtrainingon theESPNet (top tobottom). Thehorizontal axis represents the timesteps. Areaswith inconsistent predictions are shown in detail and highlighted with green dashed boxes. Other inconsistencies are highlighted with orange boxes. The ESPNet with single frame training (Sgl Train) produces inconsistencies in the right, left and on the roadsegmentation. The ESPNet L1bpredicts significantlymoreaccurateandconsistent results. similar results. We observe that the hyper-parameter λincons = 10 provides a good trade-off between ac- curacy and consistency when using the squared dif- ference loss function. The increase in consistency by 0.4 percentage points is noticeable when compar- ing the qualitative results. We set the other hyper- parameterλce=1 for all ofourexperiments. Combining the findings In order to achieve the best results with ESPNet L1b, we train the model in multiple phases. We use the squared difference in- consistency loss on correctly predicted classes with λincons = 10 and a 5× 5 convolution inside the ConvLSTM. The quantitative results are shown at the bottom of Table 2. When training with the weighted cross entropy loss and data augmentations as proposed in [19] the official Cityscapes server re- ports 60.9% mIoU on the single frame test set. Our method reaches slightly higher accuracy and signifi- cantlybetter temporalconsistencywhileusingasim- ilarnumberofparameters asMethaetal. [19]. 5.Conclusion We have shown that we can improve temporal consistency and accuracy of semantic segmentation for twodifferentsingleframearchitecturesbyadding feature propagation and a novel inconsistency loss. OntheESPNet,consistencyandmIoUimprovefrom 95.5 to 98.7% and from 45.2 to 57.9%, respectively. This is equal to a reduction of inconsistencies by 71.1% which can be observed immediately when watchingavideosequence. Moreover, we found that it is best to forward fea- tures at a high level with a standard convolution within the ConvLSTM cell. The hyper-parameter in our novel inconsistency loss function can be used to prioritize between consistency and accuracy. We alsoimproveconsistencyslightlybyaddingsynthetic datageneratedby theCarla simulator. In future experiments we are interested in com- paringothermethodsofadding the informationfrom past framestothecurrentprediction. Wealsoneedto generate synthetic data such that it contains seman- tics of all validation classes to increase overall con- sistencyandaccuracy. References [1] G. J. Brostow, J. Fauqueur, and R. Cipolla. Seman- tic object classes in video: A high-definition ground 84

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik