Page - 83 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 83 -

Text of the Page - 83 -

Category Experiment mIoU Acc Cons ConsW ESPNet SingleFrame 45.2 89.6 95.5 3.8 Convolution Type ESPNet L1aStd. 46.5 89.4 97.6 5.4 ESPNet L1aD.S. 45.2 89.0 97.2 5.5 Positionwith Eq. Params. ESPNet L1a7×7 50.3 91.4 98.5 3.1 ESPNet L1b3×3 52.0 91.5 98.7 3.2 ESPNet L1c5×5 49.9 91.4 98.2 3.0 ESPNet L1d9×9 50.1 91.5 98.3 2.9 Table 1: ConvLSTM on ESPNet. Results on Cityscapes validationset. WecomparetheESPNettrainedwithsingle frame images todifferentConvLSTMconfigurations. sequencehasannotatedground truthsemantics. This allows forcomparabilitywith single frameresults. Synthetic data Besides Cityscapes data, we also generate synthetic data with the Carla simulator [7]. In total, we create 4680 scenes with 30 frames each. We train the ESPNet L1b using different ra- tios between Cityscapes and Carla data. After train- ing we evaluate on the Cityscapes sequence valida- tion set. The quantitative results indicate that using about 10% synthetic data slightly improves frame- to-frame consistency (Cons) from 98.4% to 98.5% for Cityscapes only training while mIoU remains at 48.5%. Whenusingmorethan20%ofsyntheticdata, mIoU on the Cityscapes validation set declines sig- nificantly. We assume the reason for the decline is that only 9 of 19 semantic classes are covered by Carladata. Nevertheless,wehaveshownthatwecan improve consistency by accurately labeled video se- mantic segmentation. For simplicity, we do not use the syntheticdata set inother experiments. 4.2.FeaturePropagationEvaluation First,wecomparedifferentConvLSTMaswell as inconsistency loss configurations. Finally, we com- bine the insights from the comparison to achieve the highestperformance. ConvLSTM on VSSNet Training the VSSNet with ConvLSTM and inconsistency loss results in 44.6% mIoU, 89.9% Acc, 97.7% Cons and 4.7% ConsW. The results indicate that we are able to im- prove accuracy and consistency significantly, com- pared to the SSNet architecture trained with single frames which only achieves 39.9% mIoU and 94.4% Cons. After we have shown improvements on the VSSNet, we implement the following experiments on the ESPNet. Category Experiment mIoU Acc Cons ConsW Inconsistency Loss Func. SqDiff True 48.8 90.9 98.4 3.5 AbsDiff True 48.6 90.9 98.6 3.5 Inconsistency λ λincons =0 49.0 90.9 98.0 3.4 λincons =10 48.8 90.9 98.4 3.5 λincons =100 46.3 90.4 98.6 3.7 Comb. Results ESPNet L1b OnVal. Set 57.9 93.0 98.7 2.7 OnTest Set 60.9 - - - Table 2: Top: Inconsistency Loss. We vary parameters of the loss function. Note that the inconsistency loss results cannot be compared directly to Table 1 because we only train the LSTM parameters for faster convergence. Bot- tom: Combined Results. The last two rows show the best results we are able to produce on Cityscapes validation and test setbycombining the insightsofourexperiments. ConvLSTMconfigurations Wetestdifferentcon- volution types and positions of the ConvLSTM as proposed in Figure 2. Table 1 shows the quantitative resultsof this comparison in thecategoriesConvolu- tion Types and Position with Equal Parameters. We compare the standardconvolutionoperationwith the depth-wise separable convolution inside the ConvL- STM on the ESPNet L1a architecture. Results show that the standard convolution inside the ConvLSTM producesbetter results for all fourmetrics. Furthermore, weevaluate the positionof theCon- vLSTM layer.We choose the filter size such that all experimentshaveasimilarnumberofparameters for a fair comparison. This also ensures that the size of the receptive field at the layer is large enough to de- tect motion. The ESPNet L1b architecture clearly outperforms all other architectures in both consis- tency and accuracy. This suggests that it is more ef- ficient to propagate high level image features. Addi- tionally, we found that the Parametric ReLU (prelu) performs better than the tanh activation function in- side the ConvLSTM. Therefore, results are reported implementing thepreluactivation function. Inconsistency loss We test different inconsistency loss configurations on the ESPNet L1b architecture because this model delivered the best results in pre- vious experiments. Table 2 contains the quantitative results. Weonly train theLSTMparameters toallow for fast comparison of multiple models. The other parameters of the model are pretrained, but do not receive updates after the LSTM cell is added. Con- sequently, the scores are slightly lower than in Ta- ble 1. Substituting the squared difference loss inside Equation (2) with the absolute difference produces 83

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik