Page - 83 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Image of the Page - 83 -
Text of the Page - 83 -
Category Experiment mIoU Acc Cons ConsW
ESPNet SingleFrame 45.2 89.6 95.5 3.8
Convolution
Type ESPNet L1aStd. 46.5 89.4 97.6 5.4
ESPNet L1aD.S. 45.2 89.0 97.2 5.5
Positionwith
Eq. Params. ESPNet L1a7×7 50.3 91.4 98.5 3.1
ESPNet L1b3×3 52.0 91.5 98.7 3.2
ESPNet L1c5×5 49.9 91.4 98.2 3.0
ESPNet L1d9×9 50.1 91.5 98.3 2.9
Table 1: ConvLSTM on ESPNet. Results on Cityscapes
validationset. WecomparetheESPNettrainedwithsingle
frame images todifferentConvLSTMconfigurations.
sequencehasannotatedground truthsemantics. This
allows forcomparabilitywith single frameresults.
Synthetic data Besides Cityscapes data, we also
generate synthetic data with the Carla simulator [7].
In total, we create 4680 scenes with 30 frames
each. We train the ESPNet L1b using different ra-
tios between Cityscapes and Carla data. After train-
ing we evaluate on the Cityscapes sequence valida-
tion set. The quantitative results indicate that using
about 10% synthetic data slightly improves frame-
to-frame consistency (Cons) from 98.4% to 98.5%
for Cityscapes only training while mIoU remains at
48.5%. Whenusingmorethan20%ofsyntheticdata,
mIoU on the Cityscapes validation set declines sig-
nificantly. We assume the reason for the decline is
that only 9 of 19 semantic classes are covered by
Carladata. Nevertheless,wehaveshownthatwecan
improve consistency by accurately labeled video se-
mantic segmentation. For simplicity, we do not use
the syntheticdata set inother experiments.
4.2.FeaturePropagationEvaluation
First,wecomparedifferentConvLSTMaswell as
inconsistency loss configurations. Finally, we com-
bine the insights from the comparison to achieve the
highestperformance.
ConvLSTM on VSSNet Training the VSSNet
with ConvLSTM and inconsistency loss results in
44.6% mIoU, 89.9% Acc, 97.7% Cons and 4.7%
ConsW. The results indicate that we are able to im-
prove accuracy and consistency significantly, com-
pared to the SSNet architecture trained with single
frames which only achieves 39.9% mIoU and 94.4%
Cons. After we have shown improvements on the
VSSNet, we implement the following experiments
on the ESPNet. Category Experiment mIoU Acc Cons ConsW
Inconsistency
Loss Func. SqDiff True 48.8 90.9 98.4 3.5
AbsDiff True 48.6 90.9 98.6 3.5
Inconsistency
λ λincons =0 49.0 90.9 98.0 3.4
λincons =10 48.8 90.9 98.4 3.5
λincons =100 46.3 90.4 98.6 3.7
Comb. Results
ESPNet L1b OnVal. Set 57.9 93.0 98.7 2.7
OnTest Set 60.9 - - -
Table 2: Top: Inconsistency Loss. We vary parameters of
the loss function. Note that the inconsistency loss results
cannot be compared directly to Table 1 because we only
train the LSTM parameters for faster convergence. Bot-
tom: Combined Results. The last two rows show the best
results we are able to produce on Cityscapes validation
and test setbycombining the insightsofourexperiments.
ConvLSTMconfigurations Wetestdifferentcon-
volution types and positions of the ConvLSTM as
proposed in Figure 2. Table 1 shows the quantitative
resultsof this comparison in thecategoriesConvolu-
tion Types and Position with Equal Parameters. We
compare the standardconvolutionoperationwith the
depth-wise separable convolution inside the ConvL-
STM on the ESPNet L1a architecture. Results show
that the standard convolution inside the ConvLSTM
producesbetter results for all fourmetrics.
Furthermore, weevaluate the positionof theCon-
vLSTM layer.We choose the filter size such that all
experimentshaveasimilarnumberofparameters for
a fair comparison. This also ensures that the size of
the receptive field at the layer is large enough to de-
tect motion. The ESPNet L1b architecture clearly
outperforms all other architectures in both consis-
tency and accuracy. This suggests that it is more ef-
ficient to propagate high level image features. Addi-
tionally, we found that the Parametric ReLU (prelu)
performs better than the tanh activation function in-
side the ConvLSTM. Therefore, results are reported
implementing thepreluactivation function.
Inconsistency loss We test different inconsistency
loss configurations on the ESPNet L1b architecture
because this model delivered the best results in pre-
vious experiments. Table 2 contains the quantitative
results. Weonly train theLSTMparameters toallow
for fast comparison of multiple models. The other
parameters of the model are pretrained, but do not
receive updates after the LSTM cell is added. Con-
sequently, the scores are slightly lower than in Ta-
ble 1. Substituting the squared difference loss inside
Equation (2) with the absolute difference produces
83
Joint Austrian Computer Vision and Robotics Workshop 2020
- Title
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Editor
- Graz University of Technology
- Location
- Graz
- Date
- 2020
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Categories
- Informatik
- Technik