Seite - 84 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Bild der Seite - 84 -
Text der Seite - 84 -
Frame1 Frame2 Frame3 Frame4 Frame5
Figure 4: Qualitative Results. A comparison between input data, DeepLab Xception ground-truth, single frame training
andLSTMtrainingon theESPNet (top tobottom). Thehorizontal axis represents the timesteps. Areaswith inconsistent
predictions are shown in detail and highlighted with green dashed boxes. Other inconsistencies are highlighted with
orange boxes. The ESPNet with single frame training (Sgl Train) produces inconsistencies in the right, left and on the
roadsegmentation. The ESPNet L1bpredicts significantlymoreaccurateandconsistent results.
similar results. We observe that the hyper-parameter
λincons = 10 provides a good trade-off between ac-
curacy and consistency when using the squared dif-
ference loss function. The increase in consistency
by 0.4 percentage points is noticeable when compar-
ing the qualitative results. We set the other hyper-
parameterλce=1 for all ofourexperiments.
Combining the findings In order to achieve the
best results with ESPNet L1b, we train the model in
multiple phases. We use the squared difference in-
consistency loss on correctly predicted classes with
λincons = 10 and a 5× 5 convolution inside the
ConvLSTM. The quantitative results are shown at
the bottom of Table 2. When training with the
weighted cross entropy loss and data augmentations
as proposed in [19] the official Cityscapes server re-
ports 60.9% mIoU on the single frame test set. Our
method reaches slightly higher accuracy and signifi-
cantlybetter temporalconsistencywhileusingasim-
ilarnumberofparameters asMethaetal. [19].
5.Conclusion
We have shown that we can improve temporal
consistency and accuracy of semantic segmentation for twodifferentsingleframearchitecturesbyadding
feature propagation and a novel inconsistency loss.
OntheESPNet,consistencyandmIoUimprovefrom
95.5 to 98.7% and from 45.2 to 57.9%, respectively.
This is equal to a reduction of inconsistencies by
71.1% which can be observed immediately when
watchingavideosequence.
Moreover, we found that it is best to forward fea-
tures at a high level with a standard convolution
within the ConvLSTM cell. The hyper-parameter
in our novel inconsistency loss function can be used
to prioritize between consistency and accuracy. We
alsoimproveconsistencyslightlybyaddingsynthetic
datageneratedby theCarla simulator.
In future experiments we are interested in com-
paringothermethodsofadding the informationfrom
past framestothecurrentprediction. Wealsoneedto
generate synthetic data such that it contains seman-
tics of all validation classes to increase overall con-
sistencyandaccuracy.
References
[1] G. J. Brostow, J. Fauqueur, and R. Cipolla. Seman-
tic object classes in video: A high-definition ground
84
Joint Austrian Computer Vision and Robotics Workshop 2020
- Titel
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Herausgeber
- Graz University of Technology
- Ort
- Graz
- Datum
- 2020
- Sprache
- englisch
- Lizenz
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Abmessungen
- 21.0 x 29.7 cm
- Seiten
- 188
- Kategorien
- Informatik
- Technik