Page - 79 - in Joint Austrian Computer Vision and Robotics Workshop 2020

Image of the Page - 79 -

Text of the Page - 79 -

Frame-To-Frame ConsistentSemanticSegmentation ManuelRebol PatrickKno¨belreiter GrazUniversityofTechnology rebol@student.tugraz.at, knoebelreiter@icg.tugraz.at Abstract. In this work, we aim for temporally con- sistentsemanticsegmentationthroughout framesina video. Many semantic segmentation algorithms pro- cess images individually which leads to an inconsis- tentsceneinterpretationduetoilluminationchanges, occlusionsandothervariationsovertime. Toachieve a temporally consistent prediction, we train a con- volutional neural network (CNN) which propagates features through consecutive frames in a video us- ingaconvolutional longshort termmemory(ConvL- STM)cell. Besidesthetemporal featurepropagation, we penalize inconsistencies in our loss function. We show in our experiments that the performance im- proveswhenutilizingvideo informationcompared to single frame prediction. The mean intersection over union(mIoU)metricontheCityscapesvalidationset increases from 45.2% for the single frames to 57.9% for video data after implementing the ConvLSTM to propagate features trough time on the ESPNet. Most importantly, inconsistency decreases from 4.5% to 1.3% which is a reduction by 71.1%. Our results indicate that the added temporal information pro- ducesaframe-to-frameconsistentandmoreaccurate image understanding compared to single frame pro- cessing. 1. Introduction We address the task of semantic segmentation which assigns a semantic class for each pixel in an image. Our focus is on the computation of seman- ticsegmentationformultipleconsecutive images, re- ferredtoasframes, inavideosequence. Consecutive video frames contain similar information, because they capture a scene which only changes slightly. Therefore, the semantic segmentationofconsecutive frames is similar as long as motion between frames does not increase significantly. For example, con- sider a street scene recorded by a camera mounted Frame1 Frame2 Figure 1: Consistent Semantic Segmentation. The trained ESPNet [19] model predicts temporally inconsistent se- mantic segmentation on two consecutive frames of the Cityscapes [6] video data set (second row). The semantic segmentation is color encoded and large inconsistencies are highlighted with orange boxes. The third row shows consistent resultspredictedbyourmodel. Wereducetem- poral inconsistenciesby71%. on a vehicle in which we observe a street sign. If the framerate is largeenough,wewillobserve thestreet sign in multiple images as the vehicle passes by. In this example, the goal of this work would be to con- sistently detect the street sign as such in all frames in which the sign appears. Single frame algorithms often fail at achieving this task. In general, we aim for temporally consistent segmentation of all seman- tic classes throughout avideosequence. Many state of the art computer vision algorithms process images individually[26,17,3]andhenceare not designed for video sequences. They do not con- sider the temporal dependencies which occur when segmenting a video semantically. If single frame convolutional neural networks (CNNs) predict se- mantic segmentationonvideosequences, results can becometemporally inconsistentbecauseof illumina- 79

back to the book Joint Austrian Computer Vision and Robotics Workshop 2020"

Joint Austrian Computer Vision and Robotics Workshop 2020

Title: Joint Austrian Computer Vision and Robotics Workshop 2020
Editor: Graz University of Technology
Location: Graz
Date: 2020
Language: English
License: CC BY 4.0
ISBN: 978-3-85125-752-6
Size: 21.0 x 29.7 cm
Pages: 188
Categories: Informatik; Technik