Page - 79 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Image of the Page - 79 -
Text of the Page - 79 -
Frame-To-Frame ConsistentSemanticSegmentation
ManuelRebol PatrickKno¨belreiter
GrazUniversityofTechnology
rebol@student.tugraz.at, knoebelreiter@icg.tugraz.at
Abstract. In this work, we aim for temporally con-
sistentsemanticsegmentationthroughout framesina
video. Many semantic segmentation algorithms pro-
cess images individually which leads to an inconsis-
tentsceneinterpretationduetoilluminationchanges,
occlusionsandothervariationsovertime. Toachieve
a temporally consistent prediction, we train a con-
volutional neural network (CNN) which propagates
features through consecutive frames in a video us-
ingaconvolutional longshort termmemory(ConvL-
STM)cell. Besidesthetemporal featurepropagation,
we penalize inconsistencies in our loss function. We
show in our experiments that the performance im-
proveswhenutilizingvideo informationcompared to
single frame prediction. The mean intersection over
union(mIoU)metricontheCityscapesvalidationset
increases from 45.2% for the single frames to 57.9%
for video data after implementing the ConvLSTM to
propagate features trough time on the ESPNet. Most
importantly, inconsistency decreases from 4.5% to
1.3% which is a reduction by 71.1%. Our results
indicate that the added temporal information pro-
ducesaframe-to-frameconsistentandmoreaccurate
image understanding compared to single frame pro-
cessing.
1. Introduction
We address the task of semantic segmentation
which assigns a semantic class for each pixel in an
image. Our focus is on the computation of seman-
ticsegmentationformultipleconsecutive images, re-
ferredtoasframes, inavideosequence. Consecutive
video frames contain similar information, because
they capture a scene which only changes slightly.
Therefore, the semantic segmentationofconsecutive
frames is similar as long as motion between frames
does not increase significantly. For example, con-
sider a street scene recorded by a camera mounted Frame1 Frame2
Figure 1: Consistent Semantic Segmentation. The trained
ESPNet [19] model predicts temporally inconsistent se-
mantic segmentation on two consecutive frames of the
Cityscapes [6] video data set (second row). The semantic
segmentation is color encoded and large inconsistencies
are highlighted with orange boxes. The third row shows
consistent resultspredictedbyourmodel. Wereducetem-
poral inconsistenciesby71%.
on a vehicle in which we observe a street sign. If the
framerate is largeenough,wewillobserve thestreet
sign in multiple images as the vehicle passes by. In
this example, the goal of this work would be to con-
sistently detect the street sign as such in all frames
in which the sign appears. Single frame algorithms
often fail at achieving this task. In general, we aim
for temporally consistent segmentation of all seman-
tic classes throughout avideosequence.
Many state of the art computer vision algorithms
process images individually[26,17,3]andhenceare
not designed for video sequences. They do not con-
sider the temporal dependencies which occur when
segmenting a video semantically. If single frame
convolutional neural networks (CNNs) predict se-
mantic segmentationonvideosequences, results can
becometemporally inconsistentbecauseof illumina-
79
Joint Austrian Computer Vision and Robotics Workshop 2020
- Title
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Editor
- Graz University of Technology
- Location
- Graz
- Date
- 2020
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Categories
- Informatik
- Technik