Page - 80 - in Joint Austrian Computer Vision and Robotics Workshop 2020
Image of the Page - 80 -
Text of the Page - 80 -
tion changes, occlusions and other variations. Fig-
ure1 illustrates thedifferencesbetweena temporally
inconsistent prediction of video frames by a trained
ESPNet [19]andour consistentmodel.
We address this issue by introducing methods
which alter existing single frame CNN architectures
suchthat theirpredictionaccuracybenefitsfromhav-
ing multiple frames of the same scene. Our method
is designed such that it can be applied to any single
frame CNN architecture. Potential applications in-
cluderoboticsandautonomousvehicleswherevideo
data can be recorded easily. Since we aim for a real-
life application scenario our method does not access
future frames. Instead, we only utilize information
frompastframestopredict thecurrentframe. Weim-
plement our online method on the lightweight CNN
architecture ESPNet. We include a recurrent neural
network (RNN) layer into the ESPNet which allows
past image features to be combined with current im-
age features and thus computes consistent semantic
segmentation over time. To train the parameters of
our novel model for consistency, we introduce a in-
consistency error term to our objective function. We
verify our methods on a second architecture, which
we name Semantic Segmentation Network (SSNet).
The reason for the development of SSNet is to en-
sure that our methods do not only work on a specific
CNN. We train the parameters of the two models on
street scenes using supervised learning. The data is
provided by the Cityscapes sequence [6] and a syn-
thetic data set which we generate from the Carla [7]
simulator. Toavoidthe largeeffort requiredtomanu-
ally labelvideodata,weusethepre-trainedXception
model [3] to predict highly accurate video semantic
segmentation.
2.RelatedWork
The best performances on semantic segmentation
benchmark tasks such as PASCAL VOC [8] and
Cityscapes [6] are reached by CNN architectures.
Lightweight CNN architectures [19, 12, 29, 25, 27]
have been developed to achieve high accuracy with
low computational effort. We select the highly effi-
cient ESPNet [19] as a basis for our work because
it predicts semantic segmentation in real-time while
maintaining high prediction accuracy. It uses point-
wise convolutions together with a spatial pyramid
of dilated convolutions [28]. The dilated convolu-
tions allow the network to create a large receptive
field while maintaining a shallow architecture. Al- thoughESPNetprocesses imagesfastandaccurately,
it lacks temporal consistency when predicting con-
secutive frames. Therefore, we extend the ESPNet
andenforcevideoconsistency.
Video Consistency Kundu et al. [16] and Sid-
dhartha et al. [2] base their work on the traditional
graph cut [14, 15] approach towards semantic seg-
mentation. They extent the traditional 2D to a 3D
CRF by adding a temporal dimension which allows
them to predict temporally consistent semantic seg-
mentationonvideo. Comparedtoourapproachaddi-
tionalopticalflowinformationneeds tobecomputed
and the size of the temporal dimension must be pre-
defined in advance. This results in additional com-
putation complexity and less flexibility when chang-
ingparameters suchas the framerate. Therefore,we
decided to implement RNNs [11, 23, 4] which offer
a more flexible approach towards processing video
data.
RNNs are trained to learn which features of past
framesare relevant forcurrent [18,21]or future [24,
22] frames. In general, it is not clear if LSTM, GRU
oranyotherRNNarchitecture is superior [5,13,10].
Dependingontheapplication,onearchitecturemight
perform slightly better than the other [5]. Variations
through modifying the proposed architectures might
work even better in some cases [13]. The work of
Jozefowicz et al. [13] shows the importance of the
elements insideanRNNcell.
Luetal. [18]use theplainLSTMtoassociateob-
jects in a video. To enforce a frame-to-frame con-
sistentprediction, theyuseanassociation lossduring
the training of the LSTM. Similarly, we implement
a ConvLSTM and an inconsistency loss to tackle se-
mantic segmentation. We place the ConvLSTM on
different image feature levels in our architecture as
suggestedby [22,21].
3.ConsistentVideoSemanticSegmentation
In this section, we introduce our methods to-
wards frame-to-frameconsistent semanticsegmenta-
tion. We present different architecture to propagate
features through time. To train the architectures for
temporal consistency, we extend the cross entropy
loss functionwithanovel inconsistencyerror term.
3.1.TemporalFeaturePropagation
The propagation of image features from the past
to the current time step allows the neural network to
80
Joint Austrian Computer Vision and Robotics Workshop 2020
- Title
- Joint Austrian Computer Vision and Robotics Workshop 2020
- Editor
- Graz University of Technology
- Location
- Graz
- Date
- 2020
- Language
- English
- License
- CC BY 4.0
- ISBN
- 978-3-85125-752-6
- Size
- 21.0 x 29.7 cm
- Pages
- 188
- Categories
- Informatik
- Technik